When I attended NAACL, I wanted to do a little test. I had two pitches for my LLM.int8() paper. One pitch is about how I use advanced quantization methods to achieve no performance degradation transformer inference at scale that makes large models more accessible. The other pitch talks about emergent outliers in transformers and how they radically change what transformers learn and how they function.
From that, I learned that quantization research is like printers. Nobody cares about printers. Nobody likes printers. But everybody is happy if printers do their job.[Read more…] about LLM.int8() and Emergent Features