Comments on: LLM.int8() and Emergent Features

By: Vivek Kumar

Vivek Kumar — Tue, 06 Sep 2022 16:54:14 +0000

In reply to Tim Dettmers. Thank you Tim! We are looking at scattered sparsity to boost inference performance. And, it does come with an additional penalty and anything above 20% sparsity is a bonus.

By: Tim Dettmers

Tim Dettmers — Tue, 06 Sep 2022 14:35:13 +0000

In reply to awfidius. Thanks for pointing out the error!

By: Tim Dettmers

Tim Dettmers — Tue, 06 Sep 2022 14:31:58 +0000

In reply to Samuel. Thanks for pointing this out!

By: Tim Dettmers

Tim Dettmers — Tue, 06 Sep 2022 14:31:35 +0000

In reply to sva. Thank you for noticing — fixed!

By: Tim Dettmers

Tim Dettmers — Tue, 06 Sep 2022 14:30:57 +0000

In reply to Vivek Kumar. The sparsity is in the multiplication of the hidden state, so we can only prune dynamically with each example. Theoretically, that is possible, but its difficult to do this efficiently since it is a scattered pattern that is not well accelerated on hardware. However, the sparsity can be very high (99%) but some samples are relatively dense (40% sparsity).

By: Tim Dettmers

Tim Dettmers — Tue, 06 Sep 2022 14:28:33 +0000

In reply to Ben Harper. Thank you :)

By: Tim Dettmers

Tim Dettmers — Tue, 06 Sep 2022 14:28:21 +0000

In reply to Jonathan Kummerfeld. Yes, it seems that is possible. In a current collaboration we alter the transformer architecture and see faster training. However, we have not analyzed the dynamics of the outliers.

By: awfidius

awfidius — Wed, 31 Aug 2022 10:57:19 +0000

Couple of buglets in the (very nice) I5->I3 example.

Let’s do an example. Let’s say we have the vector [3, 1, 2, 3] in I5, and we want to quantize to I3.

Here the step-by-step recipe for quantization:

We find the absolute maximum value of the vector: [3, 1, 2, 3] -> 3
Then we divide by that value: [3, 1, 2, 3] -> [1, 0.33, 0.66, 1.0]
And now we multiple by the range of the target data type I3, which is 4: [1, 0.33, 0.66, 1.0] -> [4.0, 1.33, 2.66, 4.0]
Now we round to the nearest value: [4.0, 1.33, 2.66, 4.0] -> [4, 2**, 2, 4]
We now converted [3, 1, 2, 3**] in I5 to [4, 2**, 2, 4] in I3. To dequantize, we reverse this process.

Divide by 4: [4, 2*, 2, 4] -> [1.0, 0.5, 0.5, 1.0]
Multiply by the absolute maximum: [1.0, 0.5, 0.5, 1.0] -> [3.0, 1.5, 1.5, 3.0]
Now we round again: [3.0, 0.0, 1.5, 3.0] -> [3, 2, 2, 3]
We see that our dequantization and quantization led to *an* error:
[3, 1, 2, 3] to [3, 2, 2, 3]
The second element changed from 1 to 2. This is a quantization error that leads to the loss of information in terms of how precise the information is encoded. If we have such errors and propagate them through many layers of a neural network, they accumulate, and they may change the result of a prediction and degrade the prediction quality.

By: Vivek Kumar

Vivek Kumar — Fri, 19 Aug 2022 19:22:08 +0000

Thanks for such a detailed blog post. Very helpful!
I have a question in the section “How Emergent Features Emerge”, #2 topic, “Attention layers become very sparse”. Is this sparsity > 20%? Could we just use pruning during training to handle the sparsity well? To improve performance during inference, is it good to just use H/W supported sparsity (block or scattered) ?
Any comment in this regard to improve performance during inference would be helpful.
Thank you.

By: Samuel

Samuel — Fri, 19 Aug 2022 11:00:25 +0000

Error in the first quantization example.

The vector originally posited (first example) was: [3,1,2,3], not: [3,1,2,4]. This leads the author to erroneously cite 2 errors when there is only 1.