Skip to content

Commit

Permalink
Update hf-bitsandbytes-integration.md (huggingface#818)
Browse files Browse the repository at this point in the history
* Update hf-bitsandbytes-integration.md

Fixes typos :)

* Update hf-bitsandbytes-integration.md
  • Loading branch information
gante authored Feb 3, 2023
1 parent f407082 commit 914c35f
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions hf-bitsandbytes-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,13 +44,13 @@ Float32 (FP32) stands for the standardized IEEE 32-bit floating point representa
In the float16 (FP16) data type, 5 bits are reserved for the exponent and 10 bits are reserved for the mantissa. This makes the representable range of FP16 numbers much lower than FP32. This exposes FP16 numbers to the risk of overflowing (trying to represent a number that is very large) and underflowing (representing a number that is very small).


For example, if you do `10k * 10k` you end up with `100k` which is not possible to represent in FP16, as the largest number possible is `64k`. And thus you'd end up with `NaN` (Not a Number) result and if you have sequential computation like in neural networks, all the prior work is destroyed.
For example, if you do `10k * 10k` you end up with `100M` which is not possible to represent in FP16, as the largest number possible is `64k`. And thus you'd end up with `NaN` (Not a Number) result and if you have sequential computation like in neural networks, all the prior work is destroyed.
Usually, loss scaling is used to overcome this issue, but it doesn't always work well.

A new format, bfloat16 (BF16), was created to avoid these constraints. In BF16, 8 bits are reserved for the exponent (which is the same as in FP32) and 7 bits are reserved for the fraction.


This means that in BF16 we can retain the same dynamic range as FP32. But we lose 3 bits of precision. Now there is absolutely no problem with huge numbers, but the precision is worse than FP16 here.
This means that in BF16 we can retain the same dynamic range as FP32. But we lose 3 bits of precision with respect to FP16. Now there is absolutely no problem with huge numbers, but the precision is worse than FP16 here.

In the Ampere architecture, NVIDIA also introduced [TensorFloat-32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) (TF32) precision format, combining the dynamic range of BF16 and precision of FP16 to only use 19 bits. It's currently only used internally during certain operations.

Expand Down

0 comments on commit 914c35f

Please sign in to comment.