From 914c35f9e88656e1c275ba2f4c93101d3c0d931a Mon Sep 17 00:00:00 2001 From: Joao Gante Date: Fri, 3 Feb 2023 15:50:04 +0000 Subject: [PATCH] Update hf-bitsandbytes-integration.md (#818) * Update hf-bitsandbytes-integration.md Fixes typos :) * Update hf-bitsandbytes-integration.md --- hf-bitsandbytes-integration.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hf-bitsandbytes-integration.md b/hf-bitsandbytes-integration.md index d2739713c4..d3d9b26291 100644 --- a/hf-bitsandbytes-integration.md +++ b/hf-bitsandbytes-integration.md @@ -44,13 +44,13 @@ Float32 (FP32) stands for the standardized IEEE 32-bit floating point representa In the float16 (FP16) data type, 5 bits are reserved for the exponent and 10 bits are reserved for the mantissa. This makes the representable range of FP16 numbers much lower than FP32. This exposes FP16 numbers to the risk of overflowing (trying to represent a number that is very large) and underflowing (representing a number that is very small). -For example, if you do `10k * 10k` you end up with `100k` which is not possible to represent in FP16, as the largest number possible is `64k`. And thus you'd end up with `NaN` (Not a Number) result and if you have sequential computation like in neural networks, all the prior work is destroyed. +For example, if you do `10k * 10k` you end up with `100M` which is not possible to represent in FP16, as the largest number possible is `64k`. And thus you'd end up with `NaN` (Not a Number) result and if you have sequential computation like in neural networks, all the prior work is destroyed. Usually, loss scaling is used to overcome this issue, but it doesn't always work well. A new format, bfloat16 (BF16), was created to avoid these constraints. In BF16, 8 bits are reserved for the exponent (which is the same as in FP32) and 7 bits are reserved for the fraction. -This means that in BF16 we can retain the same dynamic range as FP32. But we lose 3 bits of precision. Now there is absolutely no problem with huge numbers, but the precision is worse than FP16 here. +This means that in BF16 we can retain the same dynamic range as FP32. But we lose 3 bits of precision with respect to FP16. Now there is absolutely no problem with huge numbers, but the precision is worse than FP16 here. In the Ampere architecture, NVIDIA also introduced [TensorFloat-32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) (TF32) precision format, combining the dynamic range of BF16 and precision of FP16 to only use 19 bits. It's currently only used internally during certain operations.