Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix extra decimal places in DataFrame.to_csv() with quoting=csv.QUOTE_NONNUMERIC and float16/float32 dtypes (#60699) #60804

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

akj2018
Copy link
Contributor

@akj2018 akj2018 commented Jan 28, 2025

  1. Resolved by converting floats to strings to preserve decimal representation.
  2. Removed unnecessary quoting=None logic for float arrays.
  3. Added tests for float16, float32, and float64 cases with mixed values.

Issue

Dataframe.to_csv() generates extra decimal places in output when quoting=csv.QUOTE_NONNUMERIC , dataframe's dtype=float16 / float32 and float_format=None.

Reason

  • Dataframe.to_csv() internally uses get_values_for_csv() and when quoting is specified (=csv.QUOTE_NONNUMERIC), it converts numpy float array to object.

elif values.dtype.kind == "f" and not isinstance(values.dtype, SparseDtype):
# see GH#13418: no special formatting is desired at the
# output (important for appropriate 'quoting' behaviour),
# so do not pass it through the FloatArrayFormatter
if float_format is None and decimal == ".":
mask = isna(values)
if not quoting:
values = values.astype(str)
else:
values = np.array(values, dtype="object")
values[mask] = na_rep
values = values.astype(object, copy=False)
return values

np.array(values, dtype="object") affects float16, float32 and float64 differently

  • For float16, float32
    • Have limited precision, therefore numbers are stored as approximations rather than exact values (8.57 stored internally in memory as 8.5703125)
    • When converted to object array, internal binary representation of the float16 values is stored inside Python's float (equivalent to numpy.float64), which can fully display that exact binary representation
    • Therefore, extra decimal places appear in the output for dtype=float16 and dtype=float32 when conversion to dtype=object
arr = np.array([8.57, 0.156, -0.312, 123.3, -54.5, np.nan], dtype=np.float16)
print(arr)
# [  8.57    0.156  -0.312 123.3   -54.5       nan]

arr_obj = arr.astype(object)
print(arr_obj)
# [8.5703125 0.156005859375 -0.31201171875 123.3125 -54.5 nan]
  • float64
    • Due to 52 bits of precision, float64 represent most decimal numbers (like 8.57) exactly or with an extremely small error that is practically undetectable when converted to a higher precision or displayed as a Python float
    • When you convert float64 numpy array to object, internal binary representation is directly transferred to the object type and there is no "extra decimals" in the output.
arr = np.array([8.57, 0.156, -0.312, 123.3, -54.5, np.nan], dtype=np.float64)
print(arr)
# [  8.57    0.156  -0.312 123.3   -54.5       nan]

arr_obj = arr.astype(object)
print(arr_obj)
# [8.57 0.156 -0.312 123.3 -54.5 nan]

Fix Implemented

To preserve the decimal representation in case of dtype=float16 and float32, we convert numpy float array to strings and then convert them back to Python's float which is nearly equivalent to numpy.float64

  • Conversion to str preserves decimal representation and prevents exposing the internal binary representation.
  • Conversion to float is necessary to avoid treating float values as string and storing them in 64-bit (double precision) preserves the string representation.

Additionally, in the original code
When quoting is None, converting first to str and then back to object is unnecessary work because the replacement of na_rep can be done directly on an object array (na_rep : str).

Therefore, quoting=None branch was removed to streamline the logic.

    elif values.dtype.kind == "f" and not isinstance(values.dtype, SparseDtype):
        # see GH#13418: no special formatting is desired at the
        # output (important for appropriate 'quoting' behaviour),
        # so do not pass it through the FloatArrayFormatter
        if float_format is None and decimal == ".":
            mask = isna(values)

            if values.dtype in [np.float16, np.float32]:
                values = np.array(values, dtype="str") # preserve decimal representation
                values = values.astype(float, copy=False) # preserve string representation 

            values = values.astype(object, copy=False)
            values[mask] = na_rep
            return values

Testing

Successfully pass all existing test cases in test_to_csv.py with tests added for dataframes with dtype as float16, float32 and float64 with mix of negative, positive and missing values and quoting=csv.QUOTE_NONNUMERIC

1. {"col": [8.57, 0.156, -0.312, 123.3, -54.5, np.nan]} and dtype="float16"

2. {"col": [8.57, 1.234567, -2.345678, 1e6, -1.5e6, np.nan]} and dtype="float32"

3. {"col": [8.57, 3.141592653589793, -2.718281828459045, 1.01e12, -5.67e11, np.nan]} and dtype="float64"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: quoting=csv.QUOTE_NONNUMERIC adds extra decimal places
1 participant