Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: h2o.import_file ignores parquet schema when reading parquet files #15690

Closed
wendycwong opened this issue Aug 11, 2023 · 4 comments
Closed
Labels

Comments

@wendycwong
Copy link
Contributor

Support ticket: https://support.h2o.ai/a/tickets/106137

@Bernard-H2O was able to come up with a work around to force column types to be double using col.asnumeric().

However, the column type information is stored in the Parquet metadata and H2O parser should use that to determine the column types instead of using H2O code to guess what each column type should be.

@wendycwong wendycwong added the bug label Aug 11, 2023
@wendycwong
Copy link
Contributor Author

code from @Bernard-H2O to test the problem and provided a workaround.

import h2o
import pandas as pd
import pyarrow.parquet as pq

simple data frame that will be save to parquet and reloaded

df = pd.DataFrame(data={'mixed_col': [1.0, 1.1],
'uniform_col': [1.00, 1.00]})

pandas data types are correct

print("Pandas dtypes ---------------------")
print(df.dtypes)
df.to_parquet('/Users/wendycwong/temp/df.parquet')

and so is the parquet schema

print("\nParquet schema --------------------")
print(pq.read_schema('/Users/wendycwong/temp/df.parquet'))

but the recovered h2o types are wrong (2nd col converted to integer)

h2o.init()
h2o_df = h2o.import_file('./df.parquet')
print("\nH2O types --------------------")
print(h2o_df.types)

Using col_types does not fix the issue

col_types = col_types = {"mixed_col": "real", "uniform_col": "real"} # also tried "numeric" which did not work either
h2o_df = h2o.import_file('./df.parquet', col_types=col_types)
print("\nH2O types --------------------")
print(h2o_df.types)

Bug Fix - this approach is the workaround that works

h2o_df = h2o.import_file('./df.parquet')
h2o_df['uniform_col'] = h2o_df['uniform_col'].asnumeric()
print("\nH2O types --------------------")
print(h2o_df.types)

@wendycwong
Copy link
Contributor Author

I have dug into the code per @arunaryasomayajula suggestion and found the problem area. In convertType in ParquetParser.java, it will treat INT32, FLOAT, DOUBLE, INT64 all as Vec.T_NUM. The time where the type of numeric is determined is when they are written into the chunks. We will try to use the smallest chunk type that can accommodate the data that needs to be parsed. Hence, if you have a column of Double but all the entries are just integer values, an integer chunk will be used to store this column and hence its final type will be integer and not double.

@wendycwong
Copy link
Contributor Author

This behavior is implemented to save memory. If a column is double but only contains integer values, the column type will be integer and not double because it takes less space to store integers than doubles. No rounding/truncation is used to store the data and hence no precision is lost during the parsing process.

There is nothing to fix.

@Bernard-H2O
Copy link

@wendycwong - thanks for the comment.
While it is true that if a column is double but only contains integer values, it would be able to store the values with no loss in precision. However, would this still be an issue since the data is being sampled (at 1000 rows)? If the 1000 rows contains a float, it should not be an issue, but if the 1000 are all integer values, is there a risk that we might not be capturing possible float values from outside the 1000 data sample?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants