-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: h2o.import_file ignores parquet schema when reading parquet files #15690
Comments
code from @Bernard-H2O to test the problem and provided a workaround.import h2o simple data frame that will be save to parquet and reloadeddf = pd.DataFrame(data={'mixed_col': [1.0, 1.1], pandas data types are correctprint("Pandas dtypes ---------------------") and so is the parquet schemaprint("\nParquet schema --------------------") but the recovered h2o types are wrong (2nd col converted to integer)h2o.init() Using col_types does not fix the issuecol_types = col_types = {"mixed_col": "real", "uniform_col": "real"} # also tried "numeric" which did not work either Bug Fix - this approach is the workaround that worksh2o_df = h2o.import_file('./df.parquet') |
I have dug into the code per @arunaryasomayajula suggestion and found the problem area. In convertType in ParquetParser.java, it will treat INT32, FLOAT, DOUBLE, INT64 all as Vec.T_NUM. The time where the type of numeric is determined is when they are written into the chunks. We will try to use the smallest chunk type that can accommodate the data that needs to be parsed. Hence, if you have a column of Double but all the entries are just integer values, an integer chunk will be used to store this column and hence its final type will be integer and not double. |
This behavior is implemented to save memory. If a column is double but only contains integer values, the column type will be integer and not double because it takes less space to store integers than doubles. No rounding/truncation is used to store the data and hence no precision is lost during the parsing process. There is nothing to fix. |
@wendycwong - thanks for the comment. |
Support ticket: https://support.h2o.ai/a/tickets/106137
@Bernard-H2O was able to come up with a work around to force column types to be double using col.asnumeric().
However, the column type information is stored in the Parquet metadata and H2O parser should use that to determine the column types instead of using H2O code to guess what each column type should be.
The text was updated successfully, but these errors were encountered: