Bug: h2o.import_file ignores parquet schema when reading parquet files #15690

wendycwong · 2023-08-11T17:50:40Z

Support ticket: https://support.h2o.ai/a/tickets/106137

@Bernard-H2O was able to come up with a work around to force column types to be double using col.asnumeric().

However, the column type information is stored in the Parquet metadata and H2O parser should use that to determine the column types instead of using H2O code to guess what each column type should be.

wendycwong · 2023-08-16T20:50:59Z

code from @Bernard-H2O to test the problem and provided a workaround.

import h2o
import pandas as pd
import pyarrow.parquet as pq

simple data frame that will be save to parquet and reloaded

df = pd.DataFrame(data={'mixed_col': [1.0, 1.1],
'uniform_col': [1.00, 1.00]})

pandas data types are correct

print("Pandas dtypes ---------------------")
print(df.dtypes)
df.to_parquet('/Users/wendycwong/temp/df.parquet')

and so is the parquet schema

print("\nParquet schema --------------------")
print(pq.read_schema('/Users/wendycwong/temp/df.parquet'))

but the recovered h2o types are wrong (2nd col converted to integer)

h2o.init()
h2o_df = h2o.import_file('./df.parquet')
print("\nH2O types --------------------")
print(h2o_df.types)

Using col_types does not fix the issue

col_types = col_types = {"mixed_col": "real", "uniform_col": "real"} # also tried "numeric" which did not work either
h2o_df = h2o.import_file('./df.parquet', col_types=col_types)
print("\nH2O types --------------------")
print(h2o_df.types)

Bug Fix - this approach is the workaround that works

h2o_df = h2o.import_file('./df.parquet')
h2o_df['uniform_col'] = h2o_df['uniform_col'].asnumeric()
print("\nH2O types --------------------")
print(h2o_df.types)

wendycwong · 2023-08-17T20:18:57Z

I have dug into the code per @arunaryasomayajula suggestion and found the problem area. In convertType in ParquetParser.java, it will treat INT32, FLOAT, DOUBLE, INT64 all as Vec.T_NUM. The time where the type of numeric is determined is when they are written into the chunks. We will try to use the smallest chunk type that can accommodate the data that needs to be parsed. Hence, if you have a column of Double but all the entries are just integer values, an integer chunk will be used to store this column and hence its final type will be integer and not double.

wendycwong · 2023-08-17T20:33:34Z

This behavior is implemented to save memory. If a column is double but only contains integer values, the column type will be integer and not double because it takes less space to store integers than doubles. No rounding/truncation is used to store the data and hence no precision is lost during the parsing process.

There is nothing to fix.

Bernard-H2O · 2023-08-17T21:50:57Z

@wendycwong - thanks for the comment.
While it is true that if a column is double but only contains integer values, it would be able to store the values with no loss in precision. However, would this still be an issue since the data is being sampled (at 1000 rows)? If the 1000 rows contains a float, it should not be an issue, but if the 1000 are all integer values, is there a risk that we might not be capturing possible float values from outside the 1000 data sample?

wendycwong added the bug label Aug 11, 2023

wendycwong closed this as completed Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: h2o.import_file ignores parquet schema when reading parquet files #15690

Bug: h2o.import_file ignores parquet schema when reading parquet files #15690

wendycwong commented Aug 11, 2023

wendycwong commented Aug 16, 2023

wendycwong commented Aug 17, 2023

wendycwong commented Aug 17, 2023

Bernard-H2O commented Aug 17, 2023

Bug: h2o.import_file ignores parquet schema when reading parquet files #15690

Bug: h2o.import_file ignores parquet schema when reading parquet files #15690

Comments

wendycwong commented Aug 11, 2023

wendycwong commented Aug 16, 2023

code from @Bernard-H2O to test the problem and provided a workaround.

simple data frame that will be save to parquet and reloaded

pandas data types are correct

and so is the parquet schema

but the recovered h2o types are wrong (2nd col converted to integer)

Using col_types does not fix the issue

Bug Fix - this approach is the workaround that works

wendycwong commented Aug 17, 2023

wendycwong commented Aug 17, 2023

Bernard-H2O commented Aug 17, 2023