Use the same column data types for all engines in benchmarks #101

MrPowers · 2024-12-13T18:23:38Z

Here's a snippet from the Polars groupby benchmarks:

pl.read_csv(src_grp, schema_overrides={"id4":pl.Int32, "id5":pl.Int32, "id6":pl.Int32, "v1":pl.Int32, "v2":pl.Int32

Looks like id4, id5, id6 and v1 are using Int32 columns.

Other engines, like Spark, are just inferring the column types:

x = spark.read.csv(src_grp, header=True, inferSchema='true')

I think we should either have all the engines infer the column data types or all the engines specify the column data types for a better comparison. It's not apples:apples if some engines are using int32 and others are using int64.

The text was updated successfully, but these errors were encountered:

Tmonster · 2025-01-13T16:03:10Z

I agree that all engines should attempt to use the same types.

It's important to note, however, that some of the aggregations have answers that overflow to int64, while all inputs are int32. I think polars had this issue somewhere.

Also, I only have a limited time to work on this benchmark, and it is mostly for maintenance and updating solutions. I don't have much time to go through every solution to ensure the setup for each system is the exact same. I am happy to review PRs if the come up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the same column data types for all engines in benchmarks #101

Use the same column data types for all engines in benchmarks #101

MrPowers commented Dec 13, 2024 •

edited

Loading

Tmonster commented Jan 13, 2025

Use the same column data types for all engines in benchmarks #101

Use the same column data types for all engines in benchmarks #101

Comments

MrPowers commented Dec 13, 2024 • edited Loading

Tmonster commented Jan 13, 2025

MrPowers commented Dec 13, 2024 •

edited

Loading