Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the same column data types for all engines in benchmarks #101

Open
MrPowers opened this issue Dec 13, 2024 · 1 comment
Open

Use the same column data types for all engines in benchmarks #101

MrPowers opened this issue Dec 13, 2024 · 1 comment

Comments

@MrPowers
Copy link

MrPowers commented Dec 13, 2024

Here's a snippet from the Polars groupby benchmarks:

pl.read_csv(src_grp, schema_overrides={"id4":pl.Int32, "id5":pl.Int32, "id6":pl.Int32, "v1":pl.Int32, "v2":pl.Int32

Looks like id4, id5, id6 and v1 are using Int32 columns.

Other engines, like Spark, are just inferring the column types:

x = spark.read.csv(src_grp, header=True, inferSchema='true')

I think we should either have all the engines infer the column data types or all the engines specify the column data types for a better comparison. It's not apples:apples if some engines are using int32 and others are using int64.

@Tmonster
Copy link
Collaborator

I agree that all engines should attempt to use the same types.

It's important to note, however, that some of the aggregations have answers that overflow to int64, while all inputs are int32. I think polars had this issue somewhere.

Also, I only have a limited time to work on this benchmark, and it is mostly for maintenance and updating solutions. I don't have much time to go through every solution to ensure the setup for each system is the exact same. I am happy to review PRs if the come up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants