Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark with Parquet #32

Open
srilman opened this issue Sep 4, 2023 · 7 comments
Open

Benchmark with Parquet #32

srilman opened this issue Sep 4, 2023 · 7 comments

Comments

@srilman
Copy link

srilman commented Sep 4, 2023

Is there any interest in using Parquet datasets to benchmark, particularly for the 50GB dataset case? Parquet is very common for large-scale data analytics, and as far as I know, most if not all of the libraries tested support Parquet files. Furthermore, some libraries have special support for Parquet datasets, like reading encoded columns and performing analytics on them directly in memory.

I think it still makes sense to do the 0.5GB and 5GB benchmarks in CSV, but I think it would be helpful to have an additional 50GB benchmark in Parquet (or even replace the existing one). That way, the benchmark doesn't take much longer.

@jangorecki
Copy link

jangorecki commented Sep 5, 2023

Parquet was used at some point in the past. Looking through git history will take you there. Whatever tools can use in-memory data, they are already using it. It is just faster than reading from disk. And then it doesn't matter if it's parquet or csv. Reading data from disk to memory is not part of timings. It is always made at start of the script. Parquet ended up to not be portable as advertised, therefore csv was kept instead. If you want to benchmark tools that runs queries on-disk data, then yes, make sense to look into parquet again. But then only using it for on-disk data cases (size: out of memory OR solution that does not support in-memory model) make sense, not replacing in-memory model.

@jangorecki
Copy link

jangorecki commented Sep 5, 2023

Actually parquet may be still in use for some tools (possibly only for 1e9 rows) In readme batch benchmark run you can read

re-save to binary format where needed

@MrPowers
Copy link

MrPowers commented Oct 7, 2023

Yea, I have argued for this shift in the past. Even better would be 50 1GB Parquet files.

The 50GB benchmarks for a single CSV file are really misleading, especially for the engines that are optimized to perform parallel reads.

@jangorecki
Copy link

jangorecki commented Oct 7, 2023

Benchmarks are not for csv or parquet. They are for in memory data. None of solutions in benchmark uses CSV as it's data model. You may want to reread my previous messages.

@MrPowers
Copy link

MrPowers commented Oct 7, 2023

@jangorecki - yea, I agree but CSV files have limitations that cause memory issues on queries that wouldn't have issues if the data was stored in Parquet. Let's look at an example pandas query:

x.groupby("id1", dropna=False, observed=True).agg({"v1": "sum"})

For the 1 billion row dataset, this query will error out on some machines if CSV is used, but work if Parquet is used. That means that the selected file format is changing the benchmark results for in memory data.

The file format is impacting the distributed compute engine results even more.

Furthermore, the benchmarks run way, way slower than they should because of the CSV file format. When pandas read_csv is used on a large file, usecols should definitely be set, but using read_parquet and setting the columns argument would be way better. This result is totally misleading IMO:

Screenshot 2023-10-07 at 12 30 53 PM

@jangorecki
Copy link

jangorecki commented Oct 7, 2023

query will error out on some machines if CSV is used

Precisely "if in-memory data is used"

Totally agree on that, whenever tool is not able to do in-memory, then fall back to on-disk data is an option. This is how for example spark is now doing join for 1e9 rows:

print("using disk memory-mapped data storage" if on_disk else "using in-memory data storage", flush=True)

can be easily adapted for pandas from there.

Moreover if it is faster to use on-disk rather than in-memory for a specific solution, then on-disk should be preferred as well, but I think it is rather uncommon scenario, because whatever on-disk format is being used, it has to be loaded into memory anyway for making computation on it. Also it should be well investigated it is a general rule, and not only under certain conditions (like super fast disk).

@kszlim
Copy link

kszlim commented Nov 3, 2023

I think it would make sense also including a (maybe separate) benchmark for seeing how fast these engines can query a parquet file (or perhaps a hive partitioned directory of parquet files).

I think that reflects real world use cases pretty well, though I agree once you include too many knobs, it's hard to have a fair and representative benchmarking setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants