Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector CSUP search query consumes 75+ GB memory #5550

Open
philrz opened this issue Dec 25, 2024 · 1 comment
Open

Vector CSUP search query consumes 75+ GB memory #5550

philrz opened this issue Dec 25, 2024 · 1 comment

Comments

@philrz
Copy link
Contributor

philrz commented Dec 25, 2024

tl;dr

I found that running the Search Test from the super command doc with CSUP in vector runtime consumed 75+ GB of RAM, which was enough to hang the EC2 instance with 32 GB of RAM that I'd used to successfully in the past to run the equivalent query with other RDBMS and as well as sequential super with BSUP input.

Details

Repro is with super commit 4084e01. The test data is available at s3://brim-sampledata/super-cmd-perf/gha.csup which was generated similar to BSUP as shown here, i.e.,

$ super -f csup -o gha.csup gharchive_gz/*.json.gz

The query is run like this:

$ super -version
Version: v1.18.0-206-g4084e011

$ SUPER_VAM=1 super -c "SELECT count()
FROM 'gha.csup'
WHERE grep('in case you have any feedback 😊', payload.pull_request.body)"

In the past I've run these queries successfully on an AWS EC2 m6idn.2xlarge instance which has 32 GB of RAM (but no swap), and that's always been enough resource to run the equivalent query successfully as shown in the doc on DuckDB, ClickHouse, DataFusion, and sequential super with BSUP input. Previously we'd not been able to run the query at all with super in vector runtime, but with the merge of #5523 it was time to give it a go. On the first try with the EC2 instance it consumed all the memory and hung the system.

To give it a closer look, I re-ran it on my Macbook which only has 16 GB of RAM but it does have swap. Keeping an eye on the process in Activity Monitor as it ran for a couple hours, I watched as it climbed to 75+ GB of RAM consumed before it finally did finish. Hoping to get a bit more detail for this issue, I attempted to get a memory profile on a re-run:

$ SUPER_VAM=1 super -memprofile=mem.pprof -c "SELECT count()
FROM 'gha.csup'
WHERE grep('in case you have any feedback 😊', payload.pull_request.body)"

However, at the very end after it showed the correct result a Killed: 9 appeared and my mem.pprof showed zero length, so I'm not sure what to make of that.

@philrz
Copy link
Contributor Author

philrz commented Dec 28, 2024

The results in #5552 also show that the same test was able to run successfully with vector runtime within the 32 GB memory footprint if Parquet format was used instead of CSUP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant