Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge the small read io #3072

Closed
WenyXu opened this issue Jan 2, 2024 · 6 comments · Fixed by #4520
Closed

Merge the small read io #3072

WenyXu opened this issue Jan 2, 2024 · 6 comments · Fixed by #4520
Assignees

Comments

@WenyXu
Copy link
Member

WenyXu commented Jan 2, 2024

          Too many small requests could result in an expensive bill; there is an optimization we can do in the future(maybe not in this PR) for object stores like s3. If the `ranges` are almost continuous, we can merge these ranges into a large chunk and fetch this chunk in the preferred size concurrently.

Originally posted by @WenyXu in #2959 (comment)

@L-Fiori
Copy link

L-Fiori commented Jan 7, 2024

Hello, I'm new to the project and I would like to contribute. Not sure if this is a good first issue though.

As I understand it, you're concurrently fetching data from Parquet files using a function that receives a vector of ranges, and your concern is that too many small requests can lead to expensive bills, so an optimization would be to merge ranges that are close together before fetching data, is that right? In this case, what would be a reasonable distance between the ranges? As I am not familiar with the kind of data that is being fetched.

@WenyXu
Copy link
Member Author

WenyXu commented Jan 7, 2024

Hello, I'm new to the project and I would like to contribute. Not sure if this is a good first issue though.

As I understand it, you're concurrently fetching data from Parquet files using a function that receives a vector of ranges, and your concern is that too many small requests can lead to expensive bills, so an optimization would be to merge ranges that are close together before fetching data, is that right? In this case, what would be a reasonable distance between the ranges? As I am not familiar with the kind of data that is being fetched.

Thanks @L-Fiori. My bad; my colleague @QuenKar is working on this. I forget to update this issue.

In this case, what would be a reasonable distance between the ranges?

This is key to this issue, and my colleague is doing some benchmarking to figure it out. We will use these benchmark results to select an optimized range distance(and the benchmark results may be posted in related PRs).

@L-Fiori
Copy link

L-Fiori commented Jan 7, 2024

Interesting! I'll stay tuned for other issues I might want to tackle, thanks for the reply ;)

@tisonkun
Copy link
Collaborator

@WenyXu This issue seems stale. Do we have other updates?

Or @L-Fiori maybe you can just take over this issue and submit a patch >_<

@WenyXu
Copy link
Member Author

WenyXu commented Apr 15, 2024

@WenyXu This issue seems stale. Do we have other updates?

We still send a lot of small IO requests.

@WenyXu WenyXu added the good first issue Good for newcomers label Jul 30, 2024
@WenyXu WenyXu removed the good first issue Good for newcomers label Aug 6, 2024
@WenyXu WenyXu self-assigned this Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants