Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Always launch one task for read_sql #48923

Merged
merged 14 commits into from
Dec 3, 2024
Merged

[Data] Always launch one task for read_sql #48923

merged 14 commits into from
Dec 3, 2024

Conversation

bveeramani
Copy link
Member

@bveeramani bveeramani commented Nov 25, 2024

Why are these changes needed?

Each read_sql read tasks attempts to read a different chunk of data using the offset and limit filters. For example, if you have a database with 200 rows, read_sql might launch two tasks that reads offset 0 limit 100 and offset 100 limit 100 to read rows 0-100 and 100-200, respectively.

However, if the underlying database doesn’t have a deterministic ordering, read tasks might read duplicate data.

To fix this correctness issue, this PR makes read_sql always launch one task. Since offset typically requires scanning and discarding rows, this PR's code should perform similarly to the original implementation.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
@alexeykudinkin
Copy link
Contributor

@bveeramani

  • Might be worthwhile to preserve this functionality of parallel fetching for very large datasets
  • We can still preserve parallelism if we'd ask the user for the column we can naturally order by (and use offsets)

python/ray/data/read_api.py Outdated Show resolved Hide resolved
warnings.warn(
"To ensure correctness, 'read_sql' always launches one task. The "
"'parallelism' argument you specified will be ignored."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just raise an error. warning is implicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To understand, if DB table is huge (1B rows or more), will this be single threaded ingest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Many DBAPI implementations don't support multithreading

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this right, we may end up with very slow ingest with just 1 task for DBs and also OOM kills. While for files, we are able to do support parallel ingests in a scaled out fashion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right.

What do we do as an alternative that's both scalable and correct? Many OFFSET implementations require scanning the entire database. So, OFFSET and LIMIT often perform the same or worse than a single task that reads the entire database.

Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
@bveeramani bveeramani enabled auto-merge (squash) December 2, 2024 19:08
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Dec 2, 2024
Signed-off-by: Balaji Veeramani <[email protected]>
@bveeramani bveeramani enabled auto-merge (squash) December 2, 2024 23:50
@bveeramani bveeramani merged commit 52f3e07 into master Dec 3, 2024
6 checks passed
@bveeramani bveeramani deleted the fix-read-sql branch December 3, 2024 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants