[Data] Always launch one task for `read_sql` #48923

bveeramani · 2024-11-25T18:41:21Z

Why are these changes needed?

Each read_sql read tasks attempts to read a different chunk of data using the offset and limit filters. For example, if you have a database with 200 rows, read_sql might launch two tasks that reads offset 0 limit 100 and offset 100 limit 100 to read rows 0-100 and 100-200, respectively.

However, if the underlying database doesn’t have a deterministic ordering, read tasks might read duplicate data.

To fix this correctness issue, this PR makes read_sql always launch one task. Since offset typically requires scanning and discarding rows, this PR's code should perform similarly to the original implementation.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Balaji Veeramani <[email protected]>

alexeykudinkin · 2024-11-25T19:55:30Z

@bveeramani

Might be worthwhile to preserve this functionality of parallel fetching for very large datasets
We can still preserve parallelism if we'd ask the user for the column we can naturally order by (and use offsets)

Signed-off-by: Balaji Veeramani <[email protected]>

python/ray/data/read_api.py

raulchen · 2024-11-26T06:25:27Z

python/ray/data/read_api.py

+        warnings.warn(
+            "To ensure correctness, 'read_sql' always launches one task. The "
+            "'parallelism' argument you specified will be ignored."
+        )


maybe just raise an error. warning is implicit.

To understand, if DB table is huge (1B rows or more), will this be single threaded ingest?

Yeah. Many DBAPI implementations don't support multithreading

If I understand this right, we may end up with very slow ingest with just 1 task for DBs and also OOM kills. While for files, we are able to do support parallel ingests in a scaled out fashion.

That's right.

What do we do as an alternative that's both scalable and correct? Many OFFSET implementations require scanning the entire database. So, OFFSET and LIMIT often perform the same or worse than a single task that reads the entire database.

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani added 3 commits November 22, 2024 15:45

Initial commit

cb830b0

Signed-off-by: Balaji Veeramani <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray

1d9dd2e

Signed-off-by: Balaji Veeramani <[email protected]>

Initial commit

c01297b

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani requested review from scottjlee, raulchen, stephanie-wang, omatthew98, alexeykudinkin and srinathk10 as code owners November 25, 2024 18:41

bveeramani assigned raulchen Nov 25, 2024

Fix typo

1b4630c

Signed-off-by: Balaji Veeramani <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray

bed2558

Signed-off-by: Balaji Veeramani <[email protected]>

raulchen approved these changes Nov 26, 2024

View reviewed changes

bveeramani added 5 commits December 1, 2024 19:20

Merge branch 'master' of https://github.com/ray-project/ray

c40b06d

Signed-off-by: Balaji Veeramani <[email protected]>

Address review comments

4731063

Signed-off-by: Balaji Veeramani <[email protected]>

Fix test

d80b12a

Signed-off-by: Balaji Veeramani <[email protected]>

Fix test

a709fe4

Signed-off-by: Balaji Veeramani <[email protected]>

Fix test

f5f10f5

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani enabled auto-merge (squash) December 2, 2024 19:08

github-actions bot added the go add ONLY when ready to merge, run all tests label Dec 2, 2024

Fix bug

9593bc6

Signed-off-by: Balaji Veeramani <[email protected]>

github-actions bot disabled auto-merge December 2, 2024 20:49

bveeramani added 3 commits December 2, 2024 14:12

Merge branch 'master' of https://github.com/ray-project/ray

49a8d71

Signed-off-by: Balaji Veeramani <[email protected]>

Merge branch 'master' into fix-read-sql

7d8aae2

Signed-off-by: Balaji Veeramani <[email protected]>

Fix typo

1f247b3

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani enabled auto-merge (squash) December 2, 2024 23:50

bveeramani merged commit 52f3e07 into master Dec 3, 2024
6 checks passed

bveeramani deleted the fix-read-sql branch December 3, 2024 00:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Always launch one task for `read_sql` #48923

[Data] Always launch one task for `read_sql` #48923

bveeramani commented Nov 25, 2024 •

edited

Loading

alexeykudinkin commented Nov 25, 2024

raulchen Nov 26, 2024

srinathk10 Dec 2, 2024

bveeramani Dec 2, 2024

srinathk10 Dec 2, 2024

bveeramani Dec 2, 2024

[Data] Always launch one task for read_sql #48923

[Data] Always launch one task for read_sql #48923

Conversation

bveeramani commented Nov 25, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

alexeykudinkin commented Nov 25, 2024

raulchen Nov 26, 2024

Choose a reason for hiding this comment

srinathk10 Dec 2, 2024

Choose a reason for hiding this comment

bveeramani Dec 2, 2024

Choose a reason for hiding this comment

srinathk10 Dec 2, 2024

Choose a reason for hiding this comment

bveeramani Dec 2, 2024

Choose a reason for hiding this comment

[Data] Always launch one task for `read_sql` #48923

[Data] Always launch one task for `read_sql` #48923

bveeramani commented Nov 25, 2024 •

edited

Loading