-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Always launch one task for read_sql
#48923
Conversation
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
|
Signed-off-by: Balaji Veeramani <[email protected]>
warnings.warn( | ||
"To ensure correctness, 'read_sql' always launches one task. The " | ||
"'parallelism' argument you specified will be ignored." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just raise an error. warning is implicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To understand, if DB table is huge (1B rows or more), will this be single threaded ingest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Many DBAPI implementations don't support multithreading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this right, we may end up with very slow ingest with just 1 task for DBs and also OOM kills. While for files, we are able to do support parallel ingests in a scaled out fashion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right.
What do we do as an alternative that's both scalable and correct? Many OFFSET
implementations require scanning the entire database. So, OFFSET
and LIMIT
often perform the same or worse than a single task that reads the entire database.
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Why are these changes needed?
Each
read_sql
read tasks attempts to read a different chunk of data using the offset and limit filters. For example, if you have a database with 200 rows,read_sql
might launch two tasks that readsoffset 0 limit 100
andoffset 100 limit 100
to read rows 0-100 and 100-200, respectively.However, if the underlying database doesn’t have a deterministic ordering, read tasks might read duplicate data.
To fix this correctness issue, this PR makes
read_sql
always launch one task. Sinceoffset
typically requires scanning and discarding rows, this PR's code should perform similarly to the original implementation.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.