You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We want to reimplement the shuffle mechanism and have it use the Ray object store to store the shuffle output.
Here are some notes on what this may look like.
ShuffleWriterExec and ShuffleReaderExec should be re-implemented in Python rather than Rust (although they can still call Rust code where needed).
Planning
The reader needs to know how to find the output from the writer, so we perhaps need to add some extra data to the distributed plan, such as a UUID for each query stage.
Writer
The new ShuffleWriterExec should execute its child plan to get a stream of record batches and iterate over them and repartition them according to the shuffle partitioning schema.
The smaller repartitioned batches should be written to the object store (although we may want to buffer them in memory first until they are a certain size).
When all batches have been processed, we should be able to write a final object that contains references to the other objects.
Reader
The reader will need to find the final object stored by the writer so that it can then load the other batches, We probably need to remove the objects once they have been read.
The text was updated successfully, but these errors were encountered:
We want to reimplement the shuffle mechanism and have it use the Ray object store to store the shuffle output.
Here are some notes on what this may look like.
ShuffleWriterExec and ShuffleReaderExec should be re-implemented in Python rather than Rust (although they can still call Rust code where needed).
Planning
The reader needs to know how to find the output from the writer, so we perhaps need to add some extra data to the distributed plan, such as a UUID for each query stage.
Writer
The new ShuffleWriterExec should execute its child plan to get a stream of record batches and iterate over them and repartition them according to the shuffle partitioning schema.
The smaller repartitioned batches should be written to the object store (although we may want to buffer them in memory first until they are a certain size).
When all batches have been processed, we should be able to write a final object that contains references to the other objects.
Reader
The reader will need to find the final object stored by the writer so that it can then load the other batches, We probably need to remove the objects once they have been read.
The text was updated successfully, but these errors were encountered: