-
Notifications
You must be signed in to change notification settings - Fork 449
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: adopt datafusion-python crate #3144
Comments
@timsaucer thoughts on this? |
My general recommendation is to keep datafusion rust repo as your primary rust dependency and (if needed) to use datafusion-python as a python dependency. But I suspect I don't fully understand the use case of what you're trying to do here. Can you give an example of what you're currently trying to do that is hampered by not using datafusion-python? |
Some folks want to be able to configure the datafusion sessioncontext or register UDFs, neither is possible at the moment. But I also don't want to build this integration from the ground-up, therefore depending on datafusion-python crate would solve that |
My gut feeling would be that our best bet is to expose record batch iterators ideally for both log and data. most modern engines / dataframe libraries should be able to consume these. The main challenge would be how to do predicate pushdown when the query would ideally only be issues at the "frontend" library. For some engines native support is also on the way, and hopefully that happens more :). When it comes to UDFs, I guess there only need to be available in the engine that does the downstream processing? But maybe we can expose some sort of python table provider (factory) that integrates with datafusion, hoping that this can be done w/o coupling the datafusion version with datafusion-python. I'd be curious to learn, what the most requested config on the session is - do we have some insights on this? |
Yeah I wasn't looking at the reader side, but rather operations from py->rust. But I agree, Datafusion-python (python library, not the crate) has already native support. Polars also has somewhat native support, not the most optimal but it's already tons faster than pyArrow dataset
Yeah udfs in this case are only to be used during our Delta operations that use Datafusion. Then each operation in python can set the session context as well |
If the case is that some users want to modify the session context or register functions, why do those users not get at this via datafusion python package? In general, pulling in |
@timsaucer They were asking for the functions to be executed within the delta-rs operations (MERGE, write etc) |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Description
Use Case
We can adopt the datafusion-python crate, for the deltalake-python. A couple benefits would be this:
pro's:
cons:
Related Issue(s)
The text was updated successfully, but these errors were encountered: