Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: adopt datafusion-python crate #3144

Closed
ion-elgreco opened this issue Jan 19, 2025 · 7 comments
Closed

Discussion: adopt datafusion-python crate #3144

ion-elgreco opened this issue Jan 19, 2025 · 7 comments
Labels
enhancement New feature or request

Comments

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Jan 19, 2025

Description

Use Case
We can adopt the datafusion-python crate, for the deltalake-python. A couple benefits would be this:

pro's:

  • Make sessionContext configurable from python
  • Allows UDFs to be registered and used during operations
  • Potentially allow plans to be returned as dry run and explained before running?

cons:

  • Datafusion-python updates at a slower rate than datafusion, which could hold us back and make the update process which is already complicated even more complicated

Related Issue(s)

@ion-elgreco ion-elgreco added the enhancement New feature or request label Jan 19, 2025
@ion-elgreco
Copy link
Collaborator Author

@timsaucer thoughts on this?

@timsaucer
Copy link
Contributor

My general recommendation is to keep datafusion rust repo as your primary rust dependency and (if needed) to use datafusion-python as a python dependency. But I suspect I don't fully understand the use case of what you're trying to do here.

Can you give an example of what you're currently trying to do that is hampered by not using datafusion-python?

@ion-elgreco
Copy link
Collaborator Author

Some folks want to be able to configure the datafusion sessioncontext or register UDFs, neither is possible at the moment. But I also don't want to build this integration from the ground-up, therefore depending on datafusion-python crate would solve that

@roeap
Copy link
Collaborator

roeap commented Jan 19, 2025

My gut feeling would be that our best bet is to expose record batch iterators ideally for both log and data. most modern engines / dataframe libraries should be able to consume these. The main challenge would be how to do predicate pushdown when the query would ideally only be issues at the "frontend" library.

For some engines native support is also on the way, and hopefully that happens more :).

When it comes to UDFs, I guess there only need to be available in the engine that does the downstream processing? But maybe we can expose some sort of python table provider (factory) that integrates with datafusion, hoping that this can be done w/o coupling the datafusion version with datafusion-python.

I'd be curious to learn, what the most requested config on the session is - do we have some insights on this?

@ion-elgreco
Copy link
Collaborator Author

My gut feeling would be that our best bet is to expose record batch iterators ideally for both log and data. most modern engines / dataframe libraries should be able to consume these. The main challenge would be how to do predicate pushdown when the query would ideally only be issues at the "frontend" library.

For some engines native support is also on the way, and hopefully that happens more :).

Yeah I wasn't looking at the reader side, but rather operations from py->rust. But I agree, Datafusion-python (python library, not the crate) has already native support. Polars also has somewhat native support, not the most optimal but it's already tons faster than pyArrow dataset

When it comes to UDFs, I guess there only need to be available in the engine that does the downstream processing? But maybe we can expose some sort of python table provider (factory) that integrates with datafusion, hoping that this can be done w/o coupling the datafusion version with datafusion-python.

I'd be curious to learn, what the most requested config on the session is - do we have some insights on this?

Yeah udfs in this case are only to be used during our Delta operations that use Datafusion. Then each operation in python can set the session context as well

@timsaucer
Copy link
Contributor

If the case is that some users want to modify the session context or register functions, why do those users not get at this via datafusion python package?

In general, pulling in datafusion-python rust I think will cause you to re-export a lot of code and not be performant.

@ion-elgreco
Copy link
Collaborator Author

@timsaucer They were asking for the functions to be executed within the delta-rs operations (MERGE, write etc)

@delta-io delta-io locked and limited conversation to collaborators Jan 24, 2025
@rtyler rtyler converted this issue into discussion #3156 Jan 24, 2025

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants