Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for third party table providers #41

Open
ccciudatu opened this issue Nov 3, 2024 · 0 comments
Open

Add support for third party table providers #41

ccciudatu opened this issue Nov 3, 2024 · 0 comments

Comments

@ccciudatu
Copy link

ccciudatu commented Nov 3, 2024

DataFusion Ray can be extended to leverage datasource integrations that are not built into DataFusion itself, but could be brought in as features, such as:

However, adding a custom table provider to a distributed DataFusion engine requires two things:

  1. registering the corresponding TableProviderFactory with the DataFusion SessionContext
  2. for integrations that define custom ExecutionPlan nodes, registering the corresponding PhysicalExtensionCodecs

The current code does not allow for such extensions, because:

  1. the datafusion SessionContext (which is wrapped in a PySessionContext) is created outside the datafusion-ray library and can only be used via its python interface (i.e. invoking named methods on PyAny flavours of session context and execution plan)
  2. the only extension codec used for plan serialization is the ShuffleCodec, which only handles the shuffle read/write nodes of datafusion-ray itself

The solution that we came up with for addressing the above limitations in our fork involves the following changes:

  1. Add (back) the rust dependency on datafusion-python and provide a python function that creates a PySessionContext from within datafusion_ray itself (which can be customized with the enabled table factories and whatnot and can also be downcast so we can use the rust reference directly).
    The "external" python datafusion.SessionContext() will continue to be supported, it will just have no support for any "extensions" that datafusion-ray was compiled with (since we don't want to attempt unsafe downcasts).
  2. Add an Extension trait (not necessarily limited to table providers) that gets implemented by each such "extension" in order to:
    a. customize the SessionContext before it gets returned to python (e.g. registering table provider factories, catalog providers etc.)
    b. provide a list of physical extension codecs required for serializing its custom physical plan nodes, if any
  3. create a composite Extensions singleton that prepares session contexts created by the new datafusion_ray.extended_session_context() python function and also maintains a composite PhysicalExtensionCodec to serialize both the built-in shuffle nodes as well as any additional codecs provided by the enabled extensions.

I'd be glad to open a PR for contributing this, unless the feature request is out of scope or there are plans to address it differently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant