You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DataFusion Ray can be extended to leverage datasource integrations that are not built into DataFusion itself, but could be brought in as features, such as:
However, adding a custom table provider to a distributed DataFusion engine requires two things:
registering the corresponding TableProviderFactory with the DataFusion SessionContext
for integrations that define custom ExecutionPlan nodes, registering the corresponding PhysicalExtensionCodecs
The current code does not allow for such extensions, because:
the datafusion SessionContext (which is wrapped in a PySessionContext) is created outside the datafusion-ray library and can only be used via its python interface (i.e. invoking named methods on PyAny flavours of session context and execution plan)
the only extension codec used for plan serialization is the ShuffleCodec, which only handles the shuffle read/write nodes of datafusion-ray itself
The solution that we came up with for addressing the above limitations in our fork involves the following changes:
Add (back) the rust dependency on datafusion-python and provide a python function that creates a PySessionContext from within datafusion_ray itself (which can be customized with the enabled table factories and whatnot and can also be downcast so we can use the rust reference directly).
The "external" python datafusion.SessionContext() will continue to be supported, it will just have no support for any "extensions" that datafusion-ray was compiled with (since we don't want to attempt unsafe downcasts).
Add an Extension trait (not necessarily limited to table providers) that gets implemented by each such "extension" in order to:
a. customize the SessionContext before it gets returned to python (e.g. registering table provider factories, catalog providers etc.)
b. provide a list of physical extension codecs required for serializing its custom physical plan nodes, if any
create a composite Extensions singleton that prepares session contexts created by the new datafusion_ray.extended_session_context() python function and also maintains a composite PhysicalExtensionCodec to serialize both the built-in shuffle nodes as well as any additional codecs provided by the enabled extensions.
I'd be glad to open a PR for contributing this, unless the feature request is out of scope or there are plans to address it differently.
The text was updated successfully, but these errors were encountered:
DataFusion Ray can be extended to leverage datasource integrations that are not built into DataFusion itself, but could be brought in as features, such as:
However, adding a custom table provider to a distributed DataFusion engine requires two things:
TableProviderFactory
with the DataFusionSessionContext
ExecutionPlan
nodes, registering the correspondingPhysicalExtensionCodec
sThe current code does not allow for such extensions, because:
SessionContext
(which is wrapped in aPySessionContext
) is created outside the datafusion-ray library and can only be used via its python interface (i.e. invoking named methods onPyAny
flavours of session context and execution plan)ShuffleCodec
, which only handles the shuffle read/write nodes of datafusion-ray itselfThe solution that we came up with for addressing the above limitations in our fork involves the following changes:
datafusion-python
and provide a python function that creates aPySessionContext
from withindatafusion_ray
itself (which can be customized with the enabled table factories and whatnot and can also be downcast so we can use the rust reference directly).The "external" python
datafusion.SessionContext()
will continue to be supported, it will just have no support for any "extensions" that datafusion-ray was compiled with (since we don't want to attempt unsafe downcasts).Extension
trait (not necessarily limited to table providers) that gets implemented by each such "extension" in order to:a. customize the
SessionContext
before it gets returned to python (e.g. registering table provider factories, catalog providers etc.)b. provide a list of physical extension codecs required for serializing its custom physical plan nodes, if any
Extensions
singleton that prepares session contexts created by the newdatafusion_ray.extended_session_context()
python function and also maintains a compositePhysicalExtensionCodec
to serialize both the built-in shuffle nodes as well as any additional codecs provided by the enabled extensions.I'd be glad to open a PR for contributing this, unless the feature request is out of scope or there are plans to address it differently.
The text was updated successfully, but these errors were encountered: