-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: remove DataFusion pyarrow feat #1000
feat: remove DataFusion pyarrow feat #1000
Conversation
FWIW this is also one of the reasons why I created pyo3-arrow: https://docs.rs/pyo3-arrow/latest/pyo3_arrow/#why-not-use-arrow-rss-python-integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The majority of changes here are caused by no longer having a conversion directly from DataFusionError
to PyErr
, which means having to map to PyDataFusionError
first. This touches a lot of places, but the change is straightforward and looks good to me.
My one comment is that the new PyScalarValue
wrapper incurs clone overhead in conversions when you only have a ScalarValue
available. This is the case in expr.rs, where a ScalarValue
needs to be wrapped into a PyScalarValue
just to get access to the to_pyarrow()
function, which itself really only requires a borrow to the ScalarValue
internally.
Therefore my one request is that this functionality should still be available on ScalarValue
directly, and the clone removed.
src/expr.rs
Outdated
@@ -356,7 +355,7 @@ impl PyExpr { | |||
/// Extracts the Expr value into a PyObject that can be shared with Python | |||
pub fn python_value(&self, py: Python) -> PyResult<PyObject> { | |||
match &self.expr { | |||
Expr::Literal(scalar_value) => Ok(scalar_value.to_pyarrow(py)?), | |||
Expr::Literal(scalar_value) => Ok(PyScalarValue(scalar_value.clone()).to_pyarrow(py)?), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now a double clone, as to_pyarrow()
is also cloning eventually. I see that this previously called into the pyarrow feature-enabled code, and this now got replaced by a version that works on PyScalarValue
instead. However, we could still have a version of this code that works directly on ScalarValue
. See notes below at pyarrow_util.rs.
src/pyarrow_util.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to have the conversion functions in this file also available directly on a ScalarValue
, which is how it was provided by the pyarrow feature. This to prevent having to clone ScalarValue
s, just to wrap them in a PyScalarValue
to get access to these functions.
I understand we can't implement traits like ToPyArrow
directly on ScalarValue
as it is not in this crate, but we can leave it as an ordinary utility function.
pub fn scalar_to_pyarrow(scalar: &ScalarValue) -> PyResult<PyObject> {
...
}
Then call it from ToPyArrow
as a common code path.
src/config.rs
Outdated
@@ -40,7 +42,7 @@ impl PyConfig { | |||
#[staticmethod] | |||
pub fn from_env() -> PyResult<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of always calling map_err
to a PyDataFusionError
and then coercing to PyResult
, you can just return a PyDataFusionResult
(i.e. Result<T, PyDataFusionError>
). You just need to implement From<PyDataFusionError> for PyErr
. https://github.com/developmentseed/obstore/blob/25f51361d8bf0634e8fe851b76d086ab0d23d0af/pyo3-object_store/src/error.rs#L102
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent suggestion! Applied across the entire branch.
…rename DataFusionError to PyDataFusionError to be less confusing
…ython. Also removed the rust unit tests copied over from upstream repo that were failing due to apache#941 in pyo3
53ddec3
to
700e8c5
Compare
Which issue does this PR close?
This addresses part of apache/datafusion#14197
Closes #971
Rationale for this change
By removing the
pyarrow
dependency of DataFusion we can updatepyo3
in without requiring corresponding updates to the DataFusion core repository. This does add in a few additional pieces, such as adding a wrapper aroundScalarValue
, but it will simplify the core DataFusion repo to not have pyo3 in it.What changes are included in this PR?
pyarrow
feature of DataFusion core repoPyScalarValue
which is a simple wrapper onScalarValue
so we can do things like implement traits on it that are currently implemented upstream in DataFusion.DataFusionError
toPyDataFusionError
so there is not confusion with the enum defined upstream.Are there any user-facing changes?
No user facing changes.