Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: remove DataFusion pyarrow feat #1000

Merged

Conversation

timsaucer
Copy link
Contributor

@timsaucer timsaucer commented Jan 20, 2025

Which issue does this PR close?

This addresses part of apache/datafusion#14197

Closes #971

Rationale for this change

By removing the pyarrow dependency of DataFusion we can update pyo3 in without requiring corresponding updates to the DataFusion core repository. This does add in a few additional pieces, such as adding a wrapper around ScalarValue, but it will simplify the core DataFusion repo to not have pyo3 in it.

What changes are included in this PR?

  • Removes pyarrow feature of DataFusion core repo
  • Adds PyScalarValue which is a simple wrapper on ScalarValue so we can do things like implement traits on it that are currently implemented upstream in DataFusion.
  • Renames DataFusionError to PyDataFusionError so there is not confusion with the enum defined upstream.

Are there any user-facing changes?

No user facing changes.

@kylebarron
Copy link
Contributor

By removing the pyarrow dependency of DataFusion we can update pyo3 in without requiring corresponding updates to the DataFusion core repository.

FWIW this is also one of the reasons why I created pyo3-arrow: https://docs.rs/pyo3-arrow/latest/pyo3_arrow/#why-not-use-arrow-rss-python-integration

Copy link

@matko matko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The majority of changes here are caused by no longer having a conversion directly from DataFusionError to PyErr, which means having to map to PyDataFusionError first. This touches a lot of places, but the change is straightforward and looks good to me.

My one comment is that the new PyScalarValue wrapper incurs clone overhead in conversions when you only have a ScalarValue available. This is the case in expr.rs, where a ScalarValue needs to be wrapped into a PyScalarValue just to get access to the to_pyarrow() function, which itself really only requires a borrow to the ScalarValue internally.

Therefore my one request is that this functionality should still be available on ScalarValue directly, and the clone removed.

src/expr.rs Outdated
@@ -356,7 +355,7 @@ impl PyExpr {
/// Extracts the Expr value into a PyObject that can be shared with Python
pub fn python_value(&self, py: Python) -> PyResult<PyObject> {
match &self.expr {
Expr::Literal(scalar_value) => Ok(scalar_value.to_pyarrow(py)?),
Expr::Literal(scalar_value) => Ok(PyScalarValue(scalar_value.clone()).to_pyarrow(py)?),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now a double clone, as to_pyarrow() is also cloning eventually. I see that this previously called into the pyarrow feature-enabled code, and this now got replaced by a version that works on PyScalarValue instead. However, we could still have a version of this code that works directly on ScalarValue. See notes below at pyarrow_util.rs.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to have the conversion functions in this file also available directly on a ScalarValue, which is how it was provided by the pyarrow feature. This to prevent having to clone ScalarValues, just to wrap them in a PyScalarValue to get access to these functions.

I understand we can't implement traits like ToPyArrow directly on ScalarValue as it is not in this crate, but we can leave it as an ordinary utility function.

pub fn scalar_to_pyarrow(scalar: &ScalarValue) -> PyResult<PyObject> {
    ...
}

Then call it from ToPyArrow as a common code path.

src/config.rs Outdated
@@ -40,7 +42,7 @@ impl PyConfig {
#[staticmethod]
pub fn from_env() -> PyResult<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of always calling map_err to a PyDataFusionError and then coercing to PyResult, you can just return a PyDataFusionResult (i.e. Result<T, PyDataFusionError>). You just need to implement From<PyDataFusionError> for PyErr. https://github.com/developmentseed/obstore/blob/25f51361d8bf0634e8fe851b76d086ab0d23d0af/pyo3-object_store/src/error.rs#L102

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent suggestion! Applied across the entire branch.

@timsaucer timsaucer force-pushed the feat/remove-datafusion-pyarrow-feat branch from 53ddec3 to 700e8c5 Compare February 1, 2025 12:54
@timsaucer timsaucer merged commit 8b51390 into apache:main Feb 1, 2025
15 checks passed
@timsaucer timsaucer deleted the feat/remove-datafusion-pyarrow-feat branch February 1, 2025 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add instructions to reduce rebuild time
3 participants