Add trait based ScalarUDF API #8578

alamb · 2023-12-18T16:47:30Z

Which issue does this PR close?

Closes #8568

Rationale for this change

This PR is a step towards #8045:

I want to make it easier to extend DataFusion's function packages (so that we can support many more different implementatins)
Splitting out functions into packages (e.g. datafusion-functions) needs an API that is easier to implement
We need a place to put more advanced ScalarUDF features (like Specialized / Pre-compiled / Prepared ScalarUDFs #8051 and add examples and description to scalar/aggregate functions? #8366) and the current lower level API is very hard to extend in a backwards compatible way

What changes are included in this PR?

introducing a trait based API for scalar functions (ScalarUDFImpl -- better names welcomed)
Add example of how you can use the trait based APIs for more advanced implementations (advanced_udf.rs)

If this PR is accepted, I plan to file tickets to track

Clean up internal implementation of ScalarUDF (e.g. make it use the trait based API rather than the current function pointers)
Add a similar trait for AggregateUDF and WindowUDF, for the same reasons

Are these changes tested?

Yes, both new tests as well as updated existing tests

Are there any user-facing changes?

There is a new way to define ScalarUDFs and additional documentation.

alamb · 2023-12-18T18:34:01Z

datafusion-examples/examples/advanced_udf.rs

+use datafusion_expr::{ColumnarValue, ScalarUDF, ScalarUDFImpl, Signature};
+use std::sync::Arc;
+
+/// This example shows how to use the full ScalarUDFImpl API to implement a user


I wanted to create an example that shows how to make a more advanced UDF that special cases constant values.

This also shows how to create a ScalarUDF using a trait (rather than free functions and closures)

alamb · 2023-12-18T18:36:00Z

datafusion/expr/src/expr.rs

-            &return_type,
-            &fun,
-        ));
+        struct TestScalarUDF {


This shows an example of the difference in trait based vs low level ScalarValue::new API that I propose to deprecate

While the trait requires more lines, I think it is much easier to implement as it is simply a standard trait implementation which I believe is far more common than Arc'd closures

alamb · 2023-12-18T18:38:02Z

datafusion/expr/src/udf.rs

+    ///
+    /// See  [`ScalarUDFImpl`] for a more convenient way to create a
+    /// `ScalarUDF` using trait objects
+    #[deprecated(since = "34.0.0", note = "please implement ScalarUDFImpl instead")]


I think this low level API is quite akward to use and very hard to extend in backwards compatible ways. The trait is easer to use and easier to extend.

Thus I propose marking this API as deprecated (note most of the examples in codebase use create_udf rather than ScalarUDF:new() directly) so I think the impact will be limited

Agreed that current low-level API looks awkward to use. Ideally a trait defining what a UDF should implement should be better solution.

alamb · 2023-12-18T18:38:34Z

datafusion/expr/src/udf.rs

+    where
+        F: ScalarUDFImpl + Send + Sync + 'static,
+    {
+        // TODO change the internal implementation to use the trait object


I plan to improve the internal representation as a follow on PR

#8713 is the follow on PR

alamb · 2023-12-18T18:41:04Z

datafusion/expr/src/udf.rs

+    }
+}
+
+/// Trait for implementing [`ScalarUDF`].


Here is the proposed new trait. I think we can use this trait to add things such as "pre-compiling" arguments #8051 and adding better examples / documentation add examples and description to scalar/aggregate functions #8366.

cc @universalmind303 for your comments

I think this looks much more intuitive than the current implementation. I actually just commented on an open issue about the api before reviewing this & my suggestion was nearly identical!

#8568 (comment)

alamb · 2023-12-18T18:53:27Z

cc @2010YOUY01, @thinkharderdev, @viirya and @andygrove -- in case you have comments about the proposed way of implementing ScalarUDF.

This PR doesn't make any API changes, but it does deprecate ScalarUDF::new()

viirya · 2023-12-18T19:54:09Z

datafusion/expr/src/udf.rs

+    /// Create a new `ScalarUDF` from a `[ScalarUDFImpl]` trait object
+    ///
+    /// Note this is the same as using the `From` impl (`ScalarUDF::from`)
+    pub fn new_from_trait<F>(fun: F) -> ScalarUDF


new_from_impl?

viirya · 2023-12-18T19:57:21Z

datafusion/expr/src/udf.rs

+/// can be used to implement any function.
+///
+/// See [`advanced_udf.rs`] for a full example with implementation. See
+/// [`ScalarUDF`] for details on a simpler API.


For simpler API, do you mean create_udf?

Yeah, I was trying to avoid replicating the same content (e.g. with links to create_udf, and simple example) all over the place (and just have it linked on ScalarUDF). I have tried to make this clearer

viirya · 2023-12-18T20:01:39Z

docs/source/library-user-guide/adding-udfs.md

@@ -76,7 +76,8 @@ The challenge however is that DataFusion doesn't know about this function. We ne

 ### Registering a Scalar UDF


Do we want to add advanced example advanced_udf.rs link to this document and also document ScalarUDFImpl there too? Maybe a follow up.

viirya · 2023-12-18T20:05:09Z

datafusion/expr/src/udf.rs

+    /// # Performance
+    /// Many functions can be optimized for the case when one or more of their
+    /// arguments are constant values [`ColumnarValue::Scalar`].


For this performance section, does it mean the implementations should optimize the case or DataFusion will optimize the case? Looks a bit unclear to me.

It means that the implementations should optimize the case -- I have tried to clarify the comments in this regard.

datafusion/expr/src/udf.rs

alamb

Thank you for the (as always) insightful review @viirya

alamb · 2023-12-18T21:06:04Z

datafusion/expr/src/udf.rs

+/// can be used to implement any function.
+///
+/// See [`advanced_udf.rs`] for a full example with implementation. See
+/// [`ScalarUDF`] for details on a simpler API.


Yeah, I was trying to avoid replicating the same content (e.g. with links to create_udf, and simple example) all over the place (and just have it linked on ScalarUDF). I have tried to make this clearer

alamb · 2023-12-18T21:08:26Z

datafusion/expr/src/udf.rs

+    /// # Performance
+    /// Many functions can be optimized for the case when one or more of their
+    /// arguments are constant values [`ColumnarValue::Scalar`].


It means that the implementations should optimize the case -- I have tried to clarify the comments in this regard.

viirya · 2023-12-19T22:30:03Z

datafusion-examples/examples/advanced_udf.rs

+                        // calculate the result for every row. The `unary` very
+                        // fast,  "vectorized" code and handles things like null
+                        // values for us.


Not sure if I read it correctly:

Suggested change

// calculate the result for every row. The `unary` very

// fast, "vectorized" code and handles things like null

// values for us.

// calculate the result for every row. The `unary` is very

// fast "vectorized" code and handles things like null

// values for us.

datafusion-examples/examples/advanced_udf.rs

viirya · 2023-12-19T22:39:47Z

datafusion/expr/src/udf.rs

    pub fn signature(&self) -> &Signature {
        &self.signature
    }

-    /// Return the type of the function given its input types
+    /// The datatype this function returns given the input argument input types


Maybe?

Suggested change

/// The datatype this function returns given the input argument input types

/// The datatype this function returns given the input argument types

viirya · 2023-12-19T22:44:01Z

docs/source/library-user-guide/adding-udfs.md

@@ -93,6 +95,11 @@ let udf = create_udf(
 );
 ```

+[`scalarudf`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html


Suggested change

[`scalarudf`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html

[`ScalarUDF`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html

For some reason this lower casing is done by prettier so I can't impmement this suggestion without causing CI to fail 😬

andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion$ git diff

diff --git a/docs/source/library-user-guide/adding-udfs.md b/docs/source/library-user-guide/adding-udfs.md index c51e4de32..1d2cc0a12 100644 --- a/docs/source/library-user-guide/adding-udfs.md +++ b/docs/source/library-user-guide/adding-udfs.md @@ -95,7 +95,7 @@ let udf = create_udf( ); -[`scalarudf`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html +[`ScalarUDF`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html [`create_udf`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/fn.create_udf.html [`make_scalar_function`]: https://docs.rs/datafusion/latest/datafusion/physical_expr/functions/fn.make_scalar_function.html [`advanced_udf.rs`]: https://github.com/apache/arrow-datafusion/blob/main/datafusion-examples/examples/advanced_udf.rs

andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion$ npx [email protected] --check '{datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md' '!datafusion/CHANGELOG.md' README.md CONTRIBUTING.md Checking formatting... [warn] docs/source/library-user-guide/adding-udfs.md [warn] Code style issues found in the above file. Forgot to run Prettier? andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion$ git reset --hard HEAD is now at 3ce1802df Improve docs for aliases andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion$ npx [email protected] --check '{datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md' '!datafusion/CHANGELOG.md' README.md CONTRIBUTING.md Checking formatting... All matched files use Prettier code style! andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion$

Co-authored-by: Liang-Chi Hsieh <[email protected]>

…fusion into alamb/better_scalar_api

alamb · 2023-12-26T13:41:22Z

Update here is I plan to merge this tomorrow unless anyone would like more time to review

thinkharderdev · 2023-12-27T14:06:16Z

Nice! It would be very useful to be able to handle serde as well for custom implementations (perhaps in a different PR?). I think this could fit relatively easily into LogicalExtensionCodec

alamb · 2023-12-28T20:07:07Z

I have several follow on tasks I will do like shortly:

File a ticket for handling serde for custom implementations: Handle Serde for Custom ScalarUDFImpl traits #8706
File a ticket for applying the same pattern to AggregateUDF Implement trait based API for defining AggregateUDF #8710 (and epic [Epic] Unify AggregateFunction Interface (remove built in list of AggregateFunction s), improve the system #8708)
File a ticket for applying the same pattern to WindowUDF Implement trait based API for defining WindowUDF #8711 (and Epic [Epic] Unify WindowFunction Interface (remove built in list of BuiltInWindowFunction s) #8709)
File a ticket for cleaning the internal implementation of ScalarUDF to use the trait (rather than the function pointers): Clean internal implementation of ScalarUDF to use ScalarUDFImpl (rather than the function pointers) #8712
PR that starts breaking out functions into their own crate Create datafusion-functions crate, extract encode and decode to #8705

alamb · 2024-01-01T18:31:42Z

Nice! It would be very useful to be able to handle serde as well for custom implementations (perhaps in a different PR?). I think this could fit relatively easily into LogicalExtensionCodec

Filed #8706

SteveLauC · 2024-02-28T07:36:36Z

datafusion/expr/src/udf.rs

 ///
+/// 1. For simple (less performant) use cases, use [`create_udf`] and [`simple_udf.rs`].


less performant

Hi, is there anyone who would like to explain a bit about why create_udf() is less performant than the UDFs created by ScalarUDFImpl?

The reason is that create_udf() always converts its arguments to ArrayRef and thus you can't implement special cases for constant values (ScalarValue) -- instead the scalar value is always converted into an array.

Update: this does not seem to be correct. I will do some more investigation

Filed #9384 to clarify docs

Using create_udf create an extra indirection. Under the hood it's creating a

pub struct SimpleScalarUDF { name: String, signature: Signature, return_type: DataType, fun: ScalarFunctionImplementation, } impl ScalarUDFImpl for SimpleScalarUDF { fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> { (self.fun)(args) } }

so it adds an extra call for every batch processed through the UDF

That makes sense 👍 -- I don't think the overhead of a single function call is worth calling out in the docs however (I think it is more confusing than helpfl), though please let me know if you disagree on #9384

Nope. I agree it's not meaningful enough to call out tin docs

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Dec 18, 2023

Introduce new trait based ScalarUDF API

7e7dae7

alamb force-pushed the alamb/better_scalar_api branch from bdd8ca1 to 7e7dae7 Compare December 18, 2023 18:32

alamb commented Dec 18, 2023

View reviewed changes

alamb marked this pull request as ready for review December 18, 2023 18:41

Merge remote-tracking branch 'apache/main' into alamb/better_scalar_api

0d9bd8d

viirya reviewed Dec 18, 2023

View reviewed changes

datafusion/expr/src/udf.rs Outdated Show resolved Hide resolved

alamb added 3 commits December 18, 2023 15:28

Merge remote-tracking branch 'apache/main' into alamb/better_scalar_api

879746c

change name to Self::new_from_impl

ae9d42c

Improve documentation, add link to advanced_udf.rs in the user guide

7fffd60

alamb commented Dec 18, 2023

View reviewed changes

alamb added 2 commits December 18, 2023 16:14

typo

3c86513

Improve docs for aliases

3ce1802

viirya reviewed Dec 19, 2023

View reviewed changes

datafusion-examples/examples/advanced_udf.rs Outdated Show resolved Hide resolved

viirya reviewed Dec 19, 2023

View reviewed changes

datafusion-examples/examples/advanced_udf.rs Outdated Show resolved Hide resolved

viirya reviewed Dec 19, 2023

View reviewed changes

viirya approved these changes Dec 19, 2023

View reviewed changes

alamb and others added 4 commits December 20, 2023 15:39

Apply suggestions from code review

0cade3f

Co-authored-by: Liang-Chi Hsieh <[email protected]>

improve docs

a7f2903

Merge remote-tracking branch 'apache/main' into alamb/better_scalar_api

d2fe8e7

Merge branch 'alamb/better_scalar_api' of github.com:alamb/arrow-data…

3904129

…fusion into alamb/better_scalar_api

alamb mentioned this pull request Dec 20, 2023

Specialized / Pre-compiled / Prepared ScalarUDFs #8051

Open

universalmind303 approved these changes Dec 21, 2023

View reviewed changes

alamb mentioned this pull request Dec 22, 2023

Implement trait based API for defining ScalarUDFs #8568

Closed

Merge remote-tracking branch 'apache/main' into alamb/better_scalar_api

e6bb42a

Merge remote-tracking branch 'apache/main' into alamb/better_scalar_api

d14b43d

alamb merged commit b2cbc78 into apache:main Dec 28, 2023
23 checks passed

alamb deleted the alamb/better_scalar_api branch December 28, 2023 20:07

alamb mentioned this pull request Jan 1, 2024

Handle Serde for Custom ScalarUDFImpl traits #8706

Closed

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

SteveLauC reviewed Feb 28, 2024

View reviewed changes

alamb mentioned this pull request Feb 28, 2024

Minor: clarify performance in docs for ScalarUDF, ScalarUDAF and ScalarUDWF #9384

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add trait based ScalarUDF API #8578

Add trait based ScalarUDF API #8578

alamb commented Dec 18, 2023 •

edited

Loading

alamb Dec 18, 2023

alamb Dec 18, 2023

alamb Dec 18, 2023 •

edited

Loading

viirya Dec 18, 2023

alamb Dec 18, 2023

alamb Jan 2, 2024

alamb Dec 18, 2023

universalmind303 Dec 21, 2023

alamb commented Dec 18, 2023

viirya Dec 18, 2023

viirya Dec 18, 2023

alamb Dec 18, 2023

viirya Dec 18, 2023

viirya Dec 18, 2023

alamb Dec 18, 2023

alamb left a comment

alamb Dec 18, 2023

alamb Dec 18, 2023

viirya Dec 19, 2023

viirya Dec 19, 2023 •

edited

Loading

viirya Dec 19, 2023

alamb Dec 20, 2023

alamb commented Dec 26, 2023

thinkharderdev commented Dec 27, 2023

alamb commented Dec 28, 2023 •

edited

Loading

alamb commented Jan 1, 2024

SteveLauC Feb 28, 2024

alamb Feb 28, 2024 •

edited

Loading

alamb Feb 28, 2024

thinkharderdev Feb 28, 2024

alamb Feb 28, 2024

thinkharderdev Feb 29, 2024

		@@ -76,7 +76,8 @@ The challenge however is that DataFusion doesn't know about this function. We ne

		### Registering a Scalar UDF

	/// The datatype this function returns given the input argument input types
	/// The datatype this function returns given the input argument types

	[`scalarudf`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html
	[`ScalarUDF`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html

		///
		/// 1. For simple (less performant) use cases, use [`create_udf`] and [`simple_udf.rs`].

Add trait based ScalarUDF API #8578

Add trait based ScalarUDF API #8578

Conversation

alamb commented Dec 18, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Dec 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 26, 2023

thinkharderdev commented Dec 27, 2023

alamb commented Dec 28, 2023 • edited Loading

alamb commented Jan 1, 2024

Choose a reason for hiding this comment

alamb Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 18, 2023 •

edited

Loading

alamb Dec 18, 2023 •

edited

Loading

viirya Dec 19, 2023 •

edited

Loading

alamb commented Dec 28, 2023 •

edited

Loading

alamb Feb 28, 2024 •

edited

Loading