-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ScalarUDFImpl::invoke_with_args
to support passing the return type created for the udf instance
#13290
Conversation
datafusion/expr/src/udf.rs
Outdated
&self, | ||
args: &[ColumnarValue], | ||
number_rows: usize, | ||
return_type: &DataType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we add InvokeUDFArg
struct that includes these to avoid breaking change in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should roll all the values into this struct or just number_rows and return_type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
including args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on the new function name, if we roll these in together. Since this subsumes invoke_batch
, would we then want to deprecate that too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally prefer single invoke
function, but that will be a large change so not required to roll them together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a better name than
pub fn invoke_batch_with_return_type(
&self,
args: InvokeArgs
) -> Result<ColumnarValue> {
is needed, maybe invoke_with_args(), I don't like it that much, but not sure on the name.
pub fn invoke_with_args(
&self,
args: InvokeArgs
) -> Result<ColumnarValue> {
In an ideal worlds, UDF would have single invoke method. This is why #13238 / #13064 / #13345 consolidation takes place. In #12819 (comment) i am attempting to clarify why If not simplify, what if we had something like |
What if we just changed the invoke_batch API (again?) like in #12819 (comment) Maybe I think that is about the best we can hope for -- it would be API churn on downstream users, but at least it will be mechanical and the deprecation warning will give them time to adjust |
if constructing |
Cool, Are you sure we want to reuse invoke, this will stop anyone updating their DF version unless they remove invoke_no_args and then use ScalarFunctionArgs |
@findepi Would you prefer a breaking rename to invoke? |
Given @alamb 's proposal |
It seems to me that https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.invoke So I propose we change the API "one last time". Note if we are going to do this I would like to also change the function to get owned arguments (so long term we can potentially do optimizations like reusing the underlying buffers with kernels like unary_mut So something like struct ScalarFunctionArgs<'a> {
pub args: Vec<ColumnarValue>,
pun num_rows: usize,
pub return_type: &'a DataType,
} Thank you for bearing with us @joseph-isaacs |
@@ -141,7 +141,11 @@ impl PhysicalExpr for ScalarFunctionExpr { | |||
.collect::<Result<Vec<_>>>()?; | |||
|
|||
// evaluate the function | |||
let output = self.fun.invoke_batch(&inputs, batch.num_rows())?; | |||
let output = self.fun.invoke_with_args(ScalarFunctionArgs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can pass in inputs
directly here (not as a slice but as the Vec) that would be ideal
This changes will require over 1000 test changes. Can we make this change incrementaily? |
Sure, what increments would you propose? Perhaps we can start with the removal of current |
Definitely |
Merge a PR with But it does look like migrating to invoke_batch would help. |
…invoke is passed the return type created for the udf instance
642928e
to
ae73371
Compare
Here is what I suggest to get this PR in:
Then we can migrate the code in DataFusion over to use |
I feel bad about this PR stalling. I am working on a proposal to get it going and in. Sorry for the delay @joseph-isaacs |
@joseph-isaacs here is a proposed patch that targets this PR that should get CI clean With that change I think we can merge this PR in Thank you for your patience |
ScalarUDFImpl::invoke_with_return_type
where the invoke is passed the return type created for the udf instanceScalarUDFImpl::invoke_with_args
to support passing the return type created for the udf instance
@@ -324,6 +327,16 @@ where | |||
} | |||
} | |||
|
|||
pub struct ScalarFunctionArgs<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this structure also provides a potential place for "preparing" a scalar function (e.g. to pre-compile a regex 🤔 ) - #8051
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you suggest having a opaque block returned from a (new interface method) which is that that value to invoke?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was thinking something like
trait ScalarUDFImpl {
/// prepares to run the function, returning any state (such as
/// a precompiled regex). Called once per instance of function in the query
fn prepare(&self, args: &ScalarFunctionArgs) -> Option<Box<dyn Any>> { None }
/// `prepared` field in ScalarFunctonArgs has the result of calling `prepare`
fn invoke_with_args(&self, args: &ScalarFunctionArgs) -> ...
pub struct ScalarFunctionArgs<'a> {
...
/// The result from a call to `prepare` for this function instance
prepared: Option<Box<dyn Any>>,
}
Do not yet deprecate `invoke_batch`, add docs to invoke_with_args
Thannks @alamb |
I have filed a follow on ticket to deprecate I think once the CI has passed this PR will be good to go Thank you for @joseph-isaacs as well as your help @jayzhan211 and @findepi |
I also filed Which I think would be quite cool |
pub number_rows: usize, | ||
// The return type of the scalar function returned (from `return_type` or `return_type_from_exprs`) | ||
// when creating the physical expression from the logical expression | ||
pub return_type: &'a DataType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should also be easy to add as a follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, as long as it's done in this release cycle so that we don't churn the api even more between cycles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes -- I will file a ticket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracking with #13519
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(BTW the change in the design to a struct I think allows for much easier non-breaking API additions in the future)
Let's get this in to unblock the next work! |
The more I think about it the less ideal I think renaming to Thank you for #13491 👀 |
Which issue does this PR close?
Closes #12819.
Rationale for this change
There is currently no way to determine the schema-defined nullability of the arrays from
args: &[ColumnarValue],
from theinvoke_batch
function. The array nullabilityArray::is_nullable()
(which is available) only checks the array null count.If the return type (from
ScalarUDFImpl::return_type
) is passed to the newly defined invoke method then this datatype can be inspected and any nullability information can be inferred.What changes are included in this PR?
Define a new trait function
invoke_batch_with_return_type
which defaults to calling toinvoke_batch
.Example usage can be seen here.
Are these changes tested?
Are there any user-facing changes?
There are no breaking changes, but a new optional method is added.