Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ScalarUDFImpl::invoke_with_args to support passing the return type created for the udf instance #13290

Merged
merged 4 commits into from
Nov 21, 2024

Conversation

joseph-isaacs
Copy link
Contributor

@joseph-isaacs joseph-isaacs commented Nov 7, 2024

Which issue does this PR close?

Closes #12819.

Rationale for this change

There is currently no way to determine the schema-defined nullability of the arrays from args: &[ColumnarValue],from the invoke_batch function. The array nullability Array::is_nullable() (which is available) only checks the array null count.

If the return type (from ScalarUDFImpl::return_type) is passed to the newly defined invoke method then this datatype can be inspected and any nullability information can be inferred.

What changes are included in this PR?

Define a new trait function invoke_batch_with_return_type which defaults to calling to invoke_batch.

Example usage can be seen here.

Are these changes tested?

Are there any user-facing changes?

There are no breaking changes, but a new optional method is added.

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions labels Nov 7, 2024
@joseph-isaacs joseph-isaacs marked this pull request as ready for review November 7, 2024 15:27
&self,
args: &[ColumnarValue],
number_rows: usize,
return_type: &DataType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we add InvokeUDFArg struct that includes these to avoid breaking change in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should roll all the values into this struct or just number_rows and return_type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

including args

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on the new function name, if we roll these in together. Since this subsumes invoke_batch, would we then want to deprecate that too?

Copy link
Contributor

@jayzhan211 jayzhan211 Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer single invoke function, but that will be a large change so not required to roll them together.

Copy link
Contributor Author

@joseph-isaacs joseph-isaacs Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better name than

 pub fn invoke_batch_with_return_type(
        &self,
        args: InvokeArgs
    ) -> Result<ColumnarValue> {

is needed, maybe invoke_with_args(), I don't like it that much, but not sure on the name.

 pub fn invoke_with_args(
        &self,
        args: InvokeArgs
    ) -> Result<ColumnarValue> {

@findepi
Copy link
Member

findepi commented Nov 11, 2024

In an ideal worlds, UDF would have single invoke method. This is why #13238 / #13064 / #13345 consolidation takes place.
Now, we add new invoke variant. Would we want all implementations to consolidate on this new variant, or would it feel awkward when doing so?
Passing back the return_type: &DataType might be sometimes useful, but generally feels redundant.

In #12819 (comment) i am attempting to clarify why simplify isn't useful for the use-case.

If not simplify, what if we had something like bind_return_type method that's guaranteed to be invoked before invoke / invoke_batch and which may return new UDF instance (if state changes) or nothing (per default impl). This way we would provide exactly and directly what's needed for the use-case, without changing invoke itself.

@joseph-isaacs
Copy link
Contributor Author

I would also be happy to scope out another prepare/bind method on a UDF. But It seems like we need a consensus on the preferred solution @findepi @alamb.

@alamb
Copy link
Contributor

alamb commented Nov 13, 2024

I would also be happy to scope out another prepare/bind method on a UDF. But It seems like we need a consensus on the preferred solution @findepi @alamb.

What if we just changed the invoke_batch API (again?) like in #12819 (comment)

Maybe invoke_with_args(args: ScalarFunctionArgs) and mark invoke_batch deprecated 😬

I think that is about the best we can hope for -- it would be API churn on downstream users, but at least it will be mechanical and the deprecation warning will give them time to adjust

@findepi
Copy link
Member

findepi commented Nov 13, 2024

if constructing ScalarFunctionArgs is cheap (a wrapper around information already possessed), i'm fine with that.
putting the args in a struct makes future-proof
btw we might be able to use invoke name again on the hope it's the last time we change the API 🤞

@joseph-isaacs
Copy link
Contributor Author

Cool, Are you sure we want to reuse invoke, this will stop anyone updating their DF version unless they remove invoke_no_args and then use ScalarFunctionArgs

@joseph-isaacs
Copy link
Contributor Author

@findepi Would you prefer a breaking rename to invoke?

@findepi
Copy link
Member

findepi commented Nov 16, 2024

Given @alamb 's proposal invoke...(args: ScalarFunctionArgs), packing args into a struct could mean this is the last time we think about the name of the invoke function. In 5 years from now perspective, invoke sounds like a better name than inoke_with_args.

@alamb
Copy link
Contributor

alamb commented Nov 18, 2024

Cool, Are you sure we want to reuse invoke, this will stop anyone updating their DF version unless they remove invoke_no_args and then use ScalarFunctionArgs

It seems to me that invoke was deprecated in 42, and we have now released 43. I think it would be ok to reuse invoke now -- maybe with a not in the documentation that it used to have a different signature.

https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.invoke

So I propose we change the API "one last time". Note if we are going to do this I would like to also change the function to get owned arguments (so long term we can potentially do optimizations like reusing the underlying buffers with kernels like unary_mut

So something like

struct ScalarFunctionArgs<'a> {
  pub args: Vec<ColumnarValue>,
  pun num_rows: usize, 
  pub return_type: &'a DataType,
}

Thank you for bearing with us @joseph-isaacs

@@ -141,7 +141,11 @@ impl PhysicalExpr for ScalarFunctionExpr {
.collect::<Result<Vec<_>>>()?;

// evaluate the function
let output = self.fun.invoke_batch(&inputs, batch.num_rows())?;
let output = self.fun.invoke_with_args(ScalarFunctionArgs {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can pass in inputs directly here (not as a slice but as the Vec) that would be ideal

@joseph-isaacs
Copy link
Contributor Author

This changes will require over 1000 test changes. Can we make this change incrementaily?

@findepi
Copy link
Member

findepi commented Nov 19, 2024

Sure, what increments would you propose? Perhaps we can start with the removal of current invoke.

@jayzhan211
Copy link
Contributor

This changes will require over 1000 test changes. Can we make this change incrementaily?

Definitely

@joseph-isaacs
Copy link
Contributor Author

joseph-isaacs commented Nov 19, 2024

Merge a PR with invoke_with_args, and then work to rename to invoke. But this is quite an annoying reason to further delay the original intent of this PR.

But it does look like migrating to invoke_batch would help.

…invoke is passed the return type created for the udf instance
@joseph-isaacs joseph-isaacs force-pushed the ji/add-invoke-with-return branch from 642928e to ae73371 Compare November 19, 2024 15:54
@alamb
Copy link
Contributor

alamb commented Nov 19, 2024

Merge a PR with invoke_with_args, and then work to rename to invoke. But this is quite an annoying reason to further delay the original intent of this PR.

Here is what I suggest to get this PR in:

  1. Don't deprecate invoke_batch (at least for a while and not in this PR). People could use that just fine and then eventually if it gets annoying we can rename invoke_with_args

Then we can migrate the code in DataFusion over to use invoke_batch / invoke_with_args slowly without affecting other downstream users.

@alamb
Copy link
Contributor

alamb commented Nov 21, 2024

I feel bad about this PR stalling. I am working on a proposal to get it going and in. Sorry for the delay @joseph-isaacs

@alamb
Copy link
Contributor

alamb commented Nov 21, 2024

@joseph-isaacs here is a proposed patch that targets this PR that should get CI clean

With that change I think we can merge this PR in

Thank you for your patience

@alamb alamb changed the title Added support for ScalarUDFImpl::invoke_with_return_type where the invoke is passed the return type created for the udf instance Add ScalarUDFImpl::invoke_with_args to support passing the return type created for the udf instance Nov 21, 2024
@@ -324,6 +327,16 @@ where
}
}

pub struct ScalarFunctionArgs<'a> {
Copy link
Contributor

@alamb alamb Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this structure also provides a potential place for "preparing" a scalar function (e.g. to pre-compile a regex 🤔 ) - #8051

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you suggest having a opaque block returned from a (new interface method) which is that that value to invoke?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking something like

trait ScalarUDFImpl { 
  /// prepares to run the function, returning any state (such as 
  /// a precompiled regex). Called once per instance of function in the query
  fn prepare(&self, args: &ScalarFunctionArgs) -> Option<Box<dyn Any>> { None }

  /// `prepared` field in ScalarFunctonArgs has the result of calling `prepare`
  fn invoke_with_args(&self, args: &ScalarFunctionArgs) -> ...
pub struct ScalarFunctionArgs<'a> {
  ...
  /// The result from a call to `prepare` for this function instance
  prepared: Option<Box<dyn Any>>,
}

Do not yet deprecate `invoke_batch`, add docs to invoke_with_args
@joseph-isaacs
Copy link
Contributor Author

Thannks @alamb

@alamb
Copy link
Contributor

alamb commented Nov 21, 2024

I have filed a follow on ticket to deprecate invoke_batch and invoke_with_args:

I think once the CI has passed this PR will be good to go

Thank you for @joseph-isaacs as well as your help @jayzhan211 and @findepi

@alamb
Copy link
Contributor

alamb commented Nov 21, 2024

I also filed

Which I think would be quite cool

pub number_rows: usize,
// The return type of the scalar function returned (from `return_type` or `return_type_from_exprs`)
// when creating the physical expression from the logical expression
pub return_type: &'a DataType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did I not see this till now?

Please please please add SessionContext or at least SessionConfig to this. It would allow us to unblock so so many tickets

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should also be easy to add as a follow-up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, as long as it's done in this release cycle so that we don't churn the api even more between cycles

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- I will file a ticket

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking with #13519

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(BTW the change in the design to a struct I think allows for much easier non-breaking API additions in the future)

@alamb
Copy link
Contributor

alamb commented Nov 21, 2024

Let's get this in to unblock the next work!

@alamb alamb merged commit edbd93a into apache:main Nov 21, 2024
25 checks passed
@joseph-isaacs
Copy link
Contributor Author

joseph-isaacs commented Nov 22, 2024

In #13491 I have migrated over to using invoke_batch in all test. I think if we are quick we can rename invoke_with_args back to invoke. @alamb @findepi

@alamb
Copy link
Contributor

alamb commented Nov 22, 2024

In #13491 I have migrated over to using invoke_batch in all test. I think if we are quick we can rename invoke_with_args back to invoke. @alamb @findepi

The more I think about it the less ideal I think renaming to invoke is -- I think it will just cause a bunch of confusion downstream.

Thank you for #13491 👀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Access children DataType or return-type in ScalarUDFImpl::invoke
5 participants