-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UDFs marked as volatile do not appear to evaluate multiple times for each output row. #8866
Comments
I will handle this issue 😄 |
@alamb Could be. My issue is observed in UDFs, while the other ticket, the issue is seen in native functions. It could very well be the fix in the planned release fixes this also. I can actually confirm this in a bit. |
@alamb I updated the repo to depend on main branch, see dadepo/df-repro@38acb9f But when I ran it, I still got an output that indicates the issue still persists.
|
@alamb I updated to include native random function dadepo/df-repro@00feaed and running that I get
Which indicates that, this works fine with native random function. Could it also be the way UDF's work? Because from my understanding when a udf has no arguments, when called, the arguments it is defined with is set to:
ie one Array, regardless of the number of rows in the table. Could this be a factor? |
It is not because if it is volatile or not. A ScalarUDF which is defined as no argument one or all scalar inputs, DataFusion will assume its output is a scalar (you also return a one element array from That's why |
If you want |
But this means, passing an argument to a UDF, not because the function needs it, but because you want the UDF to be evaluated per output row in the query? This sounds like a limitation. The native function did not require passing such an argument for it to be evaluated per output row, I was expecting there would be away to define a UDF with this behaviour. |
Well, if you don't use |
Or you want to stick with function definition instead of |
Maybe we need to update the code that invokes a scalar UDF and if it is marked as volatile it needs to have its arguments expanded into an array before being invoked 🤔 (or know enough information so it can do that expansion itself) |
I feel that it is not so reasonable to bind the nature of a volatile scalar function to some special way on expanding its arguments or how DataFusion treats its output. The definition of volatile scalar function doesn't include such spec so by doing this we might create something weird to understand by outside DataFusion. The special treatment of argument/output is specified to The good news is we have |
I think the key thing a UDF writer needs to have is access to the number of output rows to produce. As long as they have this information we can leave it to them to implement the volatile semantics internally |
@alamb Any pointers on how to get this info? I have switched to an implementation that does not use
|
Switched the implementation to
When I dbg! the args passed to a zero parameter UDF I am implementing, I get
Which suggests to me, that a Scalar value is being passed, and not an Array (I was expecting null array) as suggested in the documentation, and hence the length cannot be deduced from this. |
🤔 -- here is how the built in random function does this: https://github.com/apache/arrow-datafusion/blob/51b898288830c224b825523f9be1d54974f15d2f/datafusion/physical-expr/src/math_expressions.rs#L375-L383 And it seems to work as expected: ❯ select x, random() from foo;
+---+---------------------+
| x | random() |
+---+---------------------+
| 1 | 0.8579996222450448 |
| 2 | 0.11611126693245999 |
+---+---------------------+
2 rows in set. Query took 0.003 seconds. I wonder if you can follow what random is doing / getting? If the scalar UDFs aren't working the same for some reason, we should fix it so they are. I checked and I don't see any obvious special cases for Random 🤔 |
So I updated the reproduction repo to implement See the diff here dadepo/df-repro@d97cf9c When I run this, it fails:
And it does look like a |
Describe the bug
It seems a UDF with no arguments are only called once, even if signature is defined with
Volatility::Volatile
and also queried in the context of a table with multiple rows.By this I mean, for example:
select random_udf() from many_rows_table
There is a minimal repro here https://github.com/dadepo/df-repro
When ran, the output could be:
In the run above, the
25.0
is the result of the udf and it is repeated.To Reproduce
A minimal reproduction can be found here https://github.com/dadepo/df-repro
Expected behavior
The UDF to be evaluated per each row output.
Additional context
No response
The text was updated successfully, but these errors were encountered: