Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Make it easier to call window functions via expression API (and add example) #6746

Closed
wants to merge 1 commit into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jun 22, 2023

before polishing this PR up I wanted to get some feedback on the approach

Which issue does this PR close?

Related to #5781
Closes: #6747

Rationale for this change

@mustafasrepo suggested adding an example for DataFrame::window here: #6703 (comment)

When I tried to do so it turns out that creating an Expr::Window is quite challenging, so I took a step back and tried to clean up the API a little

What changes are included in this PR?

  1. Change WindowExpr to be a builder style interface and thus making it easier to construct Expr::WindowFunction variants. This is modeled after the CaseBuilder
  2. Add the lead, lag, etc. in expr_fns to follow the model of the other types of functions

Are these changes tested?

Yes

TOOD: I need to write tests for all the code in expr_fns which I will add to https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/dataframe_functions.rs if people like this basic approach

Are there any user-facing changes?

yes, the APIs to create window functions are improved

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner labels Jun 22, 2023
/// let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
///
/// // The following is the equivalent of "SELECT FIRST_VALUE(b) OVER(PARTITION BY a)"
/// let first_value = first_value(col("b"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the PR is motivated by trying to write this example. It turned out to be quite a pain to create the right Expr.

) -> Self {
/// Create a new Window expression with the specified argument an
/// empty `OVER` clause
pub fn new(fun: impl Into<window_function::WindowFunction>, args: Vec<Expr>) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the major change -- so that WindowFunction acts like a Builder which can be used to set partition by, order by and window_frame with a more self documenting style

@tustvold tustvold changed the base branch from main to backup-main June 27, 2023 11:12
@tustvold tustvold changed the base branch from backup-main to main June 27, 2023 11:13
@alamb alamb self-assigned this Nov 6, 2023
@timsaucer
Copy link
Contributor

As a new user, I like this change very much. I found creating lead and lag functions to be difficult to use. Suppose I wanted to call lead but provide the other two optional arguments - shift offset and default value. Would the best way to do that with this proposal to create the WindowFunction directly?

let single_lead = lead(col("mycol"));
let two_lead_with_default = expr::WindowFunction::new(BuiltInWindowFunction::Lead, vec![col("mycol"), lit(2), lit(mydefault)]);

@alamb
Copy link
Contributor Author

alamb commented May 3, 2024

As a new user, I like this change very much. I found creating lead and lag functions to be difficult to use. Suppose I wanted to call lead but provide the other two optional arguments - shift offset and default value. Would the best way to do that with this proposal to create the WindowFunction directly?

I think it would look something like this

let single_lead = lead(col("mycol"))
  .
let two_lead_with_default = lead.call(col("mycol"))
  .offset(lit(2))
  .partition_by(lit(mydefault)]))
  .build()

@timsaucer
Copy link
Contributor

I'm looking at the arguments we end up passing. I think cume_dist takes no arguments. Lead and lag take anywhere between 1 and 3. nth_value takes two. I think it might be easier to use with something like

pub fn lead(arg: Expr, offset: Option<i64>, default: Option<ScalarValue>) -> expr::WindowFunction {
    let offset = lit(ScalarValue::Int64(offset));
    let default_value = match default {
        Some(v) => lit(v),
        None => lit(ScalarValue::Null),
    };
    expr::WindowFunction::new(BuiltInWindowFunction::Lead, vec![arg, offset, default_value])
}

and for nth value it's easier

pub fn nth_value(arg: Expr, n: i64) -> expr::WindowFunction {
    expr::WindowFunction::new(BuiltInWindowFunction::NthValue, vec![arg, lit(n)])
}

@alamb
Copy link
Contributor Author

alamb commented May 6, 2024

I think it might be easier to use with something like

That makes sense to me @timsaucer -- thank you for the feedback.

I don't think I will have time to work on this PR / issue for a while, but I tried to give it some visibility in #10395 and hopefully we can find someone else who is interested in helping make it work

@alamb
Copy link
Contributor Author

alamb commented May 6, 2024

(or of course if you have the time it would be an amazing contribution 🙏 )

@timsaucer
Copy link
Contributor

I'm willing to work on this, but I'd like to wrap up the examples I have going into datafusion-python first. I think those will have more impact. I only have an hour or so a day to work on this, so I can't promise how fast I'll get to it.

@alamb
Copy link
Contributor Author

alamb commented May 7, 2024

I'm willing to work on this, but I'd like to wrap up the examples I have going into datafusion-python first. I think those will have more impact. I only have an hour or so a day to work on this, so I can't promise how fast I'll get to it.

Thank you very much 🙏

I totally understand re time constraints. Maybe we'll find some other people to help too. One can hope

Copy link

github-actions bot commented Jul 7, 2024

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Jul 7, 2024
@alamb alamb closed this Jul 7, 2024
@alamb
Copy link
Contributor Author

alamb commented Jul 7, 2024

We can revive this PR / its API when someone has time to work on it

In case anyone is following along, @jayzhan211 added a really nice trait for working with aggregate functions. Maybe we can do something similar for window functions eventually

/// Extensions for configuring [`Expr::AggregateFunction`]
///
/// Adds methods to [`Expr`] that make it easy to set optional aggregate options
/// such as `ORDER BY`, `FILTER` and `DISTINCT`
///
/// # Example
/// ```no_run
/// # use datafusion_common::Result;
/// # use datafusion_expr::{AggregateUDF, col, Expr, lit};
/// # use sqlparser::ast::NullTreatment;
/// # fn count(arg: Expr) -> Expr { todo!{} }
/// # fn first_value(arg: Expr) -> Expr { todo!{} }
/// # fn main() -> Result<()> {
/// use datafusion_expr::AggregateExt;
///
/// // Create COUNT(x FILTER y > 5)
/// let agg = count(col("x"))
/// .filter(col("y").gt(lit(5)))
/// .build()?;
/// // Create FIRST_VALUE(x ORDER BY y IGNORE NULLS)
/// let sort_expr = col("y").sort(true, true);
/// let agg = first_value(col("x"))
/// .order_by(vec![sort_expr])
/// .null_treatment(NullTreatment::IgnoreNulls)
/// .build()?;
/// # Ok(())
/// # }
/// ```
pub trait AggregateExt {
/// Add `ORDER BY <order_by>`
///
/// Note: `order_by` must be [`Expr::Sort`]
fn order_by(self, order_by: Vec<Expr>) -> AggregateBuilder;
/// Add `FILTER <filter>`
fn filter(self, filter: Expr) -> AggregateBuilder;
/// Add `DISTINCT`
fn distinct(self) -> AggregateBuilder;
/// Add `RESPECT NULLS` or `IGNORE NULLS`
fn null_treatment(self, null_treatment: NullTreatment) -> AggregateBuilder;
}

@timsaucer
Copy link
Contributor

I should have time to work on this near the end of the week or early next week. I'm in nearing the end of a large PR for datafusion-python and can dive back on this topic in once that's up for review.

@alamb
Copy link
Contributor Author

alamb commented Jul 9, 2024

Thanks @timsaucer -- looking forward to it.

I think the most useful part of this PR are the doc examples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make it easier to create WindowFunctions with the Expr API
2 participants