RFC: Add `Precision:AtLeast` and `Precision::AtMost` for more `Statistics`… precision #13293

alamb · 2024-11-07T14:25:44Z

Which issue does this PR close?

Closes Introduce a way to represent constrained statistics / bounds on values in Statistics #8078
Related to Optimized version of SortPreservingMerge that doesn't actually compare sort keys of the key ranges are ordered #10316

Discussion:

This is a Request for Comment (and maybe also a POC)

I hacked very briefly on a different approach ( Precision::Interval) but found:

ColumnStatistics already has a min and max (basically a bound) so introducing an interval into Precision would likely mean we would need to change ColumnStatistics to have an value: Precision<..> as well -- which might be a better choice, but it would be a bigger change
The Interval is defined in a different crate so we can't easily use it in Statisticss without a bunch of code

Rationale for this change

For the analysis described on #10316, we need to know if a value is constrained to a range to avoid merging. However the current Statistics are either Exact or Inexact, so once the precision becomes Inexact we lose the information that the possible minimum / maximum values do not change.

This came up twice for me recently:

@suremarc, @findepi and I spoke about this here: Epic: Statistics improvements #8227 (comment) where we need to know if a value is constrained to a range to avoid merging (see Optimized version of SortPreservingMerge that doesn't actually compare sort keys of the key ranges are ordered #10316)
Yesterday I happened to be working on code in InfluxDB 3.0 that relies on knowing min/max values and I hit influxdata@30d4368 which marks the statistics as "inexact" when filter pushdown is on for parquet, but that loses the key information that the possible minimum / maximum values do not change

I hacked around it downstream, but I think this is all sounding like it is time to add a new Precision enum that allows for this usecase

What changes are included in this PR?

Introduce Precision::AtMost, Precision::AtLeast
Add ColumnStatistics::min and ColumnStatistics::max that return the correct value
Update code to handle new precision
Update ParquetExec to use the new Precision variants when filters are on

Are these changes tested?

Are there any user-facing changes?

… precision

alamb · 2024-11-07T14:26:13Z

datafusion/common/src/stats.rs

+    /// The value is know to be at most (inclusive) of this value.
+    ///
+    /// The actual value may be smaller, but it is never greater.
+    AtMost(T),


This is the main change -- adding these variants

suremarc · 2024-11-07T15:19:11Z

Without knowing too much about the use case for inexact statistics, is it possible we may need both inexact and "precise" upper/lower bounds for column statistics? I.e. a tight, inexact lower/upper bound, and then a looser "real" upper & lower bound .

I can see this causing tension between parts of the codebase that benefit from tighter but inexact bounds and parts that benefit from having correct bounds.

alamb · 2024-11-07T17:22:57Z

Without knowing too much about the use case for inexact statistics, is it possible we may need both inexact and "precise" upper/lower bounds for column statistics? I.e. a tight, inexact lower/upper bound, and then a looser "real" upper & lower bound .

I can see this causing tension between parts of the codebase that benefit from tighter but inexact bounds and parts that benefit from having correct bounds.

I am also not super sure about the usecase for inexact statistics. I think there was some idea that knowing a value was likely close to 1M would be more helpful than simply discarding the values.

However, almost all the operations I can think of (filtering, limit, aggregation) don't make the output range larger than the input.

Maybe could consider simply removing Precision::Inexact entirely 🤔 So we would only have

Precision {
  Exact,
  AtMost,
  AtLeast,
  Unknown 
}

I still do feel like having Precision::Bounded would be ideal to reuse all the existing Interval logic but that feels like too large a change to me. But maybe not

I wonder if @berkaysynnada has any thoughts or insights?

Dandandan · 2024-11-07T18:48:00Z

Inexact is useful for / used by join reordering (number of rows or size in bytes) depending one estimated filter selectivity, we can do some estimation which side of the join is smaller (e.g. after the filter).

crepererum · 2024-11-08T11:20:46Z

I wonder if a range semantics would be nicer, i.e.:

struct Precision {
  lower: Option<T>,
  upper: Option<T>,
}

And then you have:

Big Enum	Lower	Upper
`Exact`	`Some(x)`	`Some(x)`
`Inexact`	--	--
`AtMost`	`None`	`Some(x)`
`AtLeast`	`Some(x)`	`None`
`Absent`	`None`	`None`
--	`Some(x)`	`Some(y)`

If you really want to support inexact estimations, you could extend it like this:

struct Precision {
  lower: Bound<T>,
  upper: Bound<T>,
}

enum Bound {
  Absent,
  Inexact(T),
  Exact(T),
}

alamb · 2024-11-08T12:20:02Z

I wonder if a range semantics would be nicer, i.e.:

That is a good idea.

One think is that we already have a version of that type of Interval logic (that handles open/closed upper/lower bounds, etc_ here: https://docs.rs/datafusion-expr-common/42.2.0/src/datafusion_expr_common/interval_arithmetic.rs.html#160-163

https://docs.rs/datafusion/latest/datafusion/logical_expr/interval_arithmetic/struct.Interval.html

However, that is hard coded to use ScalarValue where Precision is generic (used for usize and ScalarValue).

But maybe we can just provide a conversion to/from Interval 🤔

findepi · 2024-11-08T15:47:23Z

With single estimated value the conceptual idea is that the random variable we're estimating has distribution "condensed" around that value. One can imagine this being normal distribution with the value being the mean.

Obviously this is over-simplification. Not everything has normal distribution. For example non-negative numbers. But this mental model is easy to work with.

While exact ranges are super useful (for example for predicate derivation and pruning), inexact ranges as statistics model pose a problem how to interpret the value, when e.g. judging which side of the join is smaller, or when computing filter on top of some other computation. It's tempting to capture uncertainty as ever-widening ranges and to finally interpret the range as its middle value.

This is definitely more complex mental model and it will suerly result in the code being more complex. Will it also result in better estimates? Maybe.

There are also other alternatives so consider

histograms. Some optimizers (eg MySQL 8) use that to capture ranges of values together with their cardinality, which allows to derive histograms after applying a range filter
run-time adaptivity. There is only so much the optimizer can do a priori, before seeing the data. At some point of maturity making optimizer smarter doesn't result in queries returning faster. However, managing run-time detected skew or being able to replan are other and very powerful techniques.

Dandandan · 2024-11-09T01:49:03Z

Perhaps it makes sense to somehow separate estimated statistics vs bounds (e.g. compute / return both).

In some situations you would like to know a bound (avoiding merge), while in other situations (e.g. join order) you would like to have a point estimate.

findepi · 2024-11-10T17:46:44Z

Absolutely, the exact bounds must not be mistaken with estimates/statistics and need to be treated as two different things, even though they look similar.
Known data bounds (min/max or lower/upper bounds) allow pruning.
Statistics are estimates and don't allow pruning.
During planning we should be able to express both: known bounds and estimates (bounds or not)

crepererum · 2024-11-11T10:30:08Z

Based on the discussion, another idea:

/// ...
///
/// # Note
/// Unknown ranges are modeled as `Precision::Range{lower: None, upper: None}`,
enum Precision {
  /// Estimated, but very close to the given point.
  ///
  /// This can be treated as an over-simplified normal distribution.
  PointEstimation(T),
  
  /// Value is for sure in the given open/half-open/closed range.
  ///
  /// If `lower`/`upper` are given, they are INCLUSIVE ends.
  Range {
    lower: Option<T>,
    upper: Option<T>,
  },
}

(this might need some more docs and helper methods, but I think you get the idea)

berkaysynnada · 2024-11-11T10:50:17Z

I prefer going with

pub enum Estimate {
    Range { bounds: Interval, value: ScalarValue },
    /// Nothing is known about the value
    #[default]
    Absent,
}

as mentioned there.

We will be both making us of interval arithmetics, and eliminating the need for separate bound and estimation statistics. It does not necessitate to select which kind of stats (range or estimate) you keep.

The only challenge is we need to guard the internal values, and provide the all API's someone can require. It should be never in an inconsistent state (like estimate value is out of the bounds).

crepererum · 2024-11-11T14:05:07Z

Why does Estimage::Range has both bounds and a value? What's the value meant to be, e.g. if a parquet data source tells you that the interval is 42..=1337?

berkaysynnada · 2024-11-11T15:27:38Z

Why does Estimage::Range has both bounds and a value? What's the value meant to be, e.g. if a parquet data source tells you that the interval is 42..=1337?

If only bounds are given, the value could be its mean value maybe? Assuming a uniform dist should not harm

crepererum · 2024-11-12T08:57:16Z

Why does Estimage::Range has both bounds and a value? What's the value meant to be, e.g. if a parquet data source tells you that the interval is 42..=1337?

If only bounds are given, the value could be its mean value maybe? Assuming a uniform dist should not harm

I don't think we get a mean value from parquet for example. So that would be a rather opinionated assumption. Also note that this is somewhat hard or even impossible to calculate for some types (e.g. strings)

berkaysynnada · 2024-11-12T10:54:57Z

I don't think we get a mean value from parquet for example. So that would be a rather opinionated assumption. Also note that this is somewhat hard or even impossible to calculate for some types (e.g. strings)

Then do we need 4 states? -- Both bounds and estimation, only bounds, only estimation, and neither one

crepererum · 2024-11-12T11:53:02Z

I don't think we get a mean value from parquet for example. So that would be a rather opinionated assumption. Also note that this is somewhat hard or even impossible to calculate for some types (e.g. strings)

Then do we need 4 states? -- Both bounds and estimation, only bounds, only estimation, and neither one

yeah, if you want to have bounds AND a point estimator, then you need a larger state space, something like:

struct Precision {
  /// Actual values are very close to the given point.
  ///
  /// This can be treated as an over-simplified normal distribution.
  point_estimation: Option<T>,
  
  /// Lower bound for a open/half-open/closed range.
  ///
  /// If given, the bound is INCLUSIVE. The bounds may be 
  /// overestimated (i.e. the actual lower value may be larger) 
  /// but if provided, all values are included in this range.
  lower: Option<T>,

  /// Upper bound for a open/half-open/closed range.
  ///
  /// If given, the bound is INCLUSIVE. The bounds may be 
  /// overestimated (i.e. the actual upper value may be smaller) 
  /// but if provided, all values are included in this range.
  upper: Option<T>,
}

suremarc · 2024-12-11T15:56:10Z

It's been a month and I haven't seen any new proposals. IIUC the main use case for inexact statistics is to estimate num_rows and total_byte_size using estimated selectivity, which itself is "inexact". So basically we need point estimates for those attributes, and exact bounds for the column min/maxes.

Unless I'm misunderstanding, it seems like @crepererum's proposed API accommodates both of these use cases. Another open question is if we should try to unify Interval with Precision, but I think if we guard the internal values we will at least have the option to make this change going forward without breaking anything.

I am interested in getting this change in so that I can resume work on #13296, so I am going to start pre-emptively working on a PR with the new Precision API.

alamb · 2024-12-12T21:28:55Z

I am interested in getting this change in so that I can resume work on #13296, so I am going to start pre-emptively working on a PR with the new Precision API.

Thank you @suremarc -- I am sorry I am so behind

Note here is another recent potentially related PR from @gatesn

Add sum statistics and PhysicalExpr::column_statistics #13736

alamb · 2024-12-12T21:31:46Z

@suremarc if you are going to work on Statistics, here are some properties I think would be most useful:

Minimize the downstream API impact as much as possible (aka give downstream users a chance to adjust)
Ensure that Statistics are cheaply cloneable (as it is, copying Statistics for tables with many strings shows up often in our profiles for short queries)

It would be really great to consolidate the statistics aggregation code (e.g. that combines statistics across files) into a single struct / location (but that is a good follow on perhaps)

alamb added 2 commits November 7, 2024 09:22

Add Precision:AtLeast and Precision::AtMost for more Statistics…

3cc8d8a

… precision

Add column statistics min

f11ecc5

github-actions bot added the common Related to common crate label Nov 7, 2024

alamb commented Nov 7, 2024

View reviewed changes

alamb mentioned this pull request Nov 7, 2024

Epic: Statistics improvements #8227

Open

19 tasks

alamb mentioned this pull request Nov 8, 2024

Introduce a way to represent constrained statistics / bounds on values in Statistics #8078

Open

suremarc mentioned this pull request Nov 25, 2024

feat: Optimize SortPreservingMergeExec to avoid merging non-overlapping partitions #13296

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Add `Precision:AtLeast` and `Precision::AtMost` for more `Statistics`… precision #13293

RFC: Add `Precision:AtLeast` and `Precision::AtMost` for more `Statistics`… precision #13293

alamb commented Nov 7, 2024 •

edited

Loading

alamb Nov 7, 2024

suremarc commented Nov 7, 2024

alamb commented Nov 7, 2024

Dandandan commented Nov 7, 2024 •

edited

Loading

crepererum commented Nov 8, 2024 •

edited

Loading

alamb commented Nov 8, 2024

findepi commented Nov 8, 2024

Dandandan commented Nov 9, 2024

findepi commented Nov 10, 2024

crepererum commented Nov 11, 2024

berkaysynnada commented Nov 11, 2024

crepererum commented Nov 11, 2024

berkaysynnada commented Nov 11, 2024

crepererum commented Nov 12, 2024

berkaysynnada commented Nov 12, 2024

crepererum commented Nov 12, 2024

suremarc commented Dec 11, 2024

alamb commented Dec 12, 2024

alamb commented Dec 12, 2024

RFC: Add Precision:AtLeast and Precision::AtMost for more Statistics… precision #13293

Are you sure you want to change the base?

RFC: Add Precision:AtLeast and Precision::AtMost for more Statistics… precision #13293

Conversation

alamb commented Nov 7, 2024 • edited Loading

Which issue does this PR close?

Discussion:

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb Nov 7, 2024

Choose a reason for hiding this comment

suremarc commented Nov 7, 2024

alamb commented Nov 7, 2024

Dandandan commented Nov 7, 2024 • edited Loading

crepererum commented Nov 8, 2024 • edited Loading

alamb commented Nov 8, 2024

findepi commented Nov 8, 2024

Dandandan commented Nov 9, 2024

findepi commented Nov 10, 2024

crepererum commented Nov 11, 2024

berkaysynnada commented Nov 11, 2024

crepererum commented Nov 11, 2024

berkaysynnada commented Nov 11, 2024

crepererum commented Nov 12, 2024

berkaysynnada commented Nov 12, 2024

crepererum commented Nov 12, 2024

suremarc commented Dec 11, 2024

alamb commented Dec 12, 2024

alamb commented Dec 12, 2024

RFC: Add `Precision:AtLeast` and `Precision::AtMost` for more `Statistics`… precision #13293

RFC: Add `Precision:AtLeast` and `Precision::AtMost` for more `Statistics`… precision #13293

alamb commented Nov 7, 2024 •

edited

Loading

Dandandan commented Nov 7, 2024 •

edited

Loading

crepererum commented Nov 8, 2024 •

edited

Loading