Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add Precision:AtLeast and Precision::AtMost for more Statistics… precision #13293

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Nov 7, 2024

Which issue does this PR close?

Discussion:

This is a Request for Comment (and maybe also a POC)

I hacked very briefly on a different approach ( Precision::Interval) but found:

  • ColumnStatistics already has a min and max (basically a bound) so introducing an interval into Precision would likely mean we would need to change ColumnStatistics to have an value: Precision<..> as well -- which might be a better choice, but it would be a bigger change
  • The Interval is defined in a different crate so we can't easily use it in Statisticss without a bunch of code

Rationale for this change

For the analysis described on #10316, we need to know if a value is constrained to a range to avoid merging. However the current Statistics are either Exact or Inexact, so once the precision becomes Inexact we lose the information that the possible minimum / maximum values do not change.

This came up twice for me recently:

  1. @suremarc, @findepi and I spoke about this here: Epic: Statistics improvements #8227 (comment) where we need to know if a value is constrained to a range to avoid merging (see Optimized version of SortPreservingMerge that doesn't actually compare sort keys of the key ranges are ordered #10316)
  2. Yesterday I happened to be working on code in InfluxDB 3.0 that relies on knowing min/max values and I hit influxdata@30d4368 which marks the statistics as "inexact" when filter pushdown is on for parquet, but that loses the key information that the possible minimum / maximum values do not change

I hacked around it downstream, but I think this is all sounding like it is time to add a new Precision enum that allows for this usecase

What changes are included in this PR?

  • Introduce Precision::AtMost, Precision::AtLeast
  • Add ColumnStatistics::min and ColumnStatistics::max that return the correct value
  • Update code to handle new precision
  • Update ParquetExec to use the new Precision variants when filters are on

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the common Related to common crate label Nov 7, 2024
/// The value is know to be at most (inclusive) of this value.
///
/// The actual value may be smaller, but it is never greater.
AtMost(T),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main change -- adding these variants

@alamb alamb mentioned this pull request Nov 7, 2024
19 tasks
@suremarc
Copy link
Contributor

suremarc commented Nov 7, 2024

Without knowing too much about the use case for inexact statistics, is it possible we may need both inexact and "precise" upper/lower bounds for column statistics? I.e. a tight, inexact lower/upper bound, and then a looser "real" upper & lower bound .

I can see this causing tension between parts of the codebase that benefit from tighter but inexact bounds and parts that benefit from having correct bounds.

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2024

Without knowing too much about the use case for inexact statistics, is it possible we may need both inexact and "precise" upper/lower bounds for column statistics? I.e. a tight, inexact lower/upper bound, and then a looser "real" upper & lower bound .

I can see this causing tension between parts of the codebase that benefit from tighter but inexact bounds and parts that benefit from having correct bounds.

I am also not super sure about the usecase for inexact statistics. I think there was some idea that knowing a value was likely close to 1M would be more helpful than simply discarding the values.

However, almost all the operations I can think of (filtering, limit, aggregation) don't make the output range larger than the input.

Maybe could consider simply removing Precision::Inexact entirely 🤔 So we would only have

Precision {
  Exact,
  AtMost,
  AtLeast,
  Unknown 
}

I still do feel like having Precision::Bounded would be ideal to reuse all the existing Interval logic but that feels like too large a change to me. But maybe not

I wonder if @berkaysynnada has any thoughts or insights?

@Dandandan
Copy link
Contributor

Dandandan commented Nov 7, 2024

Inexact is useful for / used by join reordering (number of rows or size in bytes) depending one estimated filter selectivity, we can do some estimation which side of the join is smaller (e.g. after the filter).

@crepererum
Copy link
Contributor

crepererum commented Nov 8, 2024

I wonder if a range semantics would be nicer, i.e.:

struct Precision {
  lower: Option<T>,
  upper: Option<T>,
}

And then you have:

Big Enum Lower Upper
Exact Some(x) Some(x)
Inexact -- --
AtMost None Some(x)
AtLeast Some(x) None
Absent None None
-- Some(x) Some(y)

If you really want to support inexact estimations, you could extend it like this:

struct Precision {
  lower: Bound<T>,
  upper: Bound<T>,
}

enum Bound {
  Absent,
  Inexact(T),
  Exact(T),
}

@alamb
Copy link
Contributor Author

alamb commented Nov 8, 2024

I wonder if a range semantics would be nicer, i.e.:

That is a good idea.

One think is that we already have a version of that type of Interval logic (that handles open/closed upper/lower bounds, etc_ here: https://docs.rs/datafusion-expr-common/42.2.0/src/datafusion_expr_common/interval_arithmetic.rs.html#160-163

https://docs.rs/datafusion/latest/datafusion/logical_expr/interval_arithmetic/struct.Interval.html

However, that is hard coded to use ScalarValue where Precision is generic (used for usize and ScalarValue).

But maybe we can just provide a conversion to/from Interval 🤔

@findepi
Copy link
Member

findepi commented Nov 8, 2024

With single estimated value the conceptual idea is that the random variable we're estimating has distribution "condensed" around that value. One can imagine this being normal distribution with the value being the mean.

Obviously this is over-simplification. Not everything has normal distribution. For example non-negative numbers. But this mental model is easy to work with.

While exact ranges are super useful (for example for predicate derivation and pruning), inexact ranges as statistics model pose a problem how to interpret the value, when e.g. judging which side of the join is smaller, or when computing filter on top of some other computation. It's tempting to capture uncertainty as ever-widening ranges and to finally interpret the range as its middle value.

This is definitely more complex mental model and it will suerly result in the code being more complex. Will it also result in better estimates? Maybe.

There are also other alternatives so consider

  • histograms. Some optimizers (eg MySQL 8) use that to capture ranges of values together with their cardinality, which allows to derive histograms after applying a range filter
  • run-time adaptivity. There is only so much the optimizer can do a priori, before seeing the data. At some point of maturity making optimizer smarter doesn't result in queries returning faster. However, managing run-time detected skew or being able to replan are other and very powerful techniques.

@Dandandan
Copy link
Contributor

Perhaps it makes sense to somehow separate estimated statistics vs bounds (e.g. compute / return both).

In some situations you would like to know a bound (avoiding merge), while in other situations (e.g. join order) you would like to have a point estimate.

@findepi
Copy link
Member

findepi commented Nov 10, 2024

Absolutely, the exact bounds must not be mistaken with estimates/statistics and need to be treated as two different things, even though they look similar.
Known data bounds (min/max or lower/upper bounds) allow pruning.
Statistics are estimates and don't allow pruning.
During planning we should be able to express both: known bounds and estimates (bounds or not)

@crepererum
Copy link
Contributor

Based on the discussion, another idea:

/// ...
///
/// # Note
/// Unknown ranges are modeled as `Precision::Range{lower: None, upper: None}`,
enum Precision {
  /// Estimated, but very close to the given point.
  ///
  /// This can be treated as an over-simplified normal distribution.
  PointEstimation(T),
  
  /// Value is for sure in the given open/half-open/closed range.
  ///
  /// If `lower`/`upper` are given, they are INCLUSIVE ends.
  Range {
    lower: Option<T>,
    upper: Option<T>,
  },
}

(this might need some more docs and helper methods, but I think you get the idea)

@berkaysynnada
Copy link
Contributor

I prefer going with

pub enum Estimate {
    Range { bounds: Interval, value: ScalarValue },
    /// Nothing is known about the value
    #[default]
    Absent,
}

as mentioned there.

We will be both making us of interval arithmetics, and eliminating the need for separate bound and estimation statistics. It does not necessitate to select which kind of stats (range or estimate) you keep.

The only challenge is we need to guard the internal values, and provide the all API's someone can require. It should be never in an inconsistent state (like estimate value is out of the bounds).

@crepererum
Copy link
Contributor

Why does Estimage::Range has both bounds and a value? What's the value meant to be, e.g. if a parquet data source tells you that the interval is 42..=1337?

@berkaysynnada
Copy link
Contributor

Why does Estimage::Range has both bounds and a value? What's the value meant to be, e.g. if a parquet data source tells you that the interval is 42..=1337?

If only bounds are given, the value could be its mean value maybe? Assuming a uniform dist should not harm

@crepererum
Copy link
Contributor

Why does Estimage::Range has both bounds and a value? What's the value meant to be, e.g. if a parquet data source tells you that the interval is 42..=1337?

If only bounds are given, the value could be its mean value maybe? Assuming a uniform dist should not harm

I don't think we get a mean value from parquet for example. So that would be a rather opinionated assumption. Also note that this is somewhat hard or even impossible to calculate for some types (e.g. strings)

@berkaysynnada
Copy link
Contributor

I don't think we get a mean value from parquet for example. So that would be a rather opinionated assumption. Also note that this is somewhat hard or even impossible to calculate for some types (e.g. strings)

Then do we need 4 states? -- Both bounds and estimation, only bounds, only estimation, and neither one

@crepererum
Copy link
Contributor

I don't think we get a mean value from parquet for example. So that would be a rather opinionated assumption. Also note that this is somewhat hard or even impossible to calculate for some types (e.g. strings)

Then do we need 4 states? -- Both bounds and estimation, only bounds, only estimation, and neither one

yeah, if you want to have bounds AND a point estimator, then you need a larger state space, something like:

struct Precision {
  /// Actual values are very close to the given point.
  ///
  /// This can be treated as an over-simplified normal distribution.
  point_estimation: Option<T>,
  
  /// Lower bound for a open/half-open/closed range.
  ///
  /// If given, the bound is INCLUSIVE. The bounds may be 
  /// overestimated (i.e. the actual lower value may be larger) 
  /// but if provided, all values are included in this range.
  lower: Option<T>,

  /// Upper bound for a open/half-open/closed range.
  ///
  /// If given, the bound is INCLUSIVE. The bounds may be 
  /// overestimated (i.e. the actual upper value may be smaller) 
  /// but if provided, all values are included in this range.
  upper: Option<T>,
}

@suremarc
Copy link
Contributor

It's been a month and I haven't seen any new proposals. IIUC the main use case for inexact statistics is to estimate num_rows and total_byte_size using estimated selectivity, which itself is "inexact". So basically we need point estimates for those attributes, and exact bounds for the column min/maxes.

Unless I'm misunderstanding, it seems like @crepererum's proposed API accommodates both of these use cases. Another open question is if we should try to unify Interval with Precision, but I think if we guard the internal values we will at least have the option to make this change going forward without breaking anything.

I am interested in getting this change in so that I can resume work on #13296, so I am going to start pre-emptively working on a PR with the new Precision API.

@alamb
Copy link
Contributor Author

alamb commented Dec 12, 2024

I am interested in getting this change in so that I can resume work on #13296, so I am going to start pre-emptively working on a PR with the new Precision API.

Thank you @suremarc -- I am sorry I am so behind

Note here is another recent potentially related PR from @gatesn

@alamb
Copy link
Contributor Author

alamb commented Dec 12, 2024

@suremarc if you are going to work on Statistics, here are some properties I think would be most useful:

  1. Minimize the downstream API impact as much as possible (aka give downstream users a chance to adjust)
  2. Ensure that Statistics are cheaply cloneable (as it is, copying Statistics for tables with many strings shows up often in our profiles for short queries)

It would be really great to consolidate the statistics aggregation code (e.g. that combines statistics across files) into a single struct / location (but that is a good follow on perhaps)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce a way to represent constrained statistics / bounds on values in Statistics
6 participants