Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement conversion from ColumnStatistics to NullableInterval #10510

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 127 additions & 13 deletions datafusion/expr/src/interval_arithmetic.rs
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,10 @@ use arrow::compute::{cast_with_options, CastOptions};
use arrow::datatypes::DataType;
use arrow::datatypes::{IntervalUnit, TimeUnit};
use datafusion_common::rounding::{alter_fp_rounding_mode, next_down, next_up};
use datafusion_common::{internal_err, Result, ScalarValue};
use datafusion_common::stats::Precision;
use datafusion_common::{
internal_err, not_impl_err, ColumnStatistics, Result, ScalarValue,
};

macro_rules! get_extreme_value {
($extreme:ident, $value:expr) => {
Expand Down Expand Up @@ -1469,6 +1472,8 @@ pub enum NullableInterval {
MaybeNull { values: Interval },
/// The value is definitely not null, and is within the specified range.
NotNull { values: Interval },
/// No information is known about this intervals values or nullness
Unknown,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaybeNull with an unbounded interval corresponds to this Unknown variant, doesn't it ?

Copy link
Contributor Author

@demetribu demetribu May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting idea. Do you mean we should use ScalarValue::Null in Interval if the value is Absent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, Interval::MaybeNull{Null, Null} looks strange, honestly.
@jayzhan211 wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ScalarValue::Null for representing unbounded intervals is discouraged. Although it's possible to construct these, performing arithmetic on them is not feasible. Could you detect the datatype and instead create ScalarValue::Datatype(None)? This approach is consistent with how we handle unbounded cases throughout the codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I saw is that sometimes ColumnStatistics { ..., min, max, ... } both are equal to Precision::Absent. In such cases, we can't determine the type for Interval.

Copy link
Contributor Author

@demetribu demetribu May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tried to play with the stats in CLI mode, I attempted to get them from the ExecutionPlan. Here, there was a default absent value:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to get them from the ExecutionPlan. Here, there was a default absent value:

I would expect that calling execution_plan.statistics() would match the schema returned from execution_plan.schema()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will re-check tomorrow then

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anywhere we have ColumnStatistics we should also have the schema (and thus the DataType of all columns) -- maybe we just have to pass the schema along with the statistics?

Yes, I think that's the right approach for this issue. Can we open a ticket for this? It should be an easy fix, and ColumnStatistics isn't used in many parts of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
I have conducted some additional tests on various queries and observed the following results:

CREATE TABLE data_table (
    id INT,
    value INT
);
INSERT INTO data_table (id, value) VALUES
(1, 100),
(2, 200),
(3, 300),
(4, 400),
(5, 500),
(6, 600),
(7, 700),
(8, 800),
(9, 900),
(10, 1000);
SELECT id, value FROM data_table WHERE value > 500; 

Log:
Schema: id: Int32, value: Int32
Column 0: ColumnStatistics { null_count: Inexact(0), max_value: Exact(Int32(NULL)), min_value: Exact(Int32(NULL)), distinct_count: Absent }
Column 1: ColumnStatistics { null_count: Inexact(0), max_value: Inexact(Int32(NULL)), min_value: Inexact(Int32(501)), distinct_count: Absent }

SELECT AVG(value) AS average_value, SUM(value) AS total_value FROM data_table; 

Log:
Schema: average_value: Float64, total_value: Int64
Column 0: ColumnStatistics { null_count: Absent, max_value: Absent, min_value: Absent, distinct_count: Absent }
Column 1: ColumnStatistics { null_count: Absent, max_value: Absent, min_value: Absent, distinct_count: Absent }

SELECT a.id AS id_a, a.value AS value_a, b.id AS id_b, b.value AS value_b
FROM data_table a
CROSS JOIN data_table b;

Log:
Schema: id_a: Int32, value_a: Int32, id_b: Int32, value_b: Int32
Column 0: ColumnStatistics { null_count: Exact(0), max_value: Absent, min_value: Absent, distinct_count: Absent }
Column 1: ColumnStatistics { null_count: Exact(0), max_value: Absent, min_value: Absent, distinct_count: Absent }
Column 2: ColumnStatistics { null_count: Exact(0), max_value: Absent, min_value: Absent, distinct_count: Absent }
Column 3: ColumnStatistics { null_count: Exact(0), max_value: Absent, min_value: Absent, distinct_count: Absent }


impl Display for NullableInterval {
Expand All @@ -1479,6 +1484,7 @@ impl Display for NullableInterval {
write!(f, "NullableInterval: {} U {{NULL}}", values)
}
Self::NotNull { values } => write!(f, "NullableInterval: {}", values),
Self::Unknown => write!(f, "NullableInterval: Unknown"),
}
}
}
Expand All @@ -1501,36 +1507,82 @@ impl From<ScalarValue> for NullableInterval {
}
}

impl From<ColumnStatistics> for NullableInterval {
fn from(stats: ColumnStatistics) -> Self {
let null_count = match stats.null_count {
Precision::Exact(value) | Precision::Inexact(value) => value,
Precision::Absent => return NullableInterval::Unknown,
};

let distinct_count = match stats.distinct_count {
Precision::Exact(value) | Precision::Inexact(value) => value,
Precision::Absent => return NullableInterval::Unknown,
};

let lower_value = match stats.min_value {
Precision::Exact(ref value) | Precision::Inexact(ref value) => value.clone(),
Precision::Absent => return NullableInterval::Unknown,
};

let upper_value = match stats.max_value {
Precision::Exact(ref value) | Precision::Inexact(ref value) => value.clone(),
Precision::Absent => return NullableInterval::Unknown,
};

let datatype = lower_value.data_type();

if null_count == 0 {
NullableInterval::NotNull {
values: Interval {
lower: lower_value,
upper: upper_value,
},
}
} else if null_count == distinct_count {
NullableInterval::Null { datatype }
} else {
NullableInterval::MaybeNull {
values: Interval {
lower: lower_value,
upper: upper_value,
},
}
}
}
}

impl NullableInterval {
/// Get the values interval, or None if this interval is definitely null.
pub fn values(&self) -> Option<&Interval> {
match self {
Self::Null { .. } => None,
Self::Null { .. } | Self::Unknown => None,
Self::MaybeNull { values } | Self::NotNull { values } => Some(values),
}
}

/// Get the data type
pub fn data_type(&self) -> DataType {
pub fn data_type(&self) -> Option<DataType> {
match self {
Self::Null { datatype } => datatype.clone(),
Self::MaybeNull { values } | Self::NotNull { values } => values.data_type(),
Self::Null { datatype } => Some(datatype.clone()),
Self::MaybeNull { values } | Self::NotNull { values } => {
Some(values.data_type())
}
Self::Unknown => None,
}
}

/// Return true if the value is definitely true (and not null).
pub fn is_certainly_true(&self) -> bool {
match self {
Self::Null { .. } | Self::MaybeNull { .. } => false,
Self::Null { .. } | Self::MaybeNull { .. } | Self::Unknown => false,
Self::NotNull { values } => values == &Interval::CERTAINLY_TRUE,
}
}

/// Return true if the value is definitely false (and not null).
pub fn is_certainly_false(&self) -> bool {
match self {
Self::Null { .. } => false,
Self::MaybeNull { .. } => false,
Self::Null { .. } | Self::MaybeNull { .. } | Self::Unknown => false,
Self::NotNull { values } => values == &Interval::CERTAINLY_FALSE,
}
}
Expand All @@ -1547,6 +1599,7 @@ impl NullableInterval {
Self::NotNull { values } => Ok(Self::NotNull {
values: values.not()?,
}),
Self::Unknown => Ok(Self::Unknown),
}
}

Expand Down Expand Up @@ -1640,9 +1693,10 @@ impl NullableInterval {
datatype: DataType::Boolean,
})
} else {
Ok(Self::Null {
datatype: self.data_type(),
})
match self.data_type() {
Some(datatype) => Ok(Self::Null { datatype }),
None => not_impl_err!("Cannot determine data type for operation"),
}
}
}
}
Expand Down Expand Up @@ -1714,10 +1768,13 @@ impl NullableInterval {

#[cfg(test)]
mod tests {
use crate::interval_arithmetic::{next_value, prev_value, satisfy_greater, Interval};
use crate::interval_arithmetic::{
next_value, prev_value, satisfy_greater, Interval, NullableInterval,
};

use arrow::datatypes::DataType;
use datafusion_common::{Result, ScalarValue};
use datafusion_common::stats::Precision;
use datafusion_common::{ColumnStatistics, Result, ScalarValue};

#[test]
fn test_next_prev_value() -> Result<()> {
Expand Down Expand Up @@ -3212,7 +3269,64 @@ mod tests {

Ok(())
}
#[test]
fn test_interval_from_column_statistics() {
let stats_null = ColumnStatistics {
null_count: Precision::Exact(10),
max_value: Precision::Exact(ScalarValue::Int32(Some(100))),
min_value: Precision::Exact(ScalarValue::Int32(Some(1))),
distinct_count: Precision::Exact(10),
};
assert_eq!(
NullableInterval::from(stats_null),
NullableInterval::Null {
datatype: DataType::Int32
}
);

let stats_not_null = ColumnStatistics {
null_count: Precision::Exact(0),
max_value: Precision::Exact(ScalarValue::Int32(Some(100))),
min_value: Precision::Exact(ScalarValue::Int32(Some(1))),
distinct_count: Precision::Exact(20),
};
assert_eq!(
NullableInterval::from(stats_not_null),
NullableInterval::NotNull {
values: Interval {
lower: ScalarValue::Int32(Some(1)),
upper: ScalarValue::Int32(Some(100)),
}
}
);

let stats_maybe_null = ColumnStatistics {
null_count: Precision::Exact(5),
max_value: Precision::Exact(ScalarValue::Int32(Some(100))),
min_value: Precision::Exact(ScalarValue::Int32(Some(1))),
distinct_count: Precision::Exact(20),
};
assert_eq!(
NullableInterval::from(stats_maybe_null),
NullableInterval::MaybeNull {
values: Interval {
lower: ScalarValue::Int32(Some(1)),
upper: ScalarValue::Int32(Some(100)),
}
}
);

let stats_unknown = ColumnStatistics {
null_count: Precision::Absent,
max_value: Precision::Exact(ScalarValue::Int32(Some(100))),
min_value: Precision::Exact(ScalarValue::Int32(Some(1))),
distinct_count: Precision::Exact(20),
};
assert_eq!(
NullableInterval::from(stats_unknown),
NullableInterval::Unknown
);
}
#[test]
fn test_interval_display() {
let interval = Interval::make(Some(0.25_f32), Some(0.50_f32)).unwrap();
Expand Down