-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use upstream StatisticsConverter
from arrow-rs in DataFusion
#11479
Conversation
@@ -356,20 +356,24 @@ impl<'a> RowGroupPruningStatistics<'a> { | |||
&'a self, | |||
column: &'b Column, | |||
) -> Result<StatisticsConverter<'a>> { | |||
StatisticsConverter::try_new(&column.name, self.arrow_schema, self.parquet_schema) | |||
Ok(StatisticsConverter::try_new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is required to get the errors to convert
@@ -521,6 +518,31 @@ macro_rules! get_min_max_values_for_page_index { | |||
}}; | |||
} | |||
|
|||
// Copy from arrow-rs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed because the PagesPruningStatistics is not in terms of the StatisticsConverter
code yet. I will try and do that in a separate PR and avoid the need for this
ab2b7ff
to
2756145
Compare
1eb29a4
to
00eef87
Compare
00eef87
to
409023e
Compare
use datafusion::datasource::TableProvider; | ||
use datafusion::execution::object_store::ObjectStoreUrl; | ||
use datafusion::parquet::arrow::arrow_reader::statistics::StatisticsConverter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code has been ported to arrow-rs (🙏 @efredine )
} | ||
} | ||
|
||
impl<'a> PruningStatistics for RowGroupPruningStatistics<'a> { | ||
fn min_values(&self, column: &Column) -> Option<ArrayRef> { | ||
self.statistics_converter(column) | ||
.and_then(|c| c.row_group_mins(self.metadata_iter())) | ||
.and_then(|c| Ok(c.row_group_mins(self.metadata_iter())?)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is necessary to convert from ArrowError
to DataFusionError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use map_err? or into?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion.
I played around with a few alternatives and I concluded they were not easier to understand, so I plan to leave it as is. If you feel strongly I will make a follow on PR to change.
I couldn't figure out a way to use into()
Option 1: using map_err
fn min_values(&self, column: &Column) -> Option<ArrayRef> {
self.statistics_converter(column)
.and_then(|c| {
c.row_group_mins(self.metadata_iter())
.map_err(DataFusionError::from)
})
.ok()
}
Option 2: discard error earlier with ok()
Now there are two nested ok()
s
fn min_values(&self, column: &Column) -> Option<ArrayRef> {
self.statistics_converter(column)
.ok()
.and_then(|c| c.row_group_mins(self.metadata_iter()).ok())
}
@@ -1,287 +0,0 @@ | |||
// Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to arrow
@Ted-Jiang I wonder if you have a moment to review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense to me.
Thank you for the reviews @Ted-Jiang and @comphead |
Which issue does this PR close?
Closes #10922
Closes #11000
Rationale for this change
Now that @efredine has ported this code to arrow in apache/arrow-rs#6046 we can remove the copy in DataFusion
What changes are included in this PR?
Remove code from DataFusion and use the upstream arrow-rs version
Are these changes tested?
Are there any user-facing changes?