Feature/single source exec #56

mertak-synnada · 2025-01-07T14:09:10Z

Which issue does this PR close?

Rationale for this change

This PR merges all Data sources into one Execution Plan, named DataSourceExec and a single trait named DataSource which is inspired by DataSink.

This version is not intended to be merged in Upstream since it removes all the ParquetExec, CsvExec, etc., and changes all the tests, but I'm sharing this as is so that we can see all possible changes. But our main intention is to re-adding old execution plans as deprecated ones and implement this part by part to keep API stability.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

…ne DataSourceExec plan

fix csv_json example

# Conflicts: # datafusion/physical-plan/src/aggregates/mod.rs

# Conflicts: # datafusion/core/src/physical_optimizer/replace_with_order_preserving_variants.rs # datafusion/physical-plan/src/memory.rs

# Conflicts: # datafusion/core/src/datasource/file_format/arrow.rs # datafusion/core/src/datasource/file_format/csv.rs # datafusion/core/src/datasource/file_format/json.rs # datafusion/core/src/datasource/file_format/parquet.rs # datafusion/core/src/datasource/memory.rs # datafusion/core/src/test/mod.rs # datafusion/physical-plan/src/memory.rs

# Conflicts: # datafusion/core/src/datasource/memory.rs # datafusion/core/src/datasource/physical_plan/arrow_file.rs # datafusion/core/src/datasource/physical_plan/avro.rs # datafusion/core/src/datasource/physical_plan/csv.rs # datafusion/core/src/datasource/physical_plan/file_scan_config.rs # datafusion/core/src/datasource/physical_plan/json.rs # datafusion/core/src/datasource/physical_plan/parquet/mod.rs # datafusion/core/src/physical_optimizer/enforce_distribution.rs # datafusion/core/src/physical_optimizer/enforce_sorting.rs # datafusion/core/src/physical_optimizer/sanity_checker.rs # datafusion/core/src/physical_optimizer/test_utils.rs # datafusion/physical-plan/src/memory.rs # datafusion/sqllogictest/test_files/group_by.slt

jayzhan-synnada · 2025-01-20T02:43:56Z

datafusion/physical-plan/src/source.rs

+    fn with_fetch(&self, limit: Option<usize>) -> Option<Arc<dyn ExecutionPlan>> {
+        let mut source = Arc::clone(&self.source);
+        source = source.with_fetch(limit)?;
+        let cache = source.properties().clone();


Do we need to recompute some properties given the limit is given?

jayzhan-synnada · 2025-01-20T02:51:59Z

datafusion/core/src/datasource/data_source.rs

+    base_config: FileScanConfig,
+    metrics: ExecutionPlanMetricsSet,
+    projected_statistics: Statistics,
+    cache: PlanProperties,


metrics: ExecutionPlanMetricsSet,
projected_statistics: Statistics,
cache: PlanProperties,

Why do we have plan related information in config?

I think they are not belong to config.

Instead of new_exec, I think we should add

DataSourceExec::new(config: FileSourceConfig, source: Arc<dyn FileSource>,)

If we change any property in config like file group, we only change config itself, DataSourceExec that relies on the config should take care of updated one by itself.

@jayzhan-synnada It was because of some shared logic like repartitions etc. But I understand your point, I've moved the statistics & metrics into lower level configurations and moved cache to the upper level DataSourceExec.

I've created the new_exec as a syntactic sugar to not duplicate the Arc::new(DataSourceExec::new(FileSourceConfig::new(base_config,file_source))) Since config can be FileSourceConfig or MemorySourceConfig.

Can you please review again?

The current dependency structure places the data source/file layer above the physical plan layer. However, the function FileSourceConfig::new() -> DataSourceExec violates this dependency hierarchy, potentially making the module harder to decouple. In apache#10782, we might want to have separated module for catalog / file & data source / table and so on. The dependency is like

Catalog -> Schema -> Table -> FileFormat -> QueryPlanner.

If we move FileSourceConfig into FileFormat crate, we don't expect to import Planner struct here but FileSourceConfig::new_exec has to. In another direction, it is fine to import FileSourceConfig for DataSourceExec::new()

Also, I still think plan properties should be computed when we called DataSourceExec::new(), but we compute it when config created. I think the File/DataSource and PhysicalPlan abstraction is not clear enough to me 🤔

Is it possible to unify memory and file into single concept

Is it possible to unify memory and file into single concept

I believe the DataSource trait is best we can get since their open implementations are totally different and there are some file-based shared logic in FileSourceConfig

jayzhan-synnada · 2025-01-20T02:55:24Z

datafusion/core/src/datasource/file_format/arrow.rs

@@ -172,8 +173,7 @@ impl FileFormat for ArrowFormat {
        conf: FileScanConfig,
        _filters: Option<&Arc<dyn PhysicalExpr>>,
    ) -> Result<Arc<dyn ExecutionPlan>> {
-        let exec = ArrowExec::new(conf);
-        Ok(Arc::new(exec))
+        Ok(FileSourceConfig::new_exec(conf, Arc::new(ArrowConfig {})))


Therefore, I expect we have

DataSourceExec::new(config: FileSourceConfig, source: Arc<dyn FileSource>,)

or maybe we don't need FileSourceConfig at all, just take Arc<dyn FileSource> and FileScanConfig around

PlanProperties should be computed inside DataSourceExec::new. Any change of source config (i.e. with fetch or file groups) trigger the re-computation of properties if required.
ComputeProperties is common behaviour related to Plan should moved to DataSourceExec

Moved compute_properties into DataSourceExec 👍

As we talked, to move the new_exec to DataSourceExec we need to be able to import FileSourceConfig, and FileScanConfig in datafusion/physical-plan but they're in datafusion/core at the moment. Since as you mentioned this is nice to have and will require much more code movement, I'm passing this one with just a recording. I'll also mention this when we open the PR to the upstream, so maybe the community can come up with a better idea.

Here's what we want to achieve

impl DataSourceExec { pub fn from_file_config(base_config: FileScanConfig, file_source: Arc<dyn FileSource>) -> Arc<DataSourceExec> { let source = Arc::new(FileSourceConfig::new(base_config, file_source)); Arc::new(Self::new(source)) } }

make cache a part of DataSourceExec

# Conflicts: # datafusion/core/src/dataframe/mod.rs

jayzhan-synnada · 2025-01-21T01:17:35Z

datafusion/core/src/datasource/data_source.rs

@@ -261,64 +245,43 @@ impl DataSource for FileSourceConfig {
    fn repartitioned(


Could this be a method in DataSourceExec, it is actually transforming one DataSourceExec to another one with some configuration information like partition, file group

~~I have a bold assumption, maybe trait DataSource is not required, such methods should belong to DataSourceExec instead~~.

In AggregateExec, we have StreamType to differentiate the exec stream we need to run.

For DataSource there is only MemoryStream and FileStream I guess, so DataSource is probably not required at all

enum StreamType { AggregateStream(AggregateStream), GroupedHash(GroupedHashAggregateStream), GroupedPriorityQueue(GroupedTopKAggregateStream), }

We can chose the stream type based on some configuration or properties

enum DataSourceExecType { File, Memory, } fn execute_typed(&self, partition: usize, context: Arc<TaskContext>) -> SendableRecordBatchStream { if self.source.source_type() == File { FileStream } else { MemoryStream } }

I think repartitioned can be restructure like this

impl DataSource { fn repartitioned( &self, target_partitions: usize, repartition_file_min_size: usize ) -> Result<Option<Self>> {} fn output_partition(&self) -> Partitioning; } impl ExecutionPlan for DataSourceExec { fn repartitioned( &self, target_partitions: usize, config: &ConfigOptions, ) -> datafusion_common::Result<Option<Arc<dyn ExecutionPlan>>> { let source = self.source.repartitioned(); let output_partitioning = source.output_partitioning(); // Other metadata from DataSource self.clone().with_source().with_partitioning() } }

I think the dependency like this makes more sense.

trait DataSource deal with the data source related definition that differs for each source type like partitioning
trait DataSourceExec compute plan based on the DataSource's properties as function parameters

jayzhan-synnada · 2025-01-21T02:01:46Z

datafusion/core/src/datasource/data_source.rs

        file_scan_config: &FileScanConfig,
    ) -> PlanProperties {
        // Equivalence Properties
-        let eq_properties = EquivalenceProperties::new_with_orderings(schema, orderings);
+        let eq_properties = EquivalenceProperties::new_with_orderings(schema, orderings)


To compute properties what we need are Partitioning and EquivalenceProperties.

We can make DataSource trait method to get them for DataSourceExec.

alihan-synnada · 2025-01-21T08:21:52Z

docs/source/library-user-guide/custom-table-providers.md

@@ -89,7 +89,7 @@ This:
 2. Constructs the individual output arrays (columns)
 3. Returns a `MemoryStream` of a single `RecordBatch` with the arrays

-I.e. returns the "physical" data. For other examples, refer to the [`CsvExec`][csv] and [`ParquetExec`][parquet] for more complex implementations.
+I.e. returns the "physical" data. For other examples, refer to the [`CsvConfig`][csv] and [`ParquetConfig`][parquet] for more complex implementations.


These link to the old implementation and might be confusing. There might be other places in the docs or comments where a blanket find-and-replace might have broken logical consistency because ParquetExec and ParquetConfig aren't exactly the same thing.

datafusion-upstream/docs/source/library-user-guide/custom-table-providers.md

Lines 179 to 181 in cb8c2ae

[ex]: https://github.com/apache/datafusion/blob/a5e86fae3baadbd99f8fd0df83f45fde22f7b0c6/datafusion-examples/examples/custom_datasource.rs#L214C1-L276

[csv]: https://github.com/apache/datafusion/blob/a5e86fae3baadbd99f8fd0df83f45fde22f7b0c6/datafusion/core/src/datasource/physical_plan/csv.rs#L57-L70

[parquet]: https://github.com/apache/datafusion/blob/a5e86fae3baadbd99f8fd0df83f45fde22f7b0c6/datafusion/core/src/datasource/physical_plan/parquet.rs#L77-L104

I will note this in the upstream PR as well. Once merged I think we can update these with a follow-up PR.

alihan-synnada · 2025-01-21T08:25:03Z

datafusion/core/src/datasource/data_source.rs

+        let data_type = [
+            ("avro", file_source.downcast_ref::<AvroConfig>().is_some()),
+            ("arrow", file_source.downcast_ref::<ArrowConfig>().is_some()),
+            ("csv", file_source.downcast_ref::<CsvConfig>().is_some()),
+            ("json", file_source.downcast_ref::<JsonConfig>().is_some()),
+            #[cfg(feature = "parquet")]
+            (
+                "parquet",
+                file_source.downcast_ref::<ParquetConfig>().is_some(),
+            ),
+        ]


It might make more sense to use something like self.source.data_type() because ideally the implementation here should be unaware of the underlying FileSource implementation. It would make adding new file formats easier.

alihan-synnada · 2025-01-21T08:27:04Z

datafusion/core/src/datasource/data_source.rs

+        if let Some(csv_conf) = self.source.as_any().downcast_ref::<CsvConfig>() {
+            return write!(f, ", has_header={}", csv_conf.has_header);
+        }
+
+        #[cfg(feature = "parquet")]
+        if let Some(parquet_conf) = self.source.as_any().downcast_ref::<ParquetConfig>() {
+            return match t {


Same as above. Maybe something like self.source.fmt_extra(f). They would be better contained in their own implementations.

…t_partitioning in DataSource trait

create fmt_extra method

# Conflicts: # datafusion/core/src/physical_planner.rs

jayzhan-synnada

👍

mertak-synnada · 2025-01-21T13:46:53Z

Opened to upstream closing this one: apache#14224

mertak-synnada added 3 commits January 7, 2025 16:44

unify ParquetExec, AvroExec, ArrowExec, NDJsonExec, MemoryExec into o…

5b1714c

…ne DataSourceExec plan

Merge branch 'refs/heads/apache_main' into chore/single-source-exec

d2fe022

fix license headers

70505b3

mertak-synnada changed the title ~~Chore/single source exec~~ Feature/single source exec Jan 7, 2025

mertak-synnada marked this pull request as draft January 7, 2025 14:27

mertak-synnada added 2 commits January 7, 2025 17:34

fix compile errors on documents

00db759

separate non-parquet code

cea3ecd

github-actions bot added documentation Improvements or additions to documentation core physical-expr sqllogictest optimizer substrait execution proto labels Jan 7, 2025

mertak-synnada added 15 commits January 8, 2025 09:25

format code

2331577

fix typo

15b812f

fix imports

f020bd2

fix clippy

a7a5bd8

fix csv_json example

Merge branch 'refs/heads/apache_main' into chore/single-source-exec

94a306f

# Conflicts: # datafusion/physical-plan/src/aggregates/mod.rs

add comment to the example

d69b012

fix cargo docs

f14a6d8

change MemoryExec with MemorySourceConfig

8040147

Merge branch 'refs/heads/apache_main' into chore/single-source-exec

f540df0

# Conflicts: # datafusion/core/src/physical_optimizer/replace_with_order_preserving_variants.rs # datafusion/physical-plan/src/memory.rs

merge fixes

ddb221d

change MemoryExec to DataSourceExec

069d28c

fix merge conflicts

cb6a5ff

apply some syntactic sugars

4acb2fc

fix imports and comment line

ff68caa

mertak-synnada added 2 commits January 16, 2025 11:52

fix imports

104c428

jayzhan-synnada reviewed Jan 20, 2025

View reviewed changes

mertak-synnada added 7 commits January 20, 2025 11:35

add constraints and fix tests

4af421e

delete redundant file

91110cc

make metrics and statistics a part of File type specific configurations

6c76b3f

make cache a part of DataSourceExec

format code

12f0ac8

Merge branch 'refs/heads/apache_main' into chore/single-source-exec

f528bf5

# Conflicts: # datafusion/core/src/dataframe/mod.rs

fix tests

887922d

format code

91fb10e

mertak-synnada requested review from jayzhan-synnada, berkaysynnada and alihan-synnada January 20, 2025 14:36

jayzhan-synnada reviewed Jan 21, 2025

View reviewed changes

split repartitioning into DataSourceExec and FileSourceConfig parts

cb8c2ae

alihan-synnada reviewed Jan 21, 2025

View reviewed changes

mertak-synnada added 5 commits January 21, 2025 11:46

move properties into DataSourceExec and split eq_properties and outpu…

64ccad7

…t_partitioning in DataSource trait

clone source with Arc

aa047d7

return file type as enum and do not downcast if not necessary

93888a7

create fmt_extra method

Merge branch 'refs/heads/apache_main' into chore/single-source-exec

44f21a2

# Conflicts: # datafusion/core/src/physical_planner.rs

format code

aabcd04

mertak-synnada requested review from jayzhan-synnada and alihan-synnada January 21, 2025 11:00

jayzhan-synnada approved these changes Jan 21, 2025

View reviewed changes

alihan-synnada approved these changes Jan 21, 2025

View reviewed changes

mertak-synnada added 2 commits January 21, 2025 16:32

re-add deprecated plans in order to support backward compatibility

18da494

reduce diff

82b5257

mertak-synnada closed this Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/single source exec #56

Feature/single source exec #56

mertak-synnada commented Jan 7, 2025

jayzhan-synnada Jan 20, 2025

jayzhan-synnada Jan 20, 2025 •

edited

Loading

mertak-synnada Jan 20, 2025

jayzhan-synnada Jan 21, 2025 •

edited

Loading

jayzhan-synnada Jan 21, 2025 •

edited

Loading

mertak-synnada Jan 21, 2025

jayzhan-synnada Jan 20, 2025 •

edited

Loading

mertak-synnada Jan 21, 2025

jayzhan-synnada Jan 21, 2025

jayzhan-synnada Jan 21, 2025 •

edited

Loading

jayzhan-synnada Jan 21, 2025

jayzhan-synnada Jan 21, 2025

alihan-synnada Jan 21, 2025 •

edited

Loading

mertak-synnada Jan 21, 2025

alihan-synnada Jan 21, 2025

alihan-synnada Jan 21, 2025

jayzhan-synnada left a comment

mertak-synnada commented Jan 21, 2025

		@@ -261,64 +245,43 @@ impl DataSource for FileSourceConfig {
		fn repartitioned(

	[ex]: https://github.com/apache/datafusion/blob/a5e86fae3baadbd99f8fd0df83f45fde22f7b0c6/datafusion-examples/examples/custom_datasource.rs#L214C1-L276
	[csv]: https://github.com/apache/datafusion/blob/a5e86fae3baadbd99f8fd0df83f45fde22f7b0c6/datafusion/core/src/datasource/physical_plan/csv.rs#L57-L70
	[parquet]: https://github.com/apache/datafusion/blob/a5e86fae3baadbd99f8fd0df83f45fde22f7b0c6/datafusion/core/src/datasource/physical_plan/parquet.rs#L77-L104

Feature/single source exec #56

Feature/single source exec #56

Conversation

mertak-synnada commented Jan 7, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

jayzhan-synnada Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan-synnada Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

jayzhan-synnada Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan-synnada Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan-synnada Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alihan-synnada Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan-synnada left a comment

Choose a reason for hiding this comment

mertak-synnada commented Jan 21, 2025

jayzhan-synnada Jan 20, 2025 •

edited

Loading

jayzhan-synnada Jan 21, 2025 •

edited

Loading

jayzhan-synnada Jan 21, 2025 •

edited

Loading

jayzhan-synnada Jan 20, 2025 •

edited

Loading

jayzhan-synnada Jan 21, 2025 •

edited

Loading

alihan-synnada Jan 21, 2025 •

edited

Loading