Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FileScanConfig::new() API #10623

Merged
merged 6 commits into from
May 25, 2024
Merged

Add FileScanConfig::new() API #10623

merged 6 commits into from
May 25, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 22, 2024

Which issue does this PR close?

Part of #10546

Rationale for this change

While working on #10549 it was cumbersome to create a FileScanConfig -- there are many different fields, many of which need default values which require some thought to figure out.

The number of fields also makes it harder to see what is specifically set and what is the default

This has bothered me for a long time, so I made a PR to fix it.

You can see this in action here #10618

What changes are included in this PR?

  1. Add FileScanConfig::new() that takes required information
  2. Add an example
  3. Add builder style APIs for updating other fields
  4. Update the code to use the new API

Are these changes tested?

Covered by existing tests

Are there any user-facing changes?

Nicer and easier to use API for creating FileScanConfig

@github-actions github-actions bot added the core Core DataFusion crate label May 22, 2024
@alamb alamb force-pushed the alamb/file_scan_config branch from ec6fbb1 to d7e0462 Compare May 22, 2024 17:42
table_partition_cols: vec![],
output_ordering: vec![],
};
let scan_config =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty good example of what the new API allows for:

  1. Less code (defaults are set)
  2. Makes it clearer what is the default and what is not

/// PartitionedFile::new("file2.parquet", 56),
/// PartitionedFile::new("file3.parquet", 78),
/// ]);
/// ```
#[derive(Clone)]
pub struct FileScanConfig {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the new API and documentation. Note this is entirely backwards compatible (I did not change any fields to non pub, etc)

self
}

/// Add a file as a single group
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to wrap a single file in a vec![vec![..]] makes sense from the implementation point of view, but most users shouldn't have to worry about this

@alamb alamb marked this pull request as ready for review May 22, 2024 19:21
Copy link
Contributor

@phillipleblanc phillipleblanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean, thank you! This is the first time I've seen both the with_<x> and add_<x> pattern used on a builder-like struct.

Is that something you generally do for all builder-type APIs, or it depends?

@alamb
Copy link
Contributor Author

alamb commented May 23, 2024

Thank you @phillipleblanc

Is that something you generally do for all builder-type APIs, or it depends?

I would say it depends. In this case I hadd a use case for add style in the example I am working on for indexing (you can see this in action here #10618)

However, I can see how we could do something else in this case, so perhaps I should remove the set style APIs 🤔

/// # use arrow_schema::Schema;
/// use datafusion::datasource::listing::PartitionedFile;
/// # use datafusion::datasource::physical_plan::FileScanConfig;
/// # use datafusion_execution::object_store::ObjectStoreUrl;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 maybe I should change this example to use the with style 🤔

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phillipleblanc upon more consideration I htink you are right that sticking to one pattern (e.g. builder) will be cleaner. I removed the add* apis in f228af32d

Copy link
Contributor

@metegenez metegenez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort, this was a necessary API to introduce.

@alamb
Copy link
Contributor Author

alamb commented May 24, 2024

Thank you @metegenez

@alamb alamb merged commit 4709fc6 into apache:main May 25, 2024
23 checks passed
@alamb
Copy link
Contributor Author

alamb commented May 25, 2024

Thanks everyone for the reviews!

@alamb alamb deleted the alamb/file_scan_config branch May 25, 2024 11:00
jayzhan211 pushed a commit to jayzhan211/datafusion that referenced this pull request May 26, 2024
* Add FileScanConfig::new() API, update code to use new API

* Remove add_* api
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* Add FileScanConfig::new() API, update code to use new API

* Remove add_* api
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants