Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract catalog API to separate crate, change TableProvider::scan to take a trait rather than SessionState #11516

Merged
merged 1 commit into from
Jul 26, 2024

Conversation

findepi
Copy link
Member

@findepi findepi commented Jul 17, 2024

This moves CatalogProvider, TableProvider, SchemaProvider to a new datafusion-catalog crate. The circular dependency between core SessionState and implementations is broken up by introducing CatalogSession dyn trait. Implementations of TableProvider that reside under core current have access to CatalogSession by downcasting. This is supposed to be an intermediate step.

Part of #10782
Relates to #11420

@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jul 17, 2024
@findepi
Copy link
Member Author

findepi commented Jul 17, 2024

This tries to implement the idea expressed in #10782 (comment)

@findepi findepi marked this pull request as draft July 17, 2024 14:19
@findepi findepi force-pushed the findepi/catalog-api branch 4 times, most recently from d5f5a77 to 5bbf5d7 Compare July 17, 2024 14:26
@findepi findepi marked this pull request as ready for review July 17, 2024 14:28
@findepi findepi force-pushed the findepi/catalog-api branch 2 times, most recently from 8ad5111 to 0430691 Compare July 17, 2024 16:50
@findepi findepi force-pushed the findepi/catalog-api branch from 0430691 to 6c399c2 Compare July 17, 2024 17:18
@findepi
Copy link
Member Author

findepi commented Jul 17, 2024

Eventually all green.

@jayzhan211 @alamb @comphead you may want to take a look

@jayzhan211
Copy link
Contributor

I think CatalogSession is not the minimum dependency, what is the reason to replace all the SessionState with CatalogSession

@findepi
Copy link
Member Author

findepi commented Jul 18, 2024

@jayzhan211 admittedly i made some choices about what to put into CatalogSession (or however we want to call this) and what to keep outside requiring a downcast to SessionState.
my line of thought was to let CatalogSession contain things that are separate modules and not optional dependencies -- so e.g. logical and physical expressions and plans were allowed base on that.

Of course, these choices remain pretty arbitrary. there may be things that CatalogSession should have but doesn't have yet, and also things that it got, but should be left out.
I will need your help, and others, to improve those decisions (provided this is the way to go at all).

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 18, 2024

let CatalogSession contain things that are separate modules and not optional dependencies

I think this is similar to minimum dependencies what I'm thinking. The only difference is that I propose struct TableProviderContext, but you have trait CatalogSession

IMO struct is more sutiable for storing data, unlike trait that mainly used for sharing same behavior

Given the dependency of TableProvider I found, we might need to find out the smaller context for FileFormatFactory and QueryPlanner.

trait TableProvider {
    async fn scan(
        &self,
        context: &TableProviderContext,
        projection: Option<&Vec<usize>>,
        filters: &[Expr],
        limit: Option<usize>,
    ) -> Result<Arc<dyn ExecutionPlan>>;
}

struct TableProviderContext {
    FileFormatFactory
    SessionConfig
    ExectuionProps
    RuntimeEnv
    QueryPlanner
}

pub trait QueryPlanner {
    /// Given a `LogicalPlan`, create an [`ExecutionPlan`] suitable for execution
    async fn create_physical_plan(
        &self,
        logical_plan: &LogicalPlan,
        context: &QueryPlannerContext,
    ) -> Result<Arc<dyn ExecutionPlan>>;
}

pub trait FileFormatFactory: Sync + Send + GetExt {
    /// Initialize a [FileFormat] and configure based on session and command level options
    fn create(
        &self,
        context: &FileFormatContext,
        format_options: &HashMap<String, String>,
    ) -> Result<Arc<dyn FileFormat>>;

    /// Initialize a [FileFormat] with all options set to default values
    fn default(&self) -> Arc<dyn FileFormat>;
}

But before that, we need to ensure those contexts won't end up the as the same thing in the long term

@findepi
Copy link
Member Author

findepi commented Jul 19, 2024

I think this is similar to minimum dependencies what I'm thinking. The only difference is that I propose struct TableProviderContext, but you have trait CatalogSession

IMO struct is more sutiable for storing data, unlike trait that mainly used for sharing same behavior

A publicly constructible struct is indeed different than a trait. Not sure we want that though, as it would limit evolvability.
I thought about a trait, because of it resemblance to interfaces in other languages. To me seems suited for defining contracts -- and here we're defining a contract between core and the table providers.
I don't feel strongly though, even less so given my little experience

IMO a bigger question is what functionality does this new being provide.
If we go into composability, I would want to think about different table providers as add-ons or plugins.
Then, maybe FileFormatFactory could be moved our of core, as not inherently part of SQL execution?
This question would imply whether FileFormatFactory is exposed in TableProviderContext/CatalogSession.

cc @alamb it would be good to hear your opinion, especially that you created the #10782 issue, which this PR is in response to.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 19, 2024

Given the current FileFormatFactory implementation in datafusion, we only need TableOptions, we can change SessionState to FileFormatContext with TableOptions, and it can move out from the crate which seat Sessionstate.

QueryPlanner is much complex, there is schema_for_ref that returns SchemaProvider so there is circular dependency on TableProvider.

TableProvider -> SessionState -> QueryPlanner -> SchemaProvider -> TableProvider

@findepi
Copy link
Member Author

findepi commented Jul 19, 2024

At runtime there are circular dependencies, unless we further limit amount of functionality available to the TableProvider.
However, I think we actually solve the circular dependencies with indirection. CatalogSession is supposed to be a POC for that indirection.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 19, 2024

I doubt that CatalogSession solves all the dependencies issue. CatalogSession is one higher level trait above the core. The actual implementations are still leave inside core. It doesn't seem problematic since the actual implementation is inside core. If we want to move TableProvider implementation out of core into datasource, since we still need to downcast CatalogSession to SessionState, thus SessionState should be in the same as datasource, then we ends up bringing them all into the same crate. The same issue applied to other actual implementation like Catalog. Without CatalogSession, we are already able to bring TableProvider+CatalogProvider+SessionState+Others related things out of core into it's own crate, but the dependencies between TableProvider, CatalogProvider and SessionState are what left to be decoupled. Therefore, I don't think CatalogSession is enough 🤔

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 19, 2024

I think there is clear dependency something close to
Catalog -> Schema -> Table -> FileFormat -> QueryPlanner.
They all have trait and trait implementation and we can easily downcast the trait to actual struct in the lower crate.

// queryplanner crate
trait QueryPlanner {}
impl QueryPlanner for DefaultPlanner {}

// fileformat crate
trait FileFormat {}
impl FileFormat for DefaultFile {
      // FileFormatContext that has all the higher level concept things above FileFormat
   fn scan(&self, FileFormatContext) -> Arc<QueryPlanner> { // able to downcast to planner }
}

// Table crate
trait TableProvider {}
impl TableProvider for DefaultTable {
   // arguments are all higher level trait without circular dependency
   fn get_file(&self, plan) -> Arc<QueryPlanner> { // able to downcast to actual planner }
   fn get_planner(&self, fileformat) -> Arc<FileFormat> { // able to downcast to actual format}

   // TableProviderContext that has all the higher level concept things above TableProvider
   fn scan(&self, TableProviderContext)
}

Here is one example that I think we should fixed. We find the specific table from SessionState and insert query into the table, which violate the dependencies from what I thought of.

            LogicalPlan::Dml(DmlStatement {
                table_name,
                op: WriteOp::InsertInto,
                ..
            }) => {
                let name = table_name.table();
                let schema = session_state.schema_for_ref(table_name.clone())?;
                if let Some(provider) = schema.table(name).await? {
                    let input_exec = children.one()?;
                    provider
                        .insert_into(session_state, input_exec, false)
                        .await?
                } else {
                    return exec_err!("Table '{table_name}' does not exist");
                }
            }

Instead, I think we should provider insert_plan_to_table for SchemaProvider. And avoid insert plan inside QueryPlan but in SchemaProvider. Split the logic into two step with

  1. Create plan to insert from QueryPlan.
  2. Call insert_plan_to_table from SchemaProvider

Not sure if this idea works over all the places 🤔

@alamb
Copy link
Contributor

alamb commented Jul 20, 2024

I plan to check this PR out carefully tomorrow

@alamb alamb changed the title Extract catalog API to separate crate Extract catalog API to separate crate, change TableProvider::scan to take a trait rather than SessionState Jul 20, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @findepi and @jayzhan211

I think this API is quite clever and solves a long standing problem (circular dependency between Catalog -> TableProvider -> SessionState). Like many elegant solutions, once you see it the design is obvious but it was not before seeing.

I also think this API makes the usecase described in #11193 from @cisaacson (providing the same SessionState information via more APIs) straightforward.

I think using a trait (rather than a struct) to break the dependency follows the same pattern used elsewhere in DataFusion (e.g. ScalarUdfImpl)

Next Steps

First I would like to hear what @jayzhan211 thinks

If he thinks we should proceed, since this PR proposes to change a very widely used API in DataFusion I think some extra communication is warranted before we merge:

  • mark this PR as api-change and change the title to higlight the change to TableProvider
  • communicate this proposal over the mailing list / slack / discord to maximize the chance anyone with feedback to share has a chance to
  • Make sure it is left open for several days for feedback
  • Test this change with some downstream users (e.g. I can test this with InfluxDB 3.0 and see if the migration to the new API goes smoothly

datafusion/catalog/src/session.rs Outdated Show resolved Hide resolved
/// to return as a final result.
async fn scan(
&self,
state: &dyn CatalogSession,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one line I think is the biggest potential change in this PR as it will require changing all existing TableProviders which is one of the oldest and most widely used APIs in DataFusion

If we are ok with this change, I think the rest of this PR is wonderful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked in the Spice AI codebase, and this will be a relatively minor refactoring for us. I am ok with this change 👍

datafusion/catalog/src/session.rs Show resolved Hide resolved
datafusion/core/src/lib.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Jul 20, 2024

At runtime there are circular dependencies, unless we further limit amount of functionality available to the TableProvider. However, I think we actually solve the circular dependencies with indirection. CatalogSession is supposed to be a POC for that indirection.

Yes I agree with this

Instead, I think we should provider insert_plan_to_table for SchemaProvider. And avoid insert plan inside QueryPlan but in SchemaProvider. Split the logic into two step with

I also think this sounds like a reasonable plan - and I think it becomes easier as we split the logic out and defined the API via traits

@alamb
Copy link
Contributor

alamb commented Jul 20, 2024

BTW I agree with most of @jayzhan211 's ideas above, but in my mind they are follow ons / orthogonal to moving the catalog traits out of the core (as in we can do both!)

@alamb
Copy link
Contributor

alamb commented Jul 24, 2024

I am feeling even better about this PR. I propose we plan to merge it in 2 days time (Friday) unless there are objections or other comments to address. Thank you @findepi @Omega359 @cisaacson @jayzhan211 and @phillipleblanc for your help so far

@alamb
Copy link
Contributor

alamb commented Jul 24, 2024

@findepi also mentioned Session as an option which seems reasonable to me too

I slightly prefer SessionProvider as I think it is more consistent with the naming conventions of other traits in DataFusion such as TableProvider, CatalogProvider, SchemaProvider etc, (however, there are plenty of counter examples like ExprSchema)

However, I don't feel super strongly and would be interested in opinions from others if we want to have a Provider on the end of the (now named) Session trait

@findepi
Copy link
Member Author

findepi commented Jul 24, 2024

re: SessionProvider
the trait describes something that doesn't provide session, it describes something that is session.
However, same think can be said about TableProvider or CatalogProvider. I don't have an answer to the consistency argument...

i can change, this is not a big deal. @alamb @jayzhan211 @cisaacson @phillipleblanc @Omega359 let's just make sure we have agreement to avoid unnecessary back and forth.

@alamb
Copy link
Contributor

alamb commented Jul 24, 2024

Oh no, I broke the doc test -- sorry @findepi

@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

Oh no, I broke the doc test -- sorry @findepi

Here is a proposed fix: findepi#5

@findepi findepi force-pushed the findepi/catalog-api branch from 64e6c20 to 9210b48 Compare July 25, 2024 12:29
@findepi
Copy link
Member Author

findepi commented Jul 25, 2024

rebased on current main, no other changes
i couldn't locally reproduce the build problems observed on CI

@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

rebased on current main, no other changes i couldn't locally reproduce the build problems observed on CI

I think the failure is due to new rust version (if you run rustup update it would likely start failing too).

#11651

Not related to the changes in this PR

@findepi
Copy link
Member Author

findepi commented Jul 25, 2024

I think the failure is due to new rust version (if you run rustup update it would likely start failing too).

#11651

it's now fixed and the build is green, but there is a new conflict. will rebase

This moves `CatalogProvider`, `TableProvider`, `SchemaProvider` to a new
`datafusion-catalog` crate.  The circular dependency between core
`SessionState` and implementations is broken up by introducing
`CatalogSession` dyn trait.  Implementations of `TableProvider` that
reside under core current have access to `CatalogSession` by
downcasting. This is supposed to be an intermediate step.
@findepi findepi force-pushed the findepi/catalog-api branch from a4d282a to dbb8905 Compare July 25, 2024 21:33
@findepi
Copy link
Member Author

findepi commented Jul 25, 2024

( squashed and rebased, no other changes )

@alamb
Copy link
Contributor

alamb commented Jul 26, 2024

Let's do it. go go go 🚀

Thanks again everryone for your comments and help. I think this PR finally breaks open the path to separate out the last monolithic knot

@alamb alamb merged commit 2eaf1ea into apache:main Jul 26, 2024
27 checks passed
@findepi findepi deleted the findepi/catalog-api branch July 26, 2024 14:59
@findepi
Copy link
Member Author

findepi commented Jul 26, 2024

🎉 thanks for all the review feedback and the merge!

@berkaysynnada
Copy link
Contributor

Are we tracking this TODO?

// TODO remove downcast_ref from here. Should file format factory be an extension to session state?

@findepi
Copy link
Member Author

findepi commented Aug 2, 2024

@berkaysynnada yes, #11600

@alamb
Copy link
Contributor

alamb commented Aug 2, 2024

@berkaysynnada yes, #11600

I suggest adding a comment in the code with the link to the ticket to help others find this ticket

@findepi
Copy link
Member Author

findepi commented Aug 2, 2024

good idea! #11784

Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Aug 10, 2024
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Aug 10, 2024
Catlog API was extracted to a separate crate.

Ref: apache/datafusion#11516
@phillipleblanc
Copy link
Contributor

Looks like this PR didn't get tagged with api-change, so it got put under the "Documentation Updates" section in the v41 Changelog: https://github.com/apache/datafusion/blob/main/dev/changelog/41.0.0.md?plain=1#L94

@alamb
Copy link
Contributor

alamb commented Aug 14, 2024

Sorry -- thanks for the heads up @phillipleblanc

Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Aug 20, 2024
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Aug 20, 2024
Catlog API was extracted to a separate crate.

Ref: apache/datafusion#11516
andygrove pushed a commit to apache/datafusion-python that referenced this pull request Aug 23, 2024
* update datafusion deps to point to githuc.com/apache/datafusion

Datafusion 41 is not yet released on crates.io.

* update TableProvider::scan

Ref: apache/datafusion#11516

* use SessionStateBuilder

The old constructor is deprecated.

Ref: apache/datafusion#11403

* update AggregateFunction

Upstream Changes:
- The field name was switched from `func_name` to func.
- AggregateFunctionDefinition was removed

Ref: apache/datafusion#11803

* update imports in catalog

Catlog API was extracted to a separate crate.

Ref: apache/datafusion#11516

* use appropriate path for approx_distinct

Ref: apache/datafusion#11644

* migrate AggregateExt to ExprFunctionExt

Also removed `sqlparser` dependency since it's re-exported upstream.

Ref: apache/datafusion#11550

* update regr_count tests for new return type

Ref: apache/datafusion#11731

* migrate from function-array to functions-nested

The package was renamed upstream.

Ref: apache/datafusion#11602

* cargo fmt

* lock datafusion deps to 41

* remove todo from cargo.toml

All the datafusion dependencies are re-exported, but I still need to figure out *why*.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants