-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] Extract catalog functionality from the core to make it more modular #10782
Comments
If this seems like a reasonable idea to people I will file tickets to break down the work cc @andygrove @jayzhan211 @comphead @mustafasrepo for your thoughts |
I agree with this direction. But now this seems hard to achieve because |
Thanks @alamb for starting this discussion. Like @lewiszlw correctly mentioned, we got some coupling between providers and the core. I'm just trying to understand the usecase when providers needed without the core |
Is it due to the complexity of |
In my mind the real use usecase is to more easily use datafusion without having to bring in all the dependencies of LIstingTable (like parquet, avro, json, etc) So the real usecase is getting ListingTable out of the core. But since the catalog API is in the core now there is no way to get ListingTable out of the core without also first moving the catalog API
I think both the complexity of ListingTable but also because if its dependency tree (e.g. parquet-rs and avro and json and object_store and ...) For use cases like |
I agree @lewiszlw -- well put. I made a first PR to start detangling things here: #10794 (it just splits Longer term we would have to figure out where Maybe we could look into splitting out |
Hm... they probably thrive to have their own readers/writes perhaps other than arrow-rs implementation, that makes sense for me. And yes, if DF stands for extensibility we should make this happen. Not sure how difficult that can be though. We probably need to start with replacing core abstractions with traits instead of implementations to decouple it. |
Yes something like this -- I think most of the traits already exist (e.g. |
I'm interesting on how datafusion can support vector index like Duckdb's VSS extension Given the comment, it seems the issue here might be helpful 🤔 ? |
There is a dead lock that block us from extracting out catalog 😆
SessionState contains tons of concept that are not all necessary for pub struct SessionState {
/// A unique UUID that identifies the session
session_id: String,
/// Responsible for analyzing and rewrite a logical plan before optimization
analyzer: Analyzer,
/// Provides support for customising the SQL planner, e.g. to add support for custom operators like `->>` or `?`
expr_planners: Vec<Arc<dyn ExprPlanner>>,
/// Responsible for optimizing a logical plan
optimizer: Optimizer,
/// Responsible for optimizing a physical execution plan
physical_optimizers: PhysicalOptimizer,
/// Responsible for planning `LogicalPlan`s, and `ExecutionPlan`
query_planner: Arc<dyn QueryPlanner + Send + Sync>,
/// Collection of catalogs containing schemas and ultimately TableProviders
catalog_list: Arc<dyn CatalogProviderList>,
/// Table Functions
table_functions: HashMap<String, Arc<TableFunction>>,
/// Scalar functions that are registered with the context
scalar_functions: HashMap<String, Arc<ScalarUDF>>,
/// Aggregate functions registered in the context
aggregate_functions: HashMap<String, Arc<AggregateUDF>>,
/// Window functions registered in the context
window_functions: HashMap<String, Arc<WindowUDF>>,
/// Deserializer registry for extensions.
serializer_registry: Arc<dyn SerializerRegistry>,
/// Holds registered external FileFormat implementations
file_formats: HashMap<String, Arc<dyn FileFormatFactory>>,
/// Session configuration
config: SessionConfig,
/// Table options
table_options: TableOptions,
/// Execution properties
execution_props: ExecutionProps,
/// TableProviderFactories for different file formats.
///
/// Maps strings like "JSON" to an instance of [`TableProviderFactory`]
///
/// This is used to create [`TableProvider`] instances for the
/// `CREATE EXTERNAL TABLE ... STORED AS <FORMAT>` for custom file
/// formats other than those built into DataFusion
///
/// [`TableProvider`]: crate::datasource::provider::TableProvider
table_factories: HashMap<String, Arc<dyn TableProviderFactory>>,
/// Runtime environment
runtime_env: Arc<RuntimeEnv>,
/// [FunctionFactory] to support pluggable user defined function handler.
///
/// It will be invoked on `CREATE FUNCTION` statements.
/// thus, changing dialect o PostgreSql is required
function_factory: Option<Arc<dyn FunctionFactory>>,
} |
🤔 we could potentially move SessionState into the catalog crate In fact the more I think about it, it seems to make some sort of sense to have a crate with the catalog traits / interfaces such as Then we could make separate crates that actually had the implementations like
And we could put the individual datasources like
🤔 |
Let's try moving SessionState, Catalog, TableProvider together into |
Update: #11403 has been merged and I think building SessionState is much nicer now |
The dependencies in core is quite complex, take a note for it Draft the dependency graph, incomplete graph TD;
CatalogProvider --> TableProvider
TableProvider --> SessionState
TableProvider --> ExecutionPlan
SessionState --> PhysicalOptimizer
PhysicalOptimizer --> ExecutionPlan
SessionState --> QueryPlanner
QueryPlanner --> ExecutionPlan
QueryPlanner --> SessionState
SessionState --> CatalogProviderList
CatalogProviderList --> CatalogProvider
SessionState --> TableFunction
SessionState --> FileFormatFactory
FileFormatFactory --> FileFormat
FileFormat --> SessionState
FileFormat --> ExecutionPlan
SessionState --> SessionConfig
SessionState --> TableProviderFactory
TableProviderFactory --> TableProvider
SessionState --> RuntimeEnv
RuntimeEnv --> MemoryPool
RuntimeEnv --> DiskManager
RuntimeEnv --> CacheManager
RuntimeEnv --> ObjectStoreRegistry
SessionState --> FunctionFactory
FunctionFactory --> TableProvider
TableFunction --> TableProvider
Circular found. It means they should be in the same crate graph TD;
CatalogProvider --> TableProvider
TableProvider --> SessionState
SessionState --> TableFunction
TableFunction --> TableProvider
graph TD;
CatalogProvider --> TableProvider
TableProvider --> SessionState
SessionState --> CatalogProviderList
CatalogProviderList --> CatalogProvider
graph TD;
CatalogProvider --> TableProvider
TableProvider --> SessionState
SessionState --> FileFormatFactory
FileFormatFactory --> FileFormat
FileFormat --> SessionState
graph TD;
CatalogProvider --> TableProvider
TableProvider --> SessionState
SessionState --> QueryPlanner
QueryPlanner --> SessionState
CatalogProvider + TableProvider + SessionsState + FileFormat + QueryPlanner should be moved to same crate These 3 has no circular dependencies which mean we can/should move them out of core before moving
|
This graph is amazing btw, first it shows how tightly coupled we are, the second is it gives really nice understandable picture |
Current dependencies for TableProvider that are used in datafusion graph TD;
CatalogProvider --> TableProvider
TableProvider --> QueryPlanner
TableProvider --> SessionConfig
TableProvider --> ExecutionProps
TableProvider --> RuntimeEnv
TableProvider --> FileFormat
FileFormat --> SessionState
QueryPlanner --> SessionState
DataFrameTableProvider: create_physical_plan SessionState is mainly used in create_physical_plan, which I guess most of them are configs or functions 🤔 |
I agree -- I think we can pull the physical optimzier rules into their own crate. The challenge would be any back dependencies (e.g. if any depend directly on execution plans in the core like
I think this is already out of core (in
Likewise, this is already in |
Shall we file a ticket / epic to pull physical optimizer out? |
newcomer here, so maybe i say something stupid, please bear with me |
|
I had a proposal in #11420 to minimize dependency with small struct. I think replacing it with another smaller trait are similar idea, both looks good to me |
I don't know what i am doing, but here is what i had in mind: #11516 |
If we're planning to replace |
This will require moving file formats ( |
We could try to move them out to a crate |
I think this would be ideal The parquet one in particular is quite large. The others are not as large |
@findepi Is there any blocker of moving catalog / datasource out of core? |
@jayzhan211 not aware of any blockers, but not working on this right now, if this is what you're asking about. |
@jonathanc-n For example, if we want to move CsvExec out, we’ll also need to move FileScanConfig and FileCompressionType. To extract FileScanConfig, we’ll need to move PartitionedFile as well. This process will reveal that we must move the entire datasource::listing module first, and whenever we need |
Oh i see, should we be putting priority on this soon? or should this be visited back another time? |
I will help to review with this. |
I went through the thread and I couldn't figure out if we had identified/agreed on a set of issues that compose this epic @alamb |
In my opinion, this would qualify as "Things that make DataFusion easier to work with" (details here) and thus I will also try and find time to help this project along (both reviews as well as project planning) I have been annoyed recently at the amount of time it takes to compile datafusion
Here is what I suggest:
I took a quick look and I think there are two things we could move to
Perhaps we can file tickets for those items and I can link them to this epic? |
Is your feature request related to a problem or challenge?
As @goldmedal started trying to move the DynamicFileProvider so others could use it in #10745 I think it is clear that there is not a good way to add additional catalog support in the core without everything being intertwined.
Thus I think we should try and extract the different catalog providers out of datafusion core so it it easier
Describe the solution you'd like
I suggest the following final layout:
CatalogProvider
,SchemaProvider
, etc in a new cratedatafusion-catalog
(since these traits rely on table provider, etc I think this can't be indatafusion-common
ordatafusion-expr
)Memory*
providers are indatafusion-catalog
InformationSchema
providers are indatafusion-catalog
DynamicFileCatalog
indatafusion-catalog
datafusion-catalog-listing
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: