Skip to content

Commit

Permalink
chore: fix catalog's usage docs error and add missing trait content
Browse files Browse the repository at this point in the history
  • Loading branch information
tanruixiang committed Sep 18, 2023
1 parent 93f78b2 commit 826d8b0
Showing 1 changed file with 42 additions and 7 deletions.
49 changes: 42 additions & 7 deletions docs/source/library-user-guide/catalogs.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,17 +23,17 @@ This section describes how to create and manage catalogs, schemas, and tables in

## General Concepts

Catalogs, schemas, and tables are organized in a hierarchy. A catalog contains schemas, and a schema contains tables.
Cataloglist, Catalogs, schemas, and tables are organized in a hierarchy. A cataloglist contains catalogs, a catalog contains schemas and a schema contains tables.

DataFusion comes with a basic in memory catalog functionality in the [`catalog` module]. You can use these in memory implementations as is, or extend DataFusion with your own catalog implementations, for example based on local files or files on remote object storage.

[`catalog` module]: https://docs.rs/datafusion/latest/datafusion/catalog/index.html

Similarly to other concepts in DataFusion, you'll implement various traits to create your own catalogs, schemas, and tables. The following sections describe the traits you'll need to implement.

The `CatalogProvider` trait has methods to set a schema to a name, get a schema by name, and list all schemas. The `SchemaProvider`, which can be registered with a `CatalogProvider`, has methods to set a table to a name, get a table by name, list all tables, deregister a table, and check for a table's existence. The `TableProvider` trait has methods to scan underlying data and use it in DataFusion. The `TableProvider` trait is covered in more detail [here](./custom-table-providers.md).
The `CatalogList` trait has methods to register new catalogs, get a catalog by name and list all catalogs .The `CatalogProvider` trait has methods to set a schema to a name, get a schema by name, and list all schemas. The `SchemaProvider`, which can be registered with a `CatalogProvider`, has methods to set a table to a name, get a table by name, list all tables, deregister a table, and check for a table's existence. The `TableProvider` trait has methods to scan underlying data and use it in DataFusion. The `TableProvider` trait is covered in more detail [here](./custom-table-providers.md).

In the following example, we'll implement an in memory catalog, starting with the `SchemaProvider` trait as we need one to register with the `CatalogProvider`.
In the following example, we'll implement an in memory catalog, starting with the `SchemaProvider` trait as we need one to register with the `CatalogProvider`. Finally we will implement `CatalogList` to register the `CatalogProvider`.

## Implementing `MemorySchemaProvider`

Expand Down Expand Up @@ -116,14 +116,13 @@ impl SchemaProvider for Schema {
}
```

### Implementing `MemoryCatalogProvider`
## Implementing `MemoryCatalogProvider`

As mentioned, the `CatalogProvider` can manage the schemas in a catalog, and the `MemoryCatalogProvider` is a simple implementation of the `CatalogProvider` trait. It stores schemas in a `DashMap`.

```rust
pub struct MemoryCatalogList {
/// Collection of catalogs containing schemas and ultimately TableProviders
pub catalogs: DashMap<String, Arc<dyn CatalogProvider>>,
pub struct MemoryCatalogProvider {
schemas: DashMap<String, Arc<dyn SchemaProvider>>,
}
```

Expand Down Expand Up @@ -170,10 +169,46 @@ impl CatalogProvider for MemoryCatalogProvider {

Again, this is fairly straightforward, as there's an underlying data structure to store the state, via key-value pairs.

## Implementing `MemoryCatalogList`

```rust
pub struct MemoryCatalogList {
/// Collection of catalogs containing schemas and ultimately TableProviders
pub catalogs: DashMap<String, Arc<dyn CatalogProvider>>,
}
```

With that the `CataLogList` trait can be implemented.
```rust
impl CatalogList for MemoryCatalogList {
fn as_any(&self) -> &dyn Any {
self
}

fn register_catalog(
&self,
name: String,
catalog: Arc<dyn CatalogProvider>,
) -> Option<Arc<dyn CatalogProvider>> {
self.catalogs.insert(name, catalog)
}

fn catalog_names(&self) -> Vec<String> {
self.catalogs.iter().map(|c| c.key().clone()).collect()
}

fn catalog(&self, name: &str) -> Option<Arc<dyn CatalogProvider>> {
self.catalogs.get(name).map(|c| c.value().clone())
}
}
```
Like other traits, it also maintains the mapping of the Catalog's name to the CatalogProvider.

## Recap

To recap, you need to:

1. Implement the `TableProvider` trait to create a table provider, or use an existing one.
2. Implement the `SchemaProvider` trait to create a schema provider, or use an existing one.
3. Implement the `CatalogProvider` trait to create a catalog provider, or use an existing one.
4. Implement the `CataLogList` trait to create a cataloglist, or use an existing one.

0 comments on commit 826d8b0

Please sign in to comment.