added sections for ast and relation algebra steps

dzlab · Oct 10, 2024 · 257e2d4 · 257e2d4
1 parent e79b16c
commit 257e2d4
Showing 1 changed file with 106 additions and 43 deletions.
diff --git a/_posts/2024-07-17-sql-processor-calcite.md b/_posts/2024-07-17-sql-processor-calcite.md
@@ -14,7 +14,7 @@ img_excerpt: assets/logos/Apache_Calcite_Logo.svg
 
 In a [previous article]({{ "database/2024/07/06/apache-calcite/" | absolute_url }}), we saw how to create an Adapter for Apache Calcite and then how to run SQL queries against random data source. In this article we will see in [step by step](https://github.com/zabetak/slides/blob/master/2021/boss-workshop/apache-calcite-tutorial.pdf) how to use Apache Cacite to implement a SQL processor to parse an input query, validate it and then execute it.
 
-**Query**
+As an example query we will use the following simple `JOIN` query between two tables `customer` and `orders`.
 
 ```sql
 SELECT `C_NAME`, `O_ORDERKEY`, `O_ORDERDATE`
@@ -24,71 +24,98 @@ WHERE `CUSTOMER`.`c_custkey` < 3
 ORDER BY `C_NAME`, `O_ORDERKEY`
 ```
 
+## Catalog
+We need to build the catalog of metadata for Caclite to resolve the query.
+
+First, we need to create the root schema and type factory:
+
 ```java
-// TODO 1. Create the root schema and type factory
 CalciteSchema schema = CalciteSchema.createRootSchema(false);
 RelDataTypeFactory typeFactory = new JavaTypeFactoryImpl();
 ```
-// TODO 2. Create the data type for each TPC-H table
-// TODO 3. Add the TPC-H table to the schema
+
+Then, create the metadata of the two tables (columns and data types) then register them with the root schema
+
 ```java
-for(TpchTable table: TpchTable.values()) {
-  RelDataTypeFactory.Builder builder = typeFactory.builder();
-  for(TpchTable.Column c: table.columns) {
-    builder.add(c.name, typeFactory.createJavaType(c.type).getSqlTypeName());
-  }
-  String indexPath = Paths.get(DatasetIndexer.INDEX_LOCATION, "tpch", table.name()).toString();
-  schema.add(table.name(), new LuceneTable(indexPath, builder.build()));
-}
+RelDataTypeFactory.Builder builder1 = typeFactory.builder();
+builder1.add("c_custkey", typeFactory.createJavaType(Integer.class).getSqlTypeName());
+builder1.add("c_name", typeFactory.createJavaType(String.class).getSqlTypeName());
+schema.add("customer", new MyTable(builder1.build(), ...){...});
+
+RelDataTypeFactory.Builder builder2 = typeFactory.builder();
+builder2.add("o_orderkey", typeFactory.createJavaType(Integer.class).getSqlTypeName());
+builder2.add("o_custkey", typeFactory.createJavaType(Integer.class).getSqlTypeName());
+builder2.add("o_orderdate", typeFactory.createJavaType(Date.class).getSqlTypeName());
+schema.add("orders", new MyTable(builder2.build(), ...){...});
+```
+
+> Note: `MyTable` should be replaced with the actual class used to access the data and implements Calcite's `Table` / `ScannableTable`
+
+After that, Configure and instantiate a catalog reader that Calcite can use to access the metadata
+
+```java
+CalciteConnectionConfig readerConfig = CalciteConnectionConfig.DEFAULT
+        .set(CalciteConnectionProperty.CASE_SENSITIVE, "false");
+CalciteCatalogReader catalogReader = new CalciteCatalogReader(schema, Collections.emptyList(), typeFactory, readerConfig);
 ```
 
+> Note: we set the case-sensitivity to false so that we it is OK to user all uppercase table or column names.
+
 ## Query to AST
+To parse the text query into an Abstract Syntax Tree (AST), we first create a SQL parser
 
-// TODO 4. Create an SQL parser
 ```java
 SqlParser parser = SqlParser.create(sqlQuery);
 ```
-// TODO 5. Parse the query into an AST
+
+Then, we can use it to parse the query into an AST as follows:
+
 ```java
 SqlNode parseAst = parser.parseQuery();
-// TODO 6. Print and check the AST
-System.out.println("[Parsed query]");
-System.out.println(parseAst.toString());
-```
-// TODO 7. Configure and instantiate the catalog reader
-```java
-CalciteConnectionConfig readerConfig = CalciteConnectionConfig.DEFAULT
-        .set(CalciteConnectionProperty.CASE_SENSITIVE, "false");
-CalciteCatalogReader catalogReader = new CalciteCatalogReader(schema, Collections.emptyList(), typeFactory,
-        readerConfig);
 ```
-// TODO 8. Create the SQL validator using the standard operator table and default configuration
+
+We can get back the original query from the AST with `parseAst.toString()`.
+
+
+Once we have the AST, we can validate it against the catalog.
+First, create a SQL validator using the standard operator table and default configuration.
+
 ```java
 SqlValidator sqlValidator = SqlValidatorUtil.newValidator(SqlStdOperatorTable.instance(),
         catalogReader, typeFactory, SqlValidator.Config.DEFAULT);
 ```
-// TODO 9. Validate the initial AST
+
+Now we can validate the initial AST:
+
 ```java
 SqlNode validAst = sqlValidator.validate(parseAst);
-System.out.println("[Validated query");
-System.out.println(validAst.toString());
 ```
 
+Similarly to before, we can get back the original query from the validated AST with `validAst.toString()`
+
 ## AST to Logical plan
+Query optimization cannot be applied to an AST, the later must be converted to Relational Algebra expression.
+
+
+First, Create the optimization cluster to maintain planning information
 
-// TODO 10. Create the optimization cluster to maintain planning information
-// TODO 11. Configure and instantiate the converter of the AST to Logical plan
-// - No view expansion (use NOOP_EXPANDER)
-// - Standard expression normalization (use StandardConvertletTable.INSTANCE)
-// - Default configuration (SqlToRelConverter.config())
 ```java
-RelOptCluster cluster = newCluster(typeFactory);
+RelOptPlanner planner = new VolcanoPlanner();
+planner.addRelTraitDef(ConventionTraitDef.INSTANCE);
+RelOptCluster cluster =  RelOptCluster.create(planner, new RexBuilder(typeFactory));
+```
+
+Then, Configure and instantiate an AST to Logical plan converter with default configuration and Standard expression normalization
+
+```java
+RelOptTable.ViewExpander NOOP_EXPANDER = (type, query, schema, path) -> null;
 SqlToRelConverter sqlToRelConverter = new SqlToRelConverter(NOOP_EXPANDER,
         sqlValidator, catalogReader, cluster,
         StandardConvertletTable.INSTANCE,
         SqlToRelConverter.config());
 ```
-// TODO 12. Convert the valid AST into a logical plan
+
+Now, we can convert the validated AST into a logical plan and print it to standard output
 ```java
 RelNode logPlan = sqlToRelConverter.convertQuery(validAst, false, true).rel;
 // TODO 13. Display the logical plan with explain attributes
@@ -97,8 +124,8 @@ System.out.println(
 );
 ```
 
+We should see a **Logical plan** that look like this:
 
-**Logical plan**
 ```
 LogicalSort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
   LogicalProject(C_NAME=[$1], O_ORDERKEY=[$8], O_ORDERDATE=[$12])
@@ -109,8 +136,10 @@ LogicalSort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
 ```
 
 ## Logical to Physical plan
+We need to optimize the Logical Plan and convert it to a plan that can be executed by the underlying storage system.
+
+First, initialize optimizer/planner with the necessary rules that will be used to transform the Logical Plan:
 
-// TODO 14. Initialize optimizer/planner with the necessary rules
 ```java
 RelOptPlanner planner = cluster.getPlanner();
 planner.addRule(CoreRules.FILTER_TO_CALC);
@@ -120,23 +149,30 @@ planner.addRule(EnumerableRules.ENUMERABLE_CALC_RULE);
 planner.addRule(EnumerableRules.ENUMERABLE_JOIN_RULE);
 planner.addRule(EnumerableRules.ENUMERABLE_TABLE_SCAN_RULE);
 ```
-// TODO 15. Define the type of the output plan (in this case we want a physical plan in
-// EnumerableContention)
+
+Next, define the type of the output plan, in this case we want a physical plan in `EnumerableContention`
+
 ```java
 logPlan = planner.changeTraits(logPlan, logPlan.getTraitSet().replace(EnumerableConvention.INSTANCE));
 planner.setRoot(logPlan);
+```
 
-// TODO 16. Start the optimization process to obtain the most efficient physical plan based on
-// the provided rule set.
+Start the optimization process to obtain the most efficient physical plan based on the provided rule set.
+
+```java
 EnumerableRel phyPlan = (EnumerableRel) planner.findBestExp();
+```
 
-// TODO 17. Display the physical plan
+We can visualize the **Physical plan**
+
+```java
 System.out.println(
         RelOptUtil.dumpPlan("[Physical plan]", phyPlan, SqlExplainFormat.TEXT, SqlExplainLevel.EXPPLAN_ATTRIBUTES)
 );
 ```
 
-**Physical plan**
+Which will give us something like this:
+
 ```
 EnumerableSort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
   EnumerableCalc(expr#0..16=[{inputs}], C_NAME=[$t1], O_ORDERKEY=[$t8], O_ORDERDATE=[$t12])
@@ -146,11 +182,38 @@ EnumerableSort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
         EnumerableTableScan(table=[[ORDERS]])
 ```
 
+Or generate a [Dotviz graph](graphviz.org) which would look like this:
 
 ![Physical Plan](/assets/2024/07/20240717-physical_plan.svg)
 
 ## Physical to Executable plan
+```java
+/**
+ * A simple data context only with schema information.
+ */
+private static final class SchemaOnlyDataContext implements DataContext {
+  private final SchemaPlus schema;
+
+  SchemaOnlyDataContext(CalciteSchema calciteSchema) {
+    this.schema = calciteSchema.plus();
+  }
+
+  @Override public SchemaPlus getRootSchema() {
+    return schema;
+  }
 
+  @Override public JavaTypeFactory getTypeFactory() {
+    return new JavaTypeFactoryImpl();
+  }
+
+  @Override public QueryProvider getQueryProvider() {
+    return null;
+  }
+
+  @Override public Object get(final String name) {
+    return null;
+  }
+}
 ```
 // TODO 18. Compile generated code and obtain the executable program
 ```java