diff --git a/_posts/2024-07-17-sql-processor-calcite.md b/_posts/2024-07-17-sql-processor-calcite.md index 088e6b4..a49b61d 100644 --- a/_posts/2024-07-17-sql-processor-calcite.md +++ b/_posts/2024-07-17-sql-processor-calcite.md @@ -14,7 +14,7 @@ img_excerpt: assets/logos/Apache_Calcite_Logo.svg In a [previous article]({{ "database/2024/07/06/apache-calcite/" | absolute_url }}), we saw how to create an Adapter for Apache Calcite and then how to run SQL queries against random data source. In this article we will see in [step by step](https://github.com/zabetak/slides/blob/master/2021/boss-workshop/apache-calcite-tutorial.pdf) how to use Apache Cacite to implement a SQL processor to parse an input query, validate it and then execute it. -**Query** +As an example query we will use the following simple `JOIN` query between two tables `customer` and `orders`. ```sql SELECT `C_NAME`, `O_ORDERKEY`, `O_ORDERDATE` @@ -24,71 +24,98 @@ WHERE `CUSTOMER`.`c_custkey` < 3 ORDER BY `C_NAME`, `O_ORDERKEY` ``` +## Catalog +We need to build the catalog of metadata for Caclite to resolve the query. + +First, we need to create the root schema and type factory: + ```java -// TODO 1. Create the root schema and type factory CalciteSchema schema = CalciteSchema.createRootSchema(false); RelDataTypeFactory typeFactory = new JavaTypeFactoryImpl(); ``` -// TODO 2. Create the data type for each TPC-H table -// TODO 3. Add the TPC-H table to the schema + +Then, create the metadata of the two tables (columns and data types) then register them with the root schema + ```java -for(TpchTable table: TpchTable.values()) { - RelDataTypeFactory.Builder builder = typeFactory.builder(); - for(TpchTable.Column c: table.columns) { - builder.add(c.name, typeFactory.createJavaType(c.type).getSqlTypeName()); - } - String indexPath = Paths.get(DatasetIndexer.INDEX_LOCATION, "tpch", table.name()).toString(); - schema.add(table.name(), new LuceneTable(indexPath, builder.build())); -} +RelDataTypeFactory.Builder builder1 = typeFactory.builder(); +builder1.add("c_custkey", typeFactory.createJavaType(Integer.class).getSqlTypeName()); +builder1.add("c_name", typeFactory.createJavaType(String.class).getSqlTypeName()); +schema.add("customer", new MyTable(builder1.build(), ...){...}); + +RelDataTypeFactory.Builder builder2 = typeFactory.builder(); +builder2.add("o_orderkey", typeFactory.createJavaType(Integer.class).getSqlTypeName()); +builder2.add("o_custkey", typeFactory.createJavaType(Integer.class).getSqlTypeName()); +builder2.add("o_orderdate", typeFactory.createJavaType(Date.class).getSqlTypeName()); +schema.add("orders", new MyTable(builder2.build(), ...){...}); +``` + +> Note: `MyTable` should be replaced with the actual class used to access the data and implements Calcite's `Table` / `ScannableTable` + +After that, Configure and instantiate a catalog reader that Calcite can use to access the metadata + +```java +CalciteConnectionConfig readerConfig = CalciteConnectionConfig.DEFAULT + .set(CalciteConnectionProperty.CASE_SENSITIVE, "false"); +CalciteCatalogReader catalogReader = new CalciteCatalogReader(schema, Collections.emptyList(), typeFactory, readerConfig); ``` +> Note: we set the case-sensitivity to false so that we it is OK to user all uppercase table or column names. + ## Query to AST +To parse the text query into an Abstract Syntax Tree (AST), we first create a SQL parser -// TODO 4. Create an SQL parser ```java SqlParser parser = SqlParser.create(sqlQuery); ``` -// TODO 5. Parse the query into an AST + +Then, we can use it to parse the query into an AST as follows: + ```java SqlNode parseAst = parser.parseQuery(); -// TODO 6. Print and check the AST -System.out.println("[Parsed query]"); -System.out.println(parseAst.toString()); -``` -// TODO 7. Configure and instantiate the catalog reader -```java -CalciteConnectionConfig readerConfig = CalciteConnectionConfig.DEFAULT - .set(CalciteConnectionProperty.CASE_SENSITIVE, "false"); -CalciteCatalogReader catalogReader = new CalciteCatalogReader(schema, Collections.emptyList(), typeFactory, - readerConfig); ``` -// TODO 8. Create the SQL validator using the standard operator table and default configuration + +We can get back the original query from the AST with `parseAst.toString()`. + + +Once we have the AST, we can validate it against the catalog. +First, create a SQL validator using the standard operator table and default configuration. + ```java SqlValidator sqlValidator = SqlValidatorUtil.newValidator(SqlStdOperatorTable.instance(), catalogReader, typeFactory, SqlValidator.Config.DEFAULT); ``` -// TODO 9. Validate the initial AST + +Now we can validate the initial AST: + ```java SqlNode validAst = sqlValidator.validate(parseAst); -System.out.println("[Validated query"); -System.out.println(validAst.toString()); ``` +Similarly to before, we can get back the original query from the validated AST with `validAst.toString()` + ## AST to Logical plan +Query optimization cannot be applied to an AST, the later must be converted to Relational Algebra expression. + + +First, Create the optimization cluster to maintain planning information -// TODO 10. Create the optimization cluster to maintain planning information -// TODO 11. Configure and instantiate the converter of the AST to Logical plan -// - No view expansion (use NOOP_EXPANDER) -// - Standard expression normalization (use StandardConvertletTable.INSTANCE) -// - Default configuration (SqlToRelConverter.config()) ```java -RelOptCluster cluster = newCluster(typeFactory); +RelOptPlanner planner = new VolcanoPlanner(); +planner.addRelTraitDef(ConventionTraitDef.INSTANCE); +RelOptCluster cluster = RelOptCluster.create(planner, new RexBuilder(typeFactory)); +``` + +Then, Configure and instantiate an AST to Logical plan converter with default configuration and Standard expression normalization + +```java +RelOptTable.ViewExpander NOOP_EXPANDER = (type, query, schema, path) -> null; SqlToRelConverter sqlToRelConverter = new SqlToRelConverter(NOOP_EXPANDER, sqlValidator, catalogReader, cluster, StandardConvertletTable.INSTANCE, SqlToRelConverter.config()); ``` -// TODO 12. Convert the valid AST into a logical plan + +Now, we can convert the validated AST into a logical plan and print it to standard output ```java RelNode logPlan = sqlToRelConverter.convertQuery(validAst, false, true).rel; // TODO 13. Display the logical plan with explain attributes @@ -97,8 +124,8 @@ System.out.println( ); ``` +We should see a **Logical plan** that look like this: -**Logical plan** ``` LogicalSort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]) LogicalProject(C_NAME=[$1], O_ORDERKEY=[$8], O_ORDERDATE=[$12]) @@ -109,8 +136,10 @@ LogicalSort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]) ``` ## Logical to Physical plan +We need to optimize the Logical Plan and convert it to a plan that can be executed by the underlying storage system. + +First, initialize optimizer/planner with the necessary rules that will be used to transform the Logical Plan: -// TODO 14. Initialize optimizer/planner with the necessary rules ```java RelOptPlanner planner = cluster.getPlanner(); planner.addRule(CoreRules.FILTER_TO_CALC); @@ -120,23 +149,30 @@ planner.addRule(EnumerableRules.ENUMERABLE_CALC_RULE); planner.addRule(EnumerableRules.ENUMERABLE_JOIN_RULE); planner.addRule(EnumerableRules.ENUMERABLE_TABLE_SCAN_RULE); ``` -// TODO 15. Define the type of the output plan (in this case we want a physical plan in -// EnumerableContention) + +Next, define the type of the output plan, in this case we want a physical plan in `EnumerableContention` + ```java logPlan = planner.changeTraits(logPlan, logPlan.getTraitSet().replace(EnumerableConvention.INSTANCE)); planner.setRoot(logPlan); +``` -// TODO 16. Start the optimization process to obtain the most efficient physical plan based on -// the provided rule set. +Start the optimization process to obtain the most efficient physical plan based on the provided rule set. + +```java EnumerableRel phyPlan = (EnumerableRel) planner.findBestExp(); +``` -// TODO 17. Display the physical plan +We can visualize the **Physical plan** + +```java System.out.println( RelOptUtil.dumpPlan("[Physical plan]", phyPlan, SqlExplainFormat.TEXT, SqlExplainLevel.EXPPLAN_ATTRIBUTES) ); ``` -**Physical plan** +Which will give us something like this: + ``` EnumerableSort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]) EnumerableCalc(expr#0..16=[{inputs}], C_NAME=[$t1], O_ORDERKEY=[$t8], O_ORDERDATE=[$t12]) @@ -146,11 +182,38 @@ EnumerableSort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]) EnumerableTableScan(table=[[ORDERS]]) ``` +Or generate a [Dotviz graph](graphviz.org) which would look like this: ![Physical Plan](/assets/2024/07/20240717-physical_plan.svg) ## Physical to Executable plan +```java +/** + * A simple data context only with schema information. + */ +private static final class SchemaOnlyDataContext implements DataContext { + private final SchemaPlus schema; + + SchemaOnlyDataContext(CalciteSchema calciteSchema) { + this.schema = calciteSchema.plus(); + } + + @Override public SchemaPlus getRootSchema() { + return schema; + } + @Override public JavaTypeFactory getTypeFactory() { + return new JavaTypeFactoryImpl(); + } + + @Override public QueryProvider getQueryProvider() { + return null; + } + + @Override public Object get(final String name) { + return null; + } +} ``` // TODO 18. Compile generated code and obtain the executable program ```java