diff --git a/docs/_freeze/posts/ibis-bench/index/execute-results/html.json b/docs/_freeze/posts/ibis-bench/index/execute-results/html.json new file mode 100644 index 000000000000..898afa54a341 --- /dev/null +++ b/docs/_freeze/posts/ibis-bench/index/execute-results/html.json @@ -0,0 +1,16 @@ +{ + "hash": "813ef6319e015f6967cb8a583aba5a9d", + "result": { + "engine": "jupyter", + "markdown": "---\ntitle: \"Ibis benchmarking: DuckDB, DataFusion, Polars\"\nauthor: \"Cody Peterson\"\ndate: \"2024-06-24\"\nimage: \"figure1.png\"\ncategories:\n - benchmark\n - duckdb\n - datafusion\n - polars\n---\n\n*The best benchmark is your own workload on your own data*.\n\n## Key considerations\n\nThe purpose of this post is to explore some benchmarking data with Ibis. We'll\ncompare three modern single-node query engines, explore the Ibis API as a great\nchoice for each of them, and discuss the results.\n\n### The benchmark\n\n:::{.callout-important title=\"Not an official TPC-H benchmark\"}\nThis is not an [official TPC-H benchmark](https://www.tpc.org/tpch). We ran a\nderivate of the TPC-H benchmark.\n:::\n\n[The TPC-H benchmark](https://www.tpc.org/tpch) is a benchmark for databases\nand, [increasingly](https://docs.coiled.io/blog/tpch),\n[dataframes](https://pola.rs/posts/benchmarks)! It consists of 22 queries\nacross 8 tables. The SQL (or dataframe) code representing the queries is\ndesigned to test the performance of a query engine on a variety of tasks\nincluding filtering, aggregation, and joins. SQL queries are defined by the\nTPC-H benchmark. We run the SQL queries and equivalent dataframe code via Ibis\nand Polars APIs.\n\nThe data for the benchmark can be generated at any scale factor, which roughly\ncorresponds to the size of the data in memory in gigabytes. For instance, a\nscale factor of 10 would be about 10GB of data in memory.\n\n### The engines, the API, the code\n\nWe'll use three modern single-node OLAP engines\n([DuckDB](https://github.com/duckdb/duckdb),\n[DataFusion](https://github.com/apache/datafusion),\n[Polars](https://github.com/pola-rs/polars)) with the Ibis API via two coding\nparadigms (dataframe and SQL). Ibis provides a consistent API across 20+\nbackends, including these three. We run [SQL\ncode](https://github.com/lostmygithubaccount/ibis-bench/blob/v2.0.0/src/ibis_bench/queries/sql.py)\nthrough Ibis in addition to [dataframe\ncode](https://github.com/lostmygithubaccount/ibis-bench/blob/v2.0.0/src/ibis_bench/queries/ibis.py)\nto get a sense of any overhead in Ibis dataframe code.\n\n:::{.callout-note}\nIbis dataframe code generates SQL for the DuckDB and DataFusion backends and\ngenerates Polars API dataframe code for the Polars backend.\n:::\n\n:::{.callout-note title=\"Honorable mention: chDB\" collapse=\"true\"}\n[chDB](https://github.com/chdb-io/chdb) would be another great single-node OLAP\nengine to benchmark. We don't because it's not currently a backend for Ibis,\nthough [there has been work done to make it\none](https://github.com/ibis-project/ibis/pull/8497).\n\nIf you're interested in contributing to Ibis, a new backend like chDB could be a\ngreat project for you!\n:::\n\n9/22 queries for Ibis with the Polars backend fail from [lack of scalar subquery\nsupport](#failing-polars-queries). Due to this and relatively experimental SQL\nsupport in Polars, we've opted to run on [the Polars API\ndirectly](https://github.com/lostmygithubaccount/ibis-bench/blob/v2.0.0/src/ibis_bench/queries/polars.py)\nin this iteration of the benchmark. This is done with the LazyFrames API **and\nno streaming engine** ([per the Polars team's\nrecommendation](https://github.com/pola-rs/polars/issues/16694#issuecomment-2146668559)).\nThis also allows us to compare the performance of the Polars backend through\nIbis with the Polars API directly for the queries that do succeed.\n\n#### Failing queries\n\nQueries fail for one of two reasons:\n\n1. The query doesn't work in the given system\n2. The query otherwise failed on a given run\n\nWe'll note the cases of the first below. The second is usually due to memory\npressure and [will be seen at higher scale\nfactors](#failing-queries-due-to-memory-pressure) throughout the data.\n\n#### Failing DataFusion queries\n\nQueries 16, 21, and 22 fail for the DataFusion backend via Ibis dataframe code,\nand query 16 fails through SQL. Note that [all TPC-H SQL queries successfully\nrun through DataFusion\ndirectly](https://github.com/apache/datafusion-benchmarks) -- Ibis generates SQL\nthat [hits a bug with DataFusion that has already been\nfixed](https://github.com/apache/datafusion/issues/10830). We expect these\nqueries to work in the next iteration of this benchmark coming soon.\n\n#### Failing Polars queries\n\nQueries 11, 13-17, and 20-22 fail for the Polars backend via Ibis dataframe\ncode. These all fail due to lack of scalar subquery support in the backend. I've\n[opened an issue](https://github.com/ibis-project/ibis/issues/9422) for tracking\nand discussion.\n\n:::{.callout-tip title=\"Interested in contributing?\"}\nIncreasing coverage of operations for a backend is a great place to start!\n:::\n\n### How queries are written\n\nSee [the source\ncode](https://github.com/lostmygithubaccount/ibis-bench/tree/v2.0.0/src/ibis_bench/queries)\nfor the exact queries used in this iteration of the benchmark. Polars recently\nupdated their TPC-H queries, so the next iteration of this benchmark would use\nthose.\n\nQueries were adapted from [Ibis TPC-H\nqueries](https://github.com/ibis-project/ibis/tree/main/ibis/backends/tests/tpch)\nand [Polars TPC-H queries](https://github.com/pola-rs/tpch). The first 10 Ibis\ndataframe queries were translated from the Polars dataframe queries, while the\nrest were directly adapted from the Ibis repository. The SQL strings were\nadapted from the Ibis repository.\n\n### How queries are run\n\nSee [the source\ncode](https://github.com/lostmygithubaccount/ibis-bench/tree/v2.0.0) and\n[methodology](https://ibis-bench.streamlit.app/methodology) for more details. In\nshort:\n\n- data is generated as a Parquet file per table\n - standard DuckDB Parquet writer is used\n - data is always downloaded onto a compute instance (no cloud storage reads)\n- decimal types are converted to floats after reading\n - works around several issues\n - in the next iteration of this benchmark, we'll use the `decimal` type\n- each query is run three times per configuration (system, scale factor, instance type)\n- we measure the time to write the results of the query to a Parquet file\n - this includes reading the Parquet file(s) and executing the query\n\n### Biases\n\nMy name is Cody and I'm a Senior Technical Product Manager at [Voltron\nData](https://voltrondata.com). I am a contributor to the Ibis project and\nemployed to work on it -- I'm biased in favor of Ibis and the composable data\necosystem.\n\nIbis is [an independently governed open source\nproject](https://github.com/ibis-project/governance) that **is not owned by\nVoltron Data**, though several steering committee members are employed by\nVoltron Data. You can [read more about why Voltron Data supports\nIbis](../why-voda-supports-ibis/index.qmd), in addition to open source projects\nlike [Apache Arrow](https://github.com/apache/arrow) and\n[Substrait](https://github.com/substrait-io/substrait).\n\nVoltron Data is a [Gold Supporter of the DuckDB\nFoundation](https://duckdb.org/foundation) and [has a commercial relationship\nwith DuckDB Labs](https://duckdblabs.com) with regular syncs I tend to attend.\nI also use [MotherDuck](https://motherduck.com) to host our [Ibis analytics\ndashboard data](https://ibis-analytics.streamlit.app).\n\n## Results and analysis\n\nWe'll use Ibis to analyze some of the benchmarking data.\n\n:::{.callout-tip}\nWe'll only look at a small subset of the data in this post.\n\nAll the data is public, so you can follow along with the code and explore the\ndata yourself. You can also see the [Ibis benchmarking Streamlit\napp](https://ibis-bench.streamlit.app) for further analysis.\n:::\n\n\n\n### Reading the data\n\nTo follow along, install the required Python packages:\n\n```bash\npip install gcsfs 'ibis-framework[duckdb]' plotly\n```\n\nThe data is stored in a public Google Cloud Storage (GCS) bucket:\n\n::: {#26f3327a .cell execution_count=3}\n``` {.python .cell-code}\nimport os # <1>\nimport gcsfs # <1>\n\nBUCKET = \"ibis-bench\" # <2>\n\ndir_name = os.path.join(BUCKET, \"bench_logs_v2\", \"cache\") # <3>\n\nfs = gcsfs.GCSFileSystem() # <4>\nfs.ls(dir_name)[-5:] # <5>\n```\n\n::: {.cell-output .cell-output-display execution_count=57}\n```\n['ibis-bench/bench_logs_v2/cache/file_id=b6236086-7fff-4569-8731-b97a635243bd.parquet',\n 'ibis-bench/bench_logs_v2/cache/file_id=cbc0c7b1-e659-4adb-8c80-4077cd4d39ab.parquet',\n 'ibis-bench/bench_logs_v2/cache/file_id=d91454ad-2ddd-408a-bbfd-6b159dd2132b.parquet',\n 'ibis-bench/bench_logs_v2/cache/file_id=debc7203-f366-44d2-94f1-2518e6f7425f.parquet',\n 'ibis-bench/bench_logs_v2/cache/file_id=e875d852-f7e7-473c-9440-92b8f2445f3a.parquet']\n```\n:::\n:::\n\n\n1. Imports\n2. The public GCS bucket name\n3. The directory in the bucket where the data is stored\n4. Create a GCS filesystem object\n5. List the last 5 files in the directory\n\nTo start exploring the data, let's import Ibis and Plotly, set some options, and\nregister the GCS filesystem with the default (DuckDB) backend:\n\n::: {#0590f851 .cell execution_count=4}\n``` {.python .cell-code}\nimport ibis # <1>\nimport plotly.express as px # <2>\n\npx.defaults.template = \"plotly_dark\" # <3>\n\nibis.options.interactive = True # <4>\nibis.options.repr.interactive.max_rows = 22 # <5>\nibis.options.repr.interactive.max_length = 22 # <6>\nibis.options.repr.interactive.max_columns = None # <7>\n\ncon = ibis.get_backend() # <8>\ncon.register_filesystem(fs) # <9>\n```\n:::\n\n\n1. Import Ibis\n2. Import Plotly\n3. Set the Plotly template to dark\n4. Enable interactive mode for Ibis\n5. Set the maximum number of rows to display in interactive mode\n6. Set the maximum length of nested types to display in interactive mode\n7. Set the maximum number of columns to display in interactive mode\n8. Get the default (DuckDB) backend\n9. Register the GCS filesystem with the default backend\n\n\n\nNow read the data and take a look at the first few rows:\n\n::: {#c90a9cc3 .cell execution_count=6}\n``` {.python .cell-code}\nt = ( # <1>\n ibis.read_parquet(f\"gs://{dir_name}/file_id=*.parquet\") # <2>\n .mutate( # <3>\n timestamp=ibis._[\"timestamp\"].cast(\"timestamp\"),\n ) # <3>\n .relocate( # <4>\n \"instance_type\",\n \"system\",\n \"sf\",\n \"query_number\",\n \"execution_seconds\",\n \"timestamp\",\n ) # <4>\n .cache() # <5>\n)\nt.head() # <6>\n```\n\n::: {.cell-output .cell-output-display execution_count=60}\n```{=html}\n
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ instance_type  system           sf     query_number  execution_seconds  timestamp                   session_id                            n_partitions  file_type  file_id                                      ┃\n┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringint64int64float64timestamp(6)uuidint64stringstring                                       │\n├───────────────┼─────────────────┼───────┼──────────────┼───────────────────┼────────────────────────────┼──────────────────────────────────────┼──────────────┼───────────┼──────────────────────────────────────────────┤\n│ n2-standard-4polars-lazy    128169.5036132024-06-10 08:04:31.233704 │ 6708e5d3-2b8c-4ce0-adf8-65ce94e0bff1 │            1parquet  0949deaf-5f8f-4c29-b2ca-934e07173223.parquet │\n│ n2-standard-4ibis-duckdb    64735.8262952024-06-10 21:05:18.423375 │ 9a00385f-22b4-42df-ab3d-c63ed1a33a2e │            1parquet  0949deaf-5f8f-4c29-b2ca-934e07173223.parquet │\n│ n2-standard-4ibis-duckdb    128167.3761962024-06-11 03:44:22.901852 │ acb56c6b-b0d5-4bbc-8791-3542b62bd193 │            1parquet  0949deaf-5f8f-4c29-b2ca-934e07173223.parquet │\n│ n2-standard-4ibis-datafusion1678.6552902024-06-09 20:29:31.833510 │ a07fe07d-7a08-4802-b8ae-918e66e2d868 │            1parquet  0949deaf-5f8f-4c29-b2ca-934e07173223.parquet │\n│ n2-standard-4ibis-duckdb-sql1100.4473252024-06-10 08:11:31.244609 │ d523eec6-d2de-491d-b541-348c6b5bfc65 │            1parquet  0949deaf-5f8f-4c29-b2ca-934e07173223.parquet │\n└───────────────┴─────────────────┴───────┴──────────────┴───────────────────┴────────────────────────────┴──────────────────────────────────────┴──────────────┴───────────┴──────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n1. Assign the table to a variable\n2. Read the Parquet files from GCS\n3. Cast the `timestamp` column to a timestamp type\n4. Reorder the columns\n5. Cache the table to avoid re-reading cloud data\n6. Display the first few rows\n\nWe'll also create a table with details on each instance type including the CPU\ntype, number of cores, and memory in gigabytes:\n\n::: {#c352dd06 .cell execution_count=7}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to get instance details\"}\ncpu_type_cases = (\n ibis.case()\n .when(\n ibis._[\"instance_type\"].startswith(\"n2d\"),\n \"AMD EPYC\",\n )\n .when(\n ibis._[\"instance_type\"].startswith(\"n2\"),\n \"Intel Cascade and Ice Lake\",\n )\n .when(\n ibis._[\"instance_type\"].startswith(\"c3\"),\n \"Intel Sapphire Rapids\",\n )\n .when(\n ibis._[\"instance_type\"] == \"work laptop\",\n \"Apple M1 Max\",\n )\n .when(\n ibis._[\"instance_type\"] == \"personal laptop\",\n \"Apple M2 Max\",\n )\n .else_(\"unknown\")\n .end()\n)\ncpu_num_cases = (\n ibis.case()\n .when(\n ibis._[\"instance_type\"].contains(\"-\"),\n ibis._[\"instance_type\"].split(\"-\")[-1].cast(\"int\"),\n )\n .when(ibis._[\"instance_type\"].contains(\"laptop\"), 12)\n .else_(0)\n .end()\n)\nmemory_gb_cases = (\n ibis.case()\n .when(\n ibis._[\"instance_type\"].contains(\"-\"),\n ibis._[\"instance_type\"].split(\"-\")[-1].cast(\"int\") * 4,\n )\n .when(ibis._[\"instance_type\"] == \"work laptop\", 32)\n .when(ibis._[\"instance_type\"] == \"personal laptop\", 96)\n .else_(0)\n .end()\n)\n\ninstance_details = (\n t.group_by(\"instance_type\")\n .agg()\n .mutate(\n cpu_type=cpu_type_cases, cpu_cores=cpu_num_cases, memory_gbs=memory_gb_cases\n )\n).order_by(\"memory_gbs\", \"cpu_cores\", \"instance_type\")\n\ncpu_types = sorted(\n instance_details.distinct(on=\"cpu_type\")[\"cpu_type\"].to_pyarrow().to_pylist()\n)\n\ninstance_details\n```\n\n::: {.cell-output .cell-output-display execution_count=61}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ instance_type    cpu_type                    cpu_cores  memory_gbs ┃\n┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ stringstringint64int64      │\n├─────────────────┼────────────────────────────┼───────────┼────────────┤\n│ n2-standard-2  Intel Cascade and Ice Lake28 │\n│ n2d-standard-2 AMD EPYC                  28 │\n│ c3-standard-4  Intel Sapphire Rapids     416 │\n│ n2-standard-4  Intel Cascade and Ice Lake416 │\n│ n2d-standard-4 AMD EPYC                  416 │\n│ c3-standard-8  Intel Sapphire Rapids     832 │\n│ n2-standard-8  Intel Cascade and Ice Lake832 │\n│ n2d-standard-8 AMD EPYC                  832 │\n│ work laptop    Apple M1 Max              1232 │\n│ n2-standard-16 Intel Cascade and Ice Lake1664 │\n│ n2d-standard-16AMD EPYC                  1664 │\n│ c3-standard-22 Intel Sapphire Rapids     2288 │\n│ personal laptopApple M2 Max              1296 │\n│ n2-standard-32 Intel Cascade and Ice Lake32128 │\n│ n2d-standard-32AMD EPYC                  32128 │\n│ c3-standard-44 Intel Sapphire Rapids     44176 │\n└─────────────────┴────────────────────────────┴───────────┴────────────┘\n
\n```\n:::\n:::\n\n\n### What's in the data?\n\nWith the data, we can see we ran the benchmark on scale factors:\n\n::: {#99a2bcab .cell execution_count=8}\n``` {.python .cell-code}\nsfs = sorted(t.distinct(on=\"sf\")[\"sf\"].to_pyarrow().to_pylist())\nsfs\n```\n\n::: {.cell-output .cell-output-display execution_count=62}\n```\n[1, 8, 16, 32, 64, 128]\n```\n:::\n:::\n\n\n:::{.callout-note title=\"What is a scale factor?\" collapse=\"true\"}\nA scale factor is roughly the size of the data in memory in gigabytes. For\nexample, a scale factor of 1 means the data is roughly 1GB in memory.\n\nStored on disk in (compressed) Parquet format, the data is smaller -- about\n0.38GB for scale factor 1 with the compression settings used in this benchmark.\n:::\n\nWe can look at the total execution time by scale factor:\n\n::: {#a48a964d .cell execution_count=9}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show bar plot code\"}\nc = px.bar(\n t.group_by(\"sf\").agg(total_seconds=t[\"execution_seconds\"].sum()),\n x=\"sf\",\n y=\"total_seconds\",\n category_orders={\"sf\": sfs},\n title=\"total execution time by scale factor\",\n)\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\nYou can see this is roughly linear as expected.\n\nWe ran on the following queries:\n\n::: {#acda2b46 .cell execution_count=10}\n``` {.python .cell-code}\nquery_numbers = sorted(\n t.distinct(on=\"query_number\")[\"query_number\"].to_pyarrow().to_pylist()\n)\nquery_numbers\n```\n\n::: {.cell-output .cell-output-display execution_count=64}\n```\n[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]\n```\n:::\n:::\n\n\n:::{.callout-note title=\"What is a query number?\" collapse=\"true\"}\nThe TPC-H benchmark defines 22 queries. See the [TPC-H benchmark\nspecification](https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPC-H_v3.0.1.pdf)\nfor more information.\n:::\n\nWe can look at the total execution time by query number:\n\n::: {#dc3e81cb .cell execution_count=11}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show bar plot code\"}\nc = px.bar(\n t.group_by(\"query_number\").agg(total_seconds=t[\"execution_seconds\"].sum()),\n x=\"query_number\",\n y=\"total_seconds\",\n category_orders={\"query_number\": query_numbers},\n title=\"total execution time by query number\",\n)\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\nThis gives us a sense of the relative complexity of the queries.\n\nWe ran on the following instance types:\n\n::: {#08a5373b .cell execution_count=12}\n``` {.python .cell-code}\ninstance_types = sorted(\n t.distinct(on=\"instance_type\")[\"instance_type\"].to_pyarrow().to_pylist(),\n key=lambda x: (x.split(\"-\")[0], int(x.split(\"-\")[-1])) # <1>\n if \"-\" in x # <1>\n else (\"z\" + x[3], 0), # <2>\n)\ninstance_types\n```\n\n::: {.cell-output .cell-output-display execution_count=66}\n```\n['c3-standard-4',\n 'c3-standard-8',\n 'c3-standard-22',\n 'c3-standard-44',\n 'n2-standard-2',\n 'n2-standard-4',\n 'n2-standard-8',\n 'n2-standard-16',\n 'n2-standard-32',\n 'n2d-standard-2',\n 'n2d-standard-4',\n 'n2d-standard-8',\n 'n2d-standard-16',\n 'n2d-standard-32',\n 'work laptop',\n 'personal laptop']\n```\n:::\n:::\n\n\n1. This is to sort the instance types by CPU architecture and number of cores\n2. This is to sort \"personal laptop\" after \"work laptop\"\n\n:::{.callout-note title=\"What is an instance type?\" collapse=\"true\"}\nAn instance type is the compute the benchmark was run on. This consists of two\nMacBook Pro laptops (one work and one personal) and a number of Google Cloud\nCompute Engine instances.\n\nFor cloud VMs, the instance type is in the form of `--`,\nwhere:\n\n- `` specifies the CPU architecture (Intel X, AMD Y)\n- `` modifies the CPU to memory ratio (only `standard` is used with a 1:4)\n- `` is the number of vCPUs\n\nFor example, `n2d-standard-2` is a Google Cloud Compute Engine instance with an\nAMD EPYC processor, 2 vCPUs, and 8GB of memory.\n:::\n\nWe can look at the total execution time by instance type:\n\n::: {#00f7dd41 .cell execution_count=13}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show bar plot code\"}\nc = px.bar(\n t.group_by(\"instance_type\")\n .agg(total_seconds=t[\"execution_seconds\"].sum())\n .join(instance_details, \"instance_type\"),\n x=\"instance_type\",\n y=\"total_seconds\",\n color=\"cpu_type\",\n hover_data=[\"cpu_cores\", \"memory_gbs\"],\n category_orders={\n \"instance_type\": instance_types,\n \"cpu_type\": cpu_types,\n },\n title=\"total execution time by instance type\",\n)\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\nUnsurprisingly, this is inversely correlated with the number of CPU cores and\n(crucially) memory:\n\n::: {#e5169667 .cell execution_count=14}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show bar plot code\"}\nc = px.bar(\n instance_details,\n x=\"instance_type\",\n y=\"memory_gbs\",\n color=\"cpu_type\",\n hover_data=[\"cpu_cores\", \"memory_gbs\"],\n category_orders={\n \"instance_type\": instance_types,\n \"cpu_type\": cpu_types,\n },\n title=\"memory by instance type\",\n)\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\nWe ran on the following systems:\n\n::: {#adee8043 .cell execution_count=15}\n``` {.python .cell-code}\nsystems = sorted(t.distinct(on=\"system\")[\"system\"].to_pyarrow().to_pylist())\nsystems\n```\n\n::: {.cell-output .cell-output-display execution_count=69}\n```\n['ibis-datafusion',\n 'ibis-datafusion-sql',\n 'ibis-duckdb',\n 'ibis-duckdb-sql',\n 'ibis-polars',\n 'polars-lazy']\n```\n:::\n:::\n\n\n:::{.callout-note title=\"What is a system?\" collapse=\"true\"}\nFor convenience in this benchmark, a 'system' is defined as a hyphen-separated\nnaming convention where:\n\n- `ibis-*`: Ibis API was used\n - `ibis-`: Ibis dataframe code was used with the given backend\n - `ibis--sql`: SQL code was used via Ibis on the given backend\n- `polars-*`: Polars API was used\n - `polars-lazy`: Polars was used with the LazyFrames API\n:::\n\nWe can look at the total execution time by system:\n\n::: {#39ad91fb .cell execution_count=16}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show bar plot code\"}\nc = px.bar(\n t.group_by(\"system\").agg(\n total_seconds=t[\"execution_seconds\"].sum(),\n seconds_per_query=t[\"execution_seconds\"].mean(),\n num_records=t.count(),\n ),\n x=\"system\",\n y=\"num_records\",\n color=\"system\",\n category_orders={\"system\": systems},\n title=\"total execution time by system\",\n)\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n:::{.callout-warning title=\"This can be misleading!\"}\nAt this point, we have to dig deeper into the data to understand the takeaways.\nYou might look at the above and think `ibis-polars` is the fastest all-around,\nbut it's not! Recall [9/22 queries for the Polars backend are\nfailing](#failing-polars-queries), and at larger scale factors we start to see\nseveral systems fail queries due to memory pressure.\n:::\n\n### Execution time by system, scale factor, instance type, and query\n\nWe'll aggregate the data over the dimensions we care about:\n\n::: {#9a0c2610 .cell execution_count=17}\n``` {.python .cell-code}\nagg = (\n t.group_by(\"instance_type\", \"system\", \"sf\", \"n_partitions\", \"query_number\")\n .agg(\n mean_execution_seconds=t[\"execution_seconds\"].mean(),\n )\n .join(instance_details, \"instance_type\")\n)\nagg.head(3)\n```\n\n::: {.cell-output .cell-output-display execution_count=71}\n```{=html}\n
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ instance_type  system               sf     n_partitions  query_number  mean_execution_seconds  cpu_type                    cpu_cores  memory_gbs ┃\n┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ stringstringint64int64int64float64stringint64int64      │\n├───────────────┼─────────────────────┼───────┼──────────────┼──────────────┼────────────────────────┼────────────────────────────┼───────────┼────────────┤\n│ n2-standard-4ibis-datafusion    16178.641310Intel Cascade and Ice Lake416 │\n│ n2-standard-4ibis-datafusion    641523.362426Intel Cascade and Ice Lake416 │\n│ n2-standard-4ibis-datafusion-sql161110.238970Intel Cascade and Ice Lake416 │\n└───────────────┴─────────────────────┴───────┴──────────────┴──────────────┴────────────────────────┴────────────────────────────┴───────────┴────────────┘\n
\n```\n:::\n:::\n\n\nThere's a lot of data and it's difficult to visualize all at once. We'll build\nup our understanding with a few plots.\n\n::: {#e6c188ce .cell execution_count=18}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code for timings_plot\"}\ndef timings_plot(\n agg,\n sf_filter=128,\n systems_filter=systems,\n instances_filter=[instance for instance in instance_types if \"laptop\" in instance],\n queries_filter=query_numbers,\n log_y=True,\n):\n data = (\n agg.filter(agg[\"sf\"] == sf_filter)\n .filter(agg[\"system\"].isin(systems_filter))\n .filter(agg[\"instance_type\"].isin(instances_filter))\n .filter(agg[\"query_number\"].isin(queries_filter))\n )\n\n c = px.bar(\n data,\n x=\"query_number\",\n y=\"mean_execution_seconds\",\n log_y=log_y,\n color=\"system\",\n barmode=\"group\",\n pattern_shape=\"instance_type\",\n category_orders={\n \"system\": systems,\n \"instance_type\": instance_types,\n },\n hover_data=[\"cpu_type\", \"cpu_cores\", \"memory_gbs\"],\n title=f\"sf: {sf_filter}\",\n )\n\n return c\n```\n:::\n\n\nFirst, let's visualize execution time for a given scale factor, system, query,\nand family of instance types:\n\n::: {#fb98b88c .cell execution_count=19}\n``` {.python .cell-code}\nsf_filter = 128\nsystems_filter = [\"ibis-duckdb\"]\ninstances_filter = [\n instance for instance in instance_types if instance.startswith(\"n2d\")\n]\nqueries_filter = [1]\nlog_y = False\n\ntimings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\nFrom left to right, we have increasing instance resources (CPU cores and memory\n-- you can hover over the data to see the details). You can also zoom and select\nspecific labels to focus on. We notice that, as expected, queries execute faster\nwhen given more resources.\n\nNow let's add a second system:\n\n::: {#543f7d40 .cell execution_count=20}\n``` {.python .cell-code}\nsystems_filter = [\"ibis-duckdb\", \"ibis-duckdb-sql\"]\n\ntimings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n:::{.callout-note title=\"Ibis dataframe code vs Ibis SQL code\" collapse=\"true\"}\n`ibis-duckdb` is running the TPC-H queries written as Ibis dataframe code. The\n`ibis-duckdb-sql` system is running the same queries but written as SQL code\npassed into `.sql()` in Ibis as strings. The intent is to see if Ibis dataframe\ncode is introducing any significant overhead. While ideally we'd run on the\nbackend's Python client without Ibis in the mix, this keeps the benchmarking\nprocess simple and should serve as a decent proxy.\n:::\n\nIn this case, we do see that Ibis dataframe code is adding some overhead. But,\nthis is a single data point -- let's expand to the first 7 queries:\n\n:::{.callout-note title=\"Logging the y-axis\"}\nFrom here, we'll set `log_y=True` due to the wide range of execution times.\n\nWe also look at the first 7 queries due to limited horizontal space on this\nwebsite. Analyze and visualize the data yourself for all 22 queries! Or [see\nthe Ibis benchmarking Streamlit app](https://ibis-bench.streamlit.app).\n:::\n\n::: {#a6ea5967 .cell execution_count=21}\n``` {.python .cell-code}\nlog_y = True\nqueries_filter = range(1, 7+1)\n\ntimings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\nThis tells a different story. Sometimes Ibis dataframe code is a bit faster,\nsometimes a bit slower. Let's compute the totals:\n\n::: {#712081ae .cell execution_count=22}\n``` {.python .cell-code}\n(\n agg.filter(agg[\"sf\"] == sf_filter)\n .filter(agg[\"system\"].isin(systems_filter))\n .filter(agg[\"instance_type\"].isin(instances_filter))\n .filter(agg[\"query_number\"].isin(queries_filter))\n .group_by(\"system\")\n .agg(\n total_execution_seconds=agg[\"mean_execution_seconds\"].sum(),\n total_queries=ibis._.count(),\n )\n .mutate(\n seconds_per_query=ibis._[\"total_execution_seconds\"] / ibis._[\"total_queries\"]\n )\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=76}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓\n┃ system           total_execution_seconds  total_queries  seconds_per_query ┃\n┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩\n│ stringfloat64int64float64           │\n├─────────────────┼─────────────────────────┼───────────────┼───────────────────┤\n│ ibis-duckdb    1049.0071593529.971633 │\n│ ibis-duckdb-sql1180.3183463533.723381 │\n└─────────────────┴─────────────────────────┴───────────────┴───────────────────┘\n
\n```\n:::\n:::\n\n\nIbis dataframe code is a little faster overall, but this is on a subset of\nqueries and scale factors and instance types. More analysis and profiling would\nbe needed to make a definitive statement, but in general we can be happy that\nDuckDB does a great job optimizing the SQL Ibis generates and that Ibis\ndataframe code isn't adding significant overhead.\n\nLet's repeat this for DataFusion:\n\n::: {#786d628d .cell execution_count=23}\n``` {.python .cell-code}\nsystems_filter = [\"ibis-datafusion\", \"ibis-datafusion-sql\"]\n\ntimings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\nWe see a similar story. Let's confirm with a table:\n\n::: {#1b2de912 .cell execution_count=24}\n``` {.python .cell-code}\n(\n agg.filter(agg[\"sf\"] == sf_filter)\n .filter(agg[\"system\"].isin(systems_filter))\n .filter(agg[\"instance_type\"].isin(instances_filter))\n .filter(agg[\"query_number\"].isin(queries_filter))\n .group_by(\"system\")\n .agg(\n total_execution_seconds=agg[\"mean_execution_seconds\"].sum(),\n total_queries=ibis._.count(),\n )\n .mutate(\n seconds_per_query=ibis._[\"total_execution_seconds\"] / ibis._[\"total_queries\"]\n )\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=78}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓\n┃ system               total_execution_seconds  total_queries  seconds_per_query ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩\n│ stringfloat64int64float64           │\n├─────────────────────┼─────────────────────────┼───────────────┼───────────────────┤\n│ ibis-datafusion-sql1041.2593303331.553313 │\n│ ibis-datafusion    1202.1493863534.347125 │\n└─────────────────────┴─────────────────────────┴───────────────┴───────────────────┘\n
\n```\n:::\n:::\n\n\nThis time Ibis dataframe code is a bit slower overall. **However, also notice\ntwo queries are missing from `ibis-datafusion-sql`**. These are query 7 on\n`n2d-standard-2` and `n2d-standard-4` (the two instances with the least memory).\nWe'll investigate failing queries more thoroughly in the next section.\n\nFirst, let's look at Polars:\n\n::: {#0a68f425 .cell execution_count=25}\n``` {.python .cell-code}\nsystems_filter = [\"ibis-polars\", \"polars-lazy\"]\n\ntimings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\nA lot of queries are missing from `ibis-polars` and `polars-lazy`. These are\nfailing due to the high scale factor and limited memory on the instances.\n\nLet's look at a lower scale factor and my MacBooks (Polars tended to perform\nbetter on these):\n\n::: {#2b903b5b .cell execution_count=26}\n``` {.python .cell-code}\nsf_filter = 64\ninstances_filter = [\n instance for instance in instance_types if \"laptop\" in instance\n]\n\ntimings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\nWe see a similar pattern as above -- some queries are a little faster on\n`ibis-polars`, though some are **much** slower. In particular queries 1 and 2\ntend to have a lot of overhead.\n\n::: {#cea46a0c .cell execution_count=27}\n``` {.python .cell-code}\n(\n agg.filter(agg[\"sf\"] == sf_filter)\n .filter(agg[\"system\"].isin(systems_filter))\n .filter(agg[\"instance_type\"].isin(instances_filter))\n .filter(agg[\"query_number\"].isin(queries_filter))\n .group_by(\"system\")\n .agg(\n total_execution_seconds=agg[\"mean_execution_seconds\"].sum(),\n total_queries=ibis._.count(),\n )\n .mutate(\n seconds_per_query=ibis._[\"total_execution_seconds\"] / ibis._[\"total_queries\"]\n )\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=81}\n```{=html}\n
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓\n┃ system       total_execution_seconds  total_queries  seconds_per_query ┃\n┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩\n│ stringfloat64int64float64           │\n├─────────────┼─────────────────────────┼───────────────┼───────────────────┤\n│ ibis-polars185.5476651413.253405 │\n│ polars-lazy115.157749148.225554 │\n└─────────────┴─────────────────────────┴───────────────┴───────────────────┘\n
\n```\n:::\n:::\n\n\nLet's now compare all systems across a single instance type and query:\n\n::: {#fbc36677 .cell execution_count=28}\n``` {.python .cell-code}\nsf_filter = 128\ninstances_filter = [\"n2d-standard-32\"]\nsystems_filter = systems\nqueries_filter = [1]\n\ntimings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\nAnd then the first 7 queries:\n\n::: {#adb4afa0 .cell execution_count=29}\n``` {.python .cell-code}\nqueries_filter = range(1, 7+1)\n\ntimings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n:::{.callout-warning title=\"Lots of data, lots of takeaways\"}\nThere is a lot of data and it's easy to summarize and visualize it in a way that\nfavors a given system. There's a lot of missing data that needs to be accounted\nfor, as it often indicates a query that failed due to memory pressure.\n\nEach system has strengths and weaknesses. See [the discussion section\nbelow](#which-system-is-best).\n\nSee the [Ibis benchmarking Streamlit app](https://ibis-bench.streamlit.app) for\nfurther analysis, or query the data yourself!\n:::\n\n### Failing queries due to memory pressure\n\nMany queries fail due to memory pressure at higher scale factors with\ninsufficient resources. Impressively, the exception here is DuckDB.\n\n::: {#ae9c0f82 .cell execution_count=30}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to get table of failing queries\"}\ndef failing_queries(agg, sf, instance_type):\n failing = (\n t.filter(t[\"sf\"] == sf)\n .filter(t[\"instance_type\"] == instance_type)\n .group_by(\"system\")\n .agg(present_queries=ibis._[\"query_number\"].collect().unique().sort())\n )\n failing = (\n failing.mutate(\n failing_queries=t.distinct(on=\"query_number\")[\"query_number\"]\n .collect()\n .filter(lambda x: ~failing[\"present_queries\"].contains(x))\n )\n .mutate(\n num_failing_queries=ibis._[\"failing_queries\"].length(),\n num_successful_queries=ibis._[\"present_queries\"].length(),\n )\n .drop(\"present_queries\")\n .order_by(\"num_failing_queries\", \"system\")\n )\n\n return failing\n```\n:::\n\n\nLet's look at the failing queries on the largest `n2d` instance::\n\n::: {#1f5b8b26 .cell execution_count=31}\n``` {.python .cell-code}\nsf = 128\ninstance_type = \"n2d-standard-32\"\n\nfailing = failing_queries(agg, sf, instance_type)\nfailing\n```\n\n::: {.cell-output .cell-output-display execution_count=85}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ system               failing_queries                              num_failing_queries  num_successful_queries ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringarray<int64>int64int64                  │\n├─────────────────────┼─────────────────────────────────────────────┼─────────────────────┼────────────────────────┤\n│ ibis-duckdb        []022 │\n│ ibis-duckdb-sql    []022 │\n│ ibis-datafusion-sql[16]121 │\n│ polars-lazy        [9]121 │\n│ ibis-datafusion    [16, 21, 22]319 │\n│ ibis-polars        [9, 11, 13, 14, 15, 16, 17, 19, 20, 21, 22]1111 │\n└─────────────────────┴─────────────────────────────────────────────┴─────────────────────┴────────────────────────┘\n
\n```\n:::\n:::\n\n\n::: {#53363979 .cell execution_count=32}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to create a bar plot of the number of successful queries by system\"}\nc = px.bar(\n failing,\n x=\"system\",\n y=\"num_successful_queries\",\n category_orders={\n \"system\": systems,\n \"query_number\": query_numbers,\n },\n color=\"system\",\n title=\"completed queries\",\n)\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\nAnd the smallest:\n\n::: {#2ae60570 .cell execution_count=33}\n``` {.python .cell-code}\ninstance_type = \"n2d-standard-2\"\n\nfailing = failing_queries(agg, sf, instance_type)\nfailing\n```\n\n::: {.cell-output .cell-output-display execution_count=87}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ system               failing_queries                                                            num_failing_queries  num_successful_queries ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringarray<int64>int64int64                  │\n├─────────────────────┼───────────────────────────────────────────────────────────────────────────┼─────────────────────┼────────────────────────┤\n│ ibis-duckdb        []022 │\n│ ibis-duckdb-sql    []022 │\n│ ibis-datafusion-sql[7, 9, 16, 18, 20]517 │\n│ ibis-datafusion    [9, 16, 18, 20, 21, 22]616 │\n│ polars-lazy        [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 17, 18, 19, 20, 21]184 │\n│ ibis-polars        [1, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]202 │\n└─────────────────────┴───────────────────────────────────────────────────────────────────────────┴─────────────────────┴────────────────────────┘\n
\n```\n:::\n:::\n\n\n::: {#699cf3f1 .cell execution_count=34}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to create a bar plot of the number of successful queries by system\"}\nc = px.bar(\n failing,\n x=\"system\",\n y=\"num_successful_queries\",\n category_orders={\n \"system\": systems,\n \"query_number\": query_numbers,\n },\n color=\"system\",\n title=\"completed queries\",\n)\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\nA lot of queries are failing on the smallest instance due to memory pressure.\n\nWe can create a single visualization across the `n2d` instances:\n\n::: {#a5a86abe .cell execution_count=35}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to create a bar plot of the number of successful queries by system and instance type\"}\nfailing = t.group_by(\"instance_type\", \"system\", \"sf\").agg(\n total_time=t[\"execution_seconds\"].sum(),\n present_queries=ibis._[\"query_number\"].collect().unique().sort(),\n)\nfailing = (\n failing.mutate(\n failing_queries=t.distinct(on=\"query_number\")[\"query_number\"]\n .collect()\n .filter(lambda x: ~failing[\"present_queries\"].contains(x)),\n )\n .mutate(\n num_failing_queries=ibis._[\"failing_queries\"].length(),\n num_successful_queries=ibis._[\"present_queries\"].length(),\n )\n .drop(\"present_queries\")\n .relocate(\"instance_type\", \"system\", \"sf\", \"failing_queries\")\n .order_by(\"num_failing_queries\", \"instance_type\", \"system\", \"sf\")\n)\nfailing = failing.join(instance_details, \"instance_type\")\nfailing = (\n failing.filter(\n (failing[\"sf\"] == 128) & (failing[\"instance_type\"].startswith(\"n2d-\"))\n )\n).order_by(ibis.desc(\"memory_gbs\"))\n\nc = px.bar(\n failing,\n x=\"system\",\n y=\"num_successful_queries\",\n color=\"instance_type\",\n barmode=\"group\",\n hover_data=[\"cpu_cores\", \"memory_gbs\"],\n category_orders={\n \"system\": systems,\n \"instance_type\": reversed(\n [instance for instance in instance_types if instance.startswith(\"n2d\")]\n ),\n },\n title=\"completed queries\",\n)\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\nWithin each system, from left to right, we have decreasing resources (vCPUs and\nmemory). We can see how each system performs on the same queries with different\nresources.\n\n:::{.callout-warning title=\"Data is aggregated\"}\nKeep in mind data is aggregated over three runs of each query. For DuckDB, there\nwas actually a single failure on the smallest instance for query 9, out of six\nruns across the two systems, but this does not appear above because we are\nchecking for the success of the query in any of the three runs per\nconfiguration.\n:::\n\n## Discussion and reproducibility\n\nBenchmarking is fraught: it's easy to get wrong and ship your bias in the\nresults. We don't want to end up as [Figure 1 in \"Fair Benchmarking Considered\nDifficult: Common Pitfalls In Database Performance\nTesting\"](https://hannes.muehleisen.org/publications/DBTEST2018-performance-testing.pdf):\n\n![Figure 1](figure1.png)\n\nIf you have any questions or concerns, feel free to [open an\nissue](https://github.com/lostmygithubaccount/ibis-bench/issues/new) or comment\non this post below.\n\n### Which system is best?\n\nTrick question! It depends on your use case. DuckDB is a simple, performant\nin-process database with an on-disk file format (SQLite for OLAP). DataFusion is\nan extensible query engine and is often used for building databases or query\nengines. Polars is an OLAP query engine with a Python dataframe API that can be\nused as a more performant alternative to pandas.\n\nAll three make great Ibis backends and you can switch between them in a single\nline of code. This let's you write your code once and run it on the engine\nthat's best for your use case. If a better engine comes along you'll likely be\nable to use that too. And you can scale up and out across the 20+ backends Ibis\nsupports as needed.\n\nTPC-H is a decent benchmark *for what it benchmarks, which is limited*. We're\nnot running window functions, doing timeseries analysis, or feature engineering\nfor machine learning. We're not using nested data types. We're not performing\nregexes or using LLMs in UDFs...\n\nIt's easy to summarize and visualize benchmarking data in a way that favors a\ngiven system. You should favor the system that works best for your use case.\n\n### Performance converges over time\n\nLet's look at some quotes from [\"Perf is not\nenough\"](https://motherduck.com/blog/perf-is-not-enough) by Jordan Tigani of\nMotherDuck:\n\n> If you take a bunch of databases, all actively maintained, and iterate them\n> out a few years, **performance is going to converge**. If Clickhouse is applying\n> a technique that gives it an advantage for scan speed today, Snowflake will\n> likely have that within a year or two. If Snowflake adds incrementally\n> materialized views, BigQuery will soon follow. It is unlikely that important\n> performance differences will persist over time.\n>\n> As clever as the engineers working for any of these companies are, none of\n> them possess any magic incantations or things that cannot be replicated\n> elsewhere. Each database uses a different bag of tricks in order to get good\n> performance. One might compile queries to machine code, another might cache data\n> on local SSDs, and a third might use specialized network hardware to do\n> shuffles. **Given time, all of these techniques can be implemented by anyone. If\n> they work well, they likely will show up everywhere.**\n\nThis is extra true for open source databases (or query engines). If DuckDB adds\na feature that improves performance, it's likely that DataFusion and Polars will\nfollow suit -- they can go read the source code and specific commits to see how\nit was done.\n\n### Reproducing the benchmark\n\nThe source code for [is available on\nGitHub](https://github.com/lostmygithubaccount/ibis-bench/tree/v2.0.0).\n\n#### A TPC-H benchmark on 6 systems in 3 commands\n\nFirst install `ibis-bench`:\n\n```bash\npip install ibis-bench\n```\n\nThen generate the TPC-H data:\n\n```bash\nbench gen-data -s 1\n```\n\nFinally run the benchmark:\n\n```bash\nbench run -s 1 ibis-duckdb ibis-duckdb-sql ibis-datafusion ibis-datafusion-sql ibis-polars polars-lazy\n```\n\nCongratulations! You've run a TPC-H benchmark on DuckDB (Ibis dataframe code and\nSQL), DataFusion (Ibis dataframe code and SQL), and Polars (dataframe code via\nIbis and native Polars).\n\n#### What just happened?\n\nThis will generate TPC-H data at scale factor 1 as Parquet files in the\n`tpch_data` directory:\n\n```bash\ntpch_data\n└── parquet\n └── sf=1\n └── n=1\n ├── customer\n │ └── 0000.parquet\n ├── lineitem\n │ └── 0000.parquet\n ├── nation\n │ └── 0000.parquet\n ├── orders\n │ └── 0000.parquet\n ├── part\n │ └── 0000.parquet\n ├── partsupp\n │ └── 0000.parquet\n ├── region\n │ └── 0000.parquet\n └── supplier\n └── 0000.parquet\n```\n\nThe scale factor is roughly the size of data **in memory** in gigabytes (GBs).\nThe size of data on disk, however, is smaller because Parquet is compressed. We\ncan take a look at the size of the data:\n\n```bash\n384M tpch_data/parquet/sf=1/n=1\n262M tpch_data/parquet/sf=1/n=1/lineitem\n 59M tpch_data/parquet/sf=1/n=1/orders\n 12M tpch_data/parquet/sf=1/n=1/customer\n 43M tpch_data/parquet/sf=1/n=1/partsupp\n6.6M tpch_data/parquet/sf=1/n=1/part\n788K tpch_data/parquet/sf=1/n=1/supplier\n4.0K tpch_data/parquet/sf=1/n=1/nation\n4.0K tpch_data/parquet/sf=1/n=1/region\n```\n\nWe can see the total size is 0.38 GB and the size of the tables -- `lineitem` is\nby far the largest.\n\nUsing `bench run` results in a `results_data` directory with the results of the\nqueries and a `bench_logs_v2` directory with the logs of the benchmark run.\n\n#### Analyzing the results\n\nWe can use Ibis to load and analyze the log data:\n\n::: {#c06ce9c3 .cell execution_count=36}\n``` {.python .cell-code}\nimport ibis\n\nibis.options.interactive = True\nibis.options.repr.interactive.max_rows = 6\nibis.options.repr.interactive.max_columns = None\n\nt = ibis.read_json(\"bench_logs_v*/raw_json/file_id=*.json\").relocate(\n \"system\", \"sf\", \"query_number\", \"execution_seconds\"\n)\nt\n```\n\n::: {.cell-output .cell-output-display execution_count=90}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ system               sf     query_number  execution_seconds  session_id                            instance_type  timestamp                   n_partitions  file_type  file_id                                   ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringint64int64float64uuidjsonstringint64stringstring                                    │\n├─────────────────────┼───────┼──────────────┼───────────────────┼──────────────────────────────────────┼───────────────┼────────────────────────────┼──────────────┼───────────┼───────────────────────────────────────────┤\n│ ibis-datafusion-sql1120.076600 │ 0b931439-5670-4a77-89b0-d8b7c45e6eb7 │ NULL2024-06-13T16:09:34.4763971parquet  00218347-a4cd-4590-a502-8cf79f4e87c9.json │\n│ ibis-datafusion-sql1210.165074 │ 0b931439-5670-4a77-89b0-d8b7c45e6eb7 │ NULL2024-06-13T16:09:35.3767531parquet  01089668-608c-4551-ae65-6d98d69f959b.json │\n│ ibis-polars        1120.075944 │ 0b931439-5670-4a77-89b0-d8b7c45e6eb7 │ NULL2024-06-13T16:09:36.9560011parquet  02a991bd-797a-4c08-83de-c1b537f713fe.json │\n│ ibis-datafusion    1100.144007 │ 0b931439-5670-4a77-89b0-d8b7c45e6eb7 │ NULL2024-06-13T16:09:32.2976471parquet  02bd900a-3c0a-4871-b651-1690f11a81ab.json │\n│ ibis-datafusion-sql130.067048 │ 0b931439-5670-4a77-89b0-d8b7c45e6eb7 │ NULL2024-06-13T16:09:33.6993681parquet  08490a6b-e1ab-482c-83bc-85469c6b96a3.json │\n│ ibis-duckdb        1100.160302 │ 0b931439-5670-4a77-89b0-d8b7c45e6eb7 │ NULL2024-06-13T16:09:27.3163391parquet  08b92577-8150-4040-bceb-9316da7bfaf4.json │\n│                                          │\n└─────────────────────┴───────┴──────────────┴───────────────────┴──────────────────────────────────────┴───────────────┴────────────────────────────┴──────────────┴───────────┴───────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\nWe can check the total execution time for each system:\n\n::: {#98d17563 .cell execution_count=37}\n``` {.python .cell-code}\nt.group_by(\"system\").agg(total_seconds=t[\"execution_seconds\"].sum()).order_by(\n \"total_seconds\"\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=91}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n┃ system               total_seconds ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n│ stringfloat64       │\n├─────────────────────┼───────────────┤\n│ ibis-datafusion-sql2.006620 │\n│ ibis-duckdb-sql    2.067606 │\n│ polars-lazy        2.086350 │\n│ ibis-polars        2.168417 │\n│ ibis-duckdb        2.495270 │\n│ ibis-datafusion    2.529014 │\n└─────────────────────┴───────────────┘\n
\n```\n:::\n:::\n\n\nWe can visualize the results:\n\n::: {#0f703cc5 .cell execution_count=38}\n``` {.python .cell-code}\nimport plotly.express as px\n\npx.defaults.template = \"plotly_dark\"\n\nagg = t.group_by(\"system\", \"query_number\").agg(\n mean_execution_seconds=t[\"execution_seconds\"].mean(),\n)\n\nchart = px.bar(\n agg,\n x=\"query_number\",\n y=\"mean_execution_seconds\",\n color=\"system\",\n barmode=\"group\",\n title=\"Mean execution time by query\",\n category_orders={\n \"system\": sorted(t.select(\"system\").distinct().to_pandas()[\"system\"].tolist())\n },\n)\nchart\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n:::\n\n\n#### What did we run and measure, exactly?\n\nWe can import `ibis_bench` as a library and read in the TPC-H tables:\n\n::: {#09aeb621 .cell execution_count=39}\n``` {.python .cell-code}\nimport ibis\nimport polars as pl\n\nfrom datetime import date\nfrom ibis_bench.utils.read_data import get_ibis_tables, get_polars_tables\n\nsf = 1\n```\n:::\n\n\n:::{.panel-tabset}\n\n## Ibis (DuckDB)\n\n::: {#e08eab7a .cell execution_count=40}\n``` {.python .cell-code}\ncon = ibis.connect(\"duckdb://\")\n\n(customer, lineitem, nation, orders, part, partsupp, region, supplier) = (\n get_ibis_tables(sf=sf, con=con)\n)\n```\n:::\n\n\n\n\n::: {#624b5453 .cell execution_count=42}\n``` {.python .cell-code}\nlineitem.order_by(ibis.desc(\"l_orderkey\"), ibis.asc(\"l_partkey\"))\n```\n\n::: {.cell-output .cell-output-display execution_count=96}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ l_orderkey  l_partkey  l_suppkey  l_linenumber  l_quantity  l_extendedprice  l_discount  l_tax    l_returnflag  l_linestatus  l_shipdate  l_commitdate  l_receiptdate  l_shipinstruct     l_shipmode  l_comment                                ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64int64int64int64float64float64float64float64stringstringdatedatedatestringstringstring                                   │\n├────────────┼───────────┼───────────┼──────────────┼────────────┼─────────────────┼────────────┼─────────┼──────────────┼──────────────┼────────────┼──────────────┼───────────────┼───────────────────┼────────────┼──────────────────────────────────────────┤\n│    600000032255225615.05936.250.040.03N           O           1996-11-021996-11-191996-12-01TAKE BACK RETURN MAIL      riously pe                               │\n│    6000000961276128228.031447.360.010.02N           O           1996-09-221996-10-011996-10-21NONE             AIR       pecial excuses nag evenly f              │\n│    59999756452145327.09509.150.040.00A           F           1993-11-021993-09-231993-11-19DELIVER IN PERSONSHIP      ffily along the sly                      │\n│    599997572722273132.037736.640.070.01R           F           1993-10-071993-09-301993-10-21COLLECT COD      REG AIR   ld deposits aga                          │\n│    5999975371312138318.019226.340.040.01A           F           1993-11-171993-08-281993-12-08DELIVER IN PERSONFOB       counts cajole evenly? sly orbits boost f │\n│    5999974104635466246.063179.160.080.06R           F           1993-09-161993-09-211993-10-02COLLECT COD      RAIL      se slyly alo                             │\n│                                                  │\n└────────────┴───────────┴───────────┴──────────────┴────────────┴─────────────────┴────────────┴─────────┴──────────────┴──────────────┴────────────┴──────────────┴───────────────┴───────────────────┴────────────┴──────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n::: {#ff8e56e8 .cell execution_count=43}\n``` {.python .cell-code}\nlineitem.count()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=97}\n\n::: {.ansi-escaped-output}\n```{=html}\n
┌─────────┐\n│ 6001215 │\n└─────────┘
\n```\n:::\n\n:::\n:::\n\n\n## Ibis (DataFusion)\n\n::: {#a5254d3a .cell execution_count=44}\n``` {.python .cell-code}\ncon = ibis.connect(\"datafusion://\")\n\n(customer, lineitem, nation, orders, part, partsupp, region, supplier) = (\n get_ibis_tables(sf=sf, con=con)\n)\n```\n:::\n\n\n::: {#23425282 .cell execution_count=45}\n``` {.python .cell-code}\nlineitem.order_by(ibis.desc(\"l_orderkey\"), ibis.asc(\"l_partkey\"))\n```\n\n::: {.cell-output .cell-output-display execution_count=99}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ l_orderkey  l_partkey  l_suppkey  l_linenumber  l_quantity  l_extendedprice  l_discount  l_tax    l_returnflag  l_linestatus  l_shipdate  l_commitdate  l_receiptdate  l_shipinstruct     l_shipmode  l_comment                                ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64int64int64int64float64float64float64float64stringstringdatedatedatestringstringstring                                   │\n├────────────┼───────────┼───────────┼──────────────┼────────────┼─────────────────┼────────────┼─────────┼──────────────┼──────────────┼────────────┼──────────────┼───────────────┼───────────────────┼────────────┼──────────────────────────────────────────┤\n│    600000032255225615.05936.250.040.03N           O           1996-11-021996-11-191996-12-01TAKE BACK RETURN MAIL      riously pe                               │\n│    6000000961276128228.031447.360.010.02N           O           1996-09-221996-10-011996-10-21NONE             AIR       pecial excuses nag evenly f              │\n│    59999756452145327.09509.150.040.00A           F           1993-11-021993-09-231993-11-19DELIVER IN PERSONSHIP      ffily along the sly                      │\n│    599997572722273132.037736.640.070.01R           F           1993-10-071993-09-301993-10-21COLLECT COD      REG AIR   ld deposits aga                          │\n│    5999975371312138318.019226.340.040.01A           F           1993-11-171993-08-281993-12-08DELIVER IN PERSONFOB       counts cajole evenly? sly orbits boost f │\n│    5999974104635466246.063179.160.080.06R           F           1993-09-161993-09-211993-10-02COLLECT COD      RAIL      se slyly alo                             │\n│                                                  │\n└────────────┴───────────┴───────────┴──────────────┴────────────┴─────────────────┴────────────┴─────────┴──────────────┴──────────────┴────────────┴──────────────┴───────────────┴───────────────────┴────────────┴──────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n::: {#5ab7d426 .cell execution_count=46}\n``` {.python .cell-code}\nlineitem.count()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=100}\n\n::: {.ansi-escaped-output}\n```{=html}\n
┌─────────┐\n│ 6001215 │\n└─────────┘
\n```\n:::\n\n:::\n:::\n\n\n## Ibis (Polars)\n\n::: {#844e701c .cell execution_count=47}\n``` {.python .cell-code}\ncon = ibis.connect(\"polars://\")\n\n(customer, lineitem, nation, orders, part, partsupp, region, supplier) = (\n get_ibis_tables(sf=sf, con=con)\n)\n```\n:::\n\n\n::: {#42c87d76 .cell execution_count=48}\n``` {.python .cell-code}\nlineitem.order_by(ibis.desc(\"l_orderkey\"), ibis.asc(\"l_partkey\"))\n```\n\n::: {.cell-output .cell-output-display execution_count=102}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ l_orderkey  l_partkey  l_suppkey  l_linenumber  l_quantity  l_extendedprice  l_discount  l_tax    l_returnflag  l_linestatus  l_shipdate  l_commitdate  l_receiptdate  l_shipinstruct     l_shipmode  l_comment                                ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64int64int64int64float64float64float64float64stringstringdatedatedatestringstringstring                                   │\n├────────────┼───────────┼───────────┼──────────────┼────────────┼─────────────────┼────────────┼─────────┼──────────────┼──────────────┼────────────┼──────────────┼───────────────┼───────────────────┼────────────┼──────────────────────────────────────────┤\n│    600000032255225615.05936.250.040.03N           O           1996-11-021996-11-191996-12-01TAKE BACK RETURN MAIL      riously pe                               │\n│    6000000961276128228.031447.360.010.02N           O           1996-09-221996-10-011996-10-21NONE             AIR       pecial excuses nag evenly f              │\n│    59999756452145327.09509.150.040.00A           F           1993-11-021993-09-231993-11-19DELIVER IN PERSONSHIP      ffily along the sly                      │\n│    599997572722273132.037736.640.070.01R           F           1993-10-071993-09-301993-10-21COLLECT COD      REG AIR   ld deposits aga                          │\n│    5999975371312138318.019226.340.040.01A           F           1993-11-171993-08-281993-12-08DELIVER IN PERSONFOB       counts cajole evenly? sly orbits boost f │\n│    5999974104635466246.063179.160.080.06R           F           1993-09-161993-09-211993-10-02COLLECT COD      RAIL      se slyly alo                             │\n│                                                  │\n└────────────┴───────────┴───────────┴──────────────┴────────────┴─────────────────┴────────────┴─────────┴──────────────┴──────────────┴────────────┴──────────────┴───────────────┴───────────────────┴────────────┴──────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n::: {#ec10f64b .cell execution_count=49}\n``` {.python .cell-code}\nlineitem.count()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=103}\n\n::: {.ansi-escaped-output}\n```{=html}\n
┌─────────┐\n│ 6001215 │\n└─────────┘
\n```\n:::\n\n:::\n:::\n\n\n:::\n\n\n\nThe queries are also defined in `ibis_bench.queries`. Let's look at query 4 as\nan example for Ibis dataframe code, Polars dataframe code, and SQL code via\nIbis:\n\n:::{.panel-tabset}\n\n## Ibis (dataframe)\n\nDefine query 4:\n\n::: {#1b3ec7eb .cell execution_count=51}\n``` {.python .cell-code}\ndef q4(lineitem, orders, **kwargs):\n var1 = date(1993, 7, 1)\n var2 = date(1993, 10, 1)\n\n q_final = (\n lineitem.join(orders, lineitem[\"l_orderkey\"] == orders[\"o_orderkey\"])\n .filter((orders[\"o_orderdate\"] >= var1) & (orders[\"o_orderdate\"] < var2))\n .filter(lineitem[\"l_commitdate\"] < lineitem[\"l_receiptdate\"])\n .distinct(on=[\"o_orderpriority\", \"l_orderkey\"])\n .group_by(\"o_orderpriority\")\n .agg(order_count=ibis._.count())\n .order_by(\"o_orderpriority\")\n )\n\n return q_final\n```\n:::\n\n\nRun query 4:\n\n::: {#03a33620 .cell execution_count=52}\n``` {.python .cell-code}\nres = q4(lineitem, orders)\nres\n```\n\n::: {.cell-output .cell-output-display execution_count=106}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ o_orderpriority  order_count ┃\n┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ stringint64       │\n├─────────────────┼─────────────┤\n│ 1-URGENT       10594 │\n│ 2-HIGH         10476 │\n│ 3-MEDIUM       10410 │\n│ 4-NOT SPECIFIED10556 │\n│ 5-LOW          10487 │\n└─────────────────┴─────────────┘\n
\n```\n:::\n:::\n\n\n## Polars (dataframe)\n\nDefine query 4:\n\n::: {#8ddc975b .cell execution_count=53}\n``` {.python .cell-code}\ndef q4(lineitem, orders, **kwargs):\n var1 = date(1993, 7, 1)\n var2 = date(1993, 10, 1)\n\n q_final = (\n lineitem.join(orders, left_on=\"l_orderkey\", right_on=\"o_orderkey\")\n .filter(pl.col(\"o_orderdate\").is_between(var1, var2, closed=\"left\"))\n .filter(pl.col(\"l_commitdate\") < pl.col(\"l_receiptdate\"))\n .unique(subset=[\"o_orderpriority\", \"l_orderkey\"])\n .group_by(\"o_orderpriority\")\n .agg(pl.len().alias(\"order_count\"))\n .sort(\"o_orderpriority\")\n )\n\n return q_final\n```\n:::\n\n\nRun query 4:\n\n::: {#4719fe1c .cell execution_count=54}\n``` {.python .cell-code}\nres = q4(lineitem.to_polars().lazy(), orders.to_polars().lazy()).collect()\nres\n```\n\n::: {.cell-output .cell-output-display execution_count=108}\n```{=html}\n
\nshape: (5, 2)
o_orderpriorityorder_count
stru32
"1-URGENT"10594
"2-HIGH"10476
"3-MEDIUM"10410
"4-NOT SPECIFIED"10556
"5-LOW"10487
\n```\n:::\n:::\n\n\n## Ibis (SQL)\n\nDefine query 4:\n\n::: {#55c131b2 .cell execution_count=55}\n``` {.python .cell-code}\nq4_sql = \"\"\"\nSELECT\n o_orderpriority,\n count(*) AS order_count\nFROM\n orders\nWHERE\n o_orderdate >= CAST('1993-07-01' AS date)\n AND o_orderdate < CAST('1993-10-01' AS date)\n AND EXISTS (\n SELECT\n *\n FROM\n lineitem\n WHERE\n l_orderkey = o_orderkey\n AND l_commitdate < l_receiptdate)\nGROUP BY\n o_orderpriority\nORDER BY\n o_orderpriority;\n\"\"\"\nq4_sql = q4_sql.strip().strip(\";\")\n\n\ndef q4(lineitem, orders, dialect=\"duckdb\", **kwargs):\n return orders.sql(q4_sql, dialect=dialect)\n```\n:::\n\n\nRun query 4:\n\n::: {#692726a8 .cell execution_count=56}\n``` {.python .cell-code}\nres = q4(lineitem, orders)\nres\n```\n\n::: {.cell-output .cell-output-display execution_count=110}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ o_orderpriority  order_count ┃\n┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ stringint64       │\n├─────────────────┼─────────────┤\n│ 1-URGENT       10594 │\n│ 2-HIGH         10476 │\n│ 3-MEDIUM       10410 │\n│ 4-NOT SPECIFIED10556 │\n│ 5-LOW          10487 │\n└─────────────────┴─────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\nFinally, we write the result to a Parquet file. We are measuring the\nexecution time in seconds of calling the query and writing the results to disk.\n\n## Next steps\n\nWe'll publish the next iteration of this benchmark soon with updated Polars\nTPC-H queries and using newer versions of all libraries. Polars v1.0.0 should\nrelease soon. A new DataFusion version that fixes the remaining failing queries\nis also expected soon.\n\nIf you spot anything wrong, have any questions, or want to share your own\nanalysis, feel free to share below!\n\n", + "supporting": [ + "index_files/figure-html" + ], + "filters": [], + "includes": { + "include-in-header": [ + "\n\n\n\n\n" + ] + } + } +} \ No newline at end of file diff --git a/docs/posts/ibis-bench/.gitignore b/docs/posts/ibis-bench/.gitignore new file mode 100644 index 000000000000..e89505a7fc15 --- /dev/null +++ b/docs/posts/ibis-bench/.gitignore @@ -0,0 +1,3 @@ +tpch_data +results_data +bench_logs_v* diff --git a/docs/posts/ibis-bench/figure1.png b/docs/posts/ibis-bench/figure1.png new file mode 100644 index 000000000000..b163cb6c8545 Binary files /dev/null and b/docs/posts/ibis-bench/figure1.png differ diff --git a/docs/posts/ibis-bench/index.qmd b/docs/posts/ibis-bench/index.qmd new file mode 100644 index 000000000000..397824c7220a --- /dev/null +++ b/docs/posts/ibis-bench/index.qmd @@ -0,0 +1,1315 @@ +--- +title: "Ibis benchmarking: DuckDB, DataFusion, Polars" +author: "Cody Peterson" +date: "2024-06-24" +image: "figure1.png" +categories: + - benchmark + - duckdb + - datafusion + - polars +--- + +*The best benchmark is your own workload on your own data*. + +## Key considerations + +The purpose of this post is to explore some benchmarking data with Ibis. We'll +compare three modern single-node query engines, explore the Ibis API as a great +choice for each of them, and discuss the results. + +### The benchmark + +:::{.callout-important title="Not an official TPC-H benchmark"} +This is not an [official TPC-H benchmark](https://www.tpc.org/tpch). We ran a +derivate of the TPC-H benchmark. +::: + +[The TPC-H benchmark](https://www.tpc.org/tpch) is a benchmark for databases +and, [increasingly](https://docs.coiled.io/blog/tpch), +[dataframes](https://pola.rs/posts/benchmarks)! It consists of 22 queries +across 8 tables. The SQL (or dataframe) code representing the queries is +designed to test the performance of a query engine on a variety of tasks +including filtering, aggregation, and joins. SQL queries are defined by the +TPC-H benchmark. We run the SQL queries and equivalent dataframe code via Ibis +and Polars APIs. + +The data for the benchmark can be generated at any scale factor, which roughly +corresponds to the size of the data in memory in gigabytes. For instance, a +scale factor of 10 would be about 10GB of data in memory. + +### The engines, the API, the code + +We'll use three modern single-node OLAP engines +([DuckDB](https://github.com/duckdb/duckdb), +[DataFusion](https://github.com/apache/datafusion), +[Polars](https://github.com/pola-rs/polars)) with the Ibis API via two coding +paradigms (dataframe and SQL). Ibis provides a consistent API across 20+ +backends, including these three. We run [SQL +code](https://github.com/lostmygithubaccount/ibis-bench/blob/v2.0.0/src/ibis_bench/queries/sql.py) +through Ibis in addition to [dataframe +code](https://github.com/lostmygithubaccount/ibis-bench/blob/v2.0.0/src/ibis_bench/queries/ibis.py) +to get a sense of any overhead in Ibis dataframe code. + +:::{.callout-note} +Ibis dataframe code generates SQL for the DuckDB and DataFusion backends and +generates Polars API dataframe code for the Polars backend. +::: + +:::{.callout-note title="Honorable mention: chDB" collapse="true"} +[chDB](https://github.com/chdb-io/chdb) would be another great single-node OLAP +engine to benchmark. We don't because it's not currently a backend for Ibis, +though [there has been work done to make it +one](https://github.com/ibis-project/ibis/pull/8497). + +If you're interested in contributing to Ibis, a new backend like chDB could be a +great project for you! +::: + +9/22 queries for Ibis with the Polars backend fail from [lack of scalar subquery +support](#failing-polars-queries). Due to this and relatively experimental SQL +support in Polars, we've opted to run on [the Polars API +directly](https://github.com/lostmygithubaccount/ibis-bench/blob/v2.0.0/src/ibis_bench/queries/polars.py) +in this iteration of the benchmark. This is done with the LazyFrames API **and +no streaming engine** ([per the Polars team's +recommendation](https://github.com/pola-rs/polars/issues/16694#issuecomment-2146668559)). +This also allows us to compare the performance of the Polars backend through +Ibis with the Polars API directly for the queries that do succeed. + +#### Failing queries + +Queries fail for one of two reasons: + +1. The query doesn't work in the given system +2. The query otherwise failed on a given run + +We'll note the cases of the first below. The second is usually due to memory +pressure and [will be seen at higher scale +factors](#failing-queries-due-to-memory-pressure) throughout the data. + +#### Failing DataFusion queries + +Queries 16, 21, and 22 fail for the DataFusion backend via Ibis dataframe code, +and query 16 fails through SQL. Note that [all TPC-H SQL queries successfully +run through DataFusion +directly](https://github.com/apache/datafusion-benchmarks) -- Ibis generates SQL +that [hits a bug with DataFusion that has already been +fixed](https://github.com/apache/datafusion/issues/10830). We expect these +queries to work in the next iteration of this benchmark coming soon. + +#### Failing Polars queries + +Queries 11, 13-17, and 20-22 fail for the Polars backend via Ibis dataframe +code. These all fail due to lack of scalar subquery support in the backend. I've +[opened an issue](https://github.com/ibis-project/ibis/issues/9422) for tracking +and discussion. + +:::{.callout-tip title="Interested in contributing?"} +Increasing coverage of operations for a backend is a great place to start! +::: + +### How queries are written + +See [the source +code](https://github.com/lostmygithubaccount/ibis-bench/tree/v2.0.0/src/ibis_bench/queries) +for the exact queries used in this iteration of the benchmark. Polars recently +updated their TPC-H queries, so the next iteration of this benchmark would use +those. + +Queries were adapted from [Ibis TPC-H +queries](https://github.com/ibis-project/ibis/tree/main/ibis/backends/tests/tpch) +and [Polars TPC-H queries](https://github.com/pola-rs/tpch). The first 10 Ibis +dataframe queries were translated from the Polars dataframe queries, while the +rest were directly adapted from the Ibis repository. The SQL strings were +adapted from the Ibis repository. + +### How queries are run + +See [the source +code](https://github.com/lostmygithubaccount/ibis-bench/tree/v2.0.0) and +[methodology](https://ibis-bench.streamlit.app/methodology) for more details. In +short: + +- data is generated as a Parquet file per table + - standard DuckDB Parquet writer is used + - data is always downloaded onto a compute instance (no cloud storage reads) +- decimal types are converted to floats after reading + - works around several issues + - in the next iteration of this benchmark, we'll use the `decimal` type +- each query is run three times per configuration (system, scale factor, instance type) +- we measure the time to write the results of the query to a Parquet file + - this includes reading the Parquet file(s) and executing the query + +### Biases + +My name is Cody and I'm a Senior Technical Product Manager at [Voltron +Data](https://voltrondata.com). I am a contributor to the Ibis project and +employed to work on it -- I'm biased in favor of Ibis and the composable data +ecosystem. + +Ibis is [an independently governed open source +project](https://github.com/ibis-project/governance) that **is not owned by +Voltron Data**, though several steering committee members are employed by +Voltron Data. You can [read more about why Voltron Data supports +Ibis](../why-voda-supports-ibis/index.qmd), in addition to open source projects +like [Apache Arrow](https://github.com/apache/arrow) and +[Substrait](https://github.com/substrait-io/substrait). + +Voltron Data is a [Gold Supporter of the DuckDB +Foundation](https://duckdb.org/foundation) and [has a commercial relationship +with DuckDB Labs](https://duckdblabs.com) with regular syncs I tend to attend. +I also use [MotherDuck](https://motherduck.com) to host our [Ibis analytics +dashboard data](https://ibis-analytics.streamlit.app). + +## Results and analysis + +We'll use Ibis to analyze some of the benchmarking data. + +:::{.callout-tip} +We'll only look at a small subset of the data in this post. + +All the data is public, so you can follow along with the code and explore the +data yourself. You can also see the [Ibis benchmarking Streamlit +app](https://ibis-bench.streamlit.app) for further analysis. +::: + +```{python} +#| echo: false +#| code-fold: true +import warnings + +# this is to ignore a GCP warning +warnings.simplefilter("ignore") +``` + +### Reading the data + +To follow along, install the required Python packages: + +```bash +pip install gcsfs 'ibis-framework[duckdb]' plotly +``` + +The data is stored in a public Google Cloud Storage (GCS) bucket: + +```{python} +import os # <1> +import gcsfs # <1> + +BUCKET = "ibis-bench" # <2> + +dir_name = os.path.join(BUCKET, "bench_logs_v2", "cache") # <3> + +fs = gcsfs.GCSFileSystem() # <4> +fs.ls(dir_name)[-5:] # <5> +``` + +1. Imports +2. The public GCS bucket name +3. The directory in the bucket where the data is stored +4. Create a GCS filesystem object +5. List the last 5 files in the directory + +To start exploring the data, let's import Ibis and Plotly, set some options, and +register the GCS filesystem with the default (DuckDB) backend: + +```{python} +import ibis # <1> +import plotly.express as px # <2> + +px.defaults.template = "plotly_dark" # <3> + +ibis.options.interactive = True # <4> +ibis.options.repr.interactive.max_rows = 22 # <5> +ibis.options.repr.interactive.max_length = 22 # <6> +ibis.options.repr.interactive.max_columns = None # <7> + +con = ibis.get_backend() # <8> +con.register_filesystem(fs) # <9> +``` + +1. Import Ibis +2. Import Plotly +3. Set the Plotly template to dark +4. Enable interactive mode for Ibis +5. Set the maximum number of rows to display in interactive mode +6. Set the maximum length of nested types to display in interactive mode +7. Set the maximum number of columns to display in interactive mode +8. Get the default (DuckDB) backend +9. Register the GCS filesystem with the default backend + +```{python} +#| echo: false +#| code-fold: true +con.raw_sql("PRAGMA disable_progress_bar;"); +``` + +Now read the data and take a look at the first few rows: + +```{python} +t = ( # <1> + ibis.read_parquet(f"gs://{dir_name}/file_id=*.parquet") # <2> + .mutate( # <3> + timestamp=ibis._["timestamp"].cast("timestamp"), + ) # <3> + .relocate( # <4> + "instance_type", + "system", + "sf", + "query_number", + "execution_seconds", + "timestamp", + ) # <4> + .cache() # <5> +) +t.head() # <6> +``` + +1. Assign the table to a variable +2. Read the Parquet files from GCS +3. Cast the `timestamp` column to a timestamp type +4. Reorder the columns +5. Cache the table to avoid re-reading cloud data +6. Display the first few rows + +We'll also create a table with details on each instance type including the CPU +type, number of cores, and memory in gigabytes: + +```{python} +#| code-fold: true +#| code-summary: "Show code to get instance details" +cpu_type_cases = ( + ibis.case() + .when( + ibis._["instance_type"].startswith("n2d"), + "AMD EPYC", + ) + .when( + ibis._["instance_type"].startswith("n2"), + "Intel Cascade and Ice Lake", + ) + .when( + ibis._["instance_type"].startswith("c3"), + "Intel Sapphire Rapids", + ) + .when( + ibis._["instance_type"] == "work laptop", + "Apple M1 Max", + ) + .when( + ibis._["instance_type"] == "personal laptop", + "Apple M2 Max", + ) + .else_("unknown") + .end() +) +cpu_num_cases = ( + ibis.case() + .when( + ibis._["instance_type"].contains("-"), + ibis._["instance_type"].split("-")[-1].cast("int"), + ) + .when(ibis._["instance_type"].contains("laptop"), 12) + .else_(0) + .end() +) +memory_gb_cases = ( + ibis.case() + .when( + ibis._["instance_type"].contains("-"), + ibis._["instance_type"].split("-")[-1].cast("int") * 4, + ) + .when(ibis._["instance_type"] == "work laptop", 32) + .when(ibis._["instance_type"] == "personal laptop", 96) + .else_(0) + .end() +) + +instance_details = ( + t.group_by("instance_type") + .agg() + .mutate( + cpu_type=cpu_type_cases, cpu_cores=cpu_num_cases, memory_gbs=memory_gb_cases + ) +).order_by("memory_gbs", "cpu_cores", "instance_type") + +cpu_types = sorted( + instance_details.distinct(on="cpu_type")["cpu_type"].to_pyarrow().to_pylist() +) + +instance_details +``` + +### What's in the data? + +With the data, we can see we ran the benchmark on scale factors: + +```{python} +sfs = sorted(t.distinct(on="sf")["sf"].to_pyarrow().to_pylist()) +sfs +``` + +:::{.callout-note title="What is a scale factor?" collapse="true"} +A scale factor is roughly the size of the data in memory in gigabytes. For +example, a scale factor of 1 means the data is roughly 1GB in memory. + +Stored on disk in (compressed) Parquet format, the data is smaller -- about +0.38GB for scale factor 1 with the compression settings used in this benchmark. +::: + +We can look at the total execution time by scale factor: + +```{python} +#| code-fold: true +#| code-summary: "Show bar plot code" +c = px.bar( + t.group_by("sf").agg(total_seconds=t["execution_seconds"].sum()), + x="sf", + y="total_seconds", + category_orders={"sf": sfs}, + title="total execution time by scale factor", +) +c +``` + +You can see this is roughly linear as expected. + +We ran on the following queries: + +```{python} +query_numbers = sorted( + t.distinct(on="query_number")["query_number"].to_pyarrow().to_pylist() +) +query_numbers +``` + +:::{.callout-note title="What is a query number?" collapse="true"} +The TPC-H benchmark defines 22 queries. See the [TPC-H benchmark +specification](https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPC-H_v3.0.1.pdf) +for more information. +::: + +We can look at the total execution time by query number: + +```{python} +#| code-fold: true +#| code-summary: "Show bar plot code" +c = px.bar( + t.group_by("query_number").agg(total_seconds=t["execution_seconds"].sum()), + x="query_number", + y="total_seconds", + category_orders={"query_number": query_numbers}, + title="total execution time by query number", +) +c +``` + +This gives us a sense of the relative complexity of the queries. + +We ran on the following instance types: + +```{python} +instance_types = sorted( + t.distinct(on="instance_type")["instance_type"].to_pyarrow().to_pylist(), + key=lambda x: (x.split("-")[0], int(x.split("-")[-1])) # <1> + if "-" in x # <1> + else ("z" + x[3], 0), # <2> +) +instance_types +``` + +1. This is to sort the instance types by CPU architecture and number of cores +2. This is to sort "personal laptop" after "work laptop" + +:::{.callout-note title="What is an instance type?" collapse="true"} +An instance type is the compute the benchmark was run on. This consists of two +MacBook Pro laptops (one work and one personal) and a number of Google Cloud +Compute Engine instances. + +For cloud VMs, the instance type is in the form of `--`, +where: + +- `` specifies the CPU architecture (Intel X, AMD Y) +- `` modifies the CPU to memory ratio (only `standard` is used with a 1:4) +- `` is the number of vCPUs + +For example, `n2d-standard-2` is a Google Cloud Compute Engine instance with an +AMD EPYC processor, 2 vCPUs, and 8GB of memory. +::: + +We can look at the total execution time by instance type: + +```{python} +#| code-fold: true +#| code-summary: "Show bar plot code" +c = px.bar( + t.group_by("instance_type") + .agg(total_seconds=t["execution_seconds"].sum()) + .join(instance_details, "instance_type"), + x="instance_type", + y="total_seconds", + color="cpu_type", + hover_data=["cpu_cores", "memory_gbs"], + category_orders={ + "instance_type": instance_types, + "cpu_type": cpu_types, + }, + title="total execution time by instance type", +) +c +``` + +Unsurprisingly, this is inversely correlated with the number of CPU cores and +(crucially) memory: + +```{python} +#| code-fold: true +#| code-summary: "Show bar plot code" +c = px.bar( + instance_details, + x="instance_type", + y="memory_gbs", + color="cpu_type", + hover_data=["cpu_cores", "memory_gbs"], + category_orders={ + "instance_type": instance_types, + "cpu_type": cpu_types, + }, + title="memory by instance type", +) +c +``` + +We ran on the following systems: + +```{python} +systems = sorted(t.distinct(on="system")["system"].to_pyarrow().to_pylist()) +systems +``` + +:::{.callout-note title="What is a system?" collapse="true"} +For convenience in this benchmark, a 'system' is defined as a hyphen-separated +naming convention where: + +- `ibis-*`: Ibis API was used + - `ibis-`: Ibis dataframe code was used with the given backend + - `ibis--sql`: SQL code was used via Ibis on the given backend +- `polars-*`: Polars API was used + - `polars-lazy`: Polars was used with the LazyFrames API +::: + +We can look at the total execution time by system: + +```{python} +#| code-fold: true +#| code-summary: "Show bar plot code" +c = px.bar( + t.group_by("system").agg( + total_seconds=t["execution_seconds"].sum(), + seconds_per_query=t["execution_seconds"].mean(), + num_records=t.count(), + ), + x="system", + y="num_records", + color="system", + category_orders={"system": systems}, + title="total execution time by system", +) +c +``` + +:::{.callout-warning title="This can be misleading!"} +At this point, we have to dig deeper into the data to understand the takeaways. +You might look at the above and think `ibis-polars` is the fastest all-around, +but it's not! Recall [9/22 queries for the Polars backend are +failing](#failing-polars-queries), and at larger scale factors we start to see +several systems fail queries due to memory pressure. +::: + +### Execution time by system, scale factor, instance type, and query + +We'll aggregate the data over the dimensions we care about: + +```{python} +agg = ( + t.group_by("instance_type", "system", "sf", "n_partitions", "query_number") + .agg( + mean_execution_seconds=t["execution_seconds"].mean(), + ) + .join(instance_details, "instance_type") +) +agg.head(3) +``` + +There's a lot of data and it's difficult to visualize all at once. We'll build +up our understanding with a few plots. + +```{python} +# | code-fold: true +# | code-summary: "Show code for timings_plot" +def timings_plot( + agg, + sf_filter=128, + systems_filter=systems, + instances_filter=[instance for instance in instance_types if "laptop" in instance], + queries_filter=query_numbers, + log_y=True, +): + data = ( + agg.filter(agg["sf"] == sf_filter) + .filter(agg["system"].isin(systems_filter)) + .filter(agg["instance_type"].isin(instances_filter)) + .filter(agg["query_number"].isin(queries_filter)) + ) + + c = px.bar( + data, + x="query_number", + y="mean_execution_seconds", + log_y=log_y, + color="system", + barmode="group", + pattern_shape="instance_type", + category_orders={ + "system": systems, + "instance_type": instance_types, + }, + hover_data=["cpu_type", "cpu_cores", "memory_gbs"], + title=f"sf: {sf_filter}", + ) + + return c +``` + +First, let's visualize execution time for a given scale factor, system, query, +and family of instance types: + +```{python} +sf_filter = 128 +systems_filter = ["ibis-duckdb"] +instances_filter = [ + instance for instance in instance_types if instance.startswith("n2d") +] +queries_filter = [1] +log_y = False + +timings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y) +``` + +From left to right, we have increasing instance resources (CPU cores and memory +-- you can hover over the data to see the details). You can also zoom and select +specific labels to focus on. We notice that, as expected, queries execute faster +when given more resources. + +Now let's add a second system: + +```{python} +systems_filter = ["ibis-duckdb", "ibis-duckdb-sql"] + +timings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y) +``` + +:::{.callout-note title="Ibis dataframe code vs Ibis SQL code" collapse="true"} +`ibis-duckdb` is running the TPC-H queries written as Ibis dataframe code. The +`ibis-duckdb-sql` system is running the same queries but written as SQL code +passed into `.sql()` in Ibis as strings. The intent is to see if Ibis dataframe +code is introducing any significant overhead. While ideally we'd run on the +backend's Python client without Ibis in the mix, this keeps the benchmarking +process simple and should serve as a decent proxy. +::: + +In this case, we do see that Ibis dataframe code is adding some overhead. But, +this is a single data point -- let's expand to the first 7 queries: + +:::{.callout-note title="Logging the y-axis"} +From here, we'll set `log_y=True` due to the wide range of execution times. + +We also look at the first 7 queries due to limited horizontal space on this +website. Analyze and visualize the data yourself for all 22 queries! Or [see +the Ibis benchmarking Streamlit app](https://ibis-bench.streamlit.app). +::: + +```{python} +log_y = True +queries_filter = range(1, 7+1) + +timings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y) +``` + +This tells a different story. Sometimes Ibis dataframe code is a bit faster, +sometimes a bit slower. Let's compute the totals: + +```{python} +( + agg.filter(agg["sf"] == sf_filter) + .filter(agg["system"].isin(systems_filter)) + .filter(agg["instance_type"].isin(instances_filter)) + .filter(agg["query_number"].isin(queries_filter)) + .group_by("system") + .agg( + total_execution_seconds=agg["mean_execution_seconds"].sum(), + total_queries=ibis._.count(), + ) + .mutate( + seconds_per_query=ibis._["total_execution_seconds"] / ibis._["total_queries"] + ) +) +``` + +Ibis dataframe code is a little faster overall, but this is on a subset of +queries and scale factors and instance types. More analysis and profiling would +be needed to make a definitive statement, but in general we can be happy that +DuckDB does a great job optimizing the SQL Ibis generates and that Ibis +dataframe code isn't adding significant overhead. + +Let's repeat this for DataFusion: + +```{python} +systems_filter = ["ibis-datafusion", "ibis-datafusion-sql"] + +timings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y) +``` + +We see a similar story. Let's confirm with a table: + +```{python} +( + agg.filter(agg["sf"] == sf_filter) + .filter(agg["system"].isin(systems_filter)) + .filter(agg["instance_type"].isin(instances_filter)) + .filter(agg["query_number"].isin(queries_filter)) + .group_by("system") + .agg( + total_execution_seconds=agg["mean_execution_seconds"].sum(), + total_queries=ibis._.count(), + ) + .mutate( + seconds_per_query=ibis._["total_execution_seconds"] / ibis._["total_queries"] + ) +) +``` + +This time Ibis dataframe code is a bit slower overall. **However, also notice +two queries are missing from `ibis-datafusion-sql`**. These are query 7 on +`n2d-standard-2` and `n2d-standard-4` (the two instances with the least memory). +We'll investigate failing queries more thoroughly in the next section. + +First, let's look at Polars: + +```{python} +systems_filter = ["ibis-polars", "polars-lazy"] + +timings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y) +``` + +A lot of queries are missing from `ibis-polars` and `polars-lazy`. These are +failing due to the high scale factor and limited memory on the instances. + +Let's look at a lower scale factor and my MacBooks (Polars tended to perform +better on these): + +```{python} +sf_filter = 64 +instances_filter = [ + instance for instance in instance_types if "laptop" in instance +] + +timings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y) +``` + +We see a similar pattern as above -- some queries are a little faster on +`ibis-polars`, though some are **much** slower. In particular queries 1 and 2 +tend to have a lot of overhead. + +```{python} +( + agg.filter(agg["sf"] == sf_filter) + .filter(agg["system"].isin(systems_filter)) + .filter(agg["instance_type"].isin(instances_filter)) + .filter(agg["query_number"].isin(queries_filter)) + .group_by("system") + .agg( + total_execution_seconds=agg["mean_execution_seconds"].sum(), + total_queries=ibis._.count(), + ) + .mutate( + seconds_per_query=ibis._["total_execution_seconds"] / ibis._["total_queries"] + ) +) +``` + +Let's now compare all systems across a single instance type and query: + +```{python} +sf_filter = 128 +instances_filter = ["n2d-standard-32"] +systems_filter = systems +queries_filter = [1] + +timings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y) +``` + +And then the first 7 queries: + +```{python} +queries_filter = range(1, 7+1) + +timings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y) +``` + +:::{.callout-warning title="Lots of data, lots of takeaways"} +There is a lot of data and it's easy to summarize and visualize it in a way that +favors a given system. There's a lot of missing data that needs to be accounted +for, as it often indicates a query that failed due to memory pressure. + +Each system has strengths and weaknesses. See [the discussion section +below](#which-system-is-best). + +See the [Ibis benchmarking Streamlit app](https://ibis-bench.streamlit.app) for +further analysis, or query the data yourself! +::: + +### Failing queries due to memory pressure + +Many queries fail due to memory pressure at higher scale factors with +insufficient resources. Impressively, the exception here is DuckDB. + +```{python} +#| code-fold: true +#| code-summary: "Show code to get table of failing queries" +def failing_queries(agg, sf, instance_type): + failing = ( + t.filter(t["sf"] == sf) + .filter(t["instance_type"] == instance_type) + .group_by("system") + .agg(present_queries=ibis._["query_number"].collect().unique().sort()) + ) + failing = ( + failing.mutate( + failing_queries=t.distinct(on="query_number")["query_number"] + .collect() + .filter(lambda x: ~failing["present_queries"].contains(x)) + ) + .mutate( + num_failing_queries=ibis._["failing_queries"].length(), + num_successful_queries=ibis._["present_queries"].length(), + ) + .drop("present_queries") + .order_by("num_failing_queries", "system") + ) + + return failing +``` + +Let's look at the failing queries on the largest `n2d` instance:: + +```{python} +sf = 128 +instance_type = "n2d-standard-32" + +failing = failing_queries(agg, sf, instance_type) +failing +``` + +```{python} +# | code-fold: true +# | code-summary: "Show code to create a bar plot of the number of successful queries by system" +c = px.bar( + failing, + x="system", + y="num_successful_queries", + category_orders={ + "system": systems, + "query_number": query_numbers, + }, + color="system", + title="completed queries", +) +c +``` + +And the smallest: + +```{python} +instance_type = "n2d-standard-2" + +failing = failing_queries(agg, sf, instance_type) +failing +``` + +```{python} +# | code-fold: true +# | code-summary: "Show code to create a bar plot of the number of successful queries by system" +c = px.bar( + failing, + x="system", + y="num_successful_queries", + category_orders={ + "system": systems, + "query_number": query_numbers, + }, + color="system", + title="completed queries", +) +c +``` + +A lot of queries are failing on the smallest instance due to memory pressure. + +We can create a single visualization across the `n2d` instances: + +```{python} +#| code-fold: true +#| code-summary: "Show code to create a bar plot of the number of successful queries by system and instance type" +failing = t.group_by("instance_type", "system", "sf").agg( + total_time=t["execution_seconds"].sum(), + present_queries=ibis._["query_number"].collect().unique().sort(), +) +failing = ( + failing.mutate( + failing_queries=t.distinct(on="query_number")["query_number"] + .collect() + .filter(lambda x: ~failing["present_queries"].contains(x)), + ) + .mutate( + num_failing_queries=ibis._["failing_queries"].length(), + num_successful_queries=ibis._["present_queries"].length(), + ) + .drop("present_queries") + .relocate("instance_type", "system", "sf", "failing_queries") + .order_by("num_failing_queries", "instance_type", "system", "sf") +) +failing = failing.join(instance_details, "instance_type") +failing = ( + failing.filter( + (failing["sf"] == 128) & (failing["instance_type"].startswith("n2d-")) + ) +).order_by(ibis.desc("memory_gbs")) + +c = px.bar( + failing, + x="system", + y="num_successful_queries", + color="instance_type", + barmode="group", + hover_data=["cpu_cores", "memory_gbs"], + category_orders={ + "system": systems, + "instance_type": reversed( + [instance for instance in instance_types if instance.startswith("n2d")] + ), + }, + title="completed queries", +) +c +``` + +Within each system, from left to right, we have decreasing resources (vCPUs and +memory). We can see how each system performs on the same queries with different +resources. + +:::{.callout-warning title="Data is aggregated"} +Keep in mind data is aggregated over three runs of each query. For DuckDB, there +was actually a single failure on the smallest instance for query 9, out of six +runs across the two systems, but this does not appear above because we are +checking for the success of the query in any of the three runs per +configuration. +::: + +## Discussion and reproducibility + +Benchmarking is fraught: it's easy to get wrong and ship your bias in the +results. We don't want to end up as [Figure 1 in "Fair Benchmarking Considered +Difficult: Common Pitfalls In Database Performance +Testing"](https://hannes.muehleisen.org/publications/DBTEST2018-performance-testing.pdf): + +![Figure 1](figure1.png) + +If you have any questions or concerns, feel free to [open an +issue](https://github.com/lostmygithubaccount/ibis-bench/issues/new) or comment +on this post below. + +### Which system is best? + +Trick question! It depends on your use case. DuckDB is a simple, performant +in-process database with an on-disk file format (SQLite for OLAP). DataFusion is +an extensible query engine and is often used for building databases or query +engines. Polars is an OLAP query engine with a Python dataframe API that can be +used as a more performant alternative to pandas. + +All three make great Ibis backends and you can switch between them in a single +line of code. This let's you write your code once and run it on the engine +that's best for your use case. If a better engine comes along you'll likely be +able to use that too. And you can scale up and out across the 20+ backends Ibis +supports as needed. + +TPC-H is a decent benchmark *for what it benchmarks, which is limited*. We're +not running window functions, doing timeseries analysis, or feature engineering +for machine learning. We're not using nested data types. We're not performing +regexes or using LLMs in UDFs... + +It's easy to summarize and visualize benchmarking data in a way that favors a +given system. You should favor the system that works best for your use case. + +### Performance converges over time + +Let's look at some quotes from ["Perf is not +enough"](https://motherduck.com/blog/perf-is-not-enough) by Jordan Tigani of +MotherDuck: + +> If you take a bunch of databases, all actively maintained, and iterate them +> out a few years, **performance is going to converge**. If Clickhouse is applying +> a technique that gives it an advantage for scan speed today, Snowflake will +> likely have that within a year or two. If Snowflake adds incrementally +> materialized views, BigQuery will soon follow. It is unlikely that important +> performance differences will persist over time. +> +> As clever as the engineers working for any of these companies are, none of +> them possess any magic incantations or things that cannot be replicated +> elsewhere. Each database uses a different bag of tricks in order to get good +> performance. One might compile queries to machine code, another might cache data +> on local SSDs, and a third might use specialized network hardware to do +> shuffles. **Given time, all of these techniques can be implemented by anyone. If +> they work well, they likely will show up everywhere.** + +This is extra true for open source databases (or query engines). If DuckDB adds +a feature that improves performance, it's likely that DataFusion and Polars will +follow suit -- they can go read the source code and specific commits to see how +it was done. + +### Reproducing the benchmark + +The source code for [is available on +GitHub](https://github.com/lostmygithubaccount/ibis-bench/tree/v2.0.0). + +#### A TPC-H benchmark on 6 systems in 3 commands + +First install `ibis-bench`: + +```bash +pip install ibis-bench +``` + +Then generate the TPC-H data: + +```bash +bench gen-data -s 1 +``` + +Finally run the benchmark: + +```bash +bench run -s 1 ibis-duckdb ibis-duckdb-sql ibis-datafusion ibis-datafusion-sql ibis-polars polars-lazy +``` + +Congratulations! You've run a TPC-H benchmark on DuckDB (Ibis dataframe code and +SQL), DataFusion (Ibis dataframe code and SQL), and Polars (dataframe code via +Ibis and native Polars). + +#### What just happened? + +This will generate TPC-H data at scale factor 1 as Parquet files in the +`tpch_data` directory: + +```bash +tpch_data +└── parquet + └── sf=1 + └── n=1 + ├── customer + │ └── 0000.parquet + ├── lineitem + │ └── 0000.parquet + ├── nation + │ └── 0000.parquet + ├── orders + │ └── 0000.parquet + ├── part + │ └── 0000.parquet + ├── partsupp + │ └── 0000.parquet + ├── region + │ └── 0000.parquet + └── supplier + └── 0000.parquet +``` + +The scale factor is roughly the size of data **in memory** in gigabytes (GBs). +The size of data on disk, however, is smaller because Parquet is compressed. We +can take a look at the size of the data: + +```bash +384M tpch_data/parquet/sf=1/n=1 +262M tpch_data/parquet/sf=1/n=1/lineitem + 59M tpch_data/parquet/sf=1/n=1/orders + 12M tpch_data/parquet/sf=1/n=1/customer + 43M tpch_data/parquet/sf=1/n=1/partsupp +6.6M tpch_data/parquet/sf=1/n=1/part +788K tpch_data/parquet/sf=1/n=1/supplier +4.0K tpch_data/parquet/sf=1/n=1/nation +4.0K tpch_data/parquet/sf=1/n=1/region +``` + +We can see the total size is 0.38 GB and the size of the tables -- `lineitem` is +by far the largest. + +Using `bench run` results in a `results_data` directory with the results of the +queries and a `bench_logs_v2` directory with the logs of the benchmark run. + +#### Analyzing the results + +We can use Ibis to load and analyze the log data: + +```{python} +import ibis + +ibis.options.interactive = True +ibis.options.repr.interactive.max_rows = 6 +ibis.options.repr.interactive.max_columns = None + +t = ibis.read_json("bench_logs_v*/raw_json/file_id=*.json").relocate( + "system", "sf", "query_number", "execution_seconds" +) +t +``` + +We can check the total execution time for each system: + +```{python} +t.group_by("system").agg(total_seconds=t["execution_seconds"].sum()).order_by( + "total_seconds" +) +``` + +We can visualize the results: + +```{python} +import plotly.express as px + +px.defaults.template = "plotly_dark" + +agg = t.group_by("system", "query_number").agg( + mean_execution_seconds=t["execution_seconds"].mean(), +) + +chart = px.bar( + agg, + x="query_number", + y="mean_execution_seconds", + color="system", + barmode="group", + title="Mean execution time by query", + category_orders={ + "system": sorted(t.select("system").distinct().to_pandas()["system"].tolist()) + }, +) +chart +``` + +#### What did we run and measure, exactly? + +We can import `ibis_bench` as a library and read in the TPC-H tables: + +```{python} +import ibis +import polars as pl + +from datetime import date +from ibis_bench.utils.read_data import get_ibis_tables, get_polars_tables + +sf = 1 +``` + +:::{.panel-tabset} + +## Ibis (DuckDB) + +```{python} +con = ibis.connect("duckdb://") + +(customer, lineitem, nation, orders, part, partsupp, region, supplier) = ( + get_ibis_tables(sf=sf, con=con) +) +``` + +```{python} +#| echo: false +#| code-fold: true +con.raw_sql("PRAGMA disable_progress_bar;"); +``` + +```{python} +lineitem.order_by(ibis.desc("l_orderkey"), ibis.asc("l_partkey")) +``` + +```{python} +lineitem.count() +``` + +## Ibis (DataFusion) + +```{python} +con = ibis.connect("datafusion://") + +(customer, lineitem, nation, orders, part, partsupp, region, supplier) = ( + get_ibis_tables(sf=sf, con=con) +) +``` + +```{python} +lineitem.order_by(ibis.desc("l_orderkey"), ibis.asc("l_partkey")) +``` + +```{python} +lineitem.count() +``` + +## Ibis (Polars) + +```{python} +con = ibis.connect("polars://") + +(customer, lineitem, nation, orders, part, partsupp, region, supplier) = ( + get_ibis_tables(sf=sf, con=con) +) +``` + +```{python} +lineitem.order_by(ibis.desc("l_orderkey"), ibis.asc("l_partkey")) +``` + +```{python} +lineitem.count() +``` + +::: + +```{python} +#| echo: false +#| code-fold: true +con = ibis.connect("duckdb://") + +cusotmer, lineitem, nation, orders, part, partsupp, region, supplier = get_ibis_tables( + sf=sf, con=con +) +``` + +The queries are also defined in `ibis_bench.queries`. Let's look at query 4 as +an example for Ibis dataframe code, Polars dataframe code, and SQL code via +Ibis: + +:::{.panel-tabset} + +## Ibis (dataframe) + +Define query 4: + +```{python} +def q4(lineitem, orders, **kwargs): + var1 = date(1993, 7, 1) + var2 = date(1993, 10, 1) + + q_final = ( + lineitem.join(orders, lineitem["l_orderkey"] == orders["o_orderkey"]) + .filter((orders["o_orderdate"] >= var1) & (orders["o_orderdate"] < var2)) + .filter(lineitem["l_commitdate"] < lineitem["l_receiptdate"]) + .distinct(on=["o_orderpriority", "l_orderkey"]) + .group_by("o_orderpriority") + .agg(order_count=ibis._.count()) + .order_by("o_orderpriority") + ) + + return q_final +``` + +Run query 4: + +```{python} +res = q4(lineitem, orders) +res +``` + +## Polars (dataframe) + +Define query 4: + +```{python} +def q4(lineitem, orders, **kwargs): + var1 = date(1993, 7, 1) + var2 = date(1993, 10, 1) + + q_final = ( + lineitem.join(orders, left_on="l_orderkey", right_on="o_orderkey") + .filter(pl.col("o_orderdate").is_between(var1, var2, closed="left")) + .filter(pl.col("l_commitdate") < pl.col("l_receiptdate")) + .unique(subset=["o_orderpriority", "l_orderkey"]) + .group_by("o_orderpriority") + .agg(pl.len().alias("order_count")) + .sort("o_orderpriority") + ) + + return q_final +``` + +Run query 4: + +```{python} +res = q4(lineitem.to_polars().lazy(), orders.to_polars().lazy()).collect() +res +``` + +## Ibis (SQL) + +Define query 4: + +```{python} +q4_sql = """ +SELECT + o_orderpriority, + count(*) AS order_count +FROM + orders +WHERE + o_orderdate >= CAST('1993-07-01' AS date) + AND o_orderdate < CAST('1993-10-01' AS date) + AND EXISTS ( + SELECT + * + FROM + lineitem + WHERE + l_orderkey = o_orderkey + AND l_commitdate < l_receiptdate) +GROUP BY + o_orderpriority +ORDER BY + o_orderpriority; +""" +q4_sql = q4_sql.strip().strip(";") + + +def q4(lineitem, orders, dialect="duckdb", **kwargs): + return orders.sql(q4_sql, dialect=dialect) +``` + +Run query 4: + +```{python} +res = q4(lineitem, orders) +res +``` + +::: + +Finally, we write the result to a Parquet file. We are measuring the +execution time in seconds of calling the query and writing the results to disk. + +## Next steps + +We'll publish the next iteration of this benchmark soon with updated Polars +TPC-H queries and using newer versions of all libraries. Polars v1.0.0 should +release soon. A new DataFusion version that fixes the remaining failing queries +is also expected soon. + +If you spot anything wrong, have any questions, or want to share your own +analysis, feel free to share below!