diff --git a/docs/_freeze/posts/ibis-bench/index/execute-results/html.json b/docs/_freeze/posts/ibis-bench/index/execute-results/html.json new file mode 100644 index 000000000000..898afa54a341 --- /dev/null +++ b/docs/_freeze/posts/ibis-bench/index/execute-results/html.json @@ -0,0 +1,16 @@ +{ + "hash": "813ef6319e015f6967cb8a583aba5a9d", + "result": { + "engine": "jupyter", + "markdown": "---\ntitle: \"Ibis benchmarking: DuckDB, DataFusion, Polars\"\nauthor: \"Cody Peterson\"\ndate: \"2024-06-24\"\nimage: \"figure1.png\"\ncategories:\n - benchmark\n - duckdb\n - datafusion\n - polars\n---\n\n*The best benchmark is your own workload on your own data*.\n\n## Key considerations\n\nThe purpose of this post is to explore some benchmarking data with Ibis. We'll\ncompare three modern single-node query engines, explore the Ibis API as a great\nchoice for each of them, and discuss the results.\n\n### The benchmark\n\n:::{.callout-important title=\"Not an official TPC-H benchmark\"}\nThis is not an [official TPC-H benchmark](https://www.tpc.org/tpch). We ran a\nderivate of the TPC-H benchmark.\n:::\n\n[The TPC-H benchmark](https://www.tpc.org/tpch) is a benchmark for databases\nand, [increasingly](https://docs.coiled.io/blog/tpch),\n[dataframes](https://pola.rs/posts/benchmarks)! It consists of 22 queries\nacross 8 tables. The SQL (or dataframe) code representing the queries is\ndesigned to test the performance of a query engine on a variety of tasks\nincluding filtering, aggregation, and joins. SQL queries are defined by the\nTPC-H benchmark. We run the SQL queries and equivalent dataframe code via Ibis\nand Polars APIs.\n\nThe data for the benchmark can be generated at any scale factor, which roughly\ncorresponds to the size of the data in memory in gigabytes. For instance, a\nscale factor of 10 would be about 10GB of data in memory.\n\n### The engines, the API, the code\n\nWe'll use three modern single-node OLAP engines\n([DuckDB](https://github.com/duckdb/duckdb),\n[DataFusion](https://github.com/apache/datafusion),\n[Polars](https://github.com/pola-rs/polars)) with the Ibis API via two coding\nparadigms (dataframe and SQL). Ibis provides a consistent API across 20+\nbackends, including these three. We run [SQL\ncode](https://github.com/lostmygithubaccount/ibis-bench/blob/v2.0.0/src/ibis_bench/queries/sql.py)\nthrough Ibis in addition to [dataframe\ncode](https://github.com/lostmygithubaccount/ibis-bench/blob/v2.0.0/src/ibis_bench/queries/ibis.py)\nto get a sense of any overhead in Ibis dataframe code.\n\n:::{.callout-note}\nIbis dataframe code generates SQL for the DuckDB and DataFusion backends and\ngenerates Polars API dataframe code for the Polars backend.\n:::\n\n:::{.callout-note title=\"Honorable mention: chDB\" collapse=\"true\"}\n[chDB](https://github.com/chdb-io/chdb) would be another great single-node OLAP\nengine to benchmark. We don't because it's not currently a backend for Ibis,\nthough [there has been work done to make it\none](https://github.com/ibis-project/ibis/pull/8497).\n\nIf you're interested in contributing to Ibis, a new backend like chDB could be a\ngreat project for you!\n:::\n\n9/22 queries for Ibis with the Polars backend fail from [lack of scalar subquery\nsupport](#failing-polars-queries). Due to this and relatively experimental SQL\nsupport in Polars, we've opted to run on [the Polars API\ndirectly](https://github.com/lostmygithubaccount/ibis-bench/blob/v2.0.0/src/ibis_bench/queries/polars.py)\nin this iteration of the benchmark. This is done with the LazyFrames API **and\nno streaming engine** ([per the Polars team's\nrecommendation](https://github.com/pola-rs/polars/issues/16694#issuecomment-2146668559)).\nThis also allows us to compare the performance of the Polars backend through\nIbis with the Polars API directly for the queries that do succeed.\n\n#### Failing queries\n\nQueries fail for one of two reasons:\n\n1. The query doesn't work in the given system\n2. The query otherwise failed on a given run\n\nWe'll note the cases of the first below. The second is usually due to memory\npressure and [will be seen at higher scale\nfactors](#failing-queries-due-to-memory-pressure) throughout the data.\n\n#### Failing DataFusion queries\n\nQueries 16, 21, and 22 fail for the DataFusion backend via Ibis dataframe code,\nand query 16 fails through SQL. Note that [all TPC-H SQL queries successfully\nrun through DataFusion\ndirectly](https://github.com/apache/datafusion-benchmarks) -- Ibis generates SQL\nthat [hits a bug with DataFusion that has already been\nfixed](https://github.com/apache/datafusion/issues/10830). We expect these\nqueries to work in the next iteration of this benchmark coming soon.\n\n#### Failing Polars queries\n\nQueries 11, 13-17, and 20-22 fail for the Polars backend via Ibis dataframe\ncode. These all fail due to lack of scalar subquery support in the backend. I've\n[opened an issue](https://github.com/ibis-project/ibis/issues/9422) for tracking\nand discussion.\n\n:::{.callout-tip title=\"Interested in contributing?\"}\nIncreasing coverage of operations for a backend is a great place to start!\n:::\n\n### How queries are written\n\nSee [the source\ncode](https://github.com/lostmygithubaccount/ibis-bench/tree/v2.0.0/src/ibis_bench/queries)\nfor the exact queries used in this iteration of the benchmark. Polars recently\nupdated their TPC-H queries, so the next iteration of this benchmark would use\nthose.\n\nQueries were adapted from [Ibis TPC-H\nqueries](https://github.com/ibis-project/ibis/tree/main/ibis/backends/tests/tpch)\nand [Polars TPC-H queries](https://github.com/pola-rs/tpch). The first 10 Ibis\ndataframe queries were translated from the Polars dataframe queries, while the\nrest were directly adapted from the Ibis repository. The SQL strings were\nadapted from the Ibis repository.\n\n### How queries are run\n\nSee [the source\ncode](https://github.com/lostmygithubaccount/ibis-bench/tree/v2.0.0) and\n[methodology](https://ibis-bench.streamlit.app/methodology) for more details. In\nshort:\n\n- data is generated as a Parquet file per table\n - standard DuckDB Parquet writer is used\n - data is always downloaded onto a compute instance (no cloud storage reads)\n- decimal types are converted to floats after reading\n - works around several issues\n - in the next iteration of this benchmark, we'll use the `decimal` type\n- each query is run three times per configuration (system, scale factor, instance type)\n- we measure the time to write the results of the query to a Parquet file\n - this includes reading the Parquet file(s) and executing the query\n\n### Biases\n\nMy name is Cody and I'm a Senior Technical Product Manager at [Voltron\nData](https://voltrondata.com). I am a contributor to the Ibis project and\nemployed to work on it -- I'm biased in favor of Ibis and the composable data\necosystem.\n\nIbis is [an independently governed open source\nproject](https://github.com/ibis-project/governance) that **is not owned by\nVoltron Data**, though several steering committee members are employed by\nVoltron Data. You can [read more about why Voltron Data supports\nIbis](../why-voda-supports-ibis/index.qmd), in addition to open source projects\nlike [Apache Arrow](https://github.com/apache/arrow) and\n[Substrait](https://github.com/substrait-io/substrait).\n\nVoltron Data is a [Gold Supporter of the DuckDB\nFoundation](https://duckdb.org/foundation) and [has a commercial relationship\nwith DuckDB Labs](https://duckdblabs.com) with regular syncs I tend to attend.\nI also use [MotherDuck](https://motherduck.com) to host our [Ibis analytics\ndashboard data](https://ibis-analytics.streamlit.app).\n\n## Results and analysis\n\nWe'll use Ibis to analyze some of the benchmarking data.\n\n:::{.callout-tip}\nWe'll only look at a small subset of the data in this post.\n\nAll the data is public, so you can follow along with the code and explore the\ndata yourself. You can also see the [Ibis benchmarking Streamlit\napp](https://ibis-bench.streamlit.app) for further analysis.\n:::\n\n\n\n### Reading the data\n\nTo follow along, install the required Python packages:\n\n```bash\npip install gcsfs 'ibis-framework[duckdb]' plotly\n```\n\nThe data is stored in a public Google Cloud Storage (GCS) bucket:\n\n::: {#26f3327a .cell execution_count=3}\n``` {.python .cell-code}\nimport os # <1>\nimport gcsfs # <1>\n\nBUCKET = \"ibis-bench\" # <2>\n\ndir_name = os.path.join(BUCKET, \"bench_logs_v2\", \"cache\") # <3>\n\nfs = gcsfs.GCSFileSystem() # <4>\nfs.ls(dir_name)[-5:] # <5>\n```\n\n::: {.cell-output .cell-output-display execution_count=57}\n```\n['ibis-bench/bench_logs_v2/cache/file_id=b6236086-7fff-4569-8731-b97a635243bd.parquet',\n 'ibis-bench/bench_logs_v2/cache/file_id=cbc0c7b1-e659-4adb-8c80-4077cd4d39ab.parquet',\n 'ibis-bench/bench_logs_v2/cache/file_id=d91454ad-2ddd-408a-bbfd-6b159dd2132b.parquet',\n 'ibis-bench/bench_logs_v2/cache/file_id=debc7203-f366-44d2-94f1-2518e6f7425f.parquet',\n 'ibis-bench/bench_logs_v2/cache/file_id=e875d852-f7e7-473c-9440-92b8f2445f3a.parquet']\n```\n:::\n:::\n\n\n1. Imports\n2. The public GCS bucket name\n3. The directory in the bucket where the data is stored\n4. Create a GCS filesystem object\n5. List the last 5 files in the directory\n\nTo start exploring the data, let's import Ibis and Plotly, set some options, and\nregister the GCS filesystem with the default (DuckDB) backend:\n\n::: {#0590f851 .cell execution_count=4}\n``` {.python .cell-code}\nimport ibis # <1>\nimport plotly.express as px # <2>\n\npx.defaults.template = \"plotly_dark\" # <3>\n\nibis.options.interactive = True # <4>\nibis.options.repr.interactive.max_rows = 22 # <5>\nibis.options.repr.interactive.max_length = 22 # <6>\nibis.options.repr.interactive.max_columns = None # <7>\n\ncon = ibis.get_backend() # <8>\ncon.register_filesystem(fs) # <9>\n```\n:::\n\n\n1. Import Ibis\n2. Import Plotly\n3. Set the Plotly template to dark\n4. Enable interactive mode for Ibis\n5. Set the maximum number of rows to display in interactive mode\n6. Set the maximum length of nested types to display in interactive mode\n7. Set the maximum number of columns to display in interactive mode\n8. Get the default (DuckDB) backend\n9. Register the GCS filesystem with the default backend\n\n\n\nNow read the data and take a look at the first few rows:\n\n::: {#c90a9cc3 .cell execution_count=6}\n``` {.python .cell-code}\nt = ( # <1>\n ibis.read_parquet(f\"gs://{dir_name}/file_id=*.parquet\") # <2>\n .mutate( # <3>\n timestamp=ibis._[\"timestamp\"].cast(\"timestamp\"),\n ) # <3>\n .relocate( # <4>\n \"instance_type\",\n \"system\",\n \"sf\",\n \"query_number\",\n \"execution_seconds\",\n \"timestamp\",\n ) # <4>\n .cache() # <5>\n)\nt.head() # <6>\n```\n\n::: {.cell-output .cell-output-display execution_count=60}\n```{=html}\n
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ instance_type ┃ system ┃ sf ┃ query_number ┃ execution_seconds ┃ timestamp ┃ session_id ┃ n_partitions ┃ file_type ┃ file_id ┃\n┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ int64 │ int64 │ float64 │ timestamp(6) │ uuid │ int64 │ string │ string │\n├───────────────┼─────────────────┼───────┼──────────────┼───────────────────┼────────────────────────────┼──────────────────────────────────────┼──────────────┼───────────┼──────────────────────────────────────────────┤\n│ n2-standard-4 │ polars-lazy │ 128 │ 16 │ 9.503613 │ 2024-06-10 08:04:31.233704 │ 6708e5d3-2b8c-4ce0-adf8-65ce94e0bff1 │ 1 │ parquet │ 0949deaf-5f8f-4c29-b2ca-934e07173223.parquet │\n│ n2-standard-4 │ ibis-duckdb │ 64 │ 7 │ 35.826295 │ 2024-06-10 21:05:18.423375 │ 9a00385f-22b4-42df-ab3d-c63ed1a33a2e │ 1 │ parquet │ 0949deaf-5f8f-4c29-b2ca-934e07173223.parquet │\n│ n2-standard-4 │ ibis-duckdb │ 128 │ 16 │ 7.376196 │ 2024-06-11 03:44:22.901852 │ acb56c6b-b0d5-4bbc-8791-3542b62bd193 │ 1 │ parquet │ 0949deaf-5f8f-4c29-b2ca-934e07173223.parquet │\n│ n2-standard-4 │ ibis-datafusion │ 16 │ 7 │ 8.655290 │ 2024-06-09 20:29:31.833510 │ a07fe07d-7a08-4802-b8ae-918e66e2d868 │ 1 │ parquet │ 0949deaf-5f8f-4c29-b2ca-934e07173223.parquet │\n│ n2-standard-4 │ ibis-duckdb-sql │ 1 │ 10 │ 0.447325 │ 2024-06-10 08:11:31.244609 │ d523eec6-d2de-491d-b541-348c6b5bfc65 │ 1 │ parquet │ 0949deaf-5f8f-4c29-b2ca-934e07173223.parquet │\n└───────────────┴─────────────────┴───────┴──────────────┴───────────────────┴────────────────────────────┴──────────────────────────────────────┴──────────────┴───────────┴──────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n1. Assign the table to a variable\n2. Read the Parquet files from GCS\n3. Cast the `timestamp` column to a timestamp type\n4. Reorder the columns\n5. Cache the table to avoid re-reading cloud data\n6. Display the first few rows\n\nWe'll also create a table with details on each instance type including the CPU\ntype, number of cores, and memory in gigabytes:\n\n::: {#c352dd06 .cell execution_count=7}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to get instance details\"}\ncpu_type_cases = (\n ibis.case()\n .when(\n ibis._[\"instance_type\"].startswith(\"n2d\"),\n \"AMD EPYC\",\n )\n .when(\n ibis._[\"instance_type\"].startswith(\"n2\"),\n \"Intel Cascade and Ice Lake\",\n )\n .when(\n ibis._[\"instance_type\"].startswith(\"c3\"),\n \"Intel Sapphire Rapids\",\n )\n .when(\n ibis._[\"instance_type\"] == \"work laptop\",\n \"Apple M1 Max\",\n )\n .when(\n ibis._[\"instance_type\"] == \"personal laptop\",\n \"Apple M2 Max\",\n )\n .else_(\"unknown\")\n .end()\n)\ncpu_num_cases = (\n ibis.case()\n .when(\n ibis._[\"instance_type\"].contains(\"-\"),\n ibis._[\"instance_type\"].split(\"-\")[-1].cast(\"int\"),\n )\n .when(ibis._[\"instance_type\"].contains(\"laptop\"), 12)\n .else_(0)\n .end()\n)\nmemory_gb_cases = (\n ibis.case()\n .when(\n ibis._[\"instance_type\"].contains(\"-\"),\n ibis._[\"instance_type\"].split(\"-\")[-1].cast(\"int\") * 4,\n )\n .when(ibis._[\"instance_type\"] == \"work laptop\", 32)\n .when(ibis._[\"instance_type\"] == \"personal laptop\", 96)\n .else_(0)\n .end()\n)\n\ninstance_details = (\n t.group_by(\"instance_type\")\n .agg()\n .mutate(\n cpu_type=cpu_type_cases, cpu_cores=cpu_num_cases, memory_gbs=memory_gb_cases\n )\n).order_by(\"memory_gbs\", \"cpu_cores\", \"instance_type\")\n\ncpu_types = sorted(\n instance_details.distinct(on=\"cpu_type\")[\"cpu_type\"].to_pyarrow().to_pylist()\n)\n\ninstance_details\n```\n\n::: {.cell-output .cell-output-display execution_count=61}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ instance_type ┃ cpu_type ┃ cpu_cores ┃ memory_gbs ┃\n┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ string │ string │ int64 │ int64 │\n├─────────────────┼────────────────────────────┼───────────┼────────────┤\n│ n2-standard-2 │ Intel Cascade and Ice Lake │ 2 │ 8 │\n│ n2d-standard-2 │ AMD EPYC │ 2 │ 8 │\n│ c3-standard-4 │ Intel Sapphire Rapids │ 4 │ 16 │\n│ n2-standard-4 │ Intel Cascade and Ice Lake │ 4 │ 16 │\n│ n2d-standard-4 │ AMD EPYC │ 4 │ 16 │\n│ c3-standard-8 │ Intel Sapphire Rapids │ 8 │ 32 │\n│ n2-standard-8 │ Intel Cascade and Ice Lake │ 8 │ 32 │\n│ n2d-standard-8 │ AMD EPYC │ 8 │ 32 │\n│ work laptop │ Apple M1 Max │ 12 │ 32 │\n│ n2-standard-16 │ Intel Cascade and Ice Lake │ 16 │ 64 │\n│ n2d-standard-16 │ AMD EPYC │ 16 │ 64 │\n│ c3-standard-22 │ Intel Sapphire Rapids │ 22 │ 88 │\n│ personal laptop │ Apple M2 Max │ 12 │ 96 │\n│ n2-standard-32 │ Intel Cascade and Ice Lake │ 32 │ 128 │\n│ n2d-standard-32 │ AMD EPYC │ 32 │ 128 │\n│ c3-standard-44 │ Intel Sapphire Rapids │ 44 │ 176 │\n└─────────────────┴────────────────────────────┴───────────┴────────────┘\n\n```\n:::\n:::\n\n\n### What's in the data?\n\nWith the data, we can see we ran the benchmark on scale factors:\n\n::: {#99a2bcab .cell execution_count=8}\n``` {.python .cell-code}\nsfs = sorted(t.distinct(on=\"sf\")[\"sf\"].to_pyarrow().to_pylist())\nsfs\n```\n\n::: {.cell-output .cell-output-display execution_count=62}\n```\n[1, 8, 16, 32, 64, 128]\n```\n:::\n:::\n\n\n:::{.callout-note title=\"What is a scale factor?\" collapse=\"true\"}\nA scale factor is roughly the size of the data in memory in gigabytes. For\nexample, a scale factor of 1 means the data is roughly 1GB in memory.\n\nStored on disk in (compressed) Parquet format, the data is smaller -- about\n0.38GB for scale factor 1 with the compression settings used in this benchmark.\n:::\n\nWe can look at the total execution time by scale factor:\n\n::: {#a48a964d .cell execution_count=9}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show bar plot code\"}\nc = px.bar(\n t.group_by(\"sf\").agg(total_seconds=t[\"execution_seconds\"].sum()),\n x=\"sf\",\n y=\"total_seconds\",\n category_orders={\"sf\": sfs},\n title=\"total execution time by scale factor\",\n)\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ instance_type ┃ system ┃ sf ┃ n_partitions ┃ query_number ┃ mean_execution_seconds ┃ cpu_type ┃ cpu_cores ┃ memory_gbs ┃\n┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ string │ string │ int64 │ int64 │ int64 │ float64 │ string │ int64 │ int64 │\n├───────────────┼─────────────────────┼───────┼──────────────┼──────────────┼────────────────────────┼────────────────────────────┼───────────┼────────────┤\n│ n2-standard-4 │ ibis-datafusion │ 16 │ 1 │ 7 │ 8.641310 │ Intel Cascade and Ice Lake │ 4 │ 16 │\n│ n2-standard-4 │ ibis-datafusion │ 64 │ 1 │ 5 │ 23.362426 │ Intel Cascade and Ice Lake │ 4 │ 16 │\n│ n2-standard-4 │ ibis-datafusion-sql │ 16 │ 1 │ 1 │ 10.238970 │ Intel Cascade and Ice Lake │ 4 │ 16 │\n└───────────────┴─────────────────────┴───────┴──────────────┴──────────────┴────────────────────────┴────────────────────────────┴───────────┴────────────┘\n\n```\n:::\n:::\n\n\nThere's a lot of data and it's difficult to visualize all at once. We'll build\nup our understanding with a few plots.\n\n::: {#e6c188ce .cell execution_count=18}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code for timings_plot\"}\ndef timings_plot(\n agg,\n sf_filter=128,\n systems_filter=systems,\n instances_filter=[instance for instance in instance_types if \"laptop\" in instance],\n queries_filter=query_numbers,\n log_y=True,\n):\n data = (\n agg.filter(agg[\"sf\"] == sf_filter)\n .filter(agg[\"system\"].isin(systems_filter))\n .filter(agg[\"instance_type\"].isin(instances_filter))\n .filter(agg[\"query_number\"].isin(queries_filter))\n )\n\n c = px.bar(\n data,\n x=\"query_number\",\n y=\"mean_execution_seconds\",\n log_y=log_y,\n color=\"system\",\n barmode=\"group\",\n pattern_shape=\"instance_type\",\n category_orders={\n \"system\": systems,\n \"instance_type\": instance_types,\n },\n hover_data=[\"cpu_type\", \"cpu_cores\", \"memory_gbs\"],\n title=f\"sf: {sf_filter}\",\n )\n\n return c\n```\n:::\n\n\nFirst, let's visualize execution time for a given scale factor, system, query,\nand family of instance types:\n\n::: {#fb98b88c .cell execution_count=19}\n``` {.python .cell-code}\nsf_filter = 128\nsystems_filter = [\"ibis-duckdb\"]\ninstances_filter = [\n instance for instance in instance_types if instance.startswith(\"n2d\")\n]\nqueries_filter = [1]\nlog_y = False\n\ntimings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓\n┃ system ┃ total_execution_seconds ┃ total_queries ┃ seconds_per_query ┃\n┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩\n│ string │ float64 │ int64 │ float64 │\n├─────────────────┼─────────────────────────┼───────────────┼───────────────────┤\n│ ibis-duckdb │ 1049.007159 │ 35 │ 29.971633 │\n│ ibis-duckdb-sql │ 1180.318346 │ 35 │ 33.723381 │\n└─────────────────┴─────────────────────────┴───────────────┴───────────────────┘\n\n```\n:::\n:::\n\n\nIbis dataframe code is a little faster overall, but this is on a subset of\nqueries and scale factors and instance types. More analysis and profiling would\nbe needed to make a definitive statement, but in general we can be happy that\nDuckDB does a great job optimizing the SQL Ibis generates and that Ibis\ndataframe code isn't adding significant overhead.\n\nLet's repeat this for DataFusion:\n\n::: {#786d628d .cell execution_count=23}\n``` {.python .cell-code}\nsystems_filter = [\"ibis-datafusion\", \"ibis-datafusion-sql\"]\n\ntimings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓\n┃ system ┃ total_execution_seconds ┃ total_queries ┃ seconds_per_query ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩\n│ string │ float64 │ int64 │ float64 │\n├─────────────────────┼─────────────────────────┼───────────────┼───────────────────┤\n│ ibis-datafusion-sql │ 1041.259330 │ 33 │ 31.553313 │\n│ ibis-datafusion │ 1202.149386 │ 35 │ 34.347125 │\n└─────────────────────┴─────────────────────────┴───────────────┴───────────────────┘\n\n```\n:::\n:::\n\n\nThis time Ibis dataframe code is a bit slower overall. **However, also notice\ntwo queries are missing from `ibis-datafusion-sql`**. These are query 7 on\n`n2d-standard-2` and `n2d-standard-4` (the two instances with the least memory).\nWe'll investigate failing queries more thoroughly in the next section.\n\nFirst, let's look at Polars:\n\n::: {#0a68f425 .cell execution_count=25}\n``` {.python .cell-code}\nsystems_filter = [\"ibis-polars\", \"polars-lazy\"]\n\ntimings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓\n┃ system ┃ total_execution_seconds ┃ total_queries ┃ seconds_per_query ┃\n┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩\n│ string │ float64 │ int64 │ float64 │\n├─────────────┼─────────────────────────┼───────────────┼───────────────────┤\n│ ibis-polars │ 185.547665 │ 14 │ 13.253405 │\n│ polars-lazy │ 115.157749 │ 14 │ 8.225554 │\n└─────────────┴─────────────────────────┴───────────────┴───────────────────┘\n\n```\n:::\n:::\n\n\nLet's now compare all systems across a single instance type and query:\n\n::: {#fbc36677 .cell execution_count=28}\n``` {.python .cell-code}\nsf_filter = 128\ninstances_filter = [\"n2d-standard-32\"]\nsystems_filter = systems\nqueries_filter = [1]\n\ntimings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ system ┃ failing_queries ┃ num_failing_queries ┃ num_successful_queries ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ array<int64> │ int64 │ int64 │\n├─────────────────────┼─────────────────────────────────────────────┼─────────────────────┼────────────────────────┤\n│ ibis-duckdb │ [] │ 0 │ 22 │\n│ ibis-duckdb-sql │ [] │ 0 │ 22 │\n│ ibis-datafusion-sql │ [16] │ 1 │ 21 │\n│ polars-lazy │ [9] │ 1 │ 21 │\n│ ibis-datafusion │ [16, 21, 22] │ 3 │ 19 │\n│ ibis-polars │ [9, 11, 13, 14, 15, 16, 17, 19, 20, 21, 22] │ 11 │ 11 │\n└─────────────────────┴─────────────────────────────────────────────┴─────────────────────┴────────────────────────┘\n\n```\n:::\n:::\n\n\n::: {#53363979 .cell execution_count=32}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to create a bar plot of the number of successful queries by system\"}\nc = px.bar(\n failing,\n x=\"system\",\n y=\"num_successful_queries\",\n category_orders={\n \"system\": systems,\n \"query_number\": query_numbers,\n },\n color=\"system\",\n title=\"completed queries\",\n)\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ system ┃ failing_queries ┃ num_failing_queries ┃ num_successful_queries ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ array<int64> │ int64 │ int64 │\n├─────────────────────┼───────────────────────────────────────────────────────────────────────────┼─────────────────────┼────────────────────────┤\n│ ibis-duckdb │ [] │ 0 │ 22 │\n│ ibis-duckdb-sql │ [] │ 0 │ 22 │\n│ ibis-datafusion-sql │ [7, 9, 16, 18, 20] │ 5 │ 17 │\n│ ibis-datafusion │ [9, 16, 18, 20, 21, 22] │ 6 │ 16 │\n│ polars-lazy │ [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 17, 18, 19, 20, 21] │ 18 │ 4 │\n│ ibis-polars │ [1, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] │ 20 │ 2 │\n└─────────────────────┴───────────────────────────────────────────────────────────────────────────┴─────────────────────┴────────────────────────┘\n\n```\n:::\n:::\n\n\n::: {#699cf3f1 .cell execution_count=34}\n``` {.python .cell-code code-fold=\"true\" code-summary=\"Show code to create a bar plot of the number of successful queries by system\"}\nc = px.bar(\n failing,\n x=\"system\",\n y=\"num_successful_queries\",\n category_orders={\n \"system\": systems,\n \"query_number\": query_numbers,\n },\n color=\"system\",\n title=\"completed queries\",\n)\nc\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ system ┃ sf ┃ query_number ┃ execution_seconds ┃ session_id ┃ instance_type ┃ timestamp ┃ n_partitions ┃ file_type ┃ file_id ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ int64 │ int64 │ float64 │ uuid │ json │ string │ int64 │ string │ string │\n├─────────────────────┼───────┼──────────────┼───────────────────┼──────────────────────────────────────┼───────────────┼────────────────────────────┼──────────────┼───────────┼───────────────────────────────────────────┤\n│ ibis-datafusion-sql │ 1 │ 12 │ 0.076600 │ 0b931439-5670-4a77-89b0-d8b7c45e6eb7 │ NULL │ 2024-06-13T16:09:34.476397 │ 1 │ parquet │ 00218347-a4cd-4590-a502-8cf79f4e87c9.json │\n│ ibis-datafusion-sql │ 1 │ 21 │ 0.165074 │ 0b931439-5670-4a77-89b0-d8b7c45e6eb7 │ NULL │ 2024-06-13T16:09:35.376753 │ 1 │ parquet │ 01089668-608c-4551-ae65-6d98d69f959b.json │\n│ ibis-polars │ 1 │ 12 │ 0.075944 │ 0b931439-5670-4a77-89b0-d8b7c45e6eb7 │ NULL │ 2024-06-13T16:09:36.956001 │ 1 │ parquet │ 02a991bd-797a-4c08-83de-c1b537f713fe.json │\n│ ibis-datafusion │ 1 │ 10 │ 0.144007 │ 0b931439-5670-4a77-89b0-d8b7c45e6eb7 │ NULL │ 2024-06-13T16:09:32.297647 │ 1 │ parquet │ 02bd900a-3c0a-4871-b651-1690f11a81ab.json │\n│ ibis-datafusion-sql │ 1 │ 3 │ 0.067048 │ 0b931439-5670-4a77-89b0-d8b7c45e6eb7 │ NULL │ 2024-06-13T16:09:33.699368 │ 1 │ parquet │ 08490a6b-e1ab-482c-83bc-85469c6b96a3.json │\n│ ibis-duckdb │ 1 │ 10 │ 0.160302 │ 0b931439-5670-4a77-89b0-d8b7c45e6eb7 │ NULL │ 2024-06-13T16:09:27.316339 │ 1 │ parquet │ 08b92577-8150-4040-bceb-9316da7bfaf4.json │\n│ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │\n└─────────────────────┴───────┴──────────────┴───────────────────┴──────────────────────────────────────┴───────────────┴────────────────────────────┴──────────────┴───────────┴───────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\nWe can check the total execution time for each system:\n\n::: {#98d17563 .cell execution_count=37}\n``` {.python .cell-code}\nt.group_by(\"system\").agg(total_seconds=t[\"execution_seconds\"].sum()).order_by(\n \"total_seconds\"\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=91}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n┃ system ┃ total_seconds ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n│ string │ float64 │\n├─────────────────────┼───────────────┤\n│ ibis-datafusion-sql │ 2.006620 │\n│ ibis-duckdb-sql │ 2.067606 │\n│ polars-lazy │ 2.086350 │\n│ ibis-polars │ 2.168417 │\n│ ibis-duckdb │ 2.495270 │\n│ ibis-datafusion │ 2.529014 │\n└─────────────────────┴───────────────┘\n\n```\n:::\n:::\n\n\nWe can visualize the results:\n\n::: {#0f703cc5 .cell execution_count=38}\n``` {.python .cell-code}\nimport plotly.express as px\n\npx.defaults.template = \"plotly_dark\"\n\nagg = t.group_by(\"system\", \"query_number\").agg(\n mean_execution_seconds=t[\"execution_seconds\"].mean(),\n)\n\nchart = px.bar(\n agg,\n x=\"query_number\",\n y=\"mean_execution_seconds\",\n color=\"system\",\n barmode=\"group\",\n title=\"Mean execution time by query\",\n category_orders={\n \"system\": sorted(t.select(\"system\").distinct().to_pandas()[\"system\"].tolist())\n },\n)\nchart\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ l_orderkey ┃ l_partkey ┃ l_suppkey ┃ l_linenumber ┃ l_quantity ┃ l_extendedprice ┃ l_discount ┃ l_tax ┃ l_returnflag ┃ l_linestatus ┃ l_shipdate ┃ l_commitdate ┃ l_receiptdate ┃ l_shipinstruct ┃ l_shipmode ┃ l_comment ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64 │ int64 │ int64 │ int64 │ float64 │ float64 │ float64 │ float64 │ string │ string │ date │ date │ date │ string │ string │ string │\n├────────────┼───────────┼───────────┼──────────────┼────────────┼─────────────────┼────────────┼─────────┼──────────────┼──────────────┼────────────┼──────────────┼───────────────┼───────────────────┼────────────┼──────────────────────────────────────────┤\n│ 6000000 │ 32255 │ 2256 │ 1 │ 5.0 │ 5936.25 │ 0.04 │ 0.03 │ N │ O │ 1996-11-02 │ 1996-11-19 │ 1996-12-01 │ TAKE BACK RETURN │ MAIL │ riously pe │\n│ 6000000 │ 96127 │ 6128 │ 2 │ 28.0 │ 31447.36 │ 0.01 │ 0.02 │ N │ O │ 1996-09-22 │ 1996-10-01 │ 1996-10-21 │ NONE │ AIR │ pecial excuses nag evenly f │\n│ 5999975 │ 6452 │ 1453 │ 2 │ 7.0 │ 9509.15 │ 0.04 │ 0.00 │ A │ F │ 1993-11-02 │ 1993-09-23 │ 1993-11-19 │ DELIVER IN PERSON │ SHIP │ ffily along the sly │\n│ 5999975 │ 7272 │ 2273 │ 1 │ 32.0 │ 37736.64 │ 0.07 │ 0.01 │ R │ F │ 1993-10-07 │ 1993-09-30 │ 1993-10-21 │ COLLECT COD │ REG AIR │ ld deposits aga │\n│ 5999975 │ 37131 │ 2138 │ 3 │ 18.0 │ 19226.34 │ 0.04 │ 0.01 │ A │ F │ 1993-11-17 │ 1993-08-28 │ 1993-12-08 │ DELIVER IN PERSON │ FOB │ counts cajole evenly? sly orbits boost f │\n│ 5999974 │ 10463 │ 5466 │ 2 │ 46.0 │ 63179.16 │ 0.08 │ 0.06 │ R │ F │ 1993-09-16 │ 1993-09-21 │ 1993-10-02 │ COLLECT COD │ RAIL │ se slyly alo │\n│ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │\n└────────────┴───────────┴───────────┴──────────────┴────────────┴─────────────────┴────────────┴─────────┴──────────────┴──────────────┴────────────┴──────────────┴───────────────┴───────────────────┴────────────┴──────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n::: {#ff8e56e8 .cell execution_count=43}\n``` {.python .cell-code}\nlineitem.count()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=97}\n\n::: {.ansi-escaped-output}\n```{=html}\n
┌─────────┐\n│ 6001215 │\n└─────────┘
\n```\n:::\n\n:::\n:::\n\n\n## Ibis (DataFusion)\n\n::: {#a5254d3a .cell execution_count=44}\n``` {.python .cell-code}\ncon = ibis.connect(\"datafusion://\")\n\n(customer, lineitem, nation, orders, part, partsupp, region, supplier) = (\n get_ibis_tables(sf=sf, con=con)\n)\n```\n:::\n\n\n::: {#23425282 .cell execution_count=45}\n``` {.python .cell-code}\nlineitem.order_by(ibis.desc(\"l_orderkey\"), ibis.asc(\"l_partkey\"))\n```\n\n::: {.cell-output .cell-output-display execution_count=99}\n```{=html}\n┏━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ l_orderkey ┃ l_partkey ┃ l_suppkey ┃ l_linenumber ┃ l_quantity ┃ l_extendedprice ┃ l_discount ┃ l_tax ┃ l_returnflag ┃ l_linestatus ┃ l_shipdate ┃ l_commitdate ┃ l_receiptdate ┃ l_shipinstruct ┃ l_shipmode ┃ l_comment ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64 │ int64 │ int64 │ int64 │ float64 │ float64 │ float64 │ float64 │ string │ string │ date │ date │ date │ string │ string │ string │\n├────────────┼───────────┼───────────┼──────────────┼────────────┼─────────────────┼────────────┼─────────┼──────────────┼──────────────┼────────────┼──────────────┼───────────────┼───────────────────┼────────────┼──────────────────────────────────────────┤\n│ 6000000 │ 32255 │ 2256 │ 1 │ 5.0 │ 5936.25 │ 0.04 │ 0.03 │ N │ O │ 1996-11-02 │ 1996-11-19 │ 1996-12-01 │ TAKE BACK RETURN │ MAIL │ riously pe │\n│ 6000000 │ 96127 │ 6128 │ 2 │ 28.0 │ 31447.36 │ 0.01 │ 0.02 │ N │ O │ 1996-09-22 │ 1996-10-01 │ 1996-10-21 │ NONE │ AIR │ pecial excuses nag evenly f │\n│ 5999975 │ 6452 │ 1453 │ 2 │ 7.0 │ 9509.15 │ 0.04 │ 0.00 │ A │ F │ 1993-11-02 │ 1993-09-23 │ 1993-11-19 │ DELIVER IN PERSON │ SHIP │ ffily along the sly │\n│ 5999975 │ 7272 │ 2273 │ 1 │ 32.0 │ 37736.64 │ 0.07 │ 0.01 │ R │ F │ 1993-10-07 │ 1993-09-30 │ 1993-10-21 │ COLLECT COD │ REG AIR │ ld deposits aga │\n│ 5999975 │ 37131 │ 2138 │ 3 │ 18.0 │ 19226.34 │ 0.04 │ 0.01 │ A │ F │ 1993-11-17 │ 1993-08-28 │ 1993-12-08 │ DELIVER IN PERSON │ FOB │ counts cajole evenly? sly orbits boost f │\n│ 5999974 │ 10463 │ 5466 │ 2 │ 46.0 │ 63179.16 │ 0.08 │ 0.06 │ R │ F │ 1993-09-16 │ 1993-09-21 │ 1993-10-02 │ COLLECT COD │ RAIL │ se slyly alo │\n│ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │\n└────────────┴───────────┴───────────┴──────────────┴────────────┴─────────────────┴────────────┴─────────┴──────────────┴──────────────┴────────────┴──────────────┴───────────────┴───────────────────┴────────────┴──────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n::: {#5ab7d426 .cell execution_count=46}\n``` {.python .cell-code}\nlineitem.count()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=100}\n\n::: {.ansi-escaped-output}\n```{=html}\n
┌─────────┐\n│ 6001215 │\n└─────────┘
\n```\n:::\n\n:::\n:::\n\n\n## Ibis (Polars)\n\n::: {#844e701c .cell execution_count=47}\n``` {.python .cell-code}\ncon = ibis.connect(\"polars://\")\n\n(customer, lineitem, nation, orders, part, partsupp, region, supplier) = (\n get_ibis_tables(sf=sf, con=con)\n)\n```\n:::\n\n\n::: {#42c87d76 .cell execution_count=48}\n``` {.python .cell-code}\nlineitem.order_by(ibis.desc(\"l_orderkey\"), ibis.asc(\"l_partkey\"))\n```\n\n::: {.cell-output .cell-output-display execution_count=102}\n```{=html}\n┏━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ l_orderkey ┃ l_partkey ┃ l_suppkey ┃ l_linenumber ┃ l_quantity ┃ l_extendedprice ┃ l_discount ┃ l_tax ┃ l_returnflag ┃ l_linestatus ┃ l_shipdate ┃ l_commitdate ┃ l_receiptdate ┃ l_shipinstruct ┃ l_shipmode ┃ l_comment ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64 │ int64 │ int64 │ int64 │ float64 │ float64 │ float64 │ float64 │ string │ string │ date │ date │ date │ string │ string │ string │\n├────────────┼───────────┼───────────┼──────────────┼────────────┼─────────────────┼────────────┼─────────┼──────────────┼──────────────┼────────────┼──────────────┼───────────────┼───────────────────┼────────────┼──────────────────────────────────────────┤\n│ 6000000 │ 32255 │ 2256 │ 1 │ 5.0 │ 5936.25 │ 0.04 │ 0.03 │ N │ O │ 1996-11-02 │ 1996-11-19 │ 1996-12-01 │ TAKE BACK RETURN │ MAIL │ riously pe │\n│ 6000000 │ 96127 │ 6128 │ 2 │ 28.0 │ 31447.36 │ 0.01 │ 0.02 │ N │ O │ 1996-09-22 │ 1996-10-01 │ 1996-10-21 │ NONE │ AIR │ pecial excuses nag evenly f │\n│ 5999975 │ 6452 │ 1453 │ 2 │ 7.0 │ 9509.15 │ 0.04 │ 0.00 │ A │ F │ 1993-11-02 │ 1993-09-23 │ 1993-11-19 │ DELIVER IN PERSON │ SHIP │ ffily along the sly │\n│ 5999975 │ 7272 │ 2273 │ 1 │ 32.0 │ 37736.64 │ 0.07 │ 0.01 │ R │ F │ 1993-10-07 │ 1993-09-30 │ 1993-10-21 │ COLLECT COD │ REG AIR │ ld deposits aga │\n│ 5999975 │ 37131 │ 2138 │ 3 │ 18.0 │ 19226.34 │ 0.04 │ 0.01 │ A │ F │ 1993-11-17 │ 1993-08-28 │ 1993-12-08 │ DELIVER IN PERSON │ FOB │ counts cajole evenly? sly orbits boost f │\n│ 5999974 │ 10463 │ 5466 │ 2 │ 46.0 │ 63179.16 │ 0.08 │ 0.06 │ R │ F │ 1993-09-16 │ 1993-09-21 │ 1993-10-02 │ COLLECT COD │ RAIL │ se slyly alo │\n│ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │\n└────────────┴───────────┴───────────┴──────────────┴────────────┴─────────────────┴────────────┴─────────┴──────────────┴──────────────┴────────────┴──────────────┴───────────────┴───────────────────┴────────────┴──────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n::: {#ec10f64b .cell execution_count=49}\n``` {.python .cell-code}\nlineitem.count()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=103}\n\n::: {.ansi-escaped-output}\n```{=html}\n
┌─────────┐\n│ 6001215 │\n└─────────┘
\n```\n:::\n\n:::\n:::\n\n\n:::\n\n\n\nThe queries are also defined in `ibis_bench.queries`. Let's look at query 4 as\nan example for Ibis dataframe code, Polars dataframe code, and SQL code via\nIbis:\n\n:::{.panel-tabset}\n\n## Ibis (dataframe)\n\nDefine query 4:\n\n::: {#1b3ec7eb .cell execution_count=51}\n``` {.python .cell-code}\ndef q4(lineitem, orders, **kwargs):\n var1 = date(1993, 7, 1)\n var2 = date(1993, 10, 1)\n\n q_final = (\n lineitem.join(orders, lineitem[\"l_orderkey\"] == orders[\"o_orderkey\"])\n .filter((orders[\"o_orderdate\"] >= var1) & (orders[\"o_orderdate\"] < var2))\n .filter(lineitem[\"l_commitdate\"] < lineitem[\"l_receiptdate\"])\n .distinct(on=[\"o_orderpriority\", \"l_orderkey\"])\n .group_by(\"o_orderpriority\")\n .agg(order_count=ibis._.count())\n .order_by(\"o_orderpriority\")\n )\n\n return q_final\n```\n:::\n\n\nRun query 4:\n\n::: {#03a33620 .cell execution_count=52}\n``` {.python .cell-code}\nres = q4(lineitem, orders)\nres\n```\n\n::: {.cell-output .cell-output-display execution_count=106}\n```{=html}\n┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ o_orderpriority ┃ order_count ┃\n┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ string │ int64 │\n├─────────────────┼─────────────┤\n│ 1-URGENT │ 10594 │\n│ 2-HIGH │ 10476 │\n│ 3-MEDIUM │ 10410 │\n│ 4-NOT SPECIFIED │ 10556 │\n│ 5-LOW │ 10487 │\n└─────────────────┴─────────────┘\n\n```\n:::\n:::\n\n\n## Polars (dataframe)\n\nDefine query 4:\n\n::: {#8ddc975b .cell execution_count=53}\n``` {.python .cell-code}\ndef q4(lineitem, orders, **kwargs):\n var1 = date(1993, 7, 1)\n var2 = date(1993, 10, 1)\n\n q_final = (\n lineitem.join(orders, left_on=\"l_orderkey\", right_on=\"o_orderkey\")\n .filter(pl.col(\"o_orderdate\").is_between(var1, var2, closed=\"left\"))\n .filter(pl.col(\"l_commitdate\") < pl.col(\"l_receiptdate\"))\n .unique(subset=[\"o_orderpriority\", \"l_orderkey\"])\n .group_by(\"o_orderpriority\")\n .agg(pl.len().alias(\"order_count\"))\n .sort(\"o_orderpriority\")\n )\n\n return q_final\n```\n:::\n\n\nRun query 4:\n\n::: {#4719fe1c .cell execution_count=54}\n``` {.python .cell-code}\nres = q4(lineitem.to_polars().lazy(), orders.to_polars().lazy()).collect()\nres\n```\n\n::: {.cell-output .cell-output-display execution_count=108}\n```{=html}\n
o_orderpriority | order_count |
---|---|
str | u32 |
"1-URGENT" | 10594 |
"2-HIGH" | 10476 |
"3-MEDIUM" | 10410 |
"4-NOT SPECIFIED" | 10556 |
"5-LOW" | 10487 |
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ o_orderpriority ┃ order_count ┃\n┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ string │ int64 │\n├─────────────────┼─────────────┤\n│ 1-URGENT │ 10594 │\n│ 2-HIGH │ 10476 │\n│ 3-MEDIUM │ 10410 │\n│ 4-NOT SPECIFIED │ 10556 │\n│ 5-LOW │ 10487 │\n└─────────────────┴─────────────┘\n\n```\n:::\n:::\n\n\n:::\n\nFinally, we write the result to a Parquet file. We are measuring the\nexecution time in seconds of calling the query and writing the results to disk.\n\n## Next steps\n\nWe'll publish the next iteration of this benchmark soon with updated Polars\nTPC-H queries and using newer versions of all libraries. Polars v1.0.0 should\nrelease soon. A new DataFusion version that fixes the remaining failing queries\nis also expected soon.\n\nIf you spot anything wrong, have any questions, or want to share your own\nanalysis, feel free to share below!\n\n", + "supporting": [ + "index_files/figure-html" + ], + "filters": [], + "includes": { + "include-in-header": [ + "\n\n\n\n\n" + ] + } + } +} \ No newline at end of file diff --git a/docs/posts/ibis-bench/.gitignore b/docs/posts/ibis-bench/.gitignore new file mode 100644 index 000000000000..e89505a7fc15 --- /dev/null +++ b/docs/posts/ibis-bench/.gitignore @@ -0,0 +1,3 @@ +tpch_data +results_data +bench_logs_v* diff --git a/docs/posts/ibis-bench/figure1.png b/docs/posts/ibis-bench/figure1.png new file mode 100644 index 000000000000..b163cb6c8545 Binary files /dev/null and b/docs/posts/ibis-bench/figure1.png differ diff --git a/docs/posts/ibis-bench/index.qmd b/docs/posts/ibis-bench/index.qmd new file mode 100644 index 000000000000..397824c7220a --- /dev/null +++ b/docs/posts/ibis-bench/index.qmd @@ -0,0 +1,1315 @@ +--- +title: "Ibis benchmarking: DuckDB, DataFusion, Polars" +author: "Cody Peterson" +date: "2024-06-24" +image: "figure1.png" +categories: + - benchmark + - duckdb + - datafusion + - polars +--- + +*The best benchmark is your own workload on your own data*. + +## Key considerations + +The purpose of this post is to explore some benchmarking data with Ibis. We'll +compare three modern single-node query engines, explore the Ibis API as a great +choice for each of them, and discuss the results. + +### The benchmark + +:::{.callout-important title="Not an official TPC-H benchmark"} +This is not an [official TPC-H benchmark](https://www.tpc.org/tpch). We ran a +derivate of the TPC-H benchmark. +::: + +[The TPC-H benchmark](https://www.tpc.org/tpch) is a benchmark for databases +and, [increasingly](https://docs.coiled.io/blog/tpch), +[dataframes](https://pola.rs/posts/benchmarks)! It consists of 22 queries +across 8 tables. The SQL (or dataframe) code representing the queries is +designed to test the performance of a query engine on a variety of tasks +including filtering, aggregation, and joins. SQL queries are defined by the +TPC-H benchmark. We run the SQL queries and equivalent dataframe code via Ibis +and Polars APIs. + +The data for the benchmark can be generated at any scale factor, which roughly +corresponds to the size of the data in memory in gigabytes. For instance, a +scale factor of 10 would be about 10GB of data in memory. + +### The engines, the API, the code + +We'll use three modern single-node OLAP engines +([DuckDB](https://github.com/duckdb/duckdb), +[DataFusion](https://github.com/apache/datafusion), +[Polars](https://github.com/pola-rs/polars)) with the Ibis API via two coding +paradigms (dataframe and SQL). Ibis provides a consistent API across 20+ +backends, including these three. We run [SQL +code](https://github.com/lostmygithubaccount/ibis-bench/blob/v2.0.0/src/ibis_bench/queries/sql.py) +through Ibis in addition to [dataframe +code](https://github.com/lostmygithubaccount/ibis-bench/blob/v2.0.0/src/ibis_bench/queries/ibis.py) +to get a sense of any overhead in Ibis dataframe code. + +:::{.callout-note} +Ibis dataframe code generates SQL for the DuckDB and DataFusion backends and +generates Polars API dataframe code for the Polars backend. +::: + +:::{.callout-note title="Honorable mention: chDB" collapse="true"} +[chDB](https://github.com/chdb-io/chdb) would be another great single-node OLAP +engine to benchmark. We don't because it's not currently a backend for Ibis, +though [there has been work done to make it +one](https://github.com/ibis-project/ibis/pull/8497). + +If you're interested in contributing to Ibis, a new backend like chDB could be a +great project for you! +::: + +9/22 queries for Ibis with the Polars backend fail from [lack of scalar subquery +support](#failing-polars-queries). Due to this and relatively experimental SQL +support in Polars, we've opted to run on [the Polars API +directly](https://github.com/lostmygithubaccount/ibis-bench/blob/v2.0.0/src/ibis_bench/queries/polars.py) +in this iteration of the benchmark. This is done with the LazyFrames API **and +no streaming engine** ([per the Polars team's +recommendation](https://github.com/pola-rs/polars/issues/16694#issuecomment-2146668559)). +This also allows us to compare the performance of the Polars backend through +Ibis with the Polars API directly for the queries that do succeed. + +#### Failing queries + +Queries fail for one of two reasons: + +1. The query doesn't work in the given system +2. The query otherwise failed on a given run + +We'll note the cases of the first below. The second is usually due to memory +pressure and [will be seen at higher scale +factors](#failing-queries-due-to-memory-pressure) throughout the data. + +#### Failing DataFusion queries + +Queries 16, 21, and 22 fail for the DataFusion backend via Ibis dataframe code, +and query 16 fails through SQL. Note that [all TPC-H SQL queries successfully +run through DataFusion +directly](https://github.com/apache/datafusion-benchmarks) -- Ibis generates SQL +that [hits a bug with DataFusion that has already been +fixed](https://github.com/apache/datafusion/issues/10830). We expect these +queries to work in the next iteration of this benchmark coming soon. + +#### Failing Polars queries + +Queries 11, 13-17, and 20-22 fail for the Polars backend via Ibis dataframe +code. These all fail due to lack of scalar subquery support in the backend. I've +[opened an issue](https://github.com/ibis-project/ibis/issues/9422) for tracking +and discussion. + +:::{.callout-tip title="Interested in contributing?"} +Increasing coverage of operations for a backend is a great place to start! +::: + +### How queries are written + +See [the source +code](https://github.com/lostmygithubaccount/ibis-bench/tree/v2.0.0/src/ibis_bench/queries) +for the exact queries used in this iteration of the benchmark. Polars recently +updated their TPC-H queries, so the next iteration of this benchmark would use +those. + +Queries were adapted from [Ibis TPC-H +queries](https://github.com/ibis-project/ibis/tree/main/ibis/backends/tests/tpch) +and [Polars TPC-H queries](https://github.com/pola-rs/tpch). The first 10 Ibis +dataframe queries were translated from the Polars dataframe queries, while the +rest were directly adapted from the Ibis repository. The SQL strings were +adapted from the Ibis repository. + +### How queries are run + +See [the source +code](https://github.com/lostmygithubaccount/ibis-bench/tree/v2.0.0) and +[methodology](https://ibis-bench.streamlit.app/methodology) for more details. In +short: + +- data is generated as a Parquet file per table + - standard DuckDB Parquet writer is used + - data is always downloaded onto a compute instance (no cloud storage reads) +- decimal types are converted to floats after reading + - works around several issues + - in the next iteration of this benchmark, we'll use the `decimal` type +- each query is run three times per configuration (system, scale factor, instance type) +- we measure the time to write the results of the query to a Parquet file + - this includes reading the Parquet file(s) and executing the query + +### Biases + +My name is Cody and I'm a Senior Technical Product Manager at [Voltron +Data](https://voltrondata.com). I am a contributor to the Ibis project and +employed to work on it -- I'm biased in favor of Ibis and the composable data +ecosystem. + +Ibis is [an independently governed open source +project](https://github.com/ibis-project/governance) that **is not owned by +Voltron Data**, though several steering committee members are employed by +Voltron Data. You can [read more about why Voltron Data supports +Ibis](../why-voda-supports-ibis/index.qmd), in addition to open source projects +like [Apache Arrow](https://github.com/apache/arrow) and +[Substrait](https://github.com/substrait-io/substrait). + +Voltron Data is a [Gold Supporter of the DuckDB +Foundation](https://duckdb.org/foundation) and [has a commercial relationship +with DuckDB Labs](https://duckdblabs.com) with regular syncs I tend to attend. +I also use [MotherDuck](https://motherduck.com) to host our [Ibis analytics +dashboard data](https://ibis-analytics.streamlit.app). + +## Results and analysis + +We'll use Ibis to analyze some of the benchmarking data. + +:::{.callout-tip} +We'll only look at a small subset of the data in this post. + +All the data is public, so you can follow along with the code and explore the +data yourself. You can also see the [Ibis benchmarking Streamlit +app](https://ibis-bench.streamlit.app) for further analysis. +::: + +```{python} +#| echo: false +#| code-fold: true +import warnings + +# this is to ignore a GCP warning +warnings.simplefilter("ignore") +``` + +### Reading the data + +To follow along, install the required Python packages: + +```bash +pip install gcsfs 'ibis-framework[duckdb]' plotly +``` + +The data is stored in a public Google Cloud Storage (GCS) bucket: + +```{python} +import os # <1> +import gcsfs # <1> + +BUCKET = "ibis-bench" # <2> + +dir_name = os.path.join(BUCKET, "bench_logs_v2", "cache") # <3> + +fs = gcsfs.GCSFileSystem() # <4> +fs.ls(dir_name)[-5:] # <5> +``` + +1. Imports +2. The public GCS bucket name +3. The directory in the bucket where the data is stored +4. Create a GCS filesystem object +5. List the last 5 files in the directory + +To start exploring the data, let's import Ibis and Plotly, set some options, and +register the GCS filesystem with the default (DuckDB) backend: + +```{python} +import ibis # <1> +import plotly.express as px # <2> + +px.defaults.template = "plotly_dark" # <3> + +ibis.options.interactive = True # <4> +ibis.options.repr.interactive.max_rows = 22 # <5> +ibis.options.repr.interactive.max_length = 22 # <6> +ibis.options.repr.interactive.max_columns = None # <7> + +con = ibis.get_backend() # <8> +con.register_filesystem(fs) # <9> +``` + +1. Import Ibis +2. Import Plotly +3. Set the Plotly template to dark +4. Enable interactive mode for Ibis +5. Set the maximum number of rows to display in interactive mode +6. Set the maximum length of nested types to display in interactive mode +7. Set the maximum number of columns to display in interactive mode +8. Get the default (DuckDB) backend +9. Register the GCS filesystem with the default backend + +```{python} +#| echo: false +#| code-fold: true +con.raw_sql("PRAGMA disable_progress_bar;"); +``` + +Now read the data and take a look at the first few rows: + +```{python} +t = ( # <1> + ibis.read_parquet(f"gs://{dir_name}/file_id=*.parquet") # <2> + .mutate( # <3> + timestamp=ibis._["timestamp"].cast("timestamp"), + ) # <3> + .relocate( # <4> + "instance_type", + "system", + "sf", + "query_number", + "execution_seconds", + "timestamp", + ) # <4> + .cache() # <5> +) +t.head() # <6> +``` + +1. Assign the table to a variable +2. Read the Parquet files from GCS +3. Cast the `timestamp` column to a timestamp type +4. Reorder the columns +5. Cache the table to avoid re-reading cloud data +6. Display the first few rows + +We'll also create a table with details on each instance type including the CPU +type, number of cores, and memory in gigabytes: + +```{python} +#| code-fold: true +#| code-summary: "Show code to get instance details" +cpu_type_cases = ( + ibis.case() + .when( + ibis._["instance_type"].startswith("n2d"), + "AMD EPYC", + ) + .when( + ibis._["instance_type"].startswith("n2"), + "Intel Cascade and Ice Lake", + ) + .when( + ibis._["instance_type"].startswith("c3"), + "Intel Sapphire Rapids", + ) + .when( + ibis._["instance_type"] == "work laptop", + "Apple M1 Max", + ) + .when( + ibis._["instance_type"] == "personal laptop", + "Apple M2 Max", + ) + .else_("unknown") + .end() +) +cpu_num_cases = ( + ibis.case() + .when( + ibis._["instance_type"].contains("-"), + ibis._["instance_type"].split("-")[-1].cast("int"), + ) + .when(ibis._["instance_type"].contains("laptop"), 12) + .else_(0) + .end() +) +memory_gb_cases = ( + ibis.case() + .when( + ibis._["instance_type"].contains("-"), + ibis._["instance_type"].split("-")[-1].cast("int") * 4, + ) + .when(ibis._["instance_type"] == "work laptop", 32) + .when(ibis._["instance_type"] == "personal laptop", 96) + .else_(0) + .end() +) + +instance_details = ( + t.group_by("instance_type") + .agg() + .mutate( + cpu_type=cpu_type_cases, cpu_cores=cpu_num_cases, memory_gbs=memory_gb_cases + ) +).order_by("memory_gbs", "cpu_cores", "instance_type") + +cpu_types = sorted( + instance_details.distinct(on="cpu_type")["cpu_type"].to_pyarrow().to_pylist() +) + +instance_details +``` + +### What's in the data? + +With the data, we can see we ran the benchmark on scale factors: + +```{python} +sfs = sorted(t.distinct(on="sf")["sf"].to_pyarrow().to_pylist()) +sfs +``` + +:::{.callout-note title="What is a scale factor?" collapse="true"} +A scale factor is roughly the size of the data in memory in gigabytes. For +example, a scale factor of 1 means the data is roughly 1GB in memory. + +Stored on disk in (compressed) Parquet format, the data is smaller -- about +0.38GB for scale factor 1 with the compression settings used in this benchmark. +::: + +We can look at the total execution time by scale factor: + +```{python} +#| code-fold: true +#| code-summary: "Show bar plot code" +c = px.bar( + t.group_by("sf").agg(total_seconds=t["execution_seconds"].sum()), + x="sf", + y="total_seconds", + category_orders={"sf": sfs}, + title="total execution time by scale factor", +) +c +``` + +You can see this is roughly linear as expected. + +We ran on the following queries: + +```{python} +query_numbers = sorted( + t.distinct(on="query_number")["query_number"].to_pyarrow().to_pylist() +) +query_numbers +``` + +:::{.callout-note title="What is a query number?" collapse="true"} +The TPC-H benchmark defines 22 queries. See the [TPC-H benchmark +specification](https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPC-H_v3.0.1.pdf) +for more information. +::: + +We can look at the total execution time by query number: + +```{python} +#| code-fold: true +#| code-summary: "Show bar plot code" +c = px.bar( + t.group_by("query_number").agg(total_seconds=t["execution_seconds"].sum()), + x="query_number", + y="total_seconds", + category_orders={"query_number": query_numbers}, + title="total execution time by query number", +) +c +``` + +This gives us a sense of the relative complexity of the queries. + +We ran on the following instance types: + +```{python} +instance_types = sorted( + t.distinct(on="instance_type")["instance_type"].to_pyarrow().to_pylist(), + key=lambda x: (x.split("-")[0], int(x.split("-")[-1])) # <1> + if "-" in x # <1> + else ("z" + x[3], 0), # <2> +) +instance_types +``` + +1. This is to sort the instance types by CPU architecture and number of cores +2. This is to sort "personal laptop" after "work laptop" + +:::{.callout-note title="What is an instance type?" collapse="true"} +An instance type is the compute the benchmark was run on. This consists of two +MacBook Pro laptops (one work and one personal) and a number of Google Cloud +Compute Engine instances. + +For cloud VMs, the instance type is in the form of `