test: Add query benchmarking suite for pg_analytics #1703

shamb0 · 2024-09-24T16:08:50Z

Ticket(s) Closed

Closes feat: Introduce Qdrant Sink #57

This PR is part of a pair; please consider both for review and merge.

What

This PR implements benchmarking functionality to analyze query performance under different caching conditions across various data sources supported by pg_analytics.

Why

To evaluate how different cache configurations impact query performance, ensuring that the system optimally handles various data sources and caching scenarios.

How

The test function follows a structured flow:

Parquet File Check: Verifies the existence of a Parquet file at a specified path; if absent, generates the file.
Data Loading: Loads the Parquet data into a DataFrame using a DataFusion session context.
S3 Setup: Configures an S3 bucket for storing partitioned data.
Data Partitioning: Partitions the data and uploads it to the S3 bucket.
Database Setup: Sets up PostgreSQL tables using the data from S3.
Cache Configuration: Configures caching options such as disk or memory cache.
Benchmark Execution: Executes benchmark iterations with different cache configurations using the criterion framework.
Benchmark Analysis: Analyzes the results using the default metrics from criterion.

The SQL command below is used to toggle Parquet metadata caching (In-memory):

SELECT duckdb_execute($$SET enable_object_cache={cache_setting}$$)

Where cache_setting can be either "true" or "false", depending on the test scenario.

Benchmarking

To run the benchmarking, use the following command:

cd ./cargo-paradedb
RUST_LOG=info cargo run -- paradedb pga-bench parquet-run-all

Integration Notes

The diagram below outlines key components and their interactions, providing a high-level overview of the prototype design:

%% Top-to-Bottom Layout
flowchart TB
    subgraph paradedb["paradedb"]
        direction TB
        package11["cargo-paradedb<br>(rs)<br>Common Benchmarking<br>Orchestrator"]
    end
    subgraph pg_analytics["pg_analytics"]
        direction TB
        package21["tests<br>(rs)<br>Integration Test<br>"]
        package22["tests<br>(rs)<br>fixtures<br>& Tables"]
    end
    subgraph Postgres["Postgres"]
        direction TB
        package31["pg_search<br>(rs)<br>Extension<br>"]
        package32["pg_analytics<br>(rs)<br>Extension<br>"]
    end
    package21 -->|Uses| package22
    package11 -->|Uses<br>As git submodule| package22
    package11 -->|Query| package31
    package22 -->|Query| package32

…lytics Signed-off-by: shamb0 <[email protected]>

Signed-off-by: shamb0 <[email protected]>

…update-demo-multi-level-partition-table-dset-auto-sales

Signed-off-by: shamb0 <[email protected]>

…update-demo-multi-level-partition-table-dset-auto-sales

- Introduced new benchmarking modules: - Renamed to for clarity - Integrated as a submodule - Added fixtures module to support testing Signed-off-by: shamb0 <[email protected]>

Signed-off-by: shamb0 <[email protected]>

philippemnoel · 2024-09-27T15:45:38Z

.gitmodules

@@ -0,0 +1,4 @@
+[submodule "pg_analytics/downloads"]


Can we do this without a submodule please? The code should just be in cargo-paradedb. We plan to eventually move cargo-paradedb to a separate repository

philippemnoel · 2024-09-27T15:46:10Z

cargo-paradedb/Cargo.toml

@@ -4,14 +4,22 @@ version = "0.10.1"
 edition = "2021"

 # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
+[features]


@neilyio should review here. I believe he intentionally made cargo-paradedb be Postgres-agnostic. We may want to preserve that if we can.

philippemnoel

There's this strange folder called /downloads in pg_analytics. Can we remove that?

shamb0 · 2024-09-28T16:39:19Z

Hi @philippemnoel,

Thank you for your feedback on the PR. I've made some changes to address the issues you raised. In the initial implementation, the pg_analytics benchmarking module had a direct dependency on the fixtures, which led to the introduction of the paradedb/pg_analytics/downloads submodule to build the fixtures locally. This approach was necessary because the fixtures were tightly coupled with the integration tests in pg_analytics/tests/fixtures.

To resolve this, I've implemented a solution that decouples the fixtures from the integration tests. I've created a new, shared test utility crate called pga_fixtures. This crate is now used across both the benchmarking module and integration tests, which should resolve the dependency issues we were facing.

%% Top-to-Bottom Layout
flowchart TB
    subgraph paradedb["paradedb"]
        direction TB
        package11["cargo-paradedb<br>(rs)<br>Common Benchmarking<br>Orchestrator"]
    end
    subgraph pg_analytics["pg_analytics"]
        direction TB
        package21["tests<br>(rs)<br>Integration Test<br>"]
        package22["pga_fixtures"]
    end
    subgraph Postgres["Postgres"]
        direction TB
        package31["pg_search<br>(rs)<br>Extension<br>"]
        package32["pg_analytics<br>(rs)<br>Extension<br>"]
    end
    package21 -->|Uses| package22
    package11 -->|Uses<br>| package22
    package11 -->|Query| package31
    package22 -->|Query| package32

I believe this updated patch meets the requirements we discussed. I'm eager to hear your thoughts on the refactored code and am open to any further suggestions or adjustments you might have. Please let me know if you need any additional information or clarification on the changes I've made.

Signed-off-by: shamb0 <[email protected]>

philippemnoel · 2024-09-28T21:24:11Z

Cargo.toml

@@ -16,3 +16,6 @@ inherits = "release"
 debug = true
 lto = "thin"
 codegen-units = 32
+
+[patch.crates-io]
+testcontainers = { package = "testcontainers", git = "https://github.com/shamb0/testcontainers-rs.git", rev = "b05c13d" }


Can this be upstreamed?

philippemnoel · 2024-09-28T21:24:19Z

.gitmodules

Could we remove this file?

philippemnoel · 2024-09-28T21:24:53Z

cargo-paradedb/Cargo.toml

+tracing-subscriber = { version = "0.3.18", features = ["env-filter", "time"] }
+datafusion = { version = "37.1.0" }
+tokio = { version = "1.0", features = ["full"] }
+bytes = { version = "1.7.1" }
+prettytable = { version = "0.10.0" }
+time = { version = "0.3.34", features = ["serde", "macros", "local-offset"] }
+rand = { version = "0.8.5" }
+approx = { version = "0.5.1" }
+bigdecimal = { version = "0.3.1", features = ["serde"] }
+soa_derive = { version = "0.13.0" }
+aws-config = { version = "1.5.6" }
+aws-sdk-s3 = { version = "1.49.0" }
+serde_arrow = { version = "0.11.7", features = ["arrow-51"] }
+testcontainers = { version = "0.22.0" }
+testcontainers-modules = { version = "0.10.0", features = ["localstack"] }
+rstest = { version = "0.19.0" }
+duckdb = { git = "https://github.com/paradedb/duckdb-rs.git", features = [
+  "bundled",
+  "extensions-full",
+], rev = "e532dd6" }
+cargo_metadata = { version = "0.18.0" }
+camino = { version = "1.0.7", features = ["serde1"] }
+pga_fixtures = { package = "pga_fixtures", git = "https://github.com/shamb0/pg_analytics", rev = "4511eba" }


Do we really need all of these new dev dependencies?

philippemnoel · 2024-09-28T21:25:21Z

cargo-paradedb/src/benchmark/pgs_bench_sanity.rs

Why the rename?

philippemnoel · 2024-09-28T21:26:18Z

docs/ingest/configuration/multi-level-partitioned-tables.mdx

@@ -0,0 +1,59 @@
+---
+title: Multilevel Partition Tables


Multi-Level

philippemnoel · 2024-09-28T21:26:48Z

cargo-paradedb/src/subcommand.rs

@@ -15,7 +15,7 @@
 // You should have received a copy of the GNU Affero General Public License
 // along with this program. If not, see <http://www.gnu.org/licenses/>.

-use crate::benchmark::Benchmark;
+use crate::benchmark::{pga_bench_parquet, pgs_bench_sanity};


why sanity?

philippemnoel · 2024-09-28T21:30:18Z

Hi @philippemnoel,

Thank you for your feedback on the PR. I've made some changes to address the issues you raised. In the initial implementation, the pg_analytics benchmarking module had a direct dependency on the fixtures, which led to the introduction of the paradedb/pg_analytics/downloads submodule to build the fixtures locally. This approach was necessary because the fixtures were tightly coupled with the integration tests in pg_analytics/tests/fixtures.

To resolve this, I've implemented a solution that decouples the fixtures from the integration tests. I've created a new, shared test utility crate called pga_fixtures. This crate is now used across both the benchmarking module and integration tests, which should resolve the dependency issues we were facing.
%% Top-to-Bottom Layout
flowchart TB
 subgraph paradedb["paradedb"]
 direction TB
 package11["cargo-paradedb (rs) Common Benchmarking Orchestrator"]
 end
 subgraph pg_analytics["pg_analytics"]
 direction TB
 package21["tests (rs) Integration Test "]
 package22["pga_fixtures"]
 end
 subgraph Postgres["Postgres"]
 direction TB
 package31["pg_search (rs) Extension "]
 package32["pg_analytics (rs) Extension "]
 end
 package21 -->|Uses| package22
 package11 -->|Uses | package22
 package11 -->|Query| package31
 package22 -->|Query| package32
Loading
I believe this updated patch meets the requirements we discussed. I'm eager to hear your thoughts on the refactored code and am open to any further suggestions or adjustments you might have. Please let me know if you need any additional information or clarification on the changes I've made.

This is still too complicated for my liking. The current state of cargo-paradedb in dev is good. It does not depend on Postgres, which makes it easier to test and run for everyone. Today, it is used to test pg_search. Could we do the exact same, but for pg_analytics support? Hopefully it shouldn't need to depend on any fixture, but could also just take a Postgres URL to execute the test against, whatever that may be?

In that world, we wouldn't need this complex setup. We will eventually move cargo-paradedb to its own repository as well. Or is this not possible at all to separate? I don't see why it wouldn't be for pg_analytics while it is possible for pg_search.

philippemnoel · 2024-10-02T18:06:10Z

Going to close this until a more scoped PR is raised.

shamb0 added 8 commits August 25, 2024 16:17

docs: demo writeups update for multi-level partition tables in pg-ana…

a6f10aa

…lytics Signed-off-by: shamb0 <[email protected]>

docs: Update MLP table strategy to follow Hive-style partitioning

ad028ac

Signed-off-by: shamb0 <[email protected]>

Merge remote-tracking branch 'upstream/dev' into shamb0/pr-pair-docs-…

bb8b426

…update-demo-multi-level-partition-table-dset-auto-sales

docs: Update MLP table strategy to follow Hive-style partitioning

dbf034f

Signed-off-by: shamb0 <[email protected]>

Merge remote-tracking branch 'upstream/dev' into shamb0/pr-pair-docs-…

9abfc4f

…update-demo-multi-level-partition-table-dset-auto-sales

Add benchmarking extensions

eb61a6c

- Introduced new benchmarking modules: - Renamed to for clarity - Integrated as a submodule - Added fixtures module to support testing Signed-off-by: shamb0 <[email protected]>

merge to upstream dev

10208c1

Signed-off-by: shamb0 <[email protected]>

merge to upstream dev

42b5670

Signed-off-by: shamb0 <[email protected]>

This was referenced Sep 24, 2024

test: Add query benchmarking suite for pg_analytics paradedb/pg_analytics#131

Closed

test: Benchmark suite for multi-level partitioned tables with DuckDB metadata cache options paradedb/pg_analytics#117

Closed

philippemnoel reviewed Sep 27, 2024

View reviewed changes

Refactor: Moved fixtures to 'pga_fixtures' crate to support benchmarking

5fa2f71

Signed-off-by: shamb0 <[email protected]>

philippemnoel reviewed Sep 28, 2024

View reviewed changes

.gitmodules

Copy link

Collaborator

philippemnoel Sep 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove this file?

philippemnoel reviewed Sep 28, 2024

View reviewed changes

cargo-paradedb/src/benchmark/pgs_bench_sanity.rs

Copy link

Collaborator

philippemnoel Sep 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the rename?

philippemnoel reviewed Sep 28, 2024

View reviewed changes

philippemnoel closed this Oct 2, 2024

This was referenced Oct 8, 2024

test: Add query benchmarking suite for pg_analytics #1751

Closed

feat: Enable Disk Caching for Multiple Sources (Supersedes PR#30) paradedb/pg_analytics#148

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Add query benchmarking suite for pg_analytics #1703

test: Add query benchmarking suite for pg_analytics #1703

shamb0 commented Sep 24, 2024 •

edited

Loading

philippemnoel Sep 27, 2024

philippemnoel Sep 27, 2024

philippemnoel left a comment

shamb0 commented Sep 28, 2024 •

edited

Loading

philippemnoel Sep 28, 2024

philippemnoel Sep 28, 2024

philippemnoel Sep 28, 2024

philippemnoel Sep 28, 2024

philippemnoel Sep 28, 2024

philippemnoel Sep 28, 2024

philippemnoel commented Sep 28, 2024

philippemnoel commented Oct 2, 2024

test: Add query benchmarking suite for pg_analytics #1703

test: Add query benchmarking suite for pg_analytics #1703

Conversation

shamb0 commented Sep 24, 2024 • edited Loading

Ticket(s) Closed

What

Why

How

Benchmarking

Integration Notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philippemnoel left a comment

Choose a reason for hiding this comment

shamb0 commented Sep 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philippemnoel commented Sep 28, 2024

philippemnoel commented Oct 2, 2024

shamb0 commented Sep 24, 2024 •

edited

Loading

shamb0 commented Sep 28, 2024 •

edited

Loading