test: Add query benchmarking suite for pg_analytics #131

shamb0 · 2024-09-24T16:14:14Z

Ticket(s) Closed

Closes Cache Parquet metadata in pg_analytics #57

This PR is part of a pair; please consider both for review and merge.

What

This PR implements benchmarking functionality to analyze query performance under different caching conditions across various data sources supported by pg_analytics.

Why

To evaluate how different cache configurations impact query performance, ensuring that the system optimally handles various data sources and caching scenarios.

How

The test function follows a structured flow:

Parquet File Check: Verifies the existence of a Parquet file at a specified path; if absent, generates the file.
Data Loading: Loads the Parquet data into a DataFrame using a DataFusion session context.
S3 Setup: Configures an S3 bucket for storing partitioned data.
Data Partitioning: Partitions the data and uploads it to the S3 bucket.
Database Setup: Sets up PostgreSQL tables using the data from S3.
Cache Configuration: Configures caching options such as disk or memory cache.
Benchmark Execution: Executes benchmark iterations with different cache configurations using the criterion framework.
Benchmark Analysis: Analyzes the results using the default metrics from criterion.

The SQL command below is used to toggle Parquet metadata caching (In-memory):

SELECT duckdb_execute($$SET enable_object_cache={cache_setting}$$)

Where cache_setting can be either "true" or "false", depending on the test scenario.

Benchmarking

To run the benchmarking, use the following command:

cd ./cargo-paradedb
RUST_LOG=info cargo run -- paradedb pga-bench parquet-run-all

Integration Notes

The diagram below outlines key components and their interactions, providing a high-level overview of the prototype design:

%% Top-to-Bottom Layout
flowchart TB
    subgraph paradedb["paradedb"]
        direction TB
        package11["cargo-paradedb<br>(rs)<br>Common Benchmarking<br>Orchestrator"]
    end
    subgraph pg_analytics["pg_analytics"]
        direction TB
        package21["tests<br>(rs)<br>Integration Test<br>"]
        package22["tests<br>(rs)<br>fixtures<br>& Tables"]
    end
    subgraph Postgres["Postgres"]
        direction TB
        package31["pg_search<br>(rs)<br>Extension<br>"]
        package32["pg_analytics<br>(rs)<br>Extension<br>"]
    end
    package21 -->|Uses| package22
    package11 -->|Uses<br>As git submodule| package22
    package11 -->|Query| package31
    package22 -->|Query| package32

Signed-off-by: shamb0 <[email protected]>

- Profiled query performance on a foreign table with and without the DuckDB metadata cache enabled - Tested on Hive-style partitioned data in S3 to simulate real-world scenarios Signed-off-by: shamb0 <[email protected]>

…d testcontainers version - Merged changes from [PR#30](paradedb#30). - Integrated benchmarking for Hive-style partitioned Parquet file source. - Applied a patched version of to address an async container cleanup issue. Signed-off-by: shamb0 <[email protected]>

…metadata-cache

- Verified: - Test harness: pass - Integration test: pass - Benchmarking: pass Signed-off-by: shamb0 <[email protected]>

Signed-off-by: shamb0 <[email protected]>

…ity and consistency in tests. - Adjusted module imports accordingly. Signed-off-by: shamb0 <[email protected]>

philippemnoel

Thank you for submitting the PR to paradedb/paradedb for the analytics work. We'll review that and get it merged there. If you want to adjust this PR to no longer contain that work, we can review this one too

…benchmarks Signed-off-by: shamb0 <[email protected]>

philippemnoel

Hi @shamb0 these PRs have grown extremely large and are basically impossible to review properly.

Could we scope this work better so it's easier to merge? They've been blocked for a long time because of how large they are.

cargo-paradedb should host all benchmarking code
it should not depend on Postgres and simply take a PG DB URL
The new test fixtures are nice, but perhaps we can PR them separately with a rationale for why we need new test fixtures?

This will help bring this in

shamb0 · 2024-09-29T05:19:51Z

Hi @philippemnoel,

Haha, I definitely understand the challenge of reviewing large PRs! You're right—this PR is a combination of changes from multiple sources, which has made it more complex, breaking these down will significantly improve the review process.

Based on your suggestions, here's my proposed approach to restructure the work:

Component	Action Plan
Benchmarking code	Move to `cargo-paradedb`, use PG DB URL instead of direct Postgres dependency
New `pga_fixtures` crate	Create a separate PR with rationale for new test fixtures
Changes from PR #30	Split into a separate PR for easier review
Remaining changes (PR #115)	Keep in current PR, but streamline for focused review

This breakdown should make each PR more manageable and easier to review. I'll start implementing these changes right away.

Do you think this approach addresses your concerns?

philippemnoel · 2024-09-29T12:55:10Z

Hi @philippemnoel,

Haha, I definitely understand the challenge of reviewing large PRs! You're right—this PR is a combination of changes from multiple sources, which has made it more complex, breaking these down will significantly improve the review process.

Based on your suggestions, here's my proposed approach to restructure the work:

Component Action Plan
Benchmarking code Move to cargo-paradedb, use PG DB URL instead of direct Postgres dependency
New pga_fixtures crate Create a separate PR with rationale for new test fixtures
Changes from PR #30 Split into a separate PR for easier review
Remaining changes (PR #115) Keep in current PR, but streamline for focused review
This breakdown should make each PR more manageable and easier to review. I'll start implementing these changes right away.

Do you think this approach addresses your concerns?

Sounds promising!

philippemnoel · 2024-10-02T18:05:42Z

Going to close this until a more scoped PR is raised.

shamb0 added 10 commits September 16, 2024 09:59

Demo: Support for Multi-Level Partition Tables

aecc5a7

Signed-off-by: shamb0 <[email protected]>

Refactor: Address review comments

e7daee6

Signed-off-by: shamb0 <[email protected]>

Refactor: Address review comments

a7841aa

Signed-off-by: shamb0 <[email protected]>

test: Implement Hive-style partitioning for Parquet files

3ba427a

Signed-off-by: shamb0 <[email protected]>

Merge branch 'dev' into shamb0/benchmark-multi-level-partition-table-…

8ad312e

…metadata-cache

- Resolved lib test harness linking issue

360edbc

- Verified: - Test harness: pass - Integration test: pass - Benchmarking: pass Signed-off-by: shamb0 <[email protected]>

fix: resolve clippy warnings

a54fc65

Signed-off-by: shamb0 <[email protected]>

Refactor: renamed 'fixtures' module to 'pga_fixtures' for better clar…

a382208

…ity and consistency in tests. - Adjusted module imports accordingly. Signed-off-by: shamb0 <[email protected]>

This was referenced Sep 24, 2024

test: Add query benchmarking suite for pg_analytics paradedb/paradedb#1703

Closed

test: Benchmark suite for multi-level partitioned tables with DuckDB metadata cache options #117

Closed

philippemnoel reviewed Sep 27, 2024

View reviewed changes

Refactor: Move fixtures to shared 'pga_fixtures' crate for tests and …

4511eba

…benchmarks Signed-off-by: shamb0 <[email protected]>

philippemnoel reviewed Sep 28, 2024

View reviewed changes

philippemnoel closed this Oct 2, 2024

This was referenced Oct 8, 2024

test: Add query benchmarking suite for pg_analytics paradedb/paradedb#1751

Closed

feat: Enable Disk Caching for Multiple Sources (Supersedes PR#30) #148

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Add query benchmarking suite for pg_analytics #131

test: Add query benchmarking suite for pg_analytics #131

shamb0 commented Sep 24, 2024 •

edited

Loading

philippemnoel left a comment

philippemnoel left a comment

shamb0 commented Sep 29, 2024

philippemnoel commented Sep 29, 2024

philippemnoel commented Oct 2, 2024

test: Add query benchmarking suite for pg_analytics #131

test: Add query benchmarking suite for pg_analytics #131

Conversation

shamb0 commented Sep 24, 2024 • edited Loading

Ticket(s) Closed

What

Why

How

Benchmarking

Integration Notes

philippemnoel left a comment

Choose a reason for hiding this comment

philippemnoel left a comment

Choose a reason for hiding this comment

shamb0 commented Sep 29, 2024

philippemnoel commented Sep 29, 2024

philippemnoel commented Oct 2, 2024

shamb0 commented Sep 24, 2024 •

edited

Loading