Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Add query benchmarking suite for pg_analytics #131

Closed

Conversation

shamb0
Copy link

@shamb0 shamb0 commented Sep 24, 2024

Ticket(s) Closed

This PR is part of a pair; please consider both for review and merge.

paradedb/paradedb#1703
#131

What

This PR implements benchmarking functionality to analyze query performance under different caching conditions across various data sources supported by pg_analytics.

Why

To evaluate how different cache configurations impact query performance, ensuring that the system optimally handles various data sources and caching scenarios.

How

The test function follows a structured flow:

  1. Parquet File Check: Verifies the existence of a Parquet file at a specified path; if absent, generates the file.
  2. Data Loading: Loads the Parquet data into a DataFrame using a DataFusion session context.
  3. S3 Setup: Configures an S3 bucket for storing partitioned data.
  4. Data Partitioning: Partitions the data and uploads it to the S3 bucket.
  5. Database Setup: Sets up PostgreSQL tables using the data from S3.
  6. Cache Configuration: Configures caching options such as disk or memory cache.
  7. Benchmark Execution: Executes benchmark iterations with different cache configurations using the criterion framework.
  8. Benchmark Analysis: Analyzes the results using the default metrics from criterion.

The SQL command below is used to toggle Parquet metadata caching (In-memory):

SELECT duckdb_execute($$SET enable_object_cache={cache_setting}$$)

Where cache_setting can be either "true" or "false", depending on the test scenario.

Benchmarking

To run the benchmarking, use the following command:

cd ./cargo-paradedb
RUST_LOG=info cargo run -- paradedb pga-bench parquet-run-all

Integration Notes

The diagram below outlines key components and their interactions, providing a high-level overview of the prototype design:

%% Top-to-Bottom Layout
flowchart TB
    subgraph paradedb["paradedb"]
        direction TB
        package11["cargo-paradedb<br>(rs)<br>Common Benchmarking<br>Orchestrator"]
    end
    subgraph pg_analytics["pg_analytics"]
        direction TB
        package21["tests<br>(rs)<br>Integration Test<br>"]
        package22["tests<br>(rs)<br>fixtures<br>& Tables"]
    end
    subgraph Postgres["Postgres"]
        direction TB
        package31["pg_search<br>(rs)<br>Extension<br>"]
        package32["pg_analytics<br>(rs)<br>Extension<br>"]
    end
    package21 -->|Uses| package22
    package11 -->|Uses<br>As git submodule| package22
    package11 -->|Query| package31
    package22 -->|Query| package32
Loading

- Profiled query performance on a foreign table with and without the DuckDB metadata cache enabled
- Tested on Hive-style partitioned data in S3 to simulate real-world scenarios

Signed-off-by: shamb0 <[email protected]>
…d testcontainers version

- Merged changes from [PR#30](paradedb#30).
- Integrated  benchmarking for Hive-style partitioned Parquet file source.
- Applied a patched version of  to address an async container cleanup issue.

Signed-off-by: shamb0 <[email protected]>
- Verified:
  - Test harness: pass
  - Integration test: pass
  - Benchmarking: pass

Signed-off-by: shamb0 <[email protected]>
…ity and consistency in tests.

- Adjusted module imports accordingly.

Signed-off-by: shamb0 <[email protected]>
Copy link
Collaborator

@philippemnoel philippemnoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for submitting the PR to paradedb/paradedb for the analytics work. We'll review that and get it merged there. If you want to adjust this PR to no longer contain that work, we can review this one too

Copy link
Collaborator

@philippemnoel philippemnoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shamb0 these PRs have grown extremely large and are basically impossible to review properly.

Could we scope this work better so it's easier to merge? They've been blocked for a long time because of how large they are.

  • cargo-paradedb should host all benchmarking code
  • it should not depend on Postgres and simply take a PG DB URL
  • The new test fixtures are nice, but perhaps we can PR them separately with a rationale for why we need new test fixtures?

This will help bring this in

@shamb0
Copy link
Author

shamb0 commented Sep 29, 2024

Hi @philippemnoel,

Haha, I definitely understand the challenge of reviewing large PRs! You're right—this PR is a combination of changes from multiple sources, which has made it more complex, breaking these down will significantly improve the review process.

Based on your suggestions, here's my proposed approach to restructure the work:

Component Action Plan
Benchmarking code Move to cargo-paradedb, use PG DB URL instead of direct Postgres dependency
New pga_fixtures crate Create a separate PR with rationale for new test fixtures
Changes from PR #30 Split into a separate PR for easier review
Remaining changes (PR #115) Keep in current PR, but streamline for focused review

This breakdown should make each PR more manageable and easier to review. I'll start implementing these changes right away.

Do you think this approach addresses your concerns?

@philippemnoel
Copy link
Collaborator

Hi @philippemnoel,

Haha, I definitely understand the challenge of reviewing large PRs! You're right—this PR is a combination of changes from multiple sources, which has made it more complex, breaking these down will significantly improve the review process.

Based on your suggestions, here's my proposed approach to restructure the work:

Component Action Plan
Benchmarking code Move to cargo-paradedb, use PG DB URL instead of direct Postgres dependency
New pga_fixtures crate Create a separate PR with rationale for new test fixtures
Changes from PR #30 Split into a separate PR for easier review
Remaining changes (PR #115) Keep in current PR, but streamline for focused review
This breakdown should make each PR more manageable and easier to review. I'll start implementing these changes right away.

Do you think this approach addresses your concerns?

Sounds promising!

@philippemnoel
Copy link
Collaborator

Going to close this until a more scoped PR is raised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache Parquet metadata in pg_analytics
2 participants