Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Add query benchmarking suite for pg_analytics #1703

Closed

Conversation

shamb0
Copy link

@shamb0 shamb0 commented Sep 24, 2024

Ticket(s) Closed

This PR is part of a pair; please consider both for review and merge.

#1703
paradedb/pg_analytics#131

What

This PR implements benchmarking functionality to analyze query performance under different caching conditions across various data sources supported by pg_analytics.

Why

To evaluate how different cache configurations impact query performance, ensuring that the system optimally handles various data sources and caching scenarios.

How

The test function follows a structured flow:

  1. Parquet File Check: Verifies the existence of a Parquet file at a specified path; if absent, generates the file.
  2. Data Loading: Loads the Parquet data into a DataFrame using a DataFusion session context.
  3. S3 Setup: Configures an S3 bucket for storing partitioned data.
  4. Data Partitioning: Partitions the data and uploads it to the S3 bucket.
  5. Database Setup: Sets up PostgreSQL tables using the data from S3.
  6. Cache Configuration: Configures caching options such as disk or memory cache.
  7. Benchmark Execution: Executes benchmark iterations with different cache configurations using the criterion framework.
  8. Benchmark Analysis: Analyzes the results using the default metrics from criterion.

The SQL command below is used to toggle Parquet metadata caching (In-memory):

SELECT duckdb_execute($$SET enable_object_cache={cache_setting}$$)

Where cache_setting can be either "true" or "false", depending on the test scenario.

Benchmarking

To run the benchmarking, use the following command:

cd ./cargo-paradedb
RUST_LOG=info cargo run -- paradedb pga-bench parquet-run-all

Integration Notes

The diagram below outlines key components and their interactions, providing a high-level overview of the prototype design:

%% Top-to-Bottom Layout
flowchart TB
    subgraph paradedb["paradedb"]
        direction TB
        package11["cargo-paradedb<br>(rs)<br>Common Benchmarking<br>Orchestrator"]
    end
    subgraph pg_analytics["pg_analytics"]
        direction TB
        package21["tests<br>(rs)<br>Integration Test<br>"]
        package22["tests<br>(rs)<br>fixtures<br>& Tables"]
    end
    subgraph Postgres["Postgres"]
        direction TB
        package31["pg_search<br>(rs)<br>Extension<br>"]
        package32["pg_analytics<br>(rs)<br>Extension<br>"]
    end
    package21 -->|Uses| package22
    package11 -->|Uses<br>As git submodule| package22
    package11 -->|Query| package31
    package22 -->|Query| package32
Loading

…update-demo-multi-level-partition-table-dset-auto-sales
…update-demo-multi-level-partition-table-dset-auto-sales
- Introduced new benchmarking modules:
- Renamed  to  for clarity
- Integrated  as a submodule
- Added  fixtures module to support testing

Signed-off-by: shamb0 <[email protected]>
.gitmodules Outdated
@@ -0,0 +1,4 @@
[submodule "pg_analytics/downloads"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this without a submodule please? The code should just be in cargo-paradedb. We plan to eventually move cargo-paradedb to a separate repository

@@ -4,14 +4,22 @@ version = "0.10.1"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[features]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neilyio should review here. I believe he intentionally made cargo-paradedb be Postgres-agnostic. We may want to preserve that if we can.

Copy link
Collaborator

@philippemnoel philippemnoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's this strange folder called /downloads in pg_analytics. Can we remove that?

@shamb0
Copy link
Author

shamb0 commented Sep 28, 2024

Hi @philippemnoel,

Thank you for your feedback on the PR. I've made some changes to address the issues you raised. In the initial implementation, the pg_analytics benchmarking module had a direct dependency on the fixtures, which led to the introduction of the paradedb/pg_analytics/downloads submodule to build the fixtures locally. This approach was necessary because the fixtures were tightly coupled with the integration tests in pg_analytics/tests/fixtures.

image

To resolve this, I've implemented a solution that decouples the fixtures from the integration tests. I've created a new, shared test utility crate called pga_fixtures. This crate is now used across both the benchmarking module and integration tests, which should resolve the dependency issues we were facing.

%% Top-to-Bottom Layout
flowchart TB
    subgraph paradedb["paradedb"]
        direction TB
        package11["cargo-paradedb<br>(rs)<br>Common Benchmarking<br>Orchestrator"]
    end
    subgraph pg_analytics["pg_analytics"]
        direction TB
        package21["tests<br>(rs)<br>Integration Test<br>"]
        package22["pga_fixtures"]
    end
    subgraph Postgres["Postgres"]
        direction TB
        package31["pg_search<br>(rs)<br>Extension<br>"]
        package32["pg_analytics<br>(rs)<br>Extension<br>"]
    end
    package21 -->|Uses| package22
    package11 -->|Uses<br>| package22
    package11 -->|Query| package31
    package22 -->|Query| package32
Loading

I believe this updated patch meets the requirements we discussed. I'm eager to hear your thoughts on the refactored code and am open to any further suggestions or adjustments you might have. Please let me know if you need any additional information or clarification on the changes I've made.

@@ -16,3 +16,6 @@ inherits = "release"
debug = true
lto = "thin"
codegen-units = 32

[patch.crates-io]
testcontainers = { package = "testcontainers", git = "https://github.com/shamb0/testcontainers-rs.git", rev = "b05c13d" }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be upstreamed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove this file?

Comment on lines +32 to +54
tracing-subscriber = { version = "0.3.18", features = ["env-filter", "time"] }
datafusion = { version = "37.1.0" }
tokio = { version = "1.0", features = ["full"] }
bytes = { version = "1.7.1" }
prettytable = { version = "0.10.0" }
time = { version = "0.3.34", features = ["serde", "macros", "local-offset"] }
rand = { version = "0.8.5" }
approx = { version = "0.5.1" }
bigdecimal = { version = "0.3.1", features = ["serde"] }
soa_derive = { version = "0.13.0" }
aws-config = { version = "1.5.6" }
aws-sdk-s3 = { version = "1.49.0" }
serde_arrow = { version = "0.11.7", features = ["arrow-51"] }
testcontainers = { version = "0.22.0" }
testcontainers-modules = { version = "0.10.0", features = ["localstack"] }
rstest = { version = "0.19.0" }
duckdb = { git = "https://github.com/paradedb/duckdb-rs.git", features = [
"bundled",
"extensions-full",
], rev = "e532dd6" }
cargo_metadata = { version = "0.18.0" }
camino = { version = "1.0.7", features = ["serde1"] }
pga_fixtures = { package = "pga_fixtures", git = "https://github.com/shamb0/pg_analytics", rev = "4511eba" }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need all of these new dev dependencies?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the rename?

@@ -0,0 +1,59 @@
---
title: Multilevel Partition Tables
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Level

@@ -15,7 +15,7 @@
// You should have received a copy of the GNU Affero General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.

use crate::benchmark::Benchmark;
use crate::benchmark::{pga_bench_parquet, pgs_bench_sanity};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why sanity?

@philippemnoel
Copy link
Collaborator

Hi @philippemnoel,

Thank you for your feedback on the PR. I've made some changes to address the issues you raised. In the initial implementation, the pg_analytics benchmarking module had a direct dependency on the fixtures, which led to the introduction of the paradedb/pg_analytics/downloads submodule to build the fixtures locally. This approach was necessary because the fixtures were tightly coupled with the integration tests in pg_analytics/tests/fixtures.

image

To resolve this, I've implemented a solution that decouples the fixtures from the integration tests. I've created a new, shared test utility crate called pga_fixtures. This crate is now used across both the benchmarking module and integration tests, which should resolve the dependency issues we were facing.

%% Top-to-Bottom Layout
flowchart TB
    subgraph paradedb["paradedb"]
        direction TB
        package11["cargo-paradedb<br>(rs)<br>Common Benchmarking<br>Orchestrator"]
    end
    subgraph pg_analytics["pg_analytics"]
        direction TB
        package21["tests<br>(rs)<br>Integration Test<br>"]
        package22["pga_fixtures"]
    end
    subgraph Postgres["Postgres"]
        direction TB
        package31["pg_search<br>(rs)<br>Extension<br>"]
        package32["pg_analytics<br>(rs)<br>Extension<br>"]
    end
    package21 -->|Uses| package22
    package11 -->|Uses<br>| package22
    package11 -->|Query| package31
    package22 -->|Query| package32
Loading

I believe this updated patch meets the requirements we discussed. I'm eager to hear your thoughts on the refactored code and am open to any further suggestions or adjustments you might have. Please let me know if you need any additional information or clarification on the changes I've made.

This is still too complicated for my liking. The current state of cargo-paradedb in dev is good. It does not depend on Postgres, which makes it easier to test and run for everyone. Today, it is used to test pg_search. Could we do the exact same, but for pg_analytics support? Hopefully it shouldn't need to depend on any fixture, but could also just take a Postgres URL to execute the test against, whatever that may be?

In that world, we wouldn't need this complex setup. We will eventually move cargo-paradedb to its own repository as well. Or is this not possible at all to separate? I don't see why it wouldn't be for pg_analytics while it is possible for pg_search.

@philippemnoel
Copy link
Collaborator

Going to close this until a more scoped PR is raised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants