The goal of this step is to produce a collection of JSON Lines data files for loading by subsequent steps.
A collection of such files (as a tar.gz
ball) for a sample data slice is provided
here,
which contains data on 2000 blocks numbered [19005000, 19007000).
To extract a different slice, follow the procedure below:
-
If you'd like to interact with BigQuery through the command line, install gcloud CLI here:
- https://cloud.google.com/sdk/docs/install
- https://cloud.google.com/bigquery/docs/reference/libraries#client-libraries-install-python
Downloads can also be made through their web interface.
-
Create a dataset on BigQuery, named the same as the value of
@@dataset_id
that is set at the beginning of the scriptbigquery_extract.sql
. -
Run
bigquery_extract.sql
in BigQuery to populate your dataset using the Google-maintained public datasetbigquery-public-data.crypto_ethereum
.- Note:
The structure of the Google-maintained dataset may change from time to time and is beyond our control.
We will continuously monitor it and modify
bigquery_extract.sql
to maintain compatibility with the CrypQ schema and the remaining setup steps. (The last update happened in August 2024.)
- Note:
The structure of the Google-maintained dataset may change from time to time and is beyond our control.
We will continuously monitor it and modify
-
Export the data files from all tables in the extracted dataset in the JSON Lines format, and make them available under a directory for the next step. Name each file as
tablename.jsonl
, wheretablename
is the name of the table (all lower-case).- Note:
If you have downloaded the data files in JSON instead of JSON Lines format,
you can use jq, a light-weight command-line JSON processor,
to convert a JSON file into JSONL file:
jq -c '.[]' tablename.json > tablename.jsonl
- Note:
If you have downloaded the data files in JSON instead of JSON Lines format,
you can use jq, a light-weight command-line JSON processor,
to convert a JSON file into JSONL file:
The following instructions assumes PostgreSQL; steps for other database systems may vary.
Pick a name for your database, say crypq
.
-
Create the database schema with empty tables:
dropdb crypq createdb crypq psql crypq -f create.sql
The file
create.sql
is extensively documented with explanation of how data is represented. -
From the directory containing JSON Lines files:
psql crypq -f load.sql
Congrats! Your slice of the Etheurem database is now ready to go!
The benchmark queries reside in the queries/
subdirectory as .sql
files.
Some of the queries are parameterized: they have concrete default settings but can be overridden.
To prepare an update workload, you need to first extract a larger slice of data: then, the database will start out containing some blocks at the beginning of the extracted slice, and then gradually evolve according to the subsequent blocks.
In the following, we assume that you have already prepared a slice with 2000 blocks numbered [19005000, 19007000)
in a database named crypq
.
Scripts related to updates reside in the updates/
subdirectory,
where the following should be run.
The following command would make crypq
contain the first 1000 blocks initially,
and output a sequence of .sql
files under outdir/
that you can use to "play forward" the updates 1 blocks at a time:
./gen_updates.sh crypq 19006000 1 outdir
The output outdir/upserts-*.sql
files should be executed in sequence starting from the initial database state, as follows:
psql crypq -f outdir/upserts-19006000.sql
psql crypq -f outdir/upserts-19006001.sql
# ...
psql crypq -f outdir/upserts-19006999.sql
Optionally, if you prefer to keep the database size roughly constant over time,
you can call the expire.sql
script with appropriate parameters to remove data pertaining to old blocks.
For example, to keep the database at 1000 blocks, you would do:
psql crypq -v BLK_START=19005001 -f expire.sql
psql crypq -f outdir/upserts-19006000.sql
psql crypq -v BLK_START=19005002 -f expire.sql
psql crypq -f outdir/upserts-19006001.sql
# ...
psql crypq -v BLK_START=19006000 -f expire.sql
psql crypq -f outdir/upserts-19006999.sql
Finally, the gen_updates.sh
script saves the full slice and the initial database state in .sql.gz
files,
which you can use to restore these states later (see the contents of run_updates.sh
for usage examples).
We also provide an example script run_updates.sh
that executes the generated update workload.
By default, it will start by loading the initial database state and play through the sequence of updates.
The final state can then be compared with the full slice ---
they should be the same if no expiration was done.
Type ./run_updates.sh
for help.
If you would like to cite this benchmark in your work, please use:
- Vincent Capol, Yuxi Liu, Haibo Xiu, and Jun Yang. "CrypQ: A Database Benchmark Based on Dynamic, Ever-Evolving Ethereum Data." In Proceedings of the Sixteenth TPC Technology Conference on Performance Evaluation & Benchmarking (TPCTC 2024), Guangzhou, China, August 2024.
This paper can be accessed as PDF here. Figure 1 of the paper shows a schema diagram for CrypQ, which you may find helpful.
@inproceedings{tpctc24-CapolLXY-CrypQ,
author = {Vincent Capol and Yuxi Liu and Haibo Xiu and Jun Yang},
editor = {Raghunath Nambiar and Meikel Poess},
title = {{CrypQ}: A Database Benchmark Based on Dynamic, Ever-Evolving {Ethereum} Data},
booktitle = {Proceedings of the Sixteenth {TPC} Technology Conference on Performance Evaluation and Benchmarking ({TPCTC} 2024)},
address = {Guangzhou, China},
month = {August},
year = {2024},
note = {\url{https://github.com/dukedb-crypq}}
}