-
Notifications
You must be signed in to change notification settings - Fork 39
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: tison <[email protected]> Co-authored-by: Yiran <[email protected]>
- Loading branch information
Showing
444 changed files
with
27,674 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# NYC taxi benchmark | ||
|
||
This benchmark is based on the data from [New York City Taxi & Limousine Commission](https://www.nyc.gov/site/tlc/index.page). From the official site, the data includes: | ||
|
||
> taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). The trip data was not created by the TLC, and TLC makes no representations as to the accuracy of these data. | ||
The commands in the documentation assume that your current working directory is the root directory of the [GreptimeDB](https://github.com/GreptimeTeam/greptimedb) source code. | ||
|
||
## Get the data | ||
|
||
First, create a directory to store test data. | ||
```shell | ||
mkdir -p ./benchmarks/data | ||
``` | ||
|
||
You can find the data in [this page](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). We support all the **Yellow Taxi Trip Records** since 2022-01. For example, to get the data for January 2022, you can run: | ||
|
||
```shell | ||
curl "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet" -o ./benchmarks/data/yellow_tripdata_2022-01.parquet | ||
``` | ||
|
||
## Run benchmark | ||
|
||
Before running the benchmark, please make sure you have started the GreptimeDB server. You can start GreptimeDB by using the following command: | ||
|
||
```shell | ||
cargo run --release standalone start | ||
``` | ||
|
||
Our benchmark tools are included in the source code. You can run it by: | ||
|
||
```shell | ||
cargo run --release --bin nyc-taxi -- --path "./benchmarks/data/" | ||
``` |
79 changes: 79 additions & 0 deletions
79
docs/nightly/en/contributor-guide/datanode/data-persistence-indexing.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
# Data Persistence and Indexing | ||
|
||
Similar to all LSMT-like storage engines, data in MemTables is persisted to durable storage, for example, the local disk file system or object storage service. GreptimeDB adopts [Apache Parquet][1] as its persistent file format. | ||
|
||
## SST File Format | ||
|
||
Parquet is an open source columnar format that provides fast data querying and has already been adopted by many projects, such as Delta Lake. | ||
|
||
Parquet has a hierarchical structure like "row groups-columns-data pages". Data in a Parquet file is horizontally partitioned into row groups, in which all values of the same column are stored together to form a data page. Data page is the minimal storage unit. This structure greatly improves performance. | ||
|
||
First, clustering data by column makes file scanning more efficient, especially when only a few columns are queried, which is very common in analytical systems. | ||
|
||
Second, data of the same column tends to be homogeneous which helps with compression when apply techniques like dictionary and Run-Length Encoding (RLE). | ||
|
||
![Parquet file format](/parquet-file-format.png) | ||
|
||
## Data Persistence | ||
|
||
GreptimeDB provides a configuration item `storage.flush.global_write_buffer_size`, which is flush threshold of the total memory usage for all MemTables. | ||
|
||
When the size of data buffered in MemTables reaches that threshold, GreptimeDB will pick MemTables and flush them to SST files. | ||
|
||
## Indexing Data in SST Files | ||
|
||
Apache Parquet file format provides inherent statistics in headers of column chunks and data pages, which are used for pruning and skipping. | ||
|
||
![Column chunk header](/column-chunk-header.png) | ||
|
||
For example, in the above Parquet file, if you want to filter rows where `name` = `Emily`, you can easily skip row group 0 because the max value for `name` field is `Charlie`. This statistical information reduces IO operations. | ||
|
||
## Index Files | ||
|
||
For each SST file, GreptimeDB not only maintains an internal index but also generates a separate file to store the index structures specific to that SST file. | ||
|
||
The index files utilize the [Puffin][3] format, which offers significant flexibility, allowing for the storage of additional metadata and supporting a broader range of index structures. | ||
|
||
![Puffin](/puffin.png) | ||
|
||
Currently, the inverted index is the first supported index structure, and it is stored within the index file as a Blob. | ||
|
||
## Inverted Index | ||
|
||
In version 0.7, GreptimeDB introduced the inverted index to accelerate queries. | ||
|
||
The inverted index is a common index structure used for full-text searches, mapping each word in the document to a list of documents containing that word. Greptime has adopted this technology, which originates from search engines, for use in the time series databases. | ||
|
||
Search engines and time series databases operate in separate domains, yet the principle behind the applied inverted index technology is similar. This similarity requires some conceptual adjustments: | ||
1. Term: In GreptimeDB, it refers to the column value of the time series. | ||
2. Document: In GreptimeDB, it refers to the data segment containing multiple time series. | ||
|
||
The inverted index enables GreptimeDB to skip data segments that do not meet query conditions, thus improving scanning efficiency. | ||
|
||
![Inverted index searching](/inverted-index-searching.png) | ||
|
||
For instance, the query above uses the inverted index to identify data segments where `job` equals `apiserver`, `handler` matches the regex `.*users`, and `status` matches the regex `4...`. It then scans these data segments to produce the final results that meet all conditions, significantly reducing the number of IO operations. | ||
|
||
### Inverted Index Format | ||
|
||
![Inverted index format](/inverted-index-format.png) | ||
|
||
GreptimeDB builds inverted indexes by column, with each inverted index consisting of an FST and multiple Bitmaps. | ||
|
||
The FST (Finite State Transducer) enables GreptimeDB to store mappings from column values to Bitmap positions in a compact format and provides excellent search performance and supports complex search capabilities (such as regular expression matching). The Bitmaps maintain a list of data segment IDs, with each bit representing a data segment. | ||
|
||
### Index Data Segments | ||
|
||
GreptimeDB divides an SST file into multiple indexed data segments, with each segment housing an equal number of rows. This segmentation is designed to optimize query performance by scanning only the data segments that match the query conditions. | ||
|
||
For example, if a data segment contains 1024 rows and the list of data segments identified through the inverted index for the query conditions is `[0, 2]`, then only the 0th and 2nd data segments in the SST file—from rows 0 to 1023 and 2048 to 3071, respectively—need to be scanned. | ||
|
||
The number of rows in a data segment is controlled by the engine option `index.inverted_index.segment_row_count`, which defaults to `1024`. A smaller value means more precise indexing and often results in better query performance but increases the cost of index storage. By adjusting this option, a balance can be struck between storage costs and query performance. | ||
|
||
## Unified Data Access Layer: OpenDAL | ||
|
||
GreptimeDB uses [OpenDAL][2] to provide a unified data access layer, thus, the storage engine does not need to interact with different storage APIs, and data can be migrated to cloud-based storage like AWS S3 seamlessly. | ||
|
||
[1]: https://parquet.apache.org | ||
[2]: https://github.com/datafuselabs/opendal | ||
[3]: https://iceberg.apache.org/puffin-spec |
39 changes: 39 additions & 0 deletions
39
docs/nightly/en/contributor-guide/datanode/metric-engine.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# Metric Engine | ||
|
||
## Overview | ||
|
||
The `Metric` engine is a component of GreptimeDB, and it's an implementation of the storage engine. It mainly targets scenarios with a large number of small tables for observable metrics. | ||
|
||
Its main feature is to use synthetic physical wide tables to store a large amount of small table data, achieving effects such as reuse of the same column and metadata. This reduces storage overhead for small tables and improves columnar compression efficiency. The concept of a table becomes even more lightweight under the `Metric` engine. | ||
|
||
## Concepts | ||
|
||
The `Metric` engine introduces two new concepts: "logical table" and "physical table". From the user's perspective, logical tables are exactly like ordinary ones. From a storage point-of-view, physical Regions are just regular Regions. | ||
|
||
### Logical Table | ||
|
||
A logical table refers to user-defined tables. Just like any other ordinary table, its definition includes the name of the table, column definitions, index definitions etc. All operations such as queries or write-ins by users are based on these logical tables. Users don't need to worry about differences between logical and ordinary tables during usage. | ||
|
||
From an implementation standpoint, a logical table is virtual; it doesn't directly read or write physical data but maps read/write requests into corresponding requests for physical tables in order to implement data storage and querying. | ||
|
||
### Physical Table | ||
|
||
A physical table is a table that actually stores data, possessing several physical Regions defined by partition rules. | ||
|
||
## Architecture and Design | ||
|
||
The main design architecture of the `Metric` engine is as follows: | ||
|
||
![Arch](/metric-engine-arch.png) | ||
|
||
In the current version implementation, the `Metric` engine reuses the `Mito` engine to achieve storage and query capabilities for physical data. It also provides access to both physical tables and logical tables simultaneously. | ||
|
||
Regarding partitioning, logical tables have identical partition rules and Region distribution as physical tables. This makes sense because the data of logical tables are directly stored in physical tables, so their partition rules are consistent. | ||
|
||
Concerning routing metadata, the routing address of a logical table is a logical address - what its corresponding physical table is - then through this physical table for secondary routing to obtain the real physical address. This indirect routing method can significantly reduce the number of metadata modifications required when Region migration scheduling occurs in Metric engines. | ||
|
||
Operationally speaking, The `Metric` engine only supports limited operations on physical tables to prevent misoperations such as prohibiting writing into or deleting from a physical table which could affect user's logic-table data. Generally speaking, users can consider that they have read-only access to these physical tables. | ||
|
||
To improve performance during simultaneous DDL (Data Definition Language) operations on many tables, the 'Metric' engine has introduced some batch DDL operations. These batch DDL operations can merge lots of DDL actions into one request thereby reducing queries and modifications times for metadata thus enhancing performance. This feature is particularly beneficial in scenarios such as the automatic creation requests brought about by large amounts of metrics during Prometheus Remote Write cold start-up, as well as the modification requests for numerous route-tables mentioned earlier during migration of many physical regions. | ||
|
||
Apart from physical data regions belonging to physical tables, the 'Metric' engine creates an additional metadata region physically for each individual physical data region used in storing some metadata needed by itself while maintaining mapping and other states. This metadata includes the mapping relationship between logical tables and physical tables, the mapping relationship between logical columns and physical columns etc. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Overview | ||
|
||
## Introduction | ||
|
||
`Datanode` is mainly responsible for storing the actual data for GreptimeDB. As we know, in GreptimeDB, | ||
a `table` can have one or more `Region`s, and `Datanode` is responsible for managing the reading and writing | ||
of these `Region`s. `Datanode` is not aware of `table` and can be considered as a `region server`. Therefore, | ||
`Frontend` and `Metasrv` operate `Datanode` at the granularity of `Region`. | ||
|
||
![Datanode](/datanode.png) | ||
|
||
## Components | ||
|
||
A `Datanode` contains all the components needed for a `region server`. Here we list some of the vital parts: | ||
|
||
- A gRPC service is provided for reading and writing region data, and `Frontend` uses this service | ||
to read and write data from `Datanode`s. | ||
- An HTTP service, through which you can obtain metrics, configuration information, etc., of the current node. | ||
- `Heartbeat Task` is used to send heartbeat to the `Metasrv`. The heartbeat plays a crucial role in the | ||
distributed architecture of GreptimeDB and serves as a basic communication channel for distributed coordination. | ||
The upstream heartbeat messages contain important information such as the workload of a `Region`. If the | ||
`Metasrv `has made scheduling(such as `Region` migration) decisions, it will send instructions to the | ||
`Datanode` via downstream heartbeat messages. | ||
- The `Datanode` does not include components like the `Physical Planner`, `Optimizer`, etc. (these are placed in | ||
the `Frontend`). The user's query requests for one or more `Table`s will be transformed into `Region` query | ||
requests in the `Frontend`. The `Datanode` is responsible for handling these `Region` query requests. | ||
- A `Region Manager` is used to manage all `Region`s on a `Datanode`. | ||
- GreptimeDB supports a pluggable multi-engine architecture, with existing engines including `File Engine` and | ||
`Mito Engine`. |
30 changes: 30 additions & 0 deletions
30
docs/nightly/en/contributor-guide/datanode/python-scripts.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# Python Scripts | ||
|
||
## Introduction | ||
|
||
Python scripts are methods for analyzing data in GreptimeDB, | ||
by running it in the database directly instead of fetching all the data from the database and running it locally. | ||
This approach saves a lot of data transfer costs. | ||
The image below depicts how the script works. | ||
The `RecordBatch` (which is basically a column in a table with type and nullability metadata) | ||
can come from anywhere in the database, | ||
and the returned `RecordBatch` can be annotated in Python grammar to indicate its metadata, | ||
such as type or nullability. | ||
The script will do its best to convert the returned object to a `RecordBatch`, | ||
whether it is a Python list, a `RecordBatch` computed from parameters, | ||
or a constant (which is extended to the same length as the input arguments). | ||
|
||
![Python Coprocessor](/python-coprocessor.png) | ||
|
||
## Two optional backends | ||
|
||
### CPython Backend powered by PyO3 | ||
|
||
This backend is powered by [PyO3](https://pyo3.rs/v0.18.1/), enabling the use of your favourite Python libraries (such as NumPy, Pandas, etc.) and allowing Conda to manage your Python environment. | ||
|
||
But using it also involves some complications. You must set up the correct Python shared library, which can be a bit challenging. In general, you just need to install the `python-dev` package. However, if you are using Homebrew to install Python on macOS, you must create a proper soft link to `Library/Frameworks/Python.framework`. Detailed instructions on using PyO3 crate with different Python Version can be found [here](https://pyo3.rs/v0.18.1/building_and_distribution#configuring-the-python-version) | ||
|
||
### Embedded RustPython Interpreter | ||
|
||
An experiment [python interpreter](https://github.com/RustPython/RustPython) to run | ||
the coprocessor script, it supports Python 3.10 grammar. You can use all the very Python syntax, see [User Guide/Python Coprocessor](/user-guide/python-scripts/overview.md) for more! |
59 changes: 59 additions & 0 deletions
59
docs/nightly/en/contributor-guide/datanode/query-engine.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# Query Engine | ||
|
||
## Introduction | ||
|
||
GreptimeDB's query engine is built on [Apache DataFusion][1] (subproject under [Apache | ||
Arrow][2]), a brilliant query engine written in Rust. It provides a set of well functional components from | ||
logical plan, physical plan and the execution runtime. Below explains how each component is orchestrated and their positions during execution. | ||
|
||
![Execution Procedure](/execution-procedure.png) | ||
|
||
The entry point is the logical plan, which is used as the general intermediate representation of a | ||
query or execution logic etc. Two noticeable sources of logical plan are from: 1. the user query, like | ||
SQL through SQL parser and planner; 2. the Frontend's distributed query, which is explained in details in the following section. | ||
|
||
Next is the physical plan, or the execution plan. Unlike the logical plan which is a big | ||
enumeration containing all the logical plan variants (except the special extension plan node), the | ||
physical plan is in fact a trait that defines a group of methods invoked during | ||
execution. All data processing logics are packed in corresponding structures that | ||
implement the trait. They are the actual operations performed on the data, like | ||
aggregator `MIN` or `AVG`, and table scan `SELECT ... FROM`. | ||
|
||
The optimization phase which improves execution performance by transforming both logical and physical plans, is now all based on rules. It is also called, "Rule Based Optimization". Some of the rules are DataFusion native and others are customized in Greptime DB. In the future, we plan to add more | ||
rules and leverage the data statistics for Cost Based Optimization/CBO. | ||
|
||
The last phase "execute" is a verb, stands for the procedure that reads data from storage, performs | ||
calculations and generates the expected results. Although it's more abstract than previously mentioned concepts, you can just | ||
simply imagine it as executing a Rust async function. And it's indeed a future (stream). | ||
|
||
`EXPLAIN [VERBOSE] <SQL>` is very useful if you want to see how your SQL is represented in the logical or physical plan. | ||
|
||
## Data Representation | ||
|
||
GreptimeDB uses [Apache Arrow][2] as the in-memory data representation. It's column-oriented, in | ||
cross-platform format, and also contains many high-performance data operators. These features | ||
make it easy to share data in many different environments and implement calculation logic. | ||
|
||
## Indexing | ||
|
||
In time series data, there are two important dimensions: timestamp and tag columns (or like | ||
primary key in a general relational database). GreptimeDB groups data in time buckets, so it's efficient | ||
to locate and extract data within the expected time range at a very low cost. The mainly used persistent file format [Apache Parquet][3] in GreptimeDB helps a lot -- it | ||
provides multi-level indices and filters that make it easy to prune data during querying. In the future, we | ||
will make more use of this feature, and develop our separated index to handle more complex use cases. | ||
|
||
## Extensibility | ||
|
||
Extending operations in GreptimeDB is extremely simple. There are two ways to do it: 1. via the [Python Coprocessor][4] interface; 2. implement your operator like | ||
[this][5]. | ||
|
||
## Distributed Execution | ||
|
||
Covered in [Distributed Querying][6]. | ||
|
||
[1]: https://github.com/apache/arrow-datafusion | ||
[2]: https://arrow.apache.org/ | ||
[3]: https://parquet.apache.org | ||
[4]: python-scripts.md | ||
[5]: https://github.com/GreptimeTeam/greptimedb/blob/main/docs/how-to/how-to-write-aggregate-function.md | ||
[6]: ../frontend/distributed-querying.md |
Oops, something went wrong.