Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More README.md improvements #1084

Merged
merged 3 commits into from
Oct 18, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 11 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,12 @@ One of the unique attributes of the (in-progress) Vortex file format is that it
file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to
the file format specification.

In fact, the format is designed to support forward compatibility by optionally embedding WASM decoders directly into the files
For example, the Compressor implementation can choose to chunk data into a Parquet-like layout with
row groups and aligned pages (ChunkedArray of StructArray of ChunkedArrays with equal chunk sizes). Alternatively, it can choose
to chunk different columns differently based on their compressed size and data distributions (e.g., a column that is constant
across all rows can be a single chunk, whereas a large string column may be split arbitrarily many times).

In the same vein, the format is designed to support forward compatibility by optionally embedding WASM decoders directly into the files
themselves. This should help avoid the rapid calcification that has plagued other columnar file formats.

## Components
Expand Down Expand Up @@ -224,7 +229,7 @@ Expect more details on this in Q4 2024.
This project is inspired by and--in some cases--directly based upon the existing, excellent work of many researchers
and OSS developers.

In particular, the following academic papers greatly influenced the development:
In particular, the following academic papers have strongly influenced development:

* Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis.
[BtrBlocks: Efficient Columnar Compression for Data Lakes](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf).
Expand All @@ -240,12 +245,14 @@ In particular, the following academic papers greatly influenced the development:
* Biswapesh Chattopadhyay, Priyam Dutta, Weiran Liu, Ott Tinn, Andrew Mccormick, Aniket Mokashi, Paul Harvey,
Hector Gonzalez, David Lomax, Sagar Mittal, et al. [Procella: Unifying serving and analytical
data at YouTube](https://dl.acm.org/citation.cfm?id=3360438). PVLDB, 12(12): 2022-2034, 2019.
* Dominik Durner, Viktor Leis, and Thomas Neumann. [Exploiting Cloud Object Storage for High-Performance
Analytics](https://www.durner.dev/app/media/papers/anyblob-vldb23.pdf). PVLDB, 16(11): 2769-2782, 2023.


Additionally, we benefited greatly from:

* the existence, ideas, & implementation of [Apache Arrow](https://arrow.apache.org).
* likewise for the excellent [Apache DataFusion](https://github.com/apache/datafusion) project.
* the existence, ideas, & implementations of both [Apache Arrow](https://arrow.apache.org) and
[Apache DataFusion](https://github.com/apache/datafusion).
* the [parquet2](https://github.com/jorgecarleitao/parquet2) project by [Jorge Leitao](https://github.com/jorgecarleitao).
* the public discussions around choices of compression codecs, as well as the C++ implementations thereof,
from [duckdb](https://github.com/duckdb/duckdb).
Expand Down
Loading