Skip to content

Commit

Permalink
More README.md improvements (#1084)
Browse files Browse the repository at this point in the history
  • Loading branch information
lwwmanning authored and danking committed Oct 18, 2024
1 parent fb87dd5 commit cd6db84
Showing 1 changed file with 11 additions and 4 deletions.
15 changes: 11 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,12 @@ One of the unique attributes of the (in-progress) Vortex file format is that it
file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to
the file format specification.

In fact, the format is designed to support forward compatibility by optionally embedding WASM decoders directly into the files
For example, the Compressor implementation can choose to chunk data into a Parquet-like layout with
row groups and aligned pages (ChunkedArray of StructArray of ChunkedArrays with equal chunk sizes). Alternatively, it can choose
to chunk different columns differently based on their compressed size and data distributions (e.g., a column that is constant
across all rows can be a single chunk, whereas a large string column may be split arbitrarily many times).

In the same vein, the format is designed to support forward compatibility by optionally embedding WASM decoders directly into the files
themselves. This should help avoid the rapid calcification that has plagued other columnar file formats.

## Components
Expand Down Expand Up @@ -224,7 +229,7 @@ Expect more details on this in Q4 2024.
This project is inspired by and--in some cases--directly based upon the existing, excellent work of many researchers
and OSS developers.

In particular, the following academic papers greatly influenced the development:
In particular, the following academic papers have strongly influenced development:

* Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis.
[BtrBlocks: Efficient Columnar Compression for Data Lakes](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf).
Expand All @@ -240,12 +245,14 @@ In particular, the following academic papers greatly influenced the development:
* Biswapesh Chattopadhyay, Priyam Dutta, Weiran Liu, Ott Tinn, Andrew Mccormick, Aniket Mokashi, Paul Harvey,
Hector Gonzalez, David Lomax, Sagar Mittal, et al. [Procella: Unifying serving and analytical
data at YouTube](https://dl.acm.org/citation.cfm?id=3360438). PVLDB, 12(12): 2022-2034, 2019.
* Dominik Durner, Viktor Leis, and Thomas Neumann. [Exploiting Cloud Object Storage for High-Performance
Analytics](https://www.durner.dev/app/media/papers/anyblob-vldb23.pdf). PVLDB, 16(11): 2769-2782, 2023.


Additionally, we benefited greatly from:

* the existence, ideas, & implementation of [Apache Arrow](https://arrow.apache.org).
* likewise for the excellent [Apache DataFusion](https://github.com/apache/datafusion) project.
* the existence, ideas, & implementations of both [Apache Arrow](https://arrow.apache.org) and
[Apache DataFusion](https://github.com/apache/datafusion).
* the [parquet2](https://github.com/jorgecarleitao/parquet2) project by [Jorge Leitao](https://github.com/jorgecarleitao).
* the public discussions around choices of compression codecs, as well as the C++ implementations thereof,
from [duckdb](https://github.com/duckdb/duckdb).
Expand Down

0 comments on commit cd6db84

Please sign in to comment.