diff --git a/README.md b/README.md index b3ee59eb39..2737b9d6ca 100644 --- a/README.md +++ b/README.md @@ -58,7 +58,12 @@ One of the unique attributes of the (in-progress) Vortex file format is that it file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to the file format specification. -In fact, the format is designed to support forward compatibility by optionally embedding WASM decoders directly into the files +For example, the Compressor implementation can choose to chunk data into a Parquet-like layout with +row groups and aligned pages (ChunkedArray of StructArray of ChunkedArrays with equal chunk sizes). Alternatively, it can choose +to chunk different columns differently based on their compressed size and data distributions (e.g., a column that is constant +across all rows can be a single chunk, whereas a large string column may be split arbitrarily many times). + +In the same vein, the format is designed to support forward compatibility by optionally embedding WASM decoders directly into the files themselves. This should help avoid the rapid calcification that has plagued other columnar file formats. ## Components @@ -224,7 +229,7 @@ Expect more details on this in Q4 2024. This project is inspired by and--in some cases--directly based upon the existing, excellent work of many researchers and OSS developers. -In particular, the following academic papers greatly influenced the development: +In particular, the following academic papers have strongly influenced development: * Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis. [BtrBlocks: Efficient Columnar Compression for Data Lakes](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf). @@ -240,12 +245,14 @@ In particular, the following academic papers greatly influenced the development: * Biswapesh Chattopadhyay, Priyam Dutta, Weiran Liu, Ott Tinn, Andrew Mccormick, Aniket Mokashi, Paul Harvey, Hector Gonzalez, David Lomax, Sagar Mittal, et al. [Procella: Unifying serving and analytical data at YouTube](https://dl.acm.org/citation.cfm?id=3360438). PVLDB, 12(12): 2022-2034, 2019. +* Dominik Durner, Viktor Leis, and Thomas Neumann. [Exploiting Cloud Object Storage for High-Performance + Analytics](https://www.durner.dev/app/media/papers/anyblob-vldb23.pdf). PVLDB, 16(11): 2769-2782, 2023. Additionally, we benefited greatly from: -* the existence, ideas, & implementation of [Apache Arrow](https://arrow.apache.org). -* likewise for the excellent [Apache DataFusion](https://github.com/apache/datafusion) project. +* the existence, ideas, & implementations of both [Apache Arrow](https://arrow.apache.org) and + [Apache DataFusion](https://github.com/apache/datafusion). * the [parquet2](https://github.com/jorgecarleitao/parquet2) project by [Jorge Leitao](https://github.com/jorgecarleitao). * the public discussions around choices of compression codecs, as well as the C++ implementations thereof, from [duckdb](https://github.com/duckdb/duckdb).