Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/develop' into rk/lessspecialflat
Browse files Browse the repository at this point in the history
  • Loading branch information
lwwmanning committed Oct 19, 2024
2 parents 40a5298 + 85be90f commit 468cc51
Show file tree
Hide file tree
Showing 9 changed files with 337 additions and 240 deletions.
12 changes: 6 additions & 6 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

33 changes: 20 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@
Vortex is an extensible, state-of-the-art columnar file format, with associated tools for working with compressed Apache Arrow arrays
in-memory, on-disk, and over-the-wire.

Vortex is an aspiring successor to Apache Parquet, with dramatically faster random access reads (100-200x faster)
and scans (2-10x faster), while preserving approximately the same compression ratio and write throughput as Parquet with zstd.
It will also support very wide tables (at least 10s of thousands of columns) and (eventually) on-device decompression on GPUs.
Vortex is an aspiring successor to Apache Parquet, with dramatically faster random access reads (100-200x faster) and scans (2-10x faster),
while preserving approximately the same compression ratio and write throughput as Parquet with zstd.
It is designed to support very wide tables (at least 10s of thousands of columns) and (eventually) on-device decompression on GPUs.

Vortex is designed to be to columnar file formats what Apache DataFusion is to query engines: highly extensible,
extremely fast, batteries-included.
Vortex is intended to be to columnar file formats what Apache DataFusion is to query engines: highly extensible,
extremely fast, & batteries-included.

> [!CAUTION]
> This library is still under rapid development and is a work in progress!
Expand Down Expand Up @@ -58,7 +58,12 @@ One of the unique attributes of the (in-progress) Vortex file format is that it
file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to
the file format specification.

In fact, the format is designed to support forward compatibility by optionally embedding WASM decoders directly into the files
For example, the Compressor implementation can choose to chunk data into a Parquet-like layout with
row groups and aligned pages (ChunkedArray of StructArray of ChunkedArrays with equal chunk sizes). Alternatively, it can choose
to chunk different columns differently based on their compressed size and data distributions (e.g., a column that is constant
across all rows can be a single chunk, whereas a large string column may be split arbitrarily many times).

In the same vein, the format is designed to support forward compatibility by optionally embedding WASM decoders directly into the files
themselves. This should help avoid the rapid calcification that has plagued other columnar file formats.

## Components
Expand Down Expand Up @@ -120,10 +125,10 @@ in-memory array implementation, allowing us to defer decompression. Currently, t
Vortex's default compression strategy is based on the
[BtrBlocks](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf) paper.

Roughly, for each chunk of data, a sample of at least ~1% of the data is taken. Compression is then attempted (
recursively) with a set of lightweight encodings. The best-performing combination of encodings is then chosen to encode
the entire chunk. This sounds like it would be very expensive, but given basic statistics about a chunk, it is
possible to cheaply prune many encodings and ensure the search space does not explode in size.
Roughly, for each chunk of data, a sample of at least ~1% of the data is taken. Compression is then attempted
(recursively) with a set of lightweight encodings. The best-performing combination of encodings is then chosen to encode
the entire chunk. This sounds like it would be very expensive, but given the logical types and basic statistics about a
chunk, it is possible to cheaply prune many encodings and ensure the search space does not explode in size.

### Compute

Expand Down Expand Up @@ -224,7 +229,7 @@ Expect more details on this in Q4 2024.
This project is inspired by and--in some cases--directly based upon the existing, excellent work of many researchers
and OSS developers.

In particular, the following academic papers greatly influenced the development:
In particular, the following academic papers have strongly influenced development:

* Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis.
[BtrBlocks: Efficient Columnar Compression for Data Lakes](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf).
Expand All @@ -240,12 +245,14 @@ In particular, the following academic papers greatly influenced the development:
* Biswapesh Chattopadhyay, Priyam Dutta, Weiran Liu, Ott Tinn, Andrew Mccormick, Aniket Mokashi, Paul Harvey,
Hector Gonzalez, David Lomax, Sagar Mittal, et al. [Procella: Unifying serving and analytical
data at YouTube](https://dl.acm.org/citation.cfm?id=3360438). PVLDB, 12(12): 2022-2034, 2019.
* Dominik Durner, Viktor Leis, and Thomas Neumann. [Exploiting Cloud Object Storage for High-Performance
Analytics](https://www.durner.dev/app/media/papers/anyblob-vldb23.pdf). PVLDB, 16(11): 2769-2782, 2023.


Additionally, we benefited greatly from:

* the existence, ideas, & implementation of [Apache Arrow](https://arrow.apache.org).
* likewise for the excellent [Apache DataFusion](https://github.com/apache/datafusion) project.
* the existence, ideas, & implementations of both [Apache Arrow](https://arrow.apache.org) and
[Apache DataFusion](https://github.com/apache/datafusion).
* the [parquet2](https://github.com/jorgecarleitao/parquet2) project by [Jorge Leitao](https://github.com/jorgecarleitao).
* the public discussions around choices of compression codecs, as well as the C++ implementations thereof,
from [duckdb](https://github.com/duckdb/duckdb).
Expand Down
116 changes: 115 additions & 1 deletion pyvortex/src/array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@ use pyo3::exceptions::PyValueError;
use pyo3::prelude::*;
use pyo3::types::{IntoPyDict, PyList};
use vortex::array::ChunkedArray;
use vortex::compute::take;
use vortex::compute::unary::fill_forward;
use vortex::compute::{slice, take};
use vortex::{Array, ArrayDType, IntoCanonical};

use crate::dtype::PyDType;
Expand Down Expand Up @@ -138,8 +139,64 @@ impl PyArray {
PyDType::wrap(self_.py(), self_.inner.dtype().clone())
}

/// Fill forward non-null values over runs of nulls.
///
/// Leading nulls are replaced with the "zero" for that type. For integral and floating-point
/// types, this is zero. For the Boolean type, this is `:obj:`False`.
///
/// Returns
/// -------
/// :class:`vortex.encoding.Array`
///
/// Examples
/// --------
///
/// Fill forward sensor values over intermediate missing values. Note that leading nulls are
/// replaced with 0.0:
///
/// >>> a = vortex.encoding.array([
/// ... None, None, 30.29, 30.30, 30.30, None, None, 30.27, 30.25,
/// ... 30.22, None, None, None, None, 30.12, 30.11, 30.11, 30.11,
/// ... 30.10, 30.08, None, 30.21, 30.03, 30.03, 30.05, 30.07, 30.07,
/// ... ])
/// >>> a.fill_forward().to_arrow_array()
/// <pyarrow.lib.DoubleArray object at ...>
/// [
/// 0,
/// 0,
/// 30.29,
/// 30.3,
/// 30.3,
/// 30.3,
/// 30.3,
/// 30.27,
/// 30.25,
/// 30.22,
/// ...
/// 30.11,
/// 30.1,
/// 30.08,
/// 30.08,
/// 30.21,
/// 30.03,
/// 30.03,
/// 30.05,
/// 30.07,
/// 30.07
/// ]
fn fill_forward(&self) -> PyResult<PyArray> {
fill_forward(&self.inner)
.map_err(PyVortexError::map_err)
.map(|arr| PyArray { inner: arr })
}

/// Filter, permute, and/or repeat elements by their index.
///
/// Parameters
/// ----------
/// indices : :class:`vortex.encoding.Array`
/// An array of indices to keep.
///
/// Returns
/// -------
/// :class:`vortex.encoding.Array`
Expand Down Expand Up @@ -186,6 +243,63 @@ impl PyArray {
.and_then(|arr| Bound::new(py, PyArray { inner: arr }))
}

/// Keep only a contiguous subset of elements.
///
/// Parameters
/// ----------
/// start : :class:`int`
/// The start index of the range to keep, inclusive.
///
/// end : :class:`int`
/// The end index, exclusive.
///
/// Returns
/// -------
/// :class:`vortex.encoding.Array`
///
/// Examples
/// --------
///
/// Keep only the second through third elements:
///
/// >>> a = vortex.encoding.array(['a', 'b', 'c', 'd'])
/// >>> a.slice(1, 3).to_arrow_array()
/// <pyarrow.lib.StringArray object at ...>
/// [
/// "b",
/// "c"
/// ]
///
/// Keep none of the elements:
///
/// >>> a = vortex.encoding.array(['a', 'b', 'c', 'd'])
/// >>> a.slice(3, 3).to_arrow_array()
/// <pyarrow.lib.StringArray object at ...>
/// []
///
/// Unlike Python, it is an error to slice outside the bounds of the array:
///
/// >>> a = vortex.encoding.array(['a', 'b', 'c', 'd'])
/// >>> a.slice(2, 10).to_arrow_array()
/// Traceback (most recent call last):
/// ...
/// ValueError: index 10 out of bounds from 0 to 4
///
/// Or to slice with a negative value:
///
/// >>> a = vortex.encoding.array(['a', 'b', 'c', 'd'])
/// >>> a.slice(-2, -1).to_arrow_array()
/// Traceback (most recent call last):
/// ...
/// OverflowError: can't convert negative int to unsigned
///
#[pyo3(signature = (start, end, *))]
fn slice(&self, start: usize, end: usize) -> PyResult<PyArray> {
slice(&self.inner, start, end)
.map(PyArray::new)
.map_err(PyVortexError::map_err)
}

/// Internal technical details about the encoding of this Array.
///
/// Warnings
Expand Down
Loading

0 comments on commit 468cc51

Please sign in to comment.