Merge remote-tracking branch 'origin/develop' into rk/lessspecialflat

spiraldb · Oct 19, 2024 · 468cc51 · 468cc51
2 parents 40a5298 + 85be90f
commit 468cc51
Show file tree

Hide file tree

Showing 9 changed files with 337 additions and 240 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/README.md b/README.md
@@ -8,12 +8,12 @@
 Vortex is an extensible, state-of-the-art columnar file format, with associated tools for working with compressed Apache Arrow arrays 
 in-memory, on-disk, and over-the-wire.
 
-Vortex is an aspiring successor to Apache Parquet, with dramatically faster random access reads (100-200x faster)
-and scans (2-10x faster), while preserving approximately the same compression ratio and write throughput as Parquet with zstd. 
-It will also support very wide tables (at least 10s of thousands of columns) and (eventually) on-device decompression on GPUs.
+Vortex is an aspiring successor to Apache Parquet, with dramatically faster random access reads (100-200x faster) and scans (2-10x faster), 
+while preserving approximately the same compression ratio and write throughput as Parquet with zstd. 
+It is designed to support very wide tables (at least 10s of thousands of columns) and (eventually) on-device decompression on GPUs.
 
-Vortex is designed to be to columnar file formats what Apache DataFusion is to query engines: highly extensible,
-extremely fast, batteries-included.
+Vortex is intended to be to columnar file formats what Apache DataFusion is to query engines: highly extensible,
+extremely fast, & batteries-included.
 
 > [!CAUTION]
 > This library is still under rapid development and is a work in progress!
@@ -58,7 +58,12 @@ One of the unique attributes of the (in-progress) Vortex file format is that it
 file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to
 the file format specification.
 
-In fact, the format is designed to support forward compatibility by optionally embedding WASM decoders directly into the files
+For example, the Compressor implementation can choose to chunk data into a Parquet-like layout with
+row groups and aligned pages (ChunkedArray of StructArray of ChunkedArrays with equal chunk sizes). Alternatively, it can choose
+to chunk different columns differently based on their compressed size and data distributions (e.g., a column that is constant
+across all rows can be a single chunk, whereas a large string column may be split arbitrarily many times).
+
+In the same vein, the format is designed to support forward compatibility by optionally embedding WASM decoders directly into the files
 themselves. This should help avoid the rapid calcification that has plagued other columnar file formats.
 
 ## Components
@@ -120,10 +125,10 @@ in-memory array implementation, allowing us to defer decompression. Currently, t
 Vortex's default compression strategy is based on the
 [BtrBlocks](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf) paper.
 
-Roughly, for each chunk of data, a sample of at least ~1% of the data is taken. Compression is then attempted (
-recursively) with a set of lightweight encodings. The best-performing combination of encodings is then chosen to encode
-the entire chunk. This sounds like it would be very expensive, but given basic statistics about a chunk, it is
-possible to cheaply prune many encodings and ensure the search space does not explode in size.
+Roughly, for each chunk of data, a sample of at least ~1% of the data is taken. Compression is then attempted 
+(recursively) with a set of lightweight encodings. The best-performing combination of encodings is then chosen to encode
+the entire chunk. This sounds like it would be very expensive, but given the logical types and basic statistics about a
+chunk, it is possible to cheaply prune many encodings and ensure the search space does not explode in size.
 
 ### Compute
 
@@ -224,7 +229,7 @@ Expect more details on this in Q4 2024.
 This project is inspired by and--in some cases--directly based upon the existing, excellent work of many researchers
 and OSS developers.
 
-In particular, the following academic papers greatly influenced the development:
+In particular, the following academic papers have strongly influenced development:
 
 * Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis.
   [BtrBlocks: Efficient Columnar Compression for Data Lakes](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf).
@@ -240,12 +245,14 @@ In particular, the following academic papers greatly influenced the development:
 * Biswapesh Chattopadhyay, Priyam Dutta, Weiran Liu, Ott Tinn, Andrew Mccormick, Aniket Mokashi, Paul Harvey,
   Hector Gonzalez, David Lomax, Sagar Mittal, et al. [Procella: Unifying serving and analytical
   data at YouTube](https://dl.acm.org/citation.cfm?id=3360438). PVLDB, 12(12): 2022-2034, 2019.
+* Dominik Durner, Viktor Leis, and Thomas Neumann. [Exploiting Cloud Object Storage for High-Performance
+  Analytics](https://www.durner.dev/app/media/papers/anyblob-vldb23.pdf). PVLDB, 16(11): 2769-2782, 2023.
 
 
 Additionally, we benefited greatly from:
 
-* the existence, ideas, & implementation of [Apache Arrow](https://arrow.apache.org).
-* likewise for the excellent [Apache DataFusion](https://github.com/apache/datafusion) project.
+* the existence, ideas, & implementations of both [Apache Arrow](https://arrow.apache.org) and
+  [Apache DataFusion](https://github.com/apache/datafusion).
 * the [parquet2](https://github.com/jorgecarleitao/parquet2) project by [Jorge Leitao](https://github.com/jorgecarleitao).
 * the public discussions around choices of compression codecs, as well as the C++ implementations thereof,
   from [duckdb](https://github.com/duckdb/duckdb).

diff --git a/pyvortex/src/array.rs b/pyvortex/src/array.rs
@@ -4,7 +4,8 @@ use pyo3::exceptions::PyValueError;
 use pyo3::prelude::*;
 use pyo3::types::{IntoPyDict, PyList};
 use vortex::array::ChunkedArray;
-use vortex::compute::take;
+use vortex::compute::unary::fill_forward;
+use vortex::compute::{slice, take};
 use vortex::{Array, ArrayDType, IntoCanonical};
 
 use crate::dtype::PyDType;
@@ -138,8 +139,64 @@ impl PyArray {
         PyDType::wrap(self_.py(), self_.inner.dtype().clone())
     }
 
+    /// Fill forward non-null values over runs of nulls.
+    ///
+    /// Leading nulls are replaced with the "zero" for that type. For integral and floating-point
+    /// types, this is zero. For the Boolean type, this is `:obj:`False`.
+    ///
+    /// Returns
+    /// -------
+    /// :class:`vortex.encoding.Array`
+    ///
+    /// Examples
+    /// --------
+    ///
+    /// Fill forward sensor values over intermediate missing values. Note that leading nulls are
+    /// replaced with 0.0:
+    ///
+    /// >>> a = vortex.encoding.array([
+    /// ...      None,  None, 30.29, 30.30, 30.30,  None,  None, 30.27, 30.25,
+    /// ...     30.22,  None,  None,  None,  None, 30.12, 30.11, 30.11, 30.11,
+    /// ...     30.10, 30.08,  None, 30.21, 30.03, 30.03, 30.05, 30.07, 30.07,
+    /// ... ])
+    /// >>> a.fill_forward().to_arrow_array()
+    /// <pyarrow.lib.DoubleArray object at ...>
+    /// [
+    ///   0,
+    ///   0,
+    ///   30.29,
+    ///   30.3,
+    ///   30.3,
+    ///   30.3,
+    ///   30.3,
+    ///   30.27,
+    ///   30.25,
+    ///   30.22,
+    ///   ...
+    ///   30.11,
+    ///   30.1,
+    ///   30.08,
+    ///   30.08,
+    ///   30.21,
+    ///   30.03,
+    ///   30.03,
+    ///   30.05,
+    ///   30.07,
+    ///   30.07
+    /// ]
+    fn fill_forward(&self) -> PyResult<PyArray> {
+        fill_forward(&self.inner)
+            .map_err(PyVortexError::map_err)
+            .map(|arr| PyArray { inner: arr })
+    }
+
     /// Filter, permute, and/or repeat elements by their index.
     ///
+    /// Parameters
+    /// ----------
+    /// indices : :class:`vortex.encoding.Array`
+    ///     An array of indices to keep.
+    ///
     /// Returns
     /// -------
     /// :class:`vortex.encoding.Array`
@@ -186,6 +243,63 @@ impl PyArray {
             .and_then(|arr| Bound::new(py, PyArray { inner: arr }))
     }
 
+    /// Keep only a contiguous subset of elements.
+    ///
+    /// Parameters
+    /// ----------
+    /// start : :class:`int`
+    ///     The start index of the range to keep, inclusive.
+    ///
+    /// end : :class:`int`
+    ///     The end index, exclusive.
+    ///
+    /// Returns
+    /// -------
+    /// :class:`vortex.encoding.Array`
+    ///
+    /// Examples
+    /// --------
+    ///
+    /// Keep only the second through third elements:
+    ///
+    ///     >>> a = vortex.encoding.array(['a', 'b', 'c', 'd'])
+    ///     >>> a.slice(1, 3).to_arrow_array()
+    ///     <pyarrow.lib.StringArray object at ...>
+    ///     [
+    ///       "b",
+    ///       "c"
+    ///     ]
+    ///
+    /// Keep none of the elements:
+    ///
+    ///     >>> a = vortex.encoding.array(['a', 'b', 'c', 'd'])
+    ///     >>> a.slice(3, 3).to_arrow_array()
+    ///     <pyarrow.lib.StringArray object at ...>
+    ///     []
+    ///
+    /// Unlike Python, it is an error to slice outside the bounds of the array:
+    ///
+    ///     >>> a = vortex.encoding.array(['a', 'b', 'c', 'd'])
+    ///     >>> a.slice(2, 10).to_arrow_array()
+    ///     Traceback (most recent call last):
+    ///     ...
+    ///     ValueError: index 10 out of bounds from 0 to 4
+    ///
+    /// Or to slice with a negative value:
+    ///
+    ///     >>> a = vortex.encoding.array(['a', 'b', 'c', 'd'])
+    ///     >>> a.slice(-2, -1).to_arrow_array()
+    ///     Traceback (most recent call last):
+    ///     ...
+    ///     OverflowError: can't convert negative int to unsigned
+    ///
+    #[pyo3(signature = (start, end, *))]
+    fn slice(&self, start: usize, end: usize) -> PyResult<PyArray> {
+        slice(&self.inner, start, end)
+            .map(PyArray::new)
+            .map_err(PyVortexError::map_err)
+    }
+
     /// Internal technical details about the encoding of this Array.
     ///
     /// Warnings