Skip to content

Commit

Permalink
PyVortex (#729)
Browse files Browse the repository at this point in the history
PyVortex
--------

The generated documentation for this branch is available at
https://spiraldb.github.io/vortex/docs/

The Python package is now structured like this:

- `vortex`
  - `array()`: converts a list or an Arrow array into a Vortex array.
  - `encodings`
- `Array`: In Rust this is called a PyArray and it is just PyO3 wrapper
around a Vortex Rust Array. - `to_pandas` - `to_numpy`
    - `compress()`: compresses an Array.
- `dtype`: A module containing dtype constructors, e.g. `uint(32,
nullable=False)`
- `io`: Readers and writers which currently only work for Struct arrays
without top-level nulls.
    - `read()`
    - `write()`
- `expr` - `Expr`: a class, implemented in Rust, which constructs
vortex-exprs using the obvious Python operators.

I also added `python_repr` which returns a Display-able struct that
renders itself in the Python `repr` style. In particular, the dtypes
look like `uint(32, False)` rather than `u32`.

I think the only bugfixes in this PR are:

1. pyvortex/src/encode.rs: propagate the nullability from Arrow to
`Array::from_arrow`.
2. arrow/recordbatch.rs and arrow/dtype.rs need to return compatible
nullability and validity.

Future Work
-----------

1. Automatically generate and deploy the documentation to github.io.
2. Run `cd pyvortex/docs && make doctest` on every commit.
  • Loading branch information
danking authored Sep 5, 2024
1 parent 28717ad commit e3a6c5a
Show file tree
Hide file tree
Showing 26 changed files with 1,493 additions and 191 deletions.
8 changes: 8 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions pyvortex/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,19 +22,27 @@ doctest = false

[dependencies]
arrow = { workspace = true, features = ["pyarrow"] }
flexbuffers = { workspace = true }
futures = { workspace = true }
log = { workspace = true }
paste = { workspace = true }
pyo3 = { workspace = true }
pyo3-log = { workspace = true }
tokio = { workspace = true, features = ["fs"] }
vortex-alp = { workspace = true }
vortex-array = { workspace = true }
vortex-dict = { workspace = true }
vortex-dtype = { workspace = true }
vortex-error = { workspace = true }
vortex-expr = { workspace = true }
vortex-fastlanes = { workspace = true }
vortex-roaring = { workspace = true }
vortex-runend = { workspace = true }
vortex-sampling-compressor = { workspace = true }
vortex-serde = { workspace = true, features = ["tokio"] }
vortex-scalar = { workspace = true }
vortex-zigzag = { workspace = true }
itertools = { workspace = true }

# We may need this workaround?
# https://pyo3.rs/v0.20.2/faq.html#i-cant-run-cargo-test-or-i-cant-build-in-a-cargo-workspace-im-having-linker-issues-like-symbol-not-found-or-undefined-reference-to-_pyexc_systemerror
7 changes: 7 additions & 0 deletions pyvortex/docs/dtype.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Array Data Types
================

.. automodule:: vortex.dtype
:members:
:imported-members:

7 changes: 7 additions & 0 deletions pyvortex/docs/encoding.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Arrays
======

.. automodule:: vortex.encoding
:members:
:imported-members:
:special-members: __len__
6 changes: 6 additions & 0 deletions pyvortex/docs/expr.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Row Filter Expressions
======================

.. automodule:: vortex.expr
:members:
:imported-members:
8 changes: 6 additions & 2 deletions pyvortex/docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,13 @@
Vortex documentation
====================

.. automodule:: vortex
:members:
Vortex is an Apache Arrow-compatible toolkit for working with compressed array data.

.. toctree::
:maxdepth: 2
:caption: Contents:

encoding
dtype
io
expr
6 changes: 6 additions & 0 deletions pyvortex/docs/io.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Input and Output
================

.. automodule:: vortex.io
:members:
:imported-members:
9 changes: 7 additions & 2 deletions pyvortex/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@ description = "Add your description here"
authors = [
{ name = "Nicholas Gates", email = "[email protected]" }
]
dependencies = []
dependencies = [
"pydata-sphinx-theme>=0.15.4",
]
requires-python = ">= 3.11"
classifiers = ["Private :: Do Not Upload"]

Expand All @@ -17,7 +19,10 @@ build-backend = "maturin"
managed = true
dev-dependencies = [
"pyarrow>=15.0.0",
"pip"
"pip",
"sphinx>=8.0.2",
"ipython>=8.26.0",
"pandas>=2.2.2",
]

[tool.maturin]
Expand Down
7 changes: 6 additions & 1 deletion pyvortex/python/vortex/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
from ._lib import * # noqa: F403
from . import encoding
from ._lib import __doc__ as module_docs
from ._lib import dtype, expr, io

__doc__ = module_docs
del module_docs
array = encoding.array

__all__ = ["array", dtype, expr, io, encoding]
166 changes: 166 additions & 0 deletions pyvortex/python/vortex/encoding.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
import pyarrow

from ._lib import encoding as _encoding

__doc__ = _encoding.__doc__

Array = _encoding.Array
compress = _encoding.compress


def _Array_to_pandas(self: _encoding.Array, *, name: str | None = None, flatten: bool = False):
"""Construct a Pandas dataframe from this Vortex array.
Parameters
----------
obj : :class:`pyarrow.Array` or :class:`list`
The elements of this array or list become the elements of the Vortex array.
name : :class:`str`, optional
The name of the column in the newly created dataframe. If unspecified, use `x`.
flatten : :class:`bool`
If :obj:`True`, Struct columns are flattened in the dataframe. See the examples.
Returns
-------
:class:`pandas.DataFrame`
Examples
--------
Construct a :class:`.pandas.DataFrame` with one column named `animals` from the contents of a Vortex
array:
>>> array = vortex.encoding.array(['dog', 'cat', 'mouse', 'rat'])
>>> array.to_pandas(name='animals')
animals
0 dog
1 cat
2 mouse
3 rat
Construct a :class:`.pandas.DataFrame` with the default column name:
>>> array = vortex.encoding.array(['dog', 'cat', 'mouse', 'rat'])
>>> array.to_pandas()
x
0 dog
1 cat
2 mouse
3 rat
Construct a dataframe with a Struct-typed column:
>>> array = vortex.encoding.array([
... {'name': 'Joseph', 'age': 25},
... {'name': 'Narendra', 'age': 31},
... {'name': 'Angela', 'age': 33},
... {'name': 'Mikhail', 'age': 57},
... ])
>>> array.to_pandas()
x
0 {'age': 25, 'name': 'Joseph'}
1 {'age': 31, 'name': 'Narendra'}
2 {'age': 33, 'name': 'Angela'}
3 {'age': 57, 'name': 'Mikhail'}
Lift the struct fields to the top-level in the dataframe:
>>> array.to_pandas(flatten=True)
x.age x.name
0 25 Joseph
1 31 Narendra
2 33 Angela
3 57 Mikhail
"""
name = name or "x"
table = pyarrow.Table.from_arrays([self.to_arrow()], [name])
if flatten:
table = table.flatten()
return table.to_pandas()


Array.to_pandas = _Array_to_pandas


def _Array_to_numpy(self: _encoding.Array, *, zero_copy_only: bool = True):
"""Construct a NumPy array from this Vortex array.
This is an alias for :code:`self.to_arrow().to_numpy(zero_copy_only)`
Returns
-------
:class:`numpy.ndarray`
Examples
--------
Construct an ndarray from a Vortex array:
>>> array = vortex.encoding.array([1, 0, 0, 1])
>>> array.to_numpy()
array([1, 0, 0, 1])
"""
return self.to_arrow().to_numpy(zero_copy_only=zero_copy_only)


Array.to_numpy = _Array_to_numpy


def array(obj: pyarrow.Array | list) -> Array:
"""The main entry point for creating Vortex arrays from other Python objects.
This function is also available as ``vortex.array``.
Parameters
----------
obj : :class:`pyarrow.Array` or :class:`list`
The elements of this array or list become the elements of the Vortex array.
Returns
-------
:class:`vortex.encoding.Array`
Examples
--------
A Vortex array containing the first three integers.
>>> vortex.encoding.array([1, 2, 3]).to_arrow()
<pyarrow.lib.Int64Array object at ...>
[
1,
2,
3
]
The same Vortex array with a null value in the third position.
>>> vortex.encoding.array([1, 2, None, 3]).to_arrow()
<pyarrow.lib.Int64Array object at ...>
[
1,
2,
null,
3
]
Initialize a Vortex array from an Arrow array:
>>> arrow = pyarrow.array(['Hello', 'it', 'is', 'me'])
>>> vortex.encoding.array(arrow).to_arrow()
<pyarrow.lib.StringArray object at ...>
[
"Hello",
"it",
"is",
"me"
]
"""
if isinstance(obj, list):
return _encoding._encode(pyarrow.array(obj))
return _encoding._encode(obj)
Loading

0 comments on commit e3a6c5a

Please sign in to comment.