From a95727f073fdc8c4795569bba2fa2a88321bc1e4 Mon Sep 17 00:00:00 2001 From: Brian McFee Date: Fri, 25 Aug 2017 10:32:04 -0400 Subject: [PATCH 1/4] added release notes --- docs/changes.rst | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/docs/changes.rst b/docs/changes.rst index a086418..f22b3fe 100644 --- a/docs/changes.rst +++ b/docs/changes.rst @@ -1,6 +1,26 @@ Changes ======= +v1.1.0 +------ +This is primarily a maintenance release, and will be the last in the 1.x series. + +- `#97`_ Fixed an infinite loop in `Mux` +- `#91`_ Changed the default timeout for `ZMQStreamer` to 5 seconds. +- `#90`_ Fixed conda-forge package distribution +- `#89`_ Refactored internals of the `Mux` class toward the 2.x series +- `#88`_, `#100`_ improved unit tests +- `#73`_, `#95`_ Updated documentation + +.. _#73: https://github.com/pescadores/pescador/pull/73 +.. _#88: https://github.com/pescadores/pescador/pull/88 +.. _#89: https://github.com/pescadores/pescador/pull/89 +.. _#90: https://github.com/pescadores/pescador/pull/90 +.. _#91: https://github.com/pescadores/pescador/pull/91 +.. _#95: https://github.com/pescadores/pescador/pull/95 +.. _#97: https://github.com/pescadores/pescador/pull/97 +.. _#100: https://github.com/pescadores/pescador/pull/100 + v1.0.0 ------ This release constitutes a major revision over the 0.x series, and the new interface From ac89968f633cfa19cc36afcbc6e973ec9f270cac Mon Sep 17 00:00:00 2001 From: Brian McFee Date: Fri, 25 Aug 2017 10:36:11 -0400 Subject: [PATCH 2/4] updated version number to 1.1 --- pescador/version.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pescador/version.py b/pescador/version.py index d9a60c8..c866d6f 100644 --- a/pescador/version.py +++ b/pescador/version.py @@ -2,5 +2,5 @@ # -*- coding: utf-8 -*- """Version info""" -short_version = '1.0' -version = '1.0.0' +short_version = '1.1' +version = '1.1.0' From d1fba77c7e375ac2c01d13255ce25f10cd978c15 Mon Sep 17 00:00:00 2001 From: Brian McFee Date: Fri, 25 Aug 2017 10:42:57 -0400 Subject: [PATCH 3/4] revising docs restructuring toc restructured documentation restructured documentation restructured documentation rewriting main doc page updated readme and docs intro --- README.md | 25 ++++++++- docs/example1.rst | 31 ++++++----- docs/example2.rst | 8 ++- docs/example3.rst | 7 ++- docs/index.rst | 136 +++++++++++++++++++--------------------------- docs/intro.rst | 81 +++++++++++++++++++++++++++ pescador/core.py | 2 - 7 files changed, 189 insertions(+), 101 deletions(-) create mode 100644 docs/intro.rst diff --git a/README.md b/README.md index 7f998de..8cc06ad 100644 --- a/README.md +++ b/README.md @@ -7,9 +7,30 @@ pescador [![Documentation Status](https://readthedocs.org/projects/pescador/badge/?version=latest)](https://readthedocs.org/projects/pescador/?badge=latest) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.400700.svg)](https://doi.org/10.5281/zenodo.400700) -A sampling and buffering module for iterative learning. +Pescador is a library for streaming (numerical) data, primarily for use in machine learning applications. -Read the [documentation](http://pescador.readthedocs.org) +Pescador addresses the following use cases: + + - **Hierarchical sampling** + - **Out-of-core learning** + - **Parallel streaming** + +These use cases arise in the following common scenarios: + + - Say you have three data sources `(A, B, C)` that you want to sample. + Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`. + The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like! + + - Now, say you have 3000 data sources that you want to sample, and they're too large to all fit in RAM at + once. + Pescador makes it easy to interleave these sources while maintaining a small `working set`. + Not all sources are simultaneously active, but Pescador manages the working set so you don't have to. + + - If loading data incurs substantial latency (e.g., due to storage access or pre-processing), this can slow down processing. + Pescador makes it easy to do this seamlessly in a background process, so that your main thread can continue working. + + +Want to learn more? [Read the docs!](http://pescador.readthedocs.org) Installation diff --git a/docs/example1.rst b/docs/example1.rst index d04cd76..12ec202 100644 --- a/docs/example1.rst +++ b/docs/example1.rst @@ -1,18 +1,20 @@ .. _example1: -Basic example -============= +Streaming data +============== -This document will walk through the basics of using pescador to stream samples from a generator. +This example will walk through the basics of using pescador to stream samples from a generator. Our running example will be learning from an infinite stream of stochastically perturbed samples from the Iris dataset. Sample generators ----------------- -Streamers are intended to transparently pass data without modifying them. However, Pescador assumes that Streamers produce output in -a particular format. Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`. For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one -key: `X`. For supervised learning (e.g., SGDClassifier), valid data would contain both `X` and `Y` keys, both of equal length. +Streamers are intended to transparently pass data without modifying them. +However, Pescador assumes that Streamers produce output in a particular format. +Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`. +For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one key: `X`. +For supervised learning (e.g., SGDClassifier), valid data would contain both `X` and `Y` keys, both of equal length. Here's a simple example generator that draws random samples of data from the Iris dataset, and adds gaussian noise to the features. @@ -43,7 +45,6 @@ Here's a simple example generator that draws random samples of data from the Iri sample['Y'] is a scalar `np.ndarray` of shape `(,)` ''' - n, d = X.shape while True: @@ -53,16 +54,20 @@ Here's a simple example generator that draws random samples of data from the Iri yield dict(X=X[i] + noise, Y=Y[i]) - -In the code above, `noisy_samples` is a generator that can be sampled indefinitely because `noisy_samples` contains an infinite loop. Each iterate of `noisy_samples` will be a dictionary containing the sample's features and labels. +In the code above, `noisy_samples` is a generator that can be sampled indefinitely because `noisy_samples` contains an infinite loop. +Each iterate of `noisy_samples` will be a dictionary containing the sample's features and labels. Streamers --------- -Generators in python have a couple of limitations for common stream learning pipelines. First, once instantiated, a generator cannot be "restarted". Second, an instantiated generator cannot be serialized -directly, so they are difficult to use in distributed computation environments. - -Pescador provides the `Streamer` class to circumvent these issues. `Streamer` simply provides an object container for an uninstantiated generator (and its parameters), and an access method `generate()`. Calling `generate()` multiple times on a `Streamer` object is equivalent to restarting the generator, and can therefore be used to simply implement multiple pass streams. Similarly, because `Streamer` can be serialized, it is simple to pass a streamer object to a separate process for parallel computation. +Generators in python have a couple of limitations for common stream learning pipelines. +First, once instantiated, a generator cannot be "restarted". +Second, an instantiated generator cannot be serialized directly, so they are difficult to use in distributed computation environments. + +Pescador provides the `Streamer` class to circumvent these issues. +`Streamer` simply provides an object container for an uninstantiated generator (and its parameters), and an access method `generate()`. +Calling `generate()` multiple times on a `Streamer` object is equivalent to restarting the generator, and can therefore be used to simply implement multiple pass streams. +Similarly, because `Streamer` can be serialized, it is simple to pass a streamer object to a separate process for parallel computation. Here's a simple example, using the generator from the previous section. diff --git a/docs/example2.rst b/docs/example2.rst index 43ede1b..d29c855 100644 --- a/docs/example2.rst +++ b/docs/example2.rst @@ -1,13 +1,14 @@ .. _example2: -This document will walk through some advanced usage of pescador. +This example demonstrates how to re-use and multiplex streamers. We will assume a working understanding of the simple example in the previous section. Stream re-use and multiplexing ============================== -The `Mux` streamer provides a powerful interface for randomly interleaving samples from multiple input streams. `Mux` can also dynamically activate and deactivate individual `Streamers`, which allows it to operate on a bounded subset of streams at any given time. +The `Mux` streamer provides a powerful interface for randomly interleaving samples from multiple input streams. +`Mux` can also dynamically activate and deactivate individual `Streamers`, which allows it to operate on a bounded subset of streams at any given time. As a concrete example, we can simulate a mixture of noisy streams with differing variances. @@ -66,7 +67,8 @@ As a concrete example, we can simulate a mixture of noisy streams with differing print('Test accuracy: {:.3f}'.format(accuracy_score(Y[test], Ypred))) -In the above example, each `Streamer` in `streams` can make infinitely many samples. The `rate=64` argument to `Mux` says that each stream should produce some `n` samples, where `n` is sampled from a Poisson distribution of rate `rate`. When a stream exceeds its bound, it is deactivated, and a new streamer is activated to fill its place. +In the above example, each `Streamer` in `streams` can make infinitely many samples. The `rate=64` argument to `Mux` says that each stream should produce some `n` samples, where `n` is sampled from a Poisson distribution of rate `rate`. +When a stream exceeds its bound, it is deactivated, and a new streamer is activated to fill its place. Setting `rate=None` disables the random stream bounding, and `mux()` simply runs each active stream until exhaustion. diff --git a/docs/example3.rst b/docs/example3.rst index 42fda12..8a7abac 100644 --- a/docs/example3.rst +++ b/docs/example3.rst @@ -3,7 +3,11 @@ Sampling from disk ================== -A common use case for `pescador` is to sample data from a large collection of existing archives. As a concrete example, consider the problem of fitting a statistical model to a large corpus of musical recordings. When the corpus is sufficiently large, it is impossible to fit the entire set in memory while estimating the model parameters. Instead, one can pre-process each song to store pre-computed features (and, optionally, target labels) in a *numpy zip* `NPZ` archive. The problem then becomes sampling data from a collection of `NPZ` archives. +A common use case for `pescador` is to sample data from a large collection of existing archives. +As a concrete example, consider the problem of fitting a statistical model to a large corpus of musical recordings. +When the corpus is sufficiently large, it is impossible to fit the entire set in memory while estimating the model parameters. +Instead, one can pre-process each song to store pre-computed features (and, optionally, target labels) in a *numpy zip* `NPZ` archive. +The problem then becomes sampling data from a collection of `NPZ` archives. Here, we will assume that the pre-processing has already been done so that each `NPZ` file contains a numpy array of features `X` and labels `Y`. We will define infinite samplers that pull `n` examples per iterate. @@ -86,7 +90,6 @@ Alternatively, *memory-mapping* can be used to only load data as needed, but req yield dict(X=X[idx:idx + n], Y=Y[idx:idx + n]) - # Using this streamer is similar to the first example, but now you need a separate # NPY file for each X and Y npy_x_files = #LIST OF PRE-COMPUTED NPY FILES (X) diff --git a/docs/index.rst b/docs/index.rst index 802e81c..de8d70c 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -3,96 +3,72 @@ You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. -Introduction ------------- - -Pescador is a library for streaming (numerical) data for use in iterative machine learning applications. - -The core concept is the :ref:`Streamer` object, which encapsulates a Python `generator` to allow for re-use and -inter-process communication. - -The basic use case is as follows: - - 1. Define a generator function `g` which yields a dictionary of numpy arrays at each step - 2. Construct a :ref:`Streamer` object `stream = Streamer(g, args...)` - 3. Iterate over examples generated by `stream()`. - -On top of this basic functionality, pescador provides the following tools: - - - A :ref:`Streamer` allows you to turn a finite-lifecycle generator into an infinte stream with `cycle()`, by automatically restarting the generator if it completes. - - Multiplexing multiple data streams (see :ref:`Mux`) - - Transform or modify streams with Maps (see :ref:`processing-data-streams`) - - Parallel processing (see :ref:`ZMQStreamer`) - - Buffering of sampled data into fixed-size batches (see :ref:`pescador.maps.buffer_stream`) - -For examples of each of these use-cases, refer to the :ref:`Examples` section. - - -Definitions ------------ - -Pescador is designed with the following core principles: - -1. An "iterator" is an object that produces a sequence of data, i.e. via `__next__` / `next()`. (`iterator definition `_, `Iterator Types `_) +.. _pescador: -2. An "iterable" is an object that can produce iterators, i.e. via `__iter__` / `iter()`. (`iterable definition `_) +######## +Pescador +######## -3. A "stream" is the sequence of objects produced by an iterator. +Pescador is a library for streaming (numerical) data, primarily for use in machine learning applications. -4. A "generator" (or more precisely "generator function") is a callable object that returns a single generator iterator. (`generator definition `_) +Pescador addresses the following use cases: -For example: - - `range` is an iterable function - - `range(8)` is an iterable, and its iterator produces the stream (consecutively) `0, 1, 2, 3, ...` + - **Hierarchical sampling** + - **Out-of-core learning** + - **Parallel streaming** +These use cases arise in the following common scenarios: -.. _streaming-data: + - Say you have three data sources `(A, B, C)` that you want to sample. + Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`. + The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like! -Streaming Data --------------- -1. Pescador defines an object called a `Streamer` for the purposes of (re)creating iterators indefinitely and (optionally) interrupting them prematurely. + - Now, say you have 3000 data sources that you want to sample, and they're too large to all fit in RAM at + once. + Pescador makes it easy to interleave these sources while maintaining a small `working set`. + Not all sources are simultaneously active, but Pescador manages the working set so you don't have to. -2. `Streamer` inherits from `iterable` and can be iterated directly. + - If loading data incurs substantial latency (e.g., due to storage access or pre-processing), this can slow down processing. + Pescador makes it easy to do this seamlessly in a background process, so that your main thread can continue working. -3. A `Streamer` can be initialized with one of two types: - - Any iterable type, e.g. `range(7)`, `['foo', 'bar']`, `"abcdef"`, or another `Streamer()` - - A generator function and its arguments + keyword arguments. -4. A `Streamer` transparently yields the data stream flowing through it +To make this all possible, Pescador provides the following utilities: - - A `Streamer` should not modify objects in its stream. - - - In the spirit of encapsulation, the modification of data streams is achieved through separate functionality (see :ref:`processing-data-streams`) - - -Multiplexing Data Streams -------------------------- -1. Pescador defines an object called a `Mux` for the purposes of multiplexing streams of data. - -2. `Mux` inherits from `Streamer`, which makes it both iterable and recomposable, e.g. one can construct arbitrary trees of data streams. + - :ref:`Streamer` objects encapsulate data generators for re-use, infinite sampling, and inter-process + communication. + - :ref:`Mux` objects allow flexible sampling from multiple streams + - :ref:`ZMQStreamer` provides parallel processing with low communication overhead + - Transform or modify streams with Maps (see :ref:`processing-data-streams`) + - Buffering of sampled data into fixed-size batches (see :ref:`pescador.maps.buffer_stream`) -3. A `Mux` is initialized with a container of one or more iterables, and parameters to control the stochastic behavior of the object. +************ +Installation +************ -4. As a subclass of `Streamer`, a `Mux` also transparently yields the stream flowing through it, i.e. :ref:`streaming-data`. +Pescador can be installed from PyPI through `pip`: +.. code-block:: bash -.. _processing-data-streams: + pip install pescador -Processing Data Streams ------------------------ -Pescador adopts the concept of "transformers" for processing data streams. +or via `conda` using the `conda-forge` channel: -1. A transformer takes as input a single object in the stream. +.. code-block:: bash -2. A transformer yields an object. + conda install -c conda-forge pescador -3. Transformers are iterators, i.e. implement a `__next__` method, to preserve iteration. -4. An example of a built-in transformer is `enumerate` [`ref `_] +************ +Introduction +************ +.. toctree:: + :maxdepth: 2 + intro -Basic Usage --------------- +************** +Basic examples +************** .. toctree:: :maxdepth: 2 @@ -101,39 +77,41 @@ Basic Usage example3 bufferedstreaming -Examples --------- +***************** +Advanced examples +***************** .. toctree:: :maxdepth: 2 auto_examples/index +************* API Reference -------------- +************* .. toctree:: :maxdepth: 2 api - -Changes -------- +************* +Release notes +************* .. toctree:: :maxdepth: 2 changes - +********** Contribute ----------- +********** - `Issue Tracker `_ - `Source Code `_ +- `Contributing guidelines `_ - +****************** Indices and tables -================== +****************** * :ref:`genindex` * :ref:`modindex` * :ref:`search` - diff --git a/docs/intro.rst b/docs/intro.rst new file mode 100644 index 0000000..f0db798 --- /dev/null +++ b/docs/intro.rst @@ -0,0 +1,81 @@ +.. _intro: + +************ +Introduction +************ + +Pescador's primary goal is to provide fine-grained control over data streaming and sampling. +These problems can get complex quickly, so this section provides an overview of the concepts underlying +Pescador's design, and a summary of the provided functionality. + + +Definitions +----------- + +To understand what pescador does, it will help to establish some common terminology. +If you're not already familiar with Python's `iterator` and `generator` concepts, here's a quick synopsis: + +1. An `iterator` is an object that produces a sequence of data, i.e. via ``__next__`` / ``next()``. + + - See: `iterator definition `_, `Iterator Types `_ + +2. An `iterable` is an object that can produce iterators, i.e. via ``__iter__`` / ``iter()``. + + - See: `iterable definition `_ + +3. A `generator` (or more precisely `generator function`) is a callable object that returns a single iterator. + + - See: `generator definition `_ + +4. Pescador defines a `stream` as the sequence of objects produced by an iterator. + + +For example: + - ``range`` is an iterable function + - ``range(8)`` is an iterable, and its iterator produces the stream: ``0, 1, 2, 3, ...`` + + +.. _streaming-data: + +Streaming Data +-------------- +1. Pescador defines an object called a `Streamer` for the purposes of (re)creating iterators indefinitely and (optionally) interrupting them prematurely. + +2. `Streamer` implements the `iterable` interface, and can be iterated directly. + +3. A `Streamer` can be initialized with one of two types: + - Any iterable type, e.g. ``range(7)``, ``['foo', 'bar']``, ``"abcdef"``, or another ``Streamer`` + - A generator function and its arguments + keyword arguments. + +4. A `Streamer` transparently yields the data stream flowing through it + + - A `Streamer` should not modify objects in its stream. + + - In the spirit of encapsulation, the modification of data streams is achieved through separate functionality (see :ref:`processing-data-streams`) + + +Multiplexing Data Streams +------------------------- +1. Pescador defines an object called a `Mux` for the purposes of multiplexing streams of data. + +2. `Mux` inherits from `Streamer`, which makes it both iterable and recomposable. Muxes allow you to + construct arbitrary trees of data streams. This is useful for hierarchical sampling. + +3. A `Mux` is initialized with a container of one or more iterables, and parameters to control the stochastic behavior of the object. + +4. As a subclass of `Streamer`, a `Mux` also transparently yields the stream flowing through it, i.e. :ref:`streaming-data`. + + +.. _processing-data-streams: + +Processing Data Streams +----------------------- +Pescador adopts the concept of "transformers" for processing data streams. + +1. A transformer takes as input a single object in the stream. + +2. A transformer yields an object. + +3. Transformers are iterators, i.e. implement a `__next__` method, to preserve iteration. + +4. An example of a built-in transformer is `enumerate` [`ref `_] diff --git a/pescador/core.py b/pescador/core.py index 54439af..833ae8c 100644 --- a/pescador/core.py +++ b/pescador/core.py @@ -237,8 +237,6 @@ def tuples(self, *items, **kwargs): def __call__(self, max_iter=None, cycle=False, max_batches=Deprecated()): '''Convenience interface for interacting with the Streamer. - TODO: max_iter > len(self.stream_) is incompatible with cycle. - Parameters ---------- max_iter : None or int > 0 From d3454009633a6c1a447cc38e4ec26bcc6ecc17ae Mon Sep 17 00:00:00 2001 From: Brian McFee Date: Fri, 25 Aug 2017 12:19:41 -0400 Subject: [PATCH 4/4] Update README.md minor doc updates minor doc updates minor doc updates minor doc updates factored out examples index [ci skip] rewording intro doc rewording intro doc [ci skip] why.rst [ci skip] why.rst [ci skip] --- README.md | 28 +++++++------ docs/conf.py | 1 + docs/examples.rst | 13 ++++++ docs/index.rst | 29 ++++++++----- docs/intro.rst | 3 +- docs/why.rst | 101 ++++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 152 insertions(+), 23 deletions(-) create mode 100644 docs/examples.rst create mode 100644 docs/why.rst diff --git a/README.md b/README.md index 8cc06ad..e6fa9ac 100644 --- a/README.md +++ b/README.md @@ -11,23 +11,27 @@ Pescador is a library for streaming (numerical) data, primarily for use in machi Pescador addresses the following use cases: - - **Hierarchical sampling** - - **Out-of-core learning** - - **Parallel streaming** +- **Hierarchical sampling** +- **Out-of-core learning** +- **Parallel streaming** These use cases arise in the following common scenarios: - - Say you have three data sources `(A, B, C)` that you want to sample. - Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`. - The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like! +- Say you have three data sources `(A, B, C)` that you want to sample. + For example, each data source could contain all the examples of a particular category. - - Now, say you have 3000 data sources that you want to sample, and they're too large to all fit in RAM at - once. - Pescador makes it easy to interleave these sources while maintaining a small `working set`. - Not all sources are simultaneously active, but Pescador manages the working set so you don't have to. + Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`. + The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like! - - If loading data incurs substantial latency (e.g., due to storage access or pre-processing), this can slow down processing. - Pescador makes it easy to do this seamlessly in a background process, so that your main thread can continue working. +- Now, say you have 3000 data sources, each of which may contain a large number of samples. Maybe that's too much data to fit in RAM at once. + + Pescador makes it easy to interleave these sources while maintaining a small `working set`. + Not all sources are simultaneously active, but Pescador manages the working set so you don't have to. + +- If loading data incurs substantial latency (e.g., due to accessing storage access + or pre-processing), this can be a problem. + + Pescador can seamlessly move data generation into a background process, so that your main thread can continue working. Want to learn more? [Read the docs!](http://pescador.readthedocs.org) diff --git a/docs/conf.py b/docs/conf.py index d6b1b71..422ea93 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -32,6 +32,7 @@ 'sphinx.ext.autodoc', 'sphinx.ext.autosummary', 'sphinx.ext.intersphinx', + 'sphinx.ext.mathjax', # 'sphinx.ext.coverage', # 'sphinx.ext.viewcode', # 'sphinx.ext.doctest', diff --git a/docs/examples.rst b/docs/examples.rst new file mode 100644 index 0000000..60e85ed --- /dev/null +++ b/docs/examples.rst @@ -0,0 +1,13 @@ +.. _examples: + +************** +Basic examples +************** + +.. toctree:: + :maxdepth: 2 + + example1 + example2 + example3 + bufferedstreaming diff --git a/docs/index.rst b/docs/index.rst index de8d70c..ca958d1 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -19,17 +19,23 @@ Pescador addresses the following use cases: These use cases arise in the following common scenarios: - - Say you have three data sources `(A, B, C)` that you want to sample. + - Say you have three data sources `(A, B, C)` that you want to sample. + For example, each data source could contain all the examples of a particular category. + Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`. The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like! - - Now, say you have 3000 data sources that you want to sample, and they're too large to all fit in RAM at - once. + - Now, say you have 3000 data sources, each of which may contain a large number of samples. Maybe that's too much data to fit in RAM at once. + Pescador makes it easy to interleave these sources while maintaining a small `working set`. Not all sources are simultaneously active, but Pescador manages the working set so you don't have to. + This way, you can process the full data set *out of core*, but using a bounded + amount of memory. + + - If loading data incurs substantial latency (e.g., due to accessing storage access + or pre-processing), this can be a problem. - - If loading data incurs substantial latency (e.g., due to storage access or pre-processing), this can slow down processing. - Pescador makes it easy to do this seamlessly in a background process, so that your main thread can continue working. + Pescador can seamlessly move data generation into a background process, so that your main thread can continue working. To make this all possible, Pescador provides the following utilities: @@ -66,16 +72,21 @@ Introduction intro +************* +Why Pescador? +************* +.. toctree:: + :maxdepth: 2 + + why + ************** Basic examples ************** .. toctree:: :maxdepth: 2 - example1 - example2 - example3 - bufferedstreaming + examples ***************** Advanced examples diff --git a/docs/intro.rst b/docs/intro.rst index f0db798..0ab03ea 100644 --- a/docs/intro.rst +++ b/docs/intro.rst @@ -1,8 +1,7 @@ .. _intro: -************ Introduction -************ +============ Pescador's primary goal is to provide fine-grained control over data streaming and sampling. These problems can get complex quickly, so this section provides an overview of the concepts underlying diff --git a/docs/why.rst b/docs/why.rst new file mode 100644 index 0000000..9e0bc18 --- /dev/null +++ b/docs/why.rst @@ -0,0 +1,101 @@ +.. _why: + +Why Pescador? +============= + +Pescador was developed in response to a variety of recurring problems related to data streaming for training machine learning models. +After implementing custom solutions each time these problems occurred, we converged on a set of common solutions that can be applied more broadly. +The solutions provided by Pescador may or may not fit your problem. +This section of the documentation will attempt to help you figure out if Pescador is useful for your application. + + +Hierarchical sampling +--------------------- + +`Hierarchical sampling` refers to any process where you want to sample data from a distribution by conditioning on one or more variables. +For example, say you have a distribution over real-valued observations `X` and categorical labels `Y`, and you want to sample labeled observations `(X, Y)`. +A hierarchical sampler might first select a value for `Y`, and then randomly draw an example `X` that has that label. +This is equivalent to exploiting the laws of conditional probability: :math:`P[X, Y] = +P[X|Y] \times P[Y]`. + +Hierarchical sampling can be useful when dealing with highly imbalanced data, where it may sometimes be better to learn from a balanced sample and then explicitly correct for imbalance within the model. + +It can also be useful when dealing with data that has natural grouping substructure beyond categories. +For example, when modeling a large collection of audio files, each file may generate multiple observations, which will all be mutually correlated. +Hierarchical sampling can be useful in neutralizing this bias during the training process. + +Pescador implements hierarchical sampling via the :ref:`Mux` abstraction. +In its simplest form, `Mux` takes as input a set of :ref:`Streamer` objects from which samples are drawn randomly. +This effectively generates data by a process similar to the following pseudo-code: + +.. code-block:: python + :linenos: + + while True: + stream_id = random_choice(streamers) + yield next(streamers[stream_id]) + +The `Mux` object also lets you specify an arbitrary distribution over the set of streamers, giving you fine-grained control over the resulting distribution of samples. + + +The `Mux` object is also a `Streamer`, so sampling hierarchies can be nested arbitrarily deep. + +Out-of-core sampling +-------------------- + +Another common problem occurs when the size of the dataset is too large for the machine to fit in RAM simultaneously. +Going back to the audio example above, consider a problem where there are 30,000 source files, each of which generates 1GB of observation data, and the machine can only fit 100 source files in memory at any given time. + +To facilitate this use case, the `Mux` object allows you to specify a maximum number of simultaneously active streams (i.e., the *working set*). +In this case, you would most likely implement a `generator` for each file as follows: + +.. code-block:: python + :linenos: + + def sample_file(filename): + # Load observation data + X = np.load(filename) + + while True: + # Generate a random row as a dictionary + yield dict(X=X[np.random.choice(len(X))]) + + streamers = [pescador.Streamer(sample_file, fname) for fname in ALL_30K_FILES] + + for item in pescador.Mux(streamers, 100): + model.partial_fit(item['X']) + +Note that data is not loaded until the generator is instantiated. +If you specify a working set of size `k=100`, then `Mux` will select 100 streamers at random to form the working set, and only sample data from within that set. +`Mux` will then randomly evict streamers from the working set and replace them with new streamers, according to its `rate` parameter. +This results in a simple interface to draw data from all input sources but using limited memory. + +`Mux` provides a great deal of flexibility over how streamers are replaced, what to do when streamers are exhausted, etc. + + +Parallel processing +------------------- + +In the above example, all of the data I/O was handled within the `generator` function. +If the generator requires high-latency operations such as disk-access, this can become a computational bottleneck. + +Pescador makes it easy to migrate data generation into a background process, so that high-latency operations do not stall the main thread. +This is facilitated by the :ref:`ZMQStreamer` object, which acts as a simple wrapper around any streamer that produces samples in the form of dictionaries of numpy arrays. +Continuing the above example: + +.. code-block:: python + :linenos: + + mux_stream = pescador.Mux(streamers, 100) + + for item in pescador.ZMQStreamer(mux_stream): + model.partial_fit(item['X']) + + +Simple interface +---------------- +Finally, Pescador is intended to work with a variety of machine learning frameworks, such as `scikit-learn` and `Keras`. +While many frameworks provide custom tools for handling data pipelines, each one is different, and many require using specific data structures and formats. + +Pescador is meant to be framework-agnostic, and allow you to write your own data generation logic using standard Python data structures (dictionaries and numpy arrays). +We also provide helper utilities to integrate with `Keras`'s tuple generator interface.