From d3454009633a6c1a447cc38e4ec26bcc6ecc17ae Mon Sep 17 00:00:00 2001 From: Brian McFee Date: Fri, 25 Aug 2017 12:19:41 -0400 Subject: [PATCH] Update README.md minor doc updates minor doc updates minor doc updates minor doc updates factored out examples index [ci skip] rewording intro doc rewording intro doc [ci skip] why.rst [ci skip] why.rst [ci skip] --- README.md | 28 +++++++------ docs/conf.py | 1 + docs/examples.rst | 13 ++++++ docs/index.rst | 29 ++++++++----- docs/intro.rst | 3 +- docs/why.rst | 101 ++++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 152 insertions(+), 23 deletions(-) create mode 100644 docs/examples.rst create mode 100644 docs/why.rst diff --git a/README.md b/README.md index 8cc06ad..e6fa9ac 100644 --- a/README.md +++ b/README.md @@ -11,23 +11,27 @@ Pescador is a library for streaming (numerical) data, primarily for use in machi Pescador addresses the following use cases: - - **Hierarchical sampling** - - **Out-of-core learning** - - **Parallel streaming** +- **Hierarchical sampling** +- **Out-of-core learning** +- **Parallel streaming** These use cases arise in the following common scenarios: - - Say you have three data sources `(A, B, C)` that you want to sample. - Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`. - The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like! +- Say you have three data sources `(A, B, C)` that you want to sample. + For example, each data source could contain all the examples of a particular category. - - Now, say you have 3000 data sources that you want to sample, and they're too large to all fit in RAM at - once. - Pescador makes it easy to interleave these sources while maintaining a small `working set`. - Not all sources are simultaneously active, but Pescador manages the working set so you don't have to. + Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`. + The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like! - - If loading data incurs substantial latency (e.g., due to storage access or pre-processing), this can slow down processing. - Pescador makes it easy to do this seamlessly in a background process, so that your main thread can continue working. +- Now, say you have 3000 data sources, each of which may contain a large number of samples. Maybe that's too much data to fit in RAM at once. + + Pescador makes it easy to interleave these sources while maintaining a small `working set`. + Not all sources are simultaneously active, but Pescador manages the working set so you don't have to. + +- If loading data incurs substantial latency (e.g., due to accessing storage access + or pre-processing), this can be a problem. + + Pescador can seamlessly move data generation into a background process, so that your main thread can continue working. Want to learn more? [Read the docs!](http://pescador.readthedocs.org) diff --git a/docs/conf.py b/docs/conf.py index d6b1b71..422ea93 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -32,6 +32,7 @@ 'sphinx.ext.autodoc', 'sphinx.ext.autosummary', 'sphinx.ext.intersphinx', + 'sphinx.ext.mathjax', # 'sphinx.ext.coverage', # 'sphinx.ext.viewcode', # 'sphinx.ext.doctest', diff --git a/docs/examples.rst b/docs/examples.rst new file mode 100644 index 0000000..60e85ed --- /dev/null +++ b/docs/examples.rst @@ -0,0 +1,13 @@ +.. _examples: + +************** +Basic examples +************** + +.. toctree:: + :maxdepth: 2 + + example1 + example2 + example3 + bufferedstreaming diff --git a/docs/index.rst b/docs/index.rst index de8d70c..ca958d1 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -19,17 +19,23 @@ Pescador addresses the following use cases: These use cases arise in the following common scenarios: - - Say you have three data sources `(A, B, C)` that you want to sample. + - Say you have three data sources `(A, B, C)` that you want to sample. + For example, each data source could contain all the examples of a particular category. + Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`. The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like! - - Now, say you have 3000 data sources that you want to sample, and they're too large to all fit in RAM at - once. + - Now, say you have 3000 data sources, each of which may contain a large number of samples. Maybe that's too much data to fit in RAM at once. + Pescador makes it easy to interleave these sources while maintaining a small `working set`. Not all sources are simultaneously active, but Pescador manages the working set so you don't have to. + This way, you can process the full data set *out of core*, but using a bounded + amount of memory. + + - If loading data incurs substantial latency (e.g., due to accessing storage access + or pre-processing), this can be a problem. - - If loading data incurs substantial latency (e.g., due to storage access or pre-processing), this can slow down processing. - Pescador makes it easy to do this seamlessly in a background process, so that your main thread can continue working. + Pescador can seamlessly move data generation into a background process, so that your main thread can continue working. To make this all possible, Pescador provides the following utilities: @@ -66,16 +72,21 @@ Introduction intro +************* +Why Pescador? +************* +.. toctree:: + :maxdepth: 2 + + why + ************** Basic examples ************** .. toctree:: :maxdepth: 2 - example1 - example2 - example3 - bufferedstreaming + examples ***************** Advanced examples diff --git a/docs/intro.rst b/docs/intro.rst index f0db798..0ab03ea 100644 --- a/docs/intro.rst +++ b/docs/intro.rst @@ -1,8 +1,7 @@ .. _intro: -************ Introduction -************ +============ Pescador's primary goal is to provide fine-grained control over data streaming and sampling. These problems can get complex quickly, so this section provides an overview of the concepts underlying diff --git a/docs/why.rst b/docs/why.rst new file mode 100644 index 0000000..9e0bc18 --- /dev/null +++ b/docs/why.rst @@ -0,0 +1,101 @@ +.. _why: + +Why Pescador? +============= + +Pescador was developed in response to a variety of recurring problems related to data streaming for training machine learning models. +After implementing custom solutions each time these problems occurred, we converged on a set of common solutions that can be applied more broadly. +The solutions provided by Pescador may or may not fit your problem. +This section of the documentation will attempt to help you figure out if Pescador is useful for your application. + + +Hierarchical sampling +--------------------- + +`Hierarchical sampling` refers to any process where you want to sample data from a distribution by conditioning on one or more variables. +For example, say you have a distribution over real-valued observations `X` and categorical labels `Y`, and you want to sample labeled observations `(X, Y)`. +A hierarchical sampler might first select a value for `Y`, and then randomly draw an example `X` that has that label. +This is equivalent to exploiting the laws of conditional probability: :math:`P[X, Y] = +P[X|Y] \times P[Y]`. + +Hierarchical sampling can be useful when dealing with highly imbalanced data, where it may sometimes be better to learn from a balanced sample and then explicitly correct for imbalance within the model. + +It can also be useful when dealing with data that has natural grouping substructure beyond categories. +For example, when modeling a large collection of audio files, each file may generate multiple observations, which will all be mutually correlated. +Hierarchical sampling can be useful in neutralizing this bias during the training process. + +Pescador implements hierarchical sampling via the :ref:`Mux` abstraction. +In its simplest form, `Mux` takes as input a set of :ref:`Streamer` objects from which samples are drawn randomly. +This effectively generates data by a process similar to the following pseudo-code: + +.. code-block:: python + :linenos: + + while True: + stream_id = random_choice(streamers) + yield next(streamers[stream_id]) + +The `Mux` object also lets you specify an arbitrary distribution over the set of streamers, giving you fine-grained control over the resulting distribution of samples. + + +The `Mux` object is also a `Streamer`, so sampling hierarchies can be nested arbitrarily deep. + +Out-of-core sampling +-------------------- + +Another common problem occurs when the size of the dataset is too large for the machine to fit in RAM simultaneously. +Going back to the audio example above, consider a problem where there are 30,000 source files, each of which generates 1GB of observation data, and the machine can only fit 100 source files in memory at any given time. + +To facilitate this use case, the `Mux` object allows you to specify a maximum number of simultaneously active streams (i.e., the *working set*). +In this case, you would most likely implement a `generator` for each file as follows: + +.. code-block:: python + :linenos: + + def sample_file(filename): + # Load observation data + X = np.load(filename) + + while True: + # Generate a random row as a dictionary + yield dict(X=X[np.random.choice(len(X))]) + + streamers = [pescador.Streamer(sample_file, fname) for fname in ALL_30K_FILES] + + for item in pescador.Mux(streamers, 100): + model.partial_fit(item['X']) + +Note that data is not loaded until the generator is instantiated. +If you specify a working set of size `k=100`, then `Mux` will select 100 streamers at random to form the working set, and only sample data from within that set. +`Mux` will then randomly evict streamers from the working set and replace them with new streamers, according to its `rate` parameter. +This results in a simple interface to draw data from all input sources but using limited memory. + +`Mux` provides a great deal of flexibility over how streamers are replaced, what to do when streamers are exhausted, etc. + + +Parallel processing +------------------- + +In the above example, all of the data I/O was handled within the `generator` function. +If the generator requires high-latency operations such as disk-access, this can become a computational bottleneck. + +Pescador makes it easy to migrate data generation into a background process, so that high-latency operations do not stall the main thread. +This is facilitated by the :ref:`ZMQStreamer` object, which acts as a simple wrapper around any streamer that produces samples in the form of dictionaries of numpy arrays. +Continuing the above example: + +.. code-block:: python + :linenos: + + mux_stream = pescador.Mux(streamers, 100) + + for item in pescador.ZMQStreamer(mux_stream): + model.partial_fit(item['X']) + + +Simple interface +---------------- +Finally, Pescador is intended to work with a variety of machine learning frameworks, such as `scikit-learn` and `Keras`. +While many frameworks provide custom tools for handling data pipelines, each one is different, and many require using specific data structures and formats. + +Pescador is meant to be framework-agnostic, and allow you to write your own data generation logic using standard Python data structures (dictionaries and numpy arrays). +We also provide helper utilities to integrate with `Keras`'s tuple generator interface.