Update README.md

minor doc updates minor doc updates minor doc updates minor doc updates factored out examples index [ci skip] rewording intro doc rewording intro doc [ci skip] why.rst [ci skip] why.rst [ci skip]
pescadores · Aug 25, 2017 · d345400 · d345400
1 parent d1fba77
commit d345400
Show file tree

Hide file tree

Showing 6 changed files with 152 additions and 23 deletions.
diff --git a/README.md b/README.md
@@ -11,23 +11,27 @@ Pescador is a library for streaming (numerical) data, primarily for use in machi
 
 Pescador addresses the following use cases:
 
-    - **Hierarchical sampling**
-    - **Out-of-core learning**
-    - **Parallel streaming**
+- **Hierarchical sampling**
+- **Out-of-core learning**
+- **Parallel streaming**
 
 These use cases arise in the following common scenarios:
 
-    - Say you have three data sources `(A, B, C)` that you want to sample. 
-      Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`.
-      The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like!
+- Say you have three data sources `(A, B, C)` that you want to sample. 
+  For example, each data source could contain all the examples of a particular category.
 
-    - Now, say you have 3000 data sources that you want to sample, and they're too large to all fit in RAM at
-      once.
-      Pescador makes it easy to interleave these sources while maintaining a small `working set`.
-      Not all sources are simultaneously active, but Pescador manages the working set so you don't have to.
+  Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`.
+  The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like!
 
-    - If loading data incurs substantial latency (e.g., due to storage access or pre-processing), this can slow down processing.
-      Pescador makes it easy to do this seamlessly in a background process, so that your main thread can continue working.
+- Now, say you have 3000 data sources, each of which may contain a large number of samples.  Maybe that's too much data to fit in RAM at once.
+
+  Pescador makes it easy to interleave these sources while maintaining a small `working set`.
+  Not all sources are simultaneously active, but Pescador manages the working set so you don't have to.
+
+- If loading data incurs substantial latency (e.g., due to accessing storage access
+  or pre-processing), this can be a problem.
+
+  Pescador can seamlessly move data generation into a background process, so that your main thread can continue working.
 
 
 Want to learn more? [Read the docs!](http://pescador.readthedocs.org)

diff --git a/docs/conf.py b/docs/conf.py
@@ -32,6 +32,7 @@
     'sphinx.ext.autodoc',
     'sphinx.ext.autosummary',
     'sphinx.ext.intersphinx',
+    'sphinx.ext.mathjax',
     # 'sphinx.ext.coverage',
     # 'sphinx.ext.viewcode',
     # 'sphinx.ext.doctest',

diff --git a/docs/examples.rst b/docs/examples.rst
@@ -0,0 +1,13 @@
+.. _examples:
+
+**************
+Basic examples
+**************
+
+.. toctree::
+    :maxdepth: 2
+
+    example1
+    example2
+    example3
+    bufferedstreaming
diff --git a/docs/index.rst b/docs/index.rst
@@ -19,17 +19,23 @@ Pescador addresses the following use cases:
 
 These use cases arise in the following common scenarios:
 
-    - Say you have three data sources `(A, B, C)` that you want to sample. 
+    - Say you have three data sources `(A, B, C)` that you want to sample.
+      For example, each data source could contain all the examples of a particular category.
+
       Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`.
       The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like!
 
-    - Now, say you have 3000 data sources that you want to sample, and they're too large to all fit in RAM at
-      once.
+    - Now, say you have 3000 data sources, each of which may contain a large number of samples.  Maybe that's too much data to fit in RAM at once.
+
       Pescador makes it easy to interleave these sources while maintaining a small `working set`.
       Not all sources are simultaneously active, but Pescador manages the working set so you don't have to.
+      This way, you can process the full data set *out of core*, but using a bounded
+      amount of memory.
+
+    - If loading data incurs substantial latency (e.g., due to accessing storage access
+      or pre-processing), this can be a problem.
 
-    - If loading data incurs substantial latency (e.g., due to storage access or pre-processing), this can slow down processing.
-      Pescador makes it easy to do this seamlessly in a background process, so that your main thread can continue working.
+      Pescador can seamlessly move data generation into a background process, so that your main thread can continue working.
 
 
 To make this all possible, Pescador provides the following utilities:
@@ -66,16 +72,21 @@ Introduction
 
     intro
 
+*************
+Why Pescador?
+*************
+.. toctree::
+    :maxdepth: 2
+
+    why
+
 **************
 Basic examples
 **************
 .. toctree::
     :maxdepth: 2
 
-    example1
-    example2
-    example3
-    bufferedstreaming
+    examples
 
 *****************
 Advanced examples

diff --git a/docs/intro.rst b/docs/intro.rst
@@ -1,8 +1,7 @@
 .. _intro:
 
-************
 Introduction
-************
+============
 
 Pescador's primary goal is to provide fine-grained control over data streaming and sampling.
 These problems can get complex quickly, so this section provides an overview of the concepts underlying

diff --git a/docs/why.rst b/docs/why.rst
@@ -0,0 +1,101 @@
+.. _why:
+
+Why Pescador?
+=============
+
+Pescador was developed in response to a variety of recurring problems related to data streaming for training machine learning models.
+After implementing custom solutions each time these problems occurred, we converged on a set of common solutions that can be applied more broadly.
+The solutions provided by Pescador may or may not fit your problem.
+This section of the documentation will attempt to help you figure out if Pescador is useful for your application.
+
+
+Hierarchical sampling
+---------------------
+
+`Hierarchical sampling` refers to any process where you want to sample data from a distribution by conditioning on one or more variables.
+For example, say you have a distribution over real-valued observations `X` and categorical labels `Y`, and you want to sample labeled observations `(X, Y)`.
+A hierarchical sampler might first select a value for `Y`, and then randomly draw an example `X` that has that label.
+This is equivalent to exploiting the laws of conditional probability: :math:`P[X, Y] =
+P[X|Y] \times P[Y]`.
+
+Hierarchical sampling can be useful when dealing with highly imbalanced data, where it may sometimes be better to learn from a balanced sample and then explicitly correct for imbalance within the model.
+
+It can also be useful when dealing with data that has natural grouping substructure beyond categories.
+For example, when modeling a large collection of audio files, each file may generate multiple observations, which will all be mutually correlated.
+Hierarchical sampling can be useful in neutralizing this bias during the training process.
+
+Pescador implements hierarchical sampling via the :ref:`Mux` abstraction.
+In its simplest form, `Mux` takes as input a set of :ref:`Streamer` objects from which samples are drawn randomly.
+This effectively generates data by a process similar to the following pseudo-code:
+
+.. code-block:: python
+    :linenos:
+
+    while True:
+        stream_id = random_choice(streamers)
+        yield next(streamers[stream_id])
+
+The `Mux` object also lets you specify an arbitrary distribution over the set of streamers, giving you fine-grained control over the resulting distribution of samples.
+
+
+The `Mux` object is also a `Streamer`, so sampling hierarchies can be nested arbitrarily deep.
+
+Out-of-core sampling
+--------------------
+
+Another common problem occurs when the size of the dataset is too large for the machine to fit in RAM simultaneously.
+Going back to the audio example above, consider a problem where there are 30,000 source files,  each of which generates 1GB of observation data, and the machine can only fit 100 source files in memory at any given time.
+
+To facilitate this use case, the `Mux` object allows you to specify a maximum number of simultaneously active streams (i.e., the *working set*).
+In this case, you would most likely implement a `generator` for each file as follows:
+
+.. code-block:: python
+    :linenos:
+
+    def sample_file(filename):
+        # Load observation data
+        X = np.load(filename)
+
+        while True:
+            # Generate a random row as a dictionary
+            yield dict(X=X[np.random.choice(len(X))])
+
+    streamers = [pescador.Streamer(sample_file, fname) for fname in ALL_30K_FILES]
+
+    for item in pescador.Mux(streamers, 100):
+        model.partial_fit(item['X'])
+
+Note that data is not loaded until the generator is instantiated.
+If you specify a working set of size `k=100`, then `Mux` will select 100 streamers at random to form the working set, and only sample data from within that set.
+`Mux` will then randomly evict streamers from the working set and replace them with new streamers, according to its `rate` parameter.
+This results in a simple interface to draw data from all input sources but using limited memory.
+
+`Mux` provides a great deal of flexibility over how streamers are replaced, what to do when streamers are exhausted, etc.
+
+
+Parallel processing
+-------------------
+
+In the above example, all of the data I/O was handled within the `generator` function.
+If the generator requires high-latency operations such as disk-access, this can become a computational bottleneck.
+
+Pescador makes it easy to migrate data generation into a background process, so that high-latency operations do not stall the main thread.
+This is facilitated by the :ref:`ZMQStreamer` object, which acts as a simple wrapper around any streamer that produces samples in the form of dictionaries of numpy arrays.
+Continuing the above example:
+
+.. code-block:: python
+    :linenos:
+
+    mux_stream = pescador.Mux(streamers, 100)
+
+    for item in pescador.ZMQStreamer(mux_stream):
+        model.partial_fit(item['X'])
+
+
+Simple interface
+----------------
+Finally, Pescador is intended to work with a variety of machine learning frameworks, such as `scikit-learn` and `Keras`.
+While many frameworks provide custom tools for handling data pipelines, each one is different, and many require using specific data structures and formats.
+
+Pescador is meant to be framework-agnostic, and allow you to write your own data generation logic using standard Python data structures (dictionaries and numpy arrays).
+We also provide helper utilities to integrate with `Keras`'s tuple generator interface.