Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
minor doc updates

minor doc updates

minor doc updates

minor doc updates

factored out examples index [ci skip]

rewording intro doc

rewording intro doc [ci skip]

why.rst [ci skip]

why.rst [ci skip]
  • Loading branch information
bmcfee committed Aug 25, 2017
1 parent d1fba77 commit d345400
Show file tree
Hide file tree
Showing 6 changed files with 152 additions and 23 deletions.
28 changes: 16 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,23 +11,27 @@ Pescador is a library for streaming (numerical) data, primarily for use in machi

Pescador addresses the following use cases:

- **Hierarchical sampling**
- **Out-of-core learning**
- **Parallel streaming**
- **Hierarchical sampling**
- **Out-of-core learning**
- **Parallel streaming**

These use cases arise in the following common scenarios:

- Say you have three data sources `(A, B, C)` that you want to sample.
Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`.
The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like!
- Say you have three data sources `(A, B, C)` that you want to sample.
For example, each data source could contain all the examples of a particular category.

- Now, say you have 3000 data sources that you want to sample, and they're too large to all fit in RAM at
once.
Pescador makes it easy to interleave these sources while maintaining a small `working set`.
Not all sources are simultaneously active, but Pescador manages the working set so you don't have to.
Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`.
The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like!

- If loading data incurs substantial latency (e.g., due to storage access or pre-processing), this can slow down processing.
Pescador makes it easy to do this seamlessly in a background process, so that your main thread can continue working.
- Now, say you have 3000 data sources, each of which may contain a large number of samples. Maybe that's too much data to fit in RAM at once.

Pescador makes it easy to interleave these sources while maintaining a small `working set`.
Not all sources are simultaneously active, but Pescador manages the working set so you don't have to.

- If loading data incurs substantial latency (e.g., due to accessing storage access
or pre-processing), this can be a problem.

Pescador can seamlessly move data generation into a background process, so that your main thread can continue working.


Want to learn more? [Read the docs!](http://pescador.readthedocs.org)
Expand Down
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.intersphinx',
'sphinx.ext.mathjax',
# 'sphinx.ext.coverage',
# 'sphinx.ext.viewcode',
# 'sphinx.ext.doctest',
Expand Down
13 changes: 13 additions & 0 deletions docs/examples.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
.. _examples:

**************
Basic examples
**************

.. toctree::
:maxdepth: 2

example1
example2
example3
bufferedstreaming
29 changes: 20 additions & 9 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,23 @@ Pescador addresses the following use cases:

These use cases arise in the following common scenarios:

- Say you have three data sources `(A, B, C)` that you want to sample.
- Say you have three data sources `(A, B, C)` that you want to sample.
For example, each data source could contain all the examples of a particular category.

Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`.
The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like!

- Now, say you have 3000 data sources that you want to sample, and they're too large to all fit in RAM at
once.
- Now, say you have 3000 data sources, each of which may contain a large number of samples. Maybe that's too much data to fit in RAM at once.

Pescador makes it easy to interleave these sources while maintaining a small `working set`.
Not all sources are simultaneously active, but Pescador manages the working set so you don't have to.
This way, you can process the full data set *out of core*, but using a bounded
amount of memory.

- If loading data incurs substantial latency (e.g., due to accessing storage access
or pre-processing), this can be a problem.

- If loading data incurs substantial latency (e.g., due to storage access or pre-processing), this can slow down processing.
Pescador makes it easy to do this seamlessly in a background process, so that your main thread can continue working.
Pescador can seamlessly move data generation into a background process, so that your main thread can continue working.


To make this all possible, Pescador provides the following utilities:
Expand Down Expand Up @@ -66,16 +72,21 @@ Introduction

intro

*************
Why Pescador?
*************
.. toctree::
:maxdepth: 2

why

**************
Basic examples
**************
.. toctree::
:maxdepth: 2

example1
example2
example3
bufferedstreaming
examples

*****************
Advanced examples
Expand Down
3 changes: 1 addition & 2 deletions docs/intro.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
.. _intro:

************
Introduction
************
============

Pescador's primary goal is to provide fine-grained control over data streaming and sampling.
These problems can get complex quickly, so this section provides an overview of the concepts underlying
Expand Down
101 changes: 101 additions & 0 deletions docs/why.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
.. _why:

Why Pescador?
=============

Pescador was developed in response to a variety of recurring problems related to data streaming for training machine learning models.
After implementing custom solutions each time these problems occurred, we converged on a set of common solutions that can be applied more broadly.
The solutions provided by Pescador may or may not fit your problem.
This section of the documentation will attempt to help you figure out if Pescador is useful for your application.


Hierarchical sampling
---------------------

`Hierarchical sampling` refers to any process where you want to sample data from a distribution by conditioning on one or more variables.
For example, say you have a distribution over real-valued observations `X` and categorical labels `Y`, and you want to sample labeled observations `(X, Y)`.
A hierarchical sampler might first select a value for `Y`, and then randomly draw an example `X` that has that label.
This is equivalent to exploiting the laws of conditional probability: :math:`P[X, Y] =
P[X|Y] \times P[Y]`.

Hierarchical sampling can be useful when dealing with highly imbalanced data, where it may sometimes be better to learn from a balanced sample and then explicitly correct for imbalance within the model.

It can also be useful when dealing with data that has natural grouping substructure beyond categories.
For example, when modeling a large collection of audio files, each file may generate multiple observations, which will all be mutually correlated.
Hierarchical sampling can be useful in neutralizing this bias during the training process.

Pescador implements hierarchical sampling via the :ref:`Mux` abstraction.
In its simplest form, `Mux` takes as input a set of :ref:`Streamer` objects from which samples are drawn randomly.
This effectively generates data by a process similar to the following pseudo-code:

.. code-block:: python
:linenos:
while True:
stream_id = random_choice(streamers)
yield next(streamers[stream_id])
The `Mux` object also lets you specify an arbitrary distribution over the set of streamers, giving you fine-grained control over the resulting distribution of samples.


The `Mux` object is also a `Streamer`, so sampling hierarchies can be nested arbitrarily deep.

Out-of-core sampling
--------------------

Another common problem occurs when the size of the dataset is too large for the machine to fit in RAM simultaneously.
Going back to the audio example above, consider a problem where there are 30,000 source files, each of which generates 1GB of observation data, and the machine can only fit 100 source files in memory at any given time.

To facilitate this use case, the `Mux` object allows you to specify a maximum number of simultaneously active streams (i.e., the *working set*).
In this case, you would most likely implement a `generator` for each file as follows:

.. code-block:: python
:linenos:
def sample_file(filename):
# Load observation data
X = np.load(filename)
while True:
# Generate a random row as a dictionary
yield dict(X=X[np.random.choice(len(X))])
streamers = [pescador.Streamer(sample_file, fname) for fname in ALL_30K_FILES]
for item in pescador.Mux(streamers, 100):
model.partial_fit(item['X'])
Note that data is not loaded until the generator is instantiated.
If you specify a working set of size `k=100`, then `Mux` will select 100 streamers at random to form the working set, and only sample data from within that set.
`Mux` will then randomly evict streamers from the working set and replace them with new streamers, according to its `rate` parameter.
This results in a simple interface to draw data from all input sources but using limited memory.

`Mux` provides a great deal of flexibility over how streamers are replaced, what to do when streamers are exhausted, etc.


Parallel processing
-------------------

In the above example, all of the data I/O was handled within the `generator` function.
If the generator requires high-latency operations such as disk-access, this can become a computational bottleneck.

Pescador makes it easy to migrate data generation into a background process, so that high-latency operations do not stall the main thread.
This is facilitated by the :ref:`ZMQStreamer` object, which acts as a simple wrapper around any streamer that produces samples in the form of dictionaries of numpy arrays.
Continuing the above example:

.. code-block:: python
:linenos:
mux_stream = pescador.Mux(streamers, 100)
for item in pescador.ZMQStreamer(mux_stream):
model.partial_fit(item['X'])
Simple interface
----------------
Finally, Pescador is intended to work with a variety of machine learning frameworks, such as `scikit-learn` and `Keras`.
While many frameworks provide custom tools for handling data pipelines, each one is different, and many require using specific data structures and formats.

Pescador is meant to be framework-agnostic, and allow you to write your own data generation logic using standard Python data structures (dictionaries and numpy arrays).
We also provide helper utilities to integrate with `Keras`'s tuple generator interface.

0 comments on commit d345400

Please sign in to comment.