Preprocessing script, and Checkpoint DataSource #116

nicholas-leonard · 2015-02-07T18:15:24Z

Quoting a recent discussion concerning pylearn2, Pascal Lamblin (@lamblin) offered some nice solutions to a problem both our libraries are having:

Considering Datasets immutable, and only allow read access
to the data through an iterator. Current-style preprocessing
could be done either by a different script beforehand, or by
a function that returns a different Dataset object. That would
help making the experiments checkpoint and restart.

have an explicit pipeline of on-the-fly processing on minibatches
between the Dataset and the Model. These transformations would not
be part of the Theano graph, but happen on the numeric data. These
could be iterators not unlike TransformerIterator, but would not
be limited to batch-by-batch processing, and could do things like
caching, data augmentation, data reordering.

While these solutions are offered for pylearn2, they also concern dp. The Preprocess objects currently modify the DataSets inplace. Currently, preprocessing has to be done each time you run an experiment. But you could easily do it once, and reuse that Checkpoint for your experiments. All you would need is a script to create the checkpoint and a means of referring to the resulting files from you experiment.

Change Preprocesses so that they work with Batches and provide an output (no inplace).
preprocess.lua : a script to apply Preprocess to DataSources and save the resulting data to disk
Checkpoint : a generic DataSource that works with the output of the preprocess.lua scripts.
common data format : hdf5 + view spec. Or th7 + view spec (for now). View spec in JSON format.

lamblin · 2015-02-08T01:01:02Z

Such a solution (credit mostly goes to @dwf, actually) has also been implemented recently in Blocks by @bartvm, if you want to have a look.

bartvm · 2015-02-08T01:08:52Z

We call the on-the-fly preprocessors data streams, while the datasets themselves are immutable. For a lengthier discussion on how we do checkpointing you can look here.

nicholas-leonard · 2015-02-08T18:51:18Z

@bartvm very nice package this Blocks. I am definitely using it as a reference point. Love the doc. Thanks.

nicholas-leonard · 2015-03-13T23:18:26Z

Found an intermediate/quickfix solution (for checkpoints) : bbeeeab

datasource = torch.checkpoint(checkpointPath, function()
    return dp.Mnist{input_preprocess=input_preprocess}
end)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing script, and Checkpoint DataSource #116

Preprocessing script, and Checkpoint DataSource #116

nicholas-leonard commented Feb 7, 2015

lamblin commented Feb 8, 2015

bartvm commented Feb 8, 2015

nicholas-leonard commented Feb 8, 2015

nicholas-leonard commented Mar 13, 2015

Preprocessing script, and Checkpoint DataSource #116

Preprocessing script, and Checkpoint DataSource #116

Comments

nicholas-leonard commented Feb 7, 2015

lamblin commented Feb 8, 2015

bartvm commented Feb 8, 2015

nicholas-leonard commented Feb 8, 2015

nicholas-leonard commented Mar 13, 2015