Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing script, and Checkpoint DataSource #116

Open
1 of 4 tasks
nicholas-leonard opened this issue Feb 7, 2015 · 4 comments
Open
1 of 4 tasks

Preprocessing script, and Checkpoint DataSource #116

nicholas-leonard opened this issue Feb 7, 2015 · 4 comments

Comments

@nicholas-leonard
Copy link
Owner

Quoting a recent discussion concerning pylearn2, Pascal Lamblin (@lamblin) offered some nice solutions to a problem both our libraries are having:

  • Considering Datasets immutable, and only allow read access
    to the data through an iterator. Current-style preprocessing
    could be done either by a different script beforehand, or by
    a function that returns a different Dataset object. That would
    help making the experiments checkpoint and restart.
  • have an explicit pipeline of on-the-fly processing on minibatches
    between the Dataset and the Model. These transformations would not
    be part of the Theano graph, but happen on the numeric data. These
    could be iterators not unlike TransformerIterator, but would not
    be limited to batch-by-batch processing, and could do things like
    caching, data augmentation, data reordering.

While these solutions are offered for pylearn2, they also concern dp. The Preprocess objects currently modify the DataSets inplace. Currently, preprocessing has to be done each time you run an experiment. But you could easily do it once, and reuse that Checkpoint for your experiments. All you would need is a script to create the checkpoint and a means of referring to the resulting files from you experiment.

  • Change Preprocesses so that they work with Batches and provide an output (no inplace).
  • preprocess.lua : a script to apply Preprocess to DataSources and save the resulting data to disk
  • Checkpoint : a generic DataSource that works with the output of the preprocess.lua scripts.
  • common data format : hdf5 + view spec. Or th7 + view spec (for now). View spec in JSON format.
@lamblin
Copy link

lamblin commented Feb 8, 2015

Such a solution (credit mostly goes to @dwf, actually) has also been implemented recently in Blocks by @bartvm, if you want to have a look.

@bartvm
Copy link

bartvm commented Feb 8, 2015

We call the on-the-fly preprocessors data streams, while the datasets themselves are immutable. For a lengthier discussion on how we do checkpointing you can look here.

@nicholas-leonard
Copy link
Owner Author

@bartvm very nice package this Blocks. I am definitely using it as a reference point. Love the doc. Thanks.

@nicholas-leonard
Copy link
Owner Author

Found an intermediate/quickfix solution (for checkpoints) : bbeeeab

datasource = torch.checkpoint(checkpointPath, function()
    return dp.Mnist{input_preprocess=input_preprocess}
end)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants