Now that we have data in online sharing sites, we can start processing it. When processing, we want to keep the following best practices in mind:
- Avoid duplication of code: maximize re-use
- Keep track of the full state with version identifiers
- Make sure all analysis is tested
- Data, code, and documentation are coupled
The latter three topics are covered in further detail in the upcoming sessions on revision control, regression testing, and literate programming.
This is one of the best existing resources for reproducible research practices.
- Learn it !
- Love it !
When it becomes desireable to re-use code outside of the notebook, it is helpful to incrementally refactor by creating classes or functions. These classes or functions can be stored in python modules, and imported into the Notebook.
There are few helpful functions for this type of rapid development. In Python 2,
the reload built-in, and in Python 3,
imp.reload which
forces the interpreter to reload modules whose cached version is used by import
statements. Instead of manually calling these functions, there is also an
autoreload
IPython magic,
%load_ext autoreload
%autoreload 2
Run through the 03-DataProcessing.ipynb notebook in the repository. At the end of the notebook, there is a simple exercise on code re-use. Explore analysis on your images or images of nearby groups. Save the updated notebook to disk.