Skip to content

Latest commit

 

History

History
47 lines (33 loc) · 1.9 KB

DataProcessing.md

File metadata and controls

47 lines (33 loc) · 1.9 KB

Data Processing

Now that we have data in online sharing sites, we can start processing it. When processing, we want to keep the following best practices in mind:

  • Avoid duplication of code: maximize re-use
  • Keep track of the full state with version identifiers
  • Make sure all analysis is tested
  • Data, code, and documentation are coupled

The latter three topics are covered in further detail in the upcoming sessions on revision control, regression testing, and literate programming.

IPython Notebook

IPython Notebook

This is one of the best existing resources for reproducible research practices.

  • Learn it !
  • Love it !

When it becomes desireable to re-use code outside of the notebook, it is helpful to incrementally refactor by creating classes or functions. These classes or functions can be stored in python modules, and imported into the Notebook.

No Copy Paste

There are few helpful functions for this type of rapid development. In Python 2, the reload built-in, and in Python 3, imp.reload which forces the interpreter to reload modules whose cached version is used by import statements. Instead of manually calling these functions, there is also an autoreload IPython magic,


%load_ext autoreload
%autoreload 2

Hands On

Run through the 03-DataProcessing.ipynb notebook in the repository. At the end of the notebook, there is a simple exercise on code re-use. Explore analysis on your images or images of nearby groups. Save the updated notebook to disk.