Data Science Project Template

This is a personal way of structuring projects built through observation and personal experience that helped me planning and scaling up without getting lost.

.
├── config.yml
├── data
├── envs
├── LICENSE
├── README.md
├── reports
├── results
├── src
├── start_project.sh
└── workflows

Rationale

config.yml: hand-curated list of external files and parameters required for the project.
data: keep raw and preprocessed data organized.
envs: conda environments to run the main project and, if necessary, create more.
LICENSE
reports: discuss your insights with a project webpage created with jupyter-book.
results: store files and final plots for every experiment.
src: project modules.
workflows: to download, preprocess, and analyze your data.

Typical workflow

Modify config.yml to your taste adding variables that could be useful project-wide.
Create the workflows to download and preprocess your project's data at workflows/download/ and workflows/preprocess, respectively. Make sure to distinguish between code that can be used project-wide -place it in the project's modules in src/ and call the functions in your workflow-; or code that is only used specifically for that part of the project -place it in your workflow's scripts/ subdirectory-.
Now, you can analyse your data creating different experiments as subdirectories of workflows/analyses that will get inputs from data/ and will output at results/your_experiment_name/.
Commit your work, and consider adding README files.
Inspect and explore results creating jupyter notebooks at reports/notebooks/ that can be rendered into static webpages with jupyter-book. Structure your project's book by modifying reports/_toc.yml.

Requirements (for this use case)

an environment manager: e.g. conda
a workflow manager: e.g. snakemake
(optional) a webpage builder: e.g. jupyter-book

Installation

# clone repository
git clone https://github.com/MiqG/project_template
cd project_template

# removes git remote
bash start_project.sh

# remove start_project.sh
rm start_project.sh

Structure

.
├── config.yml
├── data
│   ├── prep
│   ├── raw
│   └── references
├── envs
│   └── main.yml
├── LICENSE
├── README.md
├── reports
│   ├── _config.yml
│   ├── images
│   │   └── logo.png
│   ├── notebooks
│   │   ├── example_notebook.md
│   │   ├── intro.md
│   │   └── README.md
│   ├── README.md
│   └── _toc.yml
├── results
│   ├── new_experiment
│   │   ├── files
│   │   │   └── output_example.tsv
│   │   └── plots
│   │       └── output_example.pdf
│   └── README.md
├── src
│   └── python
│       ├── setup.py
│       └── your_project_name
│           └── config.py
├── start_project.sh
└── workflows
    ├── analyses
    │   └── new_experiment
    │       ├── README.md
    │       ├── run_all.sh
    │       ├── scripts
    │       │   └── workflow_step.py
    │       └── snakefile
    ├── download
    │   ├── README.md
    │   ├── run_all.sh
    │   ├── scripts
    │   │   └── workflow_step.py
    │   └── snakefile
    ├── preprocess
    │   ├── README.md
    │   ├── run_all.sh
    │   ├── scripts
    │   │   └── workflow_step.py
    │   └── snakefile
    └── README.md

References

Buffalo, V. (2015). Bioinformatics data skills: Reproducible and robust research with open source tools. " O'Reilly Media, Inc.". link
Noble WS (2009) A Quick Guide to Organizing Computational Biology Projects. PLoS Comput Biol 5(7): e1000424. https://doi.org/10.1371/journal.pcbi.1000424
Eric Ma - Principled Data Science Workflows
Pat Schloss - Riffomonas Project

Have fun!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science Project Template

Rationale

Typical workflow

Requirements (for this use case)

Installation

Structure

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
envs		envs
reports		reports
results		results
src/python		src/python
workflows		workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yml		config.yml
start_project.sh		start_project.sh

License

MiqG/project_template

Folders and files

Latest commit

History

Repository files navigation

Data Science Project Template

Rationale

Typical workflow

Requirements (for this use case)

Installation

Structure

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages