Skip to content

DataRepo

Stephen Fisher edited this page Apr 30, 2015 · 6 revisions

The INIT and RSYNC modules are used to pull data into the pipeline and move data out of the pipeline. The pipeline can be used without these modules and the usefulness of these modules will depend on how the user stores NGS data. One scheme for storing data in a repository is as follows:

  • Each NGS run is considered an experiment (ie all lanes in an Illumina flow cell)
  • Each experiment is uniquely labeled and put in its own directory
  • Experiment directories contain the following subdirectories:
    • raw: the fastq files, compressed (gzip'd) and potentially split into multiple smaller files, as provided by the sequencing center. The fastq files containing the first read are expected to have the label "R1" in their file names. The files with the second reads must have an "R2" in their file names.
    • analyzed: the output files from this pipeline.
    • src: any 'source' data supplied by the sequencing center. This might include BCL files or Casava configuration files. This directory is not used by the pipeline.
    • info: any sequencing files that don't fit into the other directories; for example, bioanalyzer traces. This directory is not used by the pipeline.
  • Raw and Analyzed directories contain a subdirectory for each sample. The sample IDs should be used to name the respective subdirectories. The pipeline will create the subdirectories in the Analyzed directory, as needed.

Repository Structure Image

Clone this wiki locally