Utility to download and when appropriate index data downloaded from GDC. This utility takes as input Catalog files which describe data available at GDC and produces BamMap files which provide path and other metadata for downstream processing
GDC Import was developed for use with the CPTAC3 project but can be used for any GDC data.
Each download has its own clone of the importGDC.CPTAC3 github repository with a directory
named after the batch name, for instance, 20220217.TestDownload
. Then, clone the repository with,
git clone --recurse-submodules https://github.com/ding-lab/importGDC.CPTAC3.git 20220217.TestDownload
cd 20220217.TestDownload
Importing here relies on data file CPTAC3.Catalog.dat
generated by CPTAC3 Case Discover
and available here
Default locations for Catalog are provided for MGI, katmai, and compute1.
For LSF systems (compute1 and MGI), the number of simultaneous downloads is controlled by Job Groups.
The group name is typically /USER/gdc-download
, where USER is the login name. Please substitute your own name
for this value below.
First time users will need to create a job group with,
bgadd -L 5 /USER/gdc-download
This will create the named job group with a limit of 5 simultaneous downloads. To see the number of jobs queued, running and completed use the command,
bjgroup -s /USER/gdc-download
and to change the number of jobs running to 8 do,
bgmod -L 8 /USER/gdc-download
Don't make the number of jobs too high: this will saturate the network and reduce system performance. Suggest 5 jobs to start, and consult system administrators if you have questions.
Note that the Katmai system does not use LSF groups. Instead, the utility
parallel
controls the number of simultaneous jobs and is determined at start
time as described below.
Genomics files are large, with WGS often larger than 100Gb, so the choice of storage location is important.
This is determined by the DATA_ROOT
defined below. Make sure this allocation has adequate free space (df -h DATA_ROOT
)
and that you can write to it
The individual files will be written to,
$DATA_ROOT/GDC_import/data/<UUID>/<FILENAME>
where UUID is associated with the data file and provided by GDC. BAM files
will have an index file <FILENAME>.bai
and summary file <FILENAME>.flagstat
generated as well.
GDC User Token is obtained from GDC Cancer
Portal and has a filename which looks like,
gdc-user-token.2022-01-05T22_45_39.319Z.txt
. Note that this token is valid
for one month. If a new one is downloaded old tokens are invalidated.
It is suggested that you track all batch-specific details in the README.project.md file for future reference. Also, all imports associated with CPTAC3 Y3 should be tracked here.
A number of locale-specific variables are defined in gdc-import.config.sh
:
SYSTEM
- the name of this system: MGI, katmai, or compute1. This provides settings of a lot of other variables below.GDC_TOKEN
- path to GDC tokenLSF_GROUP
- LSF Group name (e.g./USER/gdc-download
) as described above. Ignored on Katmai.- The following are system-specific. Default values are provided based on SYSTEM, and these may not have to be modified
CATALOGD
- path to location of catalog fileDATA_ROOT
- location where download data will be stored.START_DOCKERD
- path to directory ofstart_docker.sh
described aboveFILE_SYSTEM
- Currently one ofMGI
,compute1
,katmai
- Used in BamMap to identify system where data resides
DOCKER_SYSTEM
- One ofMGI
,compute1
, ordocker
docker
is any generic docker system
LSF
- 1 for MGI and compute1, 0 otherwiseDL_ARGS
- optional compute group argumentsLSF_ARGS
- optional LSF arguments
- Place UUIDs to be downloaded in file
dat/download_UUID.dat
10_summarize_download.sh
- calculates the disk space required for this download. Generates an ad hoc catalog file which can be used to examine the planned download- Suggest placing output of this in
README.project.md
- Suggest placing output of this in
First, dry run of one UUID:
cat dat/download_UUID.dat | bash 20_start_download.sh -1d -
If looks good, run one UUID:
cat dat/download_UUID.dat | bash 20_start_download.sh -1 -
If download starts OK (check logs directory), download remainder (skipping the first UUID).
tail -n +2 dat/download_UUID.dat | bash 20_start_download.sh -J 5 -
tail -n +2 dat/download_UUID.dat | bash 20_start_download.sh -
cat dat/UUID-download.dat | bash 20_start_download.sh -
will start download of all UUIDs. There are a number of flags to review and modify this download
-d
will perform a dry run, to examine commands without running them-1
stops execution after one UUID is processed, can be combined with-d
-J N
will perform N downloads in parallel on katmai, and can significantly speed up downloads- Note, do not use -J on MGI or compute1. Rather, number of downloads will be governed by LSF system
-h
will list complete set of options
A number of other options exist. Run with -h
to view
30_evaluate_download_status.sh
will list download status of all UUIDs.
40_make_BamMap.sh
will create a BamMap file which lists the path and other metadata associated with
a given download. BamMap files are described in more detail in the CPTAC3.Catalog project,
and examples are here.
Downloading is performed by GDC Data Transfer
Tool. BAM files
are indexed and output of samtools flagstat
is written to provide an overview
of read statistics. The tool is wrapped in a docker image,
mwyczalkowski/importgdc
, and a wrapper shell script iterates over all UUIDs
to invoke the dockerized tool in a system-dependent way. Parallelization for
katmai is implemented in the wrapper script using GNU parallel
.
Matthew Wyczalkowski [email protected] Ding Lab Washinton University School of Medicine