sra_metadata_parser

This project aims to create a simple method for researchers to extract the relevant data for their needs from the SRA metadata dumps found here:

ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Metadata/

Thereby giving individuals a tool which can ensure they are using the most recent dataset.

The metadata dumps (.tar.gz files) contain millions of directory's each with a set of xml files that may include:

*.experiment.xml
*.run.xml
*.sample.xml
*.study.xml
*.submission.xml

The idea is to specify which data is of interest within each xml file (using an XPath like strategy) and the program will generate a .csv file (for now) for each corresponding xml document type (experiment.xml, run.xml ...). Each row in the csv file pertains to one group of associated data that has been specified by the user (with in the compiled program).

At the time if writing (21/12/2020) this example program parses the entire SRA metadata dump under 15min on a 2.50Ghz i7 processor using ~3mb of RAM under 15min.

Aims continued

I intend to develop this project into a SRA metadata dump -> SQL database program. After which I will split of the lower level functionality into another package which will expose an api allowing much more customisation and targeted data extraction, this will be beneficial for those who only want to extract very specific info, which will vastly speed up data extraction time.

Usage

sra_metadata_parser --file <file> --destination <destination>

Example

sra_metadata_parser -f NCBI_SRA_Metadata_Full_20201006.tar.gz -d ./

Building

Install Rust: https://www.rust-lang.org/tools/install

Clone this repo

git clone https://github.com/jamespeterschinner/sra_metadata_parser

Build

cargo build --release

Executable will now be in

~/sra_metadata_parser/target/release/sra_metadata_pipeline

Disclaimer

This project is currently 'good enough' code, which in part is a proof of concept aimed at solving a particular problem (extracting relevant sra metadata) in a timely manner. In order to achieve this goal the use of unsafe code with pointers of 'static life time to a Vec<'a, u8> is used. Currently the implementation checks at runtime that the capacity of this vec does not change, which would mean the unsafe pointers are now invalid.

I would like to change this in the future, providing there are not significant performance costs

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
src		src
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sra_metadata_parser

Aims continued

Usage

Example

Building

Install Rust: https://www.rust-lang.org/tools/install

Clone this repo

Build

Executable will now be in

Disclaimer

About

Releases

Packages

Languages

jamespeterschinner/sra_metadata_parser

Folders and files

Latest commit

History

Repository files navigation

sra_metadata_parser

Aims continued

Usage

Example

Building

Install Rust: https://www.rust-lang.org/tools/install

Clone this repo

Build

Executable will now be in

Disclaimer

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages