Time to include some extra species #68

markziemann · 2020-02-23T11:32:37Z

6th FEB 2020	no. SRX
Arabidopsis thaliana	30890
Zea mays	19753
Oryza sativa	9737
Triticum aestivum	6924
Solanum lycopersicum	6444
Sorghum bicolor	4646
Glycine max	3889
Populus trichocarpa	3485
Vitis vinifera	3258
Panicum virgatum	2338
Hordeum vulgare	2284
Solanum tuberosum	1851
Brachypodium distachyon	1814

18th FEB 2020	no. SRX
Schizosaccharomyces pombe	4718
Plasmodium falciparum	4298

markziemann · 2020-11-30T01:54:57Z

Macaca mulatta
Bos Taurus
Sus scrofa
Gallus gallus
Ovis aries

wdlingit · 2024-06-28T02:13:25Z

Thank you for providing this DEE2 database. I have been using it for quite a while. Some short questions:

Is this idea of adding species ongoing?
In case that this is ongoing, any specific genome assembly/annotation versions been used?

My colleagues would be interested in rice and maize. Just tested the singularity solution and it seems that we can run the pipeline by ourselves. In case that some genome assembly/annotation versions for rice and maize have been adopted by DEE2, we would like to consider following them and maybe share the computation results.

markziemann · 2024-07-04T03:00:14Z

Hi @wdlingit, we have been unsuccessfully seeking funding to support the expansion of DEE2 in particular with the backlog of mouse and human studies and the possibility of updating to the latest reference genome build.
That said, I think we can work together to get rice and maize included. I will do the necessary work to modify the pipeline to include rice and maize data and then update the web server side of things. If you could do the data processing at your institution, it would help expedite things along. I'm not sure about an exact timeline, but I might have things ready to start data processing by end of August.

wdlingit · 2024-07-04T07:11:48Z

Thank you for the reply. We collected SRR accessions with NCBI Taxonomy ID 39947 plus some minor restriction. Our current SRR list to be processed is about 7K SRRs. This is smaller than what you listed a few years ago. I think this is reasonable because Tax ID 39947 is for Oryza sativa Japonica Group, a subspecies(?) of rice. Oryza sativa Japonica Group is also available in ensembl plants ( https://plants.ensembl.org/Oryza_sativa/Info/Index ) We just started (2 hours ago) a test run of 1000 SRRs and things seem OK to me. In order to make sure things are coordinated, I listed info we applied in the volunteer_pipeline.sh script:

elif [ $ORG == "osativa" ] ; then
  GTFURL="ftp://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/oryza_sativa/Oryza_sativa.IRGSP-1.0.59.gtf.gz"
  GDNAURL="ftp://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna_sm.toplevel.fa.gz"
  CDNAURL="ftp://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/oryza_sativa/cdna/Oryza_sativa.IRGSP-1.0.cdna.all.fa.gz"
  BT2_MD5="05eb69ae1d8b8b0d2cc06e890bf55dc6"
  KAL_MD5="6f618eda89e9b057c99d4d7580c5858d"
  STAR_MD5="b374bef1756a1ea105c968d68c71127e"

wdlingit · 2024-10-09T08:19:29Z

Hi @markziemann Sorry for the long delayed reply. So far we finished computation of ~7468 rice RNAseq samples and ~38200 maize RNAseq samples. Not sure when and how to send these results to your side.

Also, we made one minor fix that fixes sequence files after a specific trimming step, which makes empty read in one end in some rare cases. Do you have a repository for the DEE2 container so that we can make a pull request for this? or we just describe our fix in an issue?

BTW, our settings for the maize part in the volunteer_pipeline.sh script

elif [ $ORG == "zmays" ] ; then
  GTFURL="https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-59/plants/gtf/zea_mays/Zea_mays.Zm-B73-REFERENCE-NAM-5.0.59.gtf.gz"
  GDNAURL="https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-59/plants/fasta/zea_mays/dna/Zea_mays.Zm-B73-REFERENCE-NAM-5.0.dna_sm.toplevel.fa.gz"
  CDNAURL="https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-59/plants/fasta/zea_mays/cdna/Zea_mays.Zm-B73-REFERENCE-NAM-5.0.cdna.all.fa.gz"
  BT2_MD5="7dc6bbdf600fd4305af72600b4c417f9"
  KAL_MD5="00ecbba2360b5ffdd24a3be6b0aa0acd"
  STAR_MD5="4a44ab4db80dcc1e887f6861dc48eae9"

markziemann · 2024-10-09T22:50:07Z

Great work!

In terms of data transfers, we can use something like magic-wormhole which can be installed using apt if you are on a Debian based system. Can I ask how big the dataset is in GB?

I'm also happy that you have a minor improvement to the pipeline. If you could email that script to me, I can update the code repo, the Docker image and the ingestion script so that your processed datasets can be whitelisted and included.

Thank you for providing the URLs, this will simplify my work somewhat.

markziemann added the feature label Oct 9, 2020

markziemann self-assigned this Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time to include some extra species #68

Time to include some extra species #68

markziemann commented Feb 23, 2020

markziemann commented Nov 30, 2020

wdlingit commented Jun 28, 2024

markziemann commented Jul 4, 2024

wdlingit commented Jul 4, 2024

wdlingit commented Oct 9, 2024

markziemann commented Oct 9, 2024

Time to include some extra species #68

Time to include some extra species #68

Comments

markziemann commented Feb 23, 2020

markziemann commented Nov 30, 2020

wdlingit commented Jun 28, 2024

markziemann commented Jul 4, 2024

wdlingit commented Jul 4, 2024

wdlingit commented Oct 9, 2024

markziemann commented Oct 9, 2024