Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time to include some extra species #68

Open
markziemann opened this issue Feb 23, 2020 · 6 comments
Open

Time to include some extra species #68

markziemann opened this issue Feb 23, 2020 · 6 comments
Assignees
Labels

Comments

@markziemann
Copy link
Owner

6th FEB 2020 no. SRX
Arabidopsis thaliana 30890
Zea mays 19753
Oryza sativa 9737
Triticum aestivum 6924
Solanum lycopersicum 6444
Sorghum bicolor 4646
Glycine max 3889
Populus trichocarpa 3485
Vitis vinifera 3258
Panicum virgatum 2338
Hordeum vulgare 2284
Solanum tuberosum 1851
Brachypodium distachyon 1814
18th FEB 2020 no. SRX
Schizosaccharomyces pombe 4718
Plasmodium falciparum 4298
@markziemann
Copy link
Owner Author

Macaca mulatta
Bos Taurus
Sus scrofa
Gallus gallus
Ovis aries

@wdlingit
Copy link

Thank you for providing this DEE2 database. I have been using it for quite a while. Some short questions:

  1. Is this idea of adding species ongoing?
  2. In case that this is ongoing, any specific genome assembly/annotation versions been used?

My colleagues would be interested in rice and maize. Just tested the singularity solution and it seems that we can run the pipeline by ourselves. In case that some genome assembly/annotation versions for rice and maize have been adopted by DEE2, we would like to consider following them and maybe share the computation results.

@markziemann
Copy link
Owner Author

Hi @wdlingit, we have been unsuccessfully seeking funding to support the expansion of DEE2 in particular with the backlog of mouse and human studies and the possibility of updating to the latest reference genome build.
That said, I think we can work together to get rice and maize included. I will do the necessary work to modify the pipeline to include rice and maize data and then update the web server side of things. If you could do the data processing at your institution, it would help expedite things along. I'm not sure about an exact timeline, but I might have things ready to start data processing by end of August.

@markziemann markziemann self-assigned this Jul 4, 2024
@wdlingit
Copy link

wdlingit commented Jul 4, 2024

Thank you for the reply. We collected SRR accessions with NCBI Taxonomy ID 39947 plus some minor restriction. Our current SRR list to be processed is about 7K SRRs. This is smaller than what you listed a few years ago. I think this is reasonable because Tax ID 39947 is for Oryza sativa Japonica Group, a subspecies(?) of rice. Oryza sativa Japonica Group is also available in ensembl plants ( https://plants.ensembl.org/Oryza_sativa/Info/Index ) We just started (2 hours ago) a test run of 1000 SRRs and things seem OK to me. In order to make sure things are coordinated, I listed info we applied in the volunteer_pipeline.sh script:

elif [ $ORG == "osativa" ] ; then
  GTFURL="ftp://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/oryza_sativa/Oryza_sativa.IRGSP-1.0.59.gtf.gz"
  GDNAURL="ftp://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna_sm.toplevel.fa.gz"
  CDNAURL="ftp://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/oryza_sativa/cdna/Oryza_sativa.IRGSP-1.0.cdna.all.fa.gz"
  BT2_MD5="05eb69ae1d8b8b0d2cc06e890bf55dc6"
  KAL_MD5="6f618eda89e9b057c99d4d7580c5858d"
  STAR_MD5="b374bef1756a1ea105c968d68c71127e"

@wdlingit
Copy link

wdlingit commented Oct 9, 2024

Hi @markziemann Sorry for the long delayed reply. So far we finished computation of ~7468 rice RNAseq samples and ~38200 maize RNAseq samples. Not sure when and how to send these results to your side.

Also, we made one minor fix that fixes sequence files after a specific trimming step, which makes empty read in one end in some rare cases. Do you have a repository for the DEE2 container so that we can make a pull request for this? or we just describe our fix in an issue?

BTW, our settings for the maize part in the volunteer_pipeline.sh script

elif [ $ORG == "zmays" ] ; then
  GTFURL="https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-59/plants/gtf/zea_mays/Zea_mays.Zm-B73-REFERENCE-NAM-5.0.59.gtf.gz"
  GDNAURL="https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-59/plants/fasta/zea_mays/dna/Zea_mays.Zm-B73-REFERENCE-NAM-5.0.dna_sm.toplevel.fa.gz"
  CDNAURL="https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-59/plants/fasta/zea_mays/cdna/Zea_mays.Zm-B73-REFERENCE-NAM-5.0.cdna.all.fa.gz"
  BT2_MD5="7dc6bbdf600fd4305af72600b4c417f9"
  KAL_MD5="00ecbba2360b5ffdd24a3be6b0aa0acd"
  STAR_MD5="4a44ab4db80dcc1e887f6861dc48eae9"

@markziemann
Copy link
Owner Author

Great work!

In terms of data transfers, we can use something like magic-wormhole which can be installed using apt if you are on a Debian based system. Can I ask how big the dataset is in GB?

I'm also happy that you have a minor improvement to the pipeline. If you could email that script to me, I can update the code repo, the Docker image and the ingestion script so that your processed datasets can be whitelisted and included.

Thank you for providing the URLs, this will simplify my work somewhat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants