Skip to content

Code and data for analysing the taxonomic distribution of Rfam families

License

Notifications You must be signed in to change notification settings

Rfam/rfam-taxonomy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rfam Taxonomy

Based on Rfam 14.8 (May 2022). See releases for previous versions.

This repository contains the code and data for analysing the taxonomic distribution of the Rfam families. The goal is to identify domain-specific subsets of Rfam covariance models for annotating bacterial, eukaryotic, and other genomes with the Infernal software.

The code uses the Rfam public MySQL database to compare the taxonomic domains of sequences from the manually curated seed alignments and the automatically identified full region hits.

📂 The results are organised in several files in the domains folder. Each file contains seven columns:

  1. Family = Rfam accession (e.g. RF00001)
  2. Domain = Taxonomic domain where the family is found (:grey_exclamation: this is the most important column)
  3. Seed domains = All taxonomic domains from the seed alignment
  4. Full region domains = All taxonomic domains from full region hits
  5. Rfam ID = Rfam identifier (e.g. 5S_rRNA)
  6. Description = Family description
  7. RNA type = One of Rfam RNA types.

Domain can be:

  • a single domain (for example, Bacteria or Eukaryota) if the majority of hits (>=90%) are from the same domain both in seed and full region hits;
  • <seed domain>/<full region domain> - if seed and full region domains are not the same, then both are listed. For example, Viruses/Eukaryota means that the seed alignment contains mostly Viruses and the full region hits contain mostly Eukaryotes);
  • Mixed - if there is no single domain where the family occurs. For example, 5S rRNA RF00001 is expected to be found in Bacteria, Archaea, and Eukaryota.
  • <seed region domain>/Mixed or Mixed/<full region domain> - For example, Bacterial SSU RF00177 has only Bacteria in the seed alignment but the full region hits also contain Eukaryota because the mitochondrial and plastid SSU is similar to the bacterial SSU and is expected to match the bacterial model.

✅ View summary with the number of families observed in each domain.

Retrieving the data

The latest version of the files can be retrieved directly from GitHub using the following URL format:

It is also possible to download the data and use it locally or regenerate the files (see the Installation section below).

Example use cases

  • If you are interested in a subset of Rfam families that match Bacteria, you can use the bacteria.csv file. For example, the following command generates a bacteria.cm file with a subset of Rfam covariance models that can be used with the Infernal cmscan program:

    curl https://raw.githubusercontent.com/Rfam/rfam-taxonomy/master/domains/bacteria.csv | \
    cut -f 1,1 -d ',' | \
    tail -n +2 | \
    cmfetch -o bacteria.cm -f Rfam.cm.gz -
    

    where cmfetch is part of the Infernal suite and Rfam.cm.gz can be downloaded from ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz.

  • You can also further process the all-domains.csv file. For example, to eliminate any families that find hits outside Bacteria, you can focus on rows where the second column is Bacteria and the third and the fourth columns contain Bacteria (100.0%). Note that such a subset would ignore many important RNA families that detect some contamination in eukaryotic sequences.


Installation

Requirements

Clone or download this repository and run the following commands:

virtualenv ENV
source ENV/bin/activate
pip install -r requirements.txt

Updating the data

After each Rfam release, the data in this repo need to be updated locally and pushed to GitHub.

  1. Generate new data

    # when running for the first time (needs to run in this order):
    python rfam-taxonomy.py --precompute-full
    python rfam-taxonomy.py --precompute-seed
    
    # after precompute is done, run:
    python rfam-taxonomy.py
    
    # to see additional options:
    python rfam-taxonomy.py --help
    
  2. Review the changes

    The results must be manually reviewed before committing the new files by checking the difference between the old and the new versions using git.

    It is normal for the values in the 3rd and 4th columns to change but Domain, the 2nd column, should stay stable unless the affected family has been significantly updated.

  3. Update release info in Readme

  4. Create new GitHub release

Feedback

Feel free to create GitHub issues to ask questions or provide feedback. Pull requests are also welcome.