Author: Matthias Bieg
pyanno is a python package that - given a list of genomic intervals - annotates these intervals to arbitrary genomic intervals of interest, as well as genes. The list of genomic intervals to be annotated, as well as the database interval regions of interest have to be encoded in a bed-like format.
The base file of intervals, that will be annotated against a database of bed like files, that contain genomic intervals of interest has to be itself bed-like. It suffuces that the first three lines follow bed conventions, i.e.
- 1st column: Chromosome
- 2nd column: Start position (0-based)
- 3rd column: End position (1-based)
The base interval entries can in addition contain an arbitrary number of additional columns.
# base.bed
#chrom start end
4 128887787 128887839
4 188862197 188862251
4 185746125 185746231
The datbase file is a tab separated file, containing information about the genomic regions of interest, that shall be annotated to the base file. The database file contains the following information:
- FILENAME: Absolute path to the file
- REGION.TYPE: E.g. protein.coding.genes, Enhancers, ...
- SOURCE: E.g., Cell type from which regions are derived
- ANNOTATION.BY: SOURCE | NAME
- MAX.DISTANCE: Maximal distance between base and database intervall, such that database intervall is anotated to base intervall.
- DISTANCE.TO: The location to which the distance shall be computed. Can be START | END | MID | REGION
- N.HITS: Integer value defining the number elements from the database that shall be annotated to the base
- NAME.COL: If ANNOTATION.BY == NAME, then you can define the column (0-based) in which the name is stored. If NAME.COL == NA, then it is assumed, that the 4th column contains the name.
The first line of the tsv file must contain the above bold column identifiers!
# database.tsv
FILENAME REGION.TYPE SOURCE ANNOTATION.BY MAX.DISTANCE DISTANCE.TO N.HITS NAME.COL
E045_15_coreMarks_dense_7_Enh.bed Enhancer.Roadmap E045 SOURCE 0 REGION 1 NA
E036_15_coreMarks_dense_7_Enh.bed Enhancer.Roadmap E036 SOURCE 0 REGION 1 NA
protein_coding_genes.bed gencode19.protein.coding.TSS gencode.v19 NAME 200000 START 1 6
enhancer_promoter_links_neutrophils.bed PCHiC.neutrophils neutrophils NAME 10000 REGION 1 7
# E045_15_coreMarks_dense_7_Enh.bed
10 179800 180000 7_Enh 0 . 179800 180000 255,255,0
10 182800 183200 7_Enh 0 . 182800 183200 255,255,0
10 265600 266000 7_Enh 0 . 265600 266000 255,255,0
.
.
.
# E036_15_coreMarks_dense_7_Enh.bed
10 132200 132400 7_Enh 0 . 132200 132400 255,255,0
10 133400 133600 7_Enh 0 . 133400 133600 255,255,0
10 152600 153000 7_Enh 0 . 152600 153000 255,255,0
.
.
.
# protein_coding_genes.bed
#chrom start end ensembl.id score strand hugo.name
1 69091 70008 ENSG00000186092.4 NA + OR4F5
1 134901 139379 ENSG00000237683.5 NA - AL627309.1
1 367640 368634 ENSG00000235249.1 NA + OR4F29
.
.
.
# enhancer_promoter_links_neutrophils.bed
#oeChr oeStart oeEnd oeName baitChr baitStart baitEnd baitName
1 1150970 1156235 . 1 850619 874081 AL645608.1;RP11-54O7.3;SAMD11
1 1000704 1005126 . 1 903641 927394 C1orf170;PLEKHN1
1 1150970 1156235 . 1 903641 927394 C1orf170;PLEKHN1
.
.
.
import pyanno
import pandas as pnd
database_filename = "database.tsv"
base_filename = "base.bed"
results_filename = "results.bed"
# Create a new GenomicRegionAnnotator instance
gra = pyanno.Annotator.GenomicRegionAnnotator()
# load base
gra.load_base_from_file(base_filename)
# load database
gra.load_database_from_file(database_filename)
# Annotate base against all database genomic region files
gra.annotate()
# Retrieve annotated base intervals as pandas.DataFrame instance
annotated_base_df = gra.get_base()
# Write annotated base intervals to disk
annotated_base_df.to_csv(results_filename, sep="\t", index=False)
#results.bed
#chrom start end Enhancer.Roadmap gencode19.protein.coding.TSS PCHiC.neutrophils
4 128887787 128887839 E036(0) MFSD8(-637) PGRMC2(0);PGRMC2(1001);PGRMC2(2881);PGRMC2(7442)
4 188862197 188862251 NA ZFP42(54674) NA
4 185746125 185746231 NA ACSL1(-1740) ACSL1(3864)
This package was implemented during my time at Charite, Universitaetsmedizin Berlin, Berlin Institute of Health (BIH) in the Department of Digital Health headed by Prof. Roland Eils.
Copyright (c) 2021, Matthias Bieg
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
-
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
-
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
-
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.