Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ps merging annotating genecounts #15

Open
wants to merge 84 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
68e96d6
add intiial code for making annotations object
preetisi Jun 16, 2022
c5df8a7
get intron to splice site annotations from gene model
preetisi Jun 28, 2022
d6d45e7
find unique exon-exon junctions from all sample files
preetisi Jun 30, 2022
45551b7
full exon matches
preetisi Jun 30, 2022
a8f752c
sparse matrix initial code
preetisi Jul 6, 2022
c523742
updated gitignore, get the unique junctions from all the exon junctio…
preetisi Jul 6, 2022
5c84d59
get all unique junctions of all samples and add them to a single data…
Jul 6, 2022
95ec410
calculate total junction counts by parsing sample files
Jul 7, 2022
d814dd2
divide the input sample files in chunk and multiprocess them in chunk…
Jul 8, 2022
a497275
Merge branch 'master' of https://github.com/SalomonisLab/altanalyze3 …
Jul 13, 2022
f7b2d13
intermittent code for sparse matrix
Jul 14, 2022
679ede6
working sparse matrix from two samples
Jul 14, 2022
723cdd4
sparse matrix function
Jul 14, 2022
b38af08
sparse matrix using dictionary of keys
Jul 15, 2022
eb934b2
dok_matrix to csr matrix
Jul 19, 2022
c2e77d7
read multiple files from a directory
Jul 21, 2022
05fde08
working version of multiprocessing
Jul 21, 2022
cf70761
new main file
Jul 21, 2022
8ad3a23
parse each file and make a sparse matrix
Jul 26, 2022
9652ef4
remove the chromosome guard
Jul 26, 2022
2ed9347
ignore the test data folder
Jul 26, 2022
221e469
read files using pandaframes
Jul 28, 2022
1ca86e2
created a giant dictionary where key is junction coordinate and it's …
Jul 28, 2022
e76c477
check if the junction already exists in the dictionary
Jul 28, 2022
bf33f53
get junction annotations from gene model
Jul 28, 2022
a8ce4b5
working version - got splice site annotations from gene model
Jul 28, 2022
77d4b6f
imported the junctions from bam derived text file
Jul 28, 2022
40face6
get junction annotations
Jul 28, 2022
d89f06a
getting splice annotation from junction file and gene model file
Aug 1, 2022
f2ea773
add log statement
Aug 2, 2022
4ede6a7
sparse matrix from giant dictionary
Aug 2, 2022
b3c8f8c
working version of making dok_matrix
Aug 2, 2022
ae7e743
indexing of dok_matrix fixed
Aug 2, 2022
2edcd72
sparse matrix printed as dense matrix for testing
Aug 4, 2022
467b3ea
old main
Aug 4, 2022
57485ef
fix the way we are storing exon annotations, fix the if statements in…
Aug 5, 2022
96e2e8a
fix - junction coordinates should be added for all junction files
Aug 5, 2022
383096e
check if coordinate is in range for both positive and negative strand
Aug 5, 2022
f434217
ignore the tsv files
Aug 5, 2022
6a0087b
make a separate class for collecting the data
Aug 5, 2022
0bb5f7f
remove the dead code
Aug 5, 2022
cc600dd
added classes
Aug 9, 2022
50f8978
memory efficent way for reading the junctions from gene model and cre…
Aug 10, 2022
bef70bc
remove dead code
Aug 10, 2022
8ccacc7
add print statements
Aug 11, 2022
4e710a7
initial commits for splice annotation
Aug 12, 2022
ce75956
check if start and stop annotations are not None and then find the sp…
Aug 14, 2022
b490881
get all the exons for a given gene id
Aug 14, 2022
a09a494
got the exons for candidate gene
Aug 15, 2022
41c3623
working version of getting the start and stop annotations
Aug 15, 2022
e26ab14
splice annotations working
Aug 16, 2022
76f8030
working version of splice site annotations
Aug 17, 2022
c20e286
annotations writing to a txt file
Aug 17, 2022
8b1e22b
correct annotations when both start and stop coordinates of a junctio…
Aug 18, 2022
87198fc
working if else statements for checking if junction exists in the gen…
Aug 18, 2022
e90af2a
fix if statement
Aug 18, 2022
9b60784
short circuiting in if else fixed
Aug 18, 2022
8a57f45
start annotation getting annotated correctly when stop target tuple d…
Aug 18, 2022
f325a40
working annotations
Aug 18, 2022
fad419d
negative and positive both strand working
Aug 18, 2022
194d7a5
exon id of stop coordinate should be first when strand is nrgative
Aug 18, 2022
439d19f
U_ annotations coorected
Aug 22, 2022
ad5757d
getting coorect annotation whene splice sites are not in same gene
Aug 23, 2022
7dce938
all annotations working correctly
Aug 31, 2022
da3feee
remove unnecessary print statements
Aug 31, 2022
1758f19
refactoring #1 - removing variables which are not needed
Aug 31, 2022
7a04082
bug -fix - fixed the infinite loop while finding the annotation, refa…
Aug 31, 2022
f8ce038
benchmark with timeit without indexing junction files"
Sep 1, 2022
71b568f
refactoring #3 - add logging statements, cleaner way of string concat…
Sep 1, 2022
34e10ee
benchmarking and adding multiprocessing
Sep 2, 2022
0eb7d1c
optimised gene model file reading - removed pandas, instead using pyt…
Sep 8, 2022
fa1ccf4
normalize data types of junction file and gene model
Sep 9, 2022
5583806
fix data type issues
Sep 9, 2022
1b8bcad
optimised when both start and stop are present - refactoring #4
Sep 9, 2022
28ebec5
refactoring #5 - reduced the number of if else statements
Sep 9, 2022
756d5ad
refatored splice site annotation function
Sep 14, 2022
91e47cc
reading junctions using pandas - for some reason it was faster than p…
Sep 15, 2022
7549dd9
refactored the annotation function
Oct 28, 2022
4b39404
sorting function
Oct 28, 2022
6752b37
removed the old main.py, added output annotations file to docs folder
Oct 28, 2022
00af13b
resolve conflict
Nov 2, 2022
2227664
preliminary attempt to sort junctions
Nov 2, 2022
7df7ce2
remove dead code
Nov 2, 2022
909c0b4
close the pool processes
Nov 2, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
__pycache__/
*.py[cod]
*$py.class
.vscode/

# C extensions
*.so
Expand Down Expand Up @@ -43,4 +44,8 @@ cov.xml
nosetests.xml
coverage.xml
*,cover
.hypothesis/
.hypothesis/
../tests/
../../../tests/hg19_ref.tsv
../../../tests/hg19_ref.tsv.gz
../../../tests/hg19_ref_introns.bed.gz
Empty file.
816,035 changes: 816,035 additions & 0 deletions altanalyze3/components/junction_annotations/annotations_all.txt

Large diffs are not rendered by default.

234 changes: 234 additions & 0 deletions altanalyze3/components/junction_annotations/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
from xml.dom.minidom import TypeInfo
import pandas as pd
import os
import timeit
import logging
import multiprocessing as mp
import gzip
# Parallel Computing

from joblib import Parallel, delayed
from tqdm.notebook import tqdm

class JunctionAnnotation:

def __init__(self):
self.junction_coordinate_BAM_dict = {}
self.gene_model_dict = {}
self.gene_model_exon_dict = {}

def each_junction_annotation(self,junction_dir, gene_model_all):
'''
Input args: path to all BAM derived sample txt files, gene_model_all txt file(hg38)
Output: map of annotations for a given junction
'''
junction_files = os.listdir(junction_dir)
self.generate_gene_model_dict(gene_model_all)
annotation_keys = []
annotations = []

for junction_file in junction_files:
if not junction_file.startswith('.'):
logging.info("Reading junction file " + ' ' + junction_file)
with open(junction_dir + junction_file) as f:
junction_df = pd.read_csv(f,sep='\t',header=None,
names=["chr", "start", "stop", "annotation", "splice_count"])

for idx,row in junction_df.iterrows():
chr = row.chr
start = row.start
stop = int(row.stop)

start_tar_tup = (str(row.chr), int(row.start))
stop_tar_tup = (str(chr), int(stop) + 1)
annotation_key = ''.join(['chr', str(chr), ':', str(start),'-',str(stop + 1)])

if start_tar_tup in self.gene_model_dict and stop_tar_tup in self.gene_model_dict == None:
continue
elif self.gene_model_dict.get(start_tar_tup) != None and self.gene_model_dict.get(stop_tar_tup) != None:
# both start and stop annotation are present in gene model
gene_id = ''
if self.gene_model_dict[start_tar_tup]['gene_id'] == self.gene_model_dict[stop_tar_tup]['gene_id']:
gene_id = self.gene_model_dict[(start_tar_tup)]['gene_id']

if(self.gene_model_dict[start_tar_tup]['strand'] == '-'):
annotation = ''.join([gene_id, ':', self.gene_model_dict[stop_tar_tup]['exon_region_id'], '-', self.gene_model_dict[start_tar_tup]['exon_region_id']])
else:
annotation = ''.join([gene_id,':',self.gene_model_dict[start_tar_tup]['exon_region_id'],'-', self.gene_model_dict[start_tar_tup]['exon_region_id']])
else:
start_gene_id = self.gene_model_dict[(start_tar_tup)]['gene_id']
stop_gene_id = self.gene_model_dict[(stop_tar_tup)]['gene_id']
annotation = ''.join([start_gene_id, ':', self.gene_model_dict[(start_tar_tup)]['exon_region_id'], '-', stop_gene_id, ":",self.gene_model_dict[(stop_tar_tup)]['exon_region_id']])

annotations.append(annotation)
annotation_keys.append(annotation_key)

elif self.gene_model_dict.get(start_tar_tup) != None and self.gene_model_dict.get(stop_tar_tup) == None:
# if start is present in gene model but stop is not
annotated_splice_site = self.annotate_splice_site(chr = chr, junction_coordinate = stop_tar_tup[1],
candidate_gene = self.gene_model_dict[start_tar_tup]['gene_id'], exon_region_id = self.gene_model_dict[start_tar_tup]['exon_region_id'], strand = self.gene_model_dict[start_tar_tup]['strand'])
annotation = ''.join([self.gene_model_dict[start_tar_tup]['gene_id'],':', self.gene_model_dict[start_tar_tup]['exon_region_id'],'-', annotated_splice_site])

annotations.append(annotation)
annotation_keys.append(annotation_key)

elif self.gene_model_dict.get(start_tar_tup) == None and self.gene_model_dict.get(stop_tar_tup) != None:
# if start is not found but stop exists in gene model
annotated_splice_site = self.annotate_splice_site(chr = chr, junction_coordinate = int(start),
candidate_gene = self.gene_model_dict[stop_tar_tup]['gene_id'], exon_region_id = self.gene_model_dict[stop_tar_tup]['exon_region_id'], strand = self.gene_model_dict[stop_tar_tup]['strand'])

annotations.append(''.join([self.gene_model_dict[stop_tar_tup]['gene_id'],':', self.gene_model_dict[stop_tar_tup]['exon_region_id'], '-', annotated_splice_site]))
annotation_keys.append(annotation_key)

self.collect_results(annotation_keys, annotations)


def collect_results(self, annotation_keys, annotations):
data = { 'annotation_key':annotation_keys,'annotations':annotations}
df1 = pd.DataFrame(data)
df1.to_csv('annotations_all.txt',sep='\t')


def annotate_splice_site(self,chr,junction_coordinate,candidate_gene, exon_region_id, strand):
annotated_splice_site = ''
buffer = 0
ref_gene_start = self.gene_model_exon_dict[candidate_gene][0]['start']
ref_gene_stop = self.gene_model_exon_dict[candidate_gene][-1]['stop']
status = self.coordinate_in_range(junction_coordinate, ref_gene_start,ref_gene_stop, buffer = 0, strand=strand)

if status == True:
# preset when initially looking (always false to start with)
candidate_found = False

for ea in self.gene_model_exon_dict[candidate_gene]:
exon_start = ea['start']
exon_stop = ea['stop']
exon_id = ea['exon_region_id']
if 'I' in exon_id:
intron_status = True
buffer = 0
else:
intron_status = False
buffer = 50
status = self.coordinate_in_range(junction_coordinate, exon_start, exon_stop, buffer, strand)
if(status):
annotated_splice_site = ''.join([str(exon_id), "_", str(junction_coordinate)])

if intron_status == False:
return annotated_splice_site
else:
candidate_found = True
else:
if candidate_found:
return annotated_splice_site
else:
status = self.coordinate_in_range(junction_coordinate, ref_gene_start,ref_gene_stop, buffer = 20000, strand = strand)
region_numbers = []
for ea in self.gene_model_exon_dict[candidate_gene]:
region_numbers.append(int(str.split(ea['exon_region_id'][1:],'.')[0]))

if status == True:
# applicable to 3'UTR
if(strand == '+'):
annotated_splice_site = ''.join(['U', str(region_numbers[-1]),'.1_', str(junction_coordinate)])
# applicable to 5'UTR
else:
annotated_splice_site = ''.join(['U0.1', '_', str(junction_coordinate)])
else:
# iterate over all the genes on same chromosome and find which junction in gene model
# is nearest to this coordinate
for (chrom, junction), value in self.gene_model_dict.items():
if chr == chrom:
if int(junction_coordinate) >= int(self.gene_model_dict[(chrom,junction)]['start']) and int(junction_coordinate) <= int(self.gene_model_dict[(chrom,junction)]['stop']):
annotated_splice_site = ''.join([self.gene_model_dict[(chrom,junction)]['gene_id'], ':', self.gene_model_dict[(chrom,junction)]['exon_region_id'], '_', str(junction_coordinate)])
else:
#print("none of the conditions met, second junction not even on same chromosome or smaller/bigger - may be its negative strand??")
continue
else:
continue

return annotated_splice_site

def coordinate_in_range(self,junction_coordinate, ref_gene_start,ref_gene_stop, buffer, strand):
#needs refactoring
if ref_gene_start <= ref_gene_stop or strand == '+':
start = int(ref_gene_start) - buffer
stop = int(ref_gene_stop) + buffer
else:
start = int(ref_gene_start) + buffer
stop = int(ref_gene_stop) - buffer

if int(start) <= int(junction_coordinate) <= int(stop):
return True
else:
return False


def generate_gene_model_dict(self,gene_model_all):
'''
This helper function takes gene reference file (generated from Frank's gene_model)
and creates a map of each junction coordinate which will be later used to find splice
site annotations for junctions in sample files.

Args: gene_model.txt file
Output:
1. gene_model dictionary where key is (chr, coordinate) and value has gene_id, exon_region_id, strand
2. gene_model_exon dictionary where key is (gene_id)
'''

logging.info("Generating reference junction dictionary from gene model(hg38).....")
starttime = timeit.default_timer()
print("The start time is :",starttime)

with open(gene_model_all) as f:
for line in f:
(gene_id,chr,strand,exon_region_id, start,stop,annotation) = line.split('\t')

self.gene_model_dict[(chr,int(start))] = {'gene_id':gene_id, 'exon_region_id':exon_region_id, 'start':start, 'stop':stop,'strand':strand}
self.gene_model_dict[(chr,int(stop))] = {'gene_id':gene_id, 'exon_region_id':exon_region_id, 'start':start, 'stop':stop,'strand':strand}
ea = {'exon_region_id':exon_region_id, 'start':start, 'stop':stop,'strand':strand, 'chr':chr}
#ea store as map - more efficient than list.
if(gene_id in self.gene_model_exon_dict):
self.gene_model_exon_dict[gene_id].append(ea)
else:
self.gene_model_exon_dict[gene_id] = [ea]
# print(self.gene_model_dict[('X', 103707210)])
logging.info("Gene Model dictionary generated")
print("The time difference is :", timeit.default_timer() - starttime)


# each_junction_annotation() needs to be modified to take only chr,start,stop.
# This code is assuming that junction files are indexed by chr,start,stop


def parallelize_annotations_process(self,junction_dir, gene_model_all,cpu_pool_count):
pool = mp.Pool(cpu_pool_count)
junction_files = os.listdir(junction_dir)
self.generate_gene_model_dict(gene_model_all)
annotation_keys = []
annotations = []
for junction_file in junction_files:
if not junction_file.startswith('.'):
logging.info("Reading junction file " + ' ' + junction_file)
with open(junction_dir + junction_file) as f:
for i, row in enumerate(f):
pool.apply_async(self.each_junction_annotation, args = (i,row,start,stop), callback=self.collect_results(annotation_keys,annotations))
pool.close()
pool.join()


if __name__ == '__main__':
cpu_pool_count = mp.cpu_count()
logging.basicConfig(level=logging.INFO)
starttime = timeit.default_timer()
print("The start time is :",starttime)
gene_model_all = '/Users/sin9gp/altanalyze3/tests/data/gene_model_all.txt'
junction_dir = '/Users/sin9gp/altanalyze3/tests/data/junction_dir/'

junction_annot = JunctionAnnotation()

junction_annot.each_junction_annotation(junction_dir=junction_dir,gene_model_all=gene_model_all)
#To-do Parallelize code
#junction_annot.parallelize_annotations_process(junction_dir, gene_model_all,cpu_pool_count)
print("The time difference is :", timeit.default_timer() - starttime)

Loading