generated from Leslie-C/Deduper
-
Notifications
You must be signed in to change notification settings - Fork 0
/
pseudocode_deduper
74 lines (54 loc) · 2.4 KB
/
pseudocode_deduper
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
DEDUPER PSEUDOCODE
LOCATION
/home/ano/bioinfo/Bi624/Deduper-Dremtz
INPUT
SAM file
OUTPUT: all SAM files
if computational time
1. deduplicated
2. duplicates that were removed from output1
3. umatched/error UMIs
GOAL
Remove all PCR duplicates from SAM file leaving only copy remaining in the SAM file
SAM headers
RNAME (col 3) -> chromosome
POS (col 4) -> position
FLAG (col 2) -> strandedness
CIGAR (col 6) -> soft clipping
QNAME (col 1) -> UMI
PIPELINE OVERVIEW
After reading in files, UMI list and opening a loop
BASH
sort by position start. ###I think this will correctly sort reverse strands also but check (use samtools for this?)
PYTHON
###Dre, is this the most efficient order?
1. IF Sequence matches or is a close match to an UMI (QNAME field) then don't return,
DO return WANTED sequences
2 IF clipping happened (CIGAR str in bitwise flag #)
THEN ignore clipped sections and compare Position and Sequence from after clipped section ###Dre: How do I do this?
3. IF Chromosome Name (SAM field 3, CHROMNAME)
AND IF forward strand (SAM field 2)
AND left most Position matches (SAM field 4, POS)
AND sequence MATCHES
THEN write a single sequence to new file
OR reverse stranded (SAM field 4, POS)
AND right most Position matches ###DRE how do I check for this?
AND sequence MATCHES
THEN write a single sequence to new file
4. RETURN list that does not have any of the excluded sequences from above
FUNCTIONS:
Function 1: maybe write a function to grab just the fields we want from SAM
put these in to a dictionary with key as UMI
Function 2: adjust position based off of clipping number
Function 3: if reverse compliment, then account for indels
maybe make this an addon to function 2
UNIT TESTS:
/home/ano/bioinfo/Bi624/Deduper-Dremtz/unit_tests
#delete in unit test descriptions before running UT's
EXPECTED OUTPUT = 7 Lines
2 unmatched chromname same everything else
1 clipped file that matches another file but is clipped 2 spaces
1 matching fwd strand w/ same starting Positions
2 fwd compliment with different starting positions
1 rev compliments w/ same ending position and sequence
= 7 total lines