Skip to content

Collapsed repeats

kseniakh edited this page Mar 10, 2017 · 1 revision

Collapsed repeats

Collapsed repeat - a deletion of one copy of an interspersed repeat from the reference sequence in a query sequence.



Figure 1: Collapsed repeat example. a) corresponds to a case where a query sequence has the same direction as a reference sequence. b) corresponds to a case with a reverse complemented query sequence. Collapsed_repeat, Repeat_q and Repeat_r are similar or near-similar repeats.



A collapsed repeat difference is output in the query_struct.gff and ref_struct.gff files. Information about the locations of the repeated regions involved in a difference is contained in the query_additional.gff and ref_additional.gff files.

An example with the collapsed repeat entries in query_struct.gff :

##gff-version 3
##sequence-region	query_1	1	57855
query_1	NucDiff_v2.0	SO:0000159	2515	2515	.	.	.	ID=SV_1;Name=deletion;del_len=80;query_dir=1;ref_sequence=ref_1;ref_coord=2561-2640;color=#0000EE
query_1	NucDiff_v2.0	SO:0000159	2515	2515	.	.	.	ID=SV_2;Name=collapsed_repeat;del_len=5;query_dir=1;ref_sequence=ref_1;ref_coord=2641-2645;query_repeated_region=2511-2515;color=#0000EE
query_1	NucDiff_v2.0	SO:0000159	3523	3523	.	.	.	ID=SV_3;Name=deletion;del_len=147;query_dir=1;ref_sequence=ref_1;ref_coord=3654-3800;color=#0000EE
query_1	NucDiff_v2.0	SO:0000159	3523	3523	.	.	.	ID=SV_4;Name=collapsed_repeat;del_len=8;query_dir=1;ref_sequence=ref_1;ref_coord=3801-3808;query_repeated_region=3516-3523;color=#0000EE
query_1	NucDiff_v2.0	SO:0000159	4525	4525	.	.	.	ID=SV_5;Name=deletion;del_len=250;query_dir=1;ref_sequence=ref_1;ref_coord=4811-5060;color=#0000EE
query_1	NucDiff_v2.0	SO:0000159	4525	4525	.	.	.	ID=SV_6;Name=collapsed_repeat;del_len=5;query_dir=1;ref_sequence=ref_1;ref_coord=5061-5065;query_repeated_region=4521-4525;color=#0000EE



The query_struct.gff file contains the following information (see Figure 1 for notations):

GFF3 fields Content Notes
col 1 Query_seq
col 2 NucDiff_v2.0 name and current version of the tool
col 3 SO:0000159 Sequence Ontology accession number corresponding to the "deletion" SO term
col 4 Q_pos
col 5 Q_pos
col 6/col 7/col8 . score/strand/phase fields are not used
col 9, ID "SV_1" ID in query_struct.gff is equal to ID in ref_struct.gff
col 9, Name "collapsed_repeat"
col 9, del_len Length(Collapsed_repeat)
col 9, query_dir "1" or "-1" -1 if the duplicated reference fragment should be reverse complemented before its insertion to a Query_seq
col 9, ref_sequence Ref_seq
col 9, ref_coord Del_st - Del_end
col 9, query_repeated_region St_q - Q_pos



An example with the collapsed repeat entries in ref_struct.gff :

##gff-version 3
##sequence-region	ref_1	1	75565
ref_1	NucDiff_v2.0	SO:0000159	2561	2640	.	.	.	ID=SV_1;Name=deletion;del_len=80;query_dir=1;query_sequence=query_1;query_coord=2515;color=#0000EE
ref_1	NucDiff_v2.0	SO:0000159	2641	2645	.	.	.	ID=SV_2;Name=collapsed_repeat;del_len=5;query_dir=1;query_sequence=query_1;query_coord=2515;ref_repeated_region=2556-2560;color=#0000EE
ref_1	NucDiff_v2.0	SO:0000159	3654	3800	.	.	.	ID=SV_3;Name=deletion;del_len=147;query_dir=1;query_sequence=query_1;query_coord=3523;color=#0000EE
ref_1	NucDiff_v2.0	SO:0000159	3801	3808	.	.	.	ID=SV_4;Name=collapsed_repeat;del_len=8;query_dir=1;query_sequence=query_1;query_coord=3523;ref_repeated_region=3646-3653;color=#0000EE
ref_1	NucDiff_v2.0	SO:0000159	4811	5060	.	.	.	ID=SV_5;Name=deletion;del_len=250;query_dir=1;query_sequence=query_1;query_coord=4525;color=#0000EE
ref_1	NucDiff_v2.0	SO:0000159	5061	5065	.	.	.	ID=SV_6;Name=collapsed_repeat;del_len=5;query_dir=1;query_sequence=query_1;query_coord=4525;ref_repeated_region=4806-4810;color=#0000EE



The ref_struct.gff file contains the following information (see Figure 1 for notations):

GFF3 fields Content Notes
col 1 Ref_seq
col 2 NucDiff_v2.0 name and current version of the tool
col 3 SO:0000159 Sequence Ontology accession number corresponding to the "deletion" SO term
col 4 Del_st
col 5 Del_end
col 6/col 7/col8 . score/strand/phase fields are not used
col 9, ID "SV_1" ID in ref_struct.gff is equal to ID in query_struct.gff
col 9, Name "collapsed_repeat"
col 9, del_len Length(Collapsed_repeat)
col 9, query_dir "1" or "-1" -1 if the duplicated reference fragment should be reverse complemented before its insertion to a Ref_seq
col 9, query_sequence Query_seq
col 9, query_coord Q_pos
col 9, ref_repeated_region St_r - End_r



An example with the additional information in query_additional.gff :

##gff-version 3
##sequence-region	query_1	1	57855
query_1	NucDiff_v2.0	SO:0000657	2511	2515	.	.	.	ID=Region_1;Name=Repeated_region;query_repeat_len=5;difference_type=collapsed_repeat;difference_coord_query=2515-2515;difference_len=5
query_1	NucDiff_v2.0	SO:0000657	3516	3523	.	.	.	ID=Region_2;Name=Repeated_region;query_repeat_len=8;difference_type=collapsed_repeat;difference_coord_query=3523-3523;difference_len=8
query_1	NucDiff_v2.0	SO:0000657	4521	4525	.	.	.	ID=Region_3;Name=Repeated_region;query_repeat_len=5;difference_type=collapsed_repeat;difference_coord_query=4525-4525;difference_len=5



The query_additional.gff file contains the following information (see Figure 1 for notations):

GFF3 fields Content Notes
col 1 Query_seq
col 2 NucDiff_v2.0 name and current version of the tool
col 3 SO:0000657 Sequence Ontology accession number corresponding to the "repeat_region" SO term
col 4 St_q
col 5 Q_pos
col 6/col 7/col8 . score/strand/phase fields are not used
col 9, ID "Region_1" IDs in query_additional.gff and ref_additional.gff are independent
col 9, Name "Repeated_region"
col 9, query_repeat_len Length(Repeat_q)
col 9, difference_type "collapsed_repeat"
col 9, difference_coord_query Q_pos - Q_pos
col 9, difference_len Length(Collapsed_repeat)



An example with the additional information in ref_additional.gff :

##gff-version 3
##sequence-region	ref_1	1	75565
ref_1	NucDiff_v2.0	SO:0000657	2556	2560	.	.	.	ID=Region_1;Name=Repeated_region;ref_repeat_len=5;difference_type=collapsed_repeat;difference_coord_ref=2641-2645;difference_len=5;color=#DB0101
ref_1	NucDiff_v2.0	SO:0000657	3646	3653	.	.	.	ID=Region_2;Name=Repeated_region;ref_repeat_len=8;difference_type=collapsed_repeat;difference_coord_ref=3801-3808;difference_len=8;color=#DB0101
ref_1	NucDiff_v2.0	SO:0000657	4806	4810	.	.	.	ID=Region_3;Name=Repeated_region;ref_repeat_len=5;difference_type=collapsed_repeat;difference_coord_ref=5061-5065;difference_len=5;color=#DB0101



A ref_additional.gff file contains the following information (see Figure 1 for notations):

GFF3 fields Content Notes
col 1 Ref_seq
col 2 NucDiff_v2.0 name and current version of the tool
col 3 SO:0000657 Sequence Ontology accession number corresponding to the "repeat_region" SO term
col 4 St_r
col 5 End_r
col 6/col 7/col8 . score/strand/phase fields are not used
col 9, ID "Region_1" IDs in query_additional.gff and ref_additional.gff are independent
col 9, Name "Repeated_region"
col 9, ref_repeat_len Length(Repeat_r)
col 9, difference_type "collapsed_repeat"
col 9, difference_coord_ref Del_st - Del_end
col 9, difference_len Length(Collapsed_repeat)