Skip to content

Tandem duplications (case 2)

kseniakh edited this page Mar 10, 2017 · 1 revision

Tandem duplications (case 2)

Tandem duplication - an insertion of an extra copy of some reference sequence region adjacent to this region in the query sequence.

Figure 1: Tandem duplication example (case 2). C and C*, K and K*, Repeat_r_1 and Repeat_q_1, Repeat_r_2 and Repeat_q_2 are similar or near similar repeats.

A tandem duplication difference is output in the query_struct.gff and ref_struct.gff files. Information about the locations of the repeated regions involved in a difference is contained in the query_additional.gff and ref_additional.gff files.

An example with the tandem duplication entries in query_struct.gff :

##gff-version 3
##sequence-region	query_1	1	93275
query_1	NucDiff_v2.0	SO:1000173	2691	2775	.	.	.	ID=SV_1;Name=tandem_duplication;ins_len=85;query_dir=1;ref_sequence=ref_1;ref_coord=2645;query_repeated_region_1=2601-2690;query_repeated_region_2=2686-2690;color=#EE0000
query_1	NucDiff_v2.0	SO:1000173	3936	4090	.	.	.	ID=SV_2;Name=tandem_duplication;ins_len=155;query_dir=1;ref_sequence=ref_1;ref_coord=3805;query_repeated_region_1=3776-3935;query_repeated_region_2=3931-3935;color=#EE0000
query_1	NucDiff_v2.0	SO:1000173	5351	5605	.	.	.	ID=SV_3;Name=tandem_duplication;ins_len=255;query_dir=1;ref_sequence=ref_1;ref_coord=5065;query_repeated_region_1=5090-5350;query_repeated_region_2=5345-5350;color=#EE0000

The query_struct.gff file contains the following information (see Figure 1 for notations):

GFF3 fields Content Notes
col 1 Query_seq
col 2 NucDiff_v2.0 name and current version of the tool
col 3 SO:1000173 Sequence Ontology accession number corresponding to the "tandem_duplication" SO term
col 4 Ins_st
col 5 Ins_end
col 6/col 7/col8 . score/strand/phase fields are not used
col 9, ID "SV_1" ID in query_struct.gff is equal to ID in ref_struct.gff
col 9, Name "tandem_duplication"
col 9, ins_len Length(Tandem_duplication)
col 9, query_dir "1" or "-1" -1 if the duplicated tandem unit should be reverse complemented before its insertion to a Ref_seq
col 9, ref_sequence Ref_seq
col 9, ref_coord Ref_pos
col 9, query_repeated_region_1 St_q_1 - End_q
col 9, query_repeated_region_2 St_q_2 - End_q

An example with the tandem duplication entries in ref_struct.gff :

##gff-version 3
##sequence-region	ref_1	1	75565
ref_1	NucDiff_v2.0	SO:1000173	2645	2645	.	.	.	ID=SV_1;Name=tandem_duplication;ins_len=85;query_dir=1;query_sequence=query_1;query_coord=2691-2775;ref_repeated_region_1=2556-2645;ref_repeated_region_2=2641-2645;color=#EE0000
ref_1	NucDiff_v2.0	SO:1000173	3805	3805	.	.	.	ID=SV_2;Name=tandem_duplication;ins_len=155;query_dir=1;query_sequence=query_1;query_coord=3936-4090;ref_repeated_region_1=3646-3805;ref_repeated_region_2=3801-3805;color=#EE0000
ref_1	NucDiff_v2.0	SO:1000173	5065	5065	.	.	.	ID=SV_3;Name=tandem_duplication;ins_len=255;query_dir=1;query_sequence=query_1;query_coord=5351-5605;ref_repeated_region_1=4805-5065;ref_repeated_region_2=5060-5065;color=#EE0000

The ref_struct.gff file contains the following information (see Figure 1 for notations):

GFF3 fields Content Notes
col 1 Ref_seq
col 2 NucDiff_v2.0 name and current version of the tool
col 3 SO:1000173 Sequence Ontology accession number corresponding to the "tandem_duplication" SO term
col 4 Ref_pos
col 5 Ref_pos
col 6/col 7/col8 . score/strand/phase fields are not used
col 9, ID "SV_1" ID in ref_struct.gff is equal to ID in query_struct.gff
col 9, Name "tandem_duplication"
col 9, ins_len Length(Tandem_duplication)
col 9, query_dir "1" or "-1" -1 if the duplicated fragment should be reverse complemented before its insertion to a Ref_seq
col 9, query_sequence Query_seq
col 9, query_coord Ins_st - Ins_end
col 9, ref_repeated_region_1 St_r_1 - Ref_pos
col 9, ref_repeated_region_1 St_r_2 - Ref_pos

An example with the additional information in query_additional.gff :

##gff-version 3
##sequence-region	query_1	1	93275
query_1	NucDiff_v2.0	SO:0000657	2601	2690	.	.	.	ID=Region_1;Name=Repeated_region;query_repeat_len=90;difference_type=tandem_duplication;difference_coord_query=2691-2775;difference_len=85
query_1	NucDiff_v2.0	SO:0000657	2686	2690	.	.	.	ID=Region_2;Name=Repeated_region;query_repeat_len=5;difference_type=tandem_duplication;difference_coord_query=2691-2775;difference_len=85
query_1	NucDiff_v2.0	SO:0000657	3776	3935	.	.	.	ID=Region_3;Name=Repeated_region;query_repeat_len=160;difference_type=tandem_duplication;difference_coord_query=3936-4090;difference_len=155
query_1	NucDiff_v2.0	SO:0000657	3931	3935	.	.	.	ID=Region_4;Name=Repeated_region;query_repeat_len=5;difference_type=tandem_duplication;difference_coord_query=3936-4090;difference_len=155
query_1	NucDiff_v2.0	SO:0000657	5090	5350	.	.	.	ID=Region_5;Name=Repeated_region;query_repeat_len=261;difference_type=tandem_duplication;difference_coord_query=5351-5605;difference_len=255
query_1	NucDiff_v2.0	SO:0000657	5345	5350	.	.	.	ID=Region_6;Name=Repeated_region;query_repeat_len=6;difference_type=tandem_duplication;difference_coord_query=5351-5605;difference_len=255

The query_additional.gff file contains the following information (see Figure 1 for notations):

GFF3 fields Content for Repeat_q_1 Content for Repeat_q_2 Notes
col 1 Query_seq Query_seq
col 2 NucDiff_v2.0 NucDiff_v2.0 name and current version of the tool
col 3 SO:0000657 SO:0000657 Sequence Ontology accession number corresponding to the "repeat_region" SO term
col 4 St_q_1 St_q_2
col 5 End_q End_q
col 6/col 7/col8 . . core/strand/phase fields are not used
col 9, ID "Region_1" "Region_2" IDs in query_additional.gff and ref_additional.gff are independent
col 9, Name "Repeated_region" "Repeated_region"
col 9, query_repeat_len Length(Repeat_q_1) Length(Repeat_q_2)
col 9, difference_type "tandem_duplication" "tandem_duplication"
col 9, difference_coord_query Ins_st - Ins_end Ins_st - Ins_end
col 9, difference_len Length(Tandem_duplication) Length(Tandem_duplication)

An example with the additional information in ref_additional.gff :

##gff-version 3
##sequence-region	ref_1	1	75565
ref_1	NucDiff_v2.0	SO:0000657	2556	2645	.	.	.	ID=Region_1;Name=Repeated_region;ref_repeat_len=90;difference_type=tandem_duplication;difference_coord_ref=2645-2645;difference_len=85;color=#DB0101
ref_1	NucDiff_v2.0	SO:0000657	2641	2645	.	.	.	ID=Region_2;Name=Repeated_region;ref_repeat_len=5;difference_type=tandem_duplication;difference_coord_ref=2645-2645;difference_len=85;color=#DB0101
ref_1	NucDiff_v2.0	SO:0000657	3646	3805	.	.	.	ID=Region_3;Name=Repeated_region;ref_repeat_len=160;difference_type=tandem_duplication;difference_coord_ref=3805-3805;difference_len=155;color=#DB0101
ref_1	NucDiff_v2.0	SO:0000657	3801	3805	.	.	.	ID=Region_4;Name=Repeated_region;ref_repeat_len=5;difference_type=tandem_duplication;difference_coord_ref=3805-3805;difference_len=155;color=#DB0101
ref_1	NucDiff_v2.0	SO:0000657	4805	5065	.	.	.	ID=Region_5;Name=Repeated_region;ref_repeat_len=261;difference_type=tandem_duplication;difference_coord_ref=5065-5065;difference_len=255;color=#DB0101
ref_1	NucDiff_v2.0	SO:0000657	5060	5065	.	.	.	ID=Region_6;Name=Repeated_region;ref_repeat_len=6;difference_type=tandem_duplication;difference_coord_ref=5065-5065;difference_len=255;color=#DB0101

The ref_additional.gff file contains the following information (see Figure 1 for notations):

GFF3 fields Content for Repeat_r_1 Content for Repeat_r_2 Notes
col 1 Ref_seq Ref_seq
col 2 NucDiff_v2.0 NucDiff_v2.0 name and current version of the tool
col 3 SO:0000657 SO:0000657 Sequence Ontology accession number corresponding to the "repeat_region" SO term
col 4 St_r_1 St_r_2
col 5 Ref_pos Ref_pos
col 6/col 7/col8 . . score/strand/phase fields are not used
col 9, ID "Region_1" "Region_2" IDs in query_additional.gff and ref_additional.gff are independent
col 9, Name "Repeated_region" "Repeated_region"
col 9, ref_repeat_len Length(Repeat_r_1) Length(Repeat_r_2)
col 9, difference_type "tandem_duplication" "tandem_duplication"
col 9, difference_coord_ref Ref_pos - Ref_pos Ref_pos - Ref_pos
col 9, difference_len Length(Tandem_duplication) Length(Tandem_duplication)