-
Notifications
You must be signed in to change notification settings - Fork 10
Tandem duplications (case 2)
Tandem duplication - an insertion of an extra copy of some reference sequence region adjacent to this region in the query sequence.
Figure 1: Tandem duplication example (case 2). C and C*, K and K*, Repeat_r_1 and Repeat_q_1, Repeat_r_2 and Repeat_q_2 are similar or near similar repeats.
A tandem duplication difference is output in the query_struct.gff and ref_struct.gff files. Information about the locations of the repeated regions involved in a difference is contained in the query_additional.gff and ref_additional.gff files.
An example with the tandem duplication entries in query_struct.gff :
##gff-version 3
##sequence-region query_1 1 93275
query_1 NucDiff_v2.0 SO:1000173 2691 2775 . . . ID=SV_1;Name=tandem_duplication;ins_len=85;query_dir=1;ref_sequence=ref_1;ref_coord=2645;query_repeated_region_1=2601-2690;query_repeated_region_2=2686-2690;color=#EE0000
query_1 NucDiff_v2.0 SO:1000173 3936 4090 . . . ID=SV_2;Name=tandem_duplication;ins_len=155;query_dir=1;ref_sequence=ref_1;ref_coord=3805;query_repeated_region_1=3776-3935;query_repeated_region_2=3931-3935;color=#EE0000
query_1 NucDiff_v2.0 SO:1000173 5351 5605 . . . ID=SV_3;Name=tandem_duplication;ins_len=255;query_dir=1;ref_sequence=ref_1;ref_coord=5065;query_repeated_region_1=5090-5350;query_repeated_region_2=5345-5350;color=#EE0000
The query_struct.gff file contains the following information (see Figure 1 for notations):
GFF3 fields | Content | Notes |
---|---|---|
col 1 | Query_seq | |
col 2 | NucDiff_v2.0 | name and current version of the tool |
col 3 | SO:1000173 | Sequence Ontology accession number corresponding to the "tandem_duplication" SO term |
col 4 | Ins_st | |
col 5 | Ins_end | |
col 6/col 7/col8 | . | score/strand/phase fields are not used |
col 9, ID | "SV_1" | ID in query_struct.gff is equal to ID in ref_struct.gff |
col 9, Name | "tandem_duplication" | |
col 9, ins_len | Length(Tandem_duplication) | |
col 9, query_dir | "1" or "-1" | -1 if the duplicated tandem unit should be reverse complemented before its insertion to a Ref_seq |
col 9, ref_sequence | Ref_seq | |
col 9, ref_coord | Ref_pos | |
col 9, query_repeated_region_1 | St_q_1 - End_q | |
col 9, query_repeated_region_2 | St_q_2 - End_q |
An example with the tandem duplication entries in ref_struct.gff :
##gff-version 3
##sequence-region ref_1 1 75565
ref_1 NucDiff_v2.0 SO:1000173 2645 2645 . . . ID=SV_1;Name=tandem_duplication;ins_len=85;query_dir=1;query_sequence=query_1;query_coord=2691-2775;ref_repeated_region_1=2556-2645;ref_repeated_region_2=2641-2645;color=#EE0000
ref_1 NucDiff_v2.0 SO:1000173 3805 3805 . . . ID=SV_2;Name=tandem_duplication;ins_len=155;query_dir=1;query_sequence=query_1;query_coord=3936-4090;ref_repeated_region_1=3646-3805;ref_repeated_region_2=3801-3805;color=#EE0000
ref_1 NucDiff_v2.0 SO:1000173 5065 5065 . . . ID=SV_3;Name=tandem_duplication;ins_len=255;query_dir=1;query_sequence=query_1;query_coord=5351-5605;ref_repeated_region_1=4805-5065;ref_repeated_region_2=5060-5065;color=#EE0000
The ref_struct.gff file contains the following information (see Figure 1 for notations):
GFF3 fields | Content | Notes |
---|---|---|
col 1 | Ref_seq | |
col 2 | NucDiff_v2.0 | name and current version of the tool |
col 3 | SO:1000173 | Sequence Ontology accession number corresponding to the "tandem_duplication" SO term |
col 4 | Ref_pos | |
col 5 | Ref_pos | |
col 6/col 7/col8 | . | score/strand/phase fields are not used |
col 9, ID | "SV_1" | ID in ref_struct.gff is equal to ID in query_struct.gff |
col 9, Name | "tandem_duplication" | |
col 9, ins_len | Length(Tandem_duplication) | |
col 9, query_dir | "1" or "-1" | -1 if the duplicated fragment should be reverse complemented before its insertion to a Ref_seq |
col 9, query_sequence | Query_seq | |
col 9, query_coord | Ins_st - Ins_end | |
col 9, ref_repeated_region_1 | St_r_1 - Ref_pos | |
col 9, ref_repeated_region_1 | St_r_2 - Ref_pos |
An example with the additional information in query_additional.gff :
##gff-version 3
##sequence-region query_1 1 93275
query_1 NucDiff_v2.0 SO:0000657 2601 2690 . . . ID=Region_1;Name=Repeated_region;query_repeat_len=90;difference_type=tandem_duplication;difference_coord_query=2691-2775;difference_len=85
query_1 NucDiff_v2.0 SO:0000657 2686 2690 . . . ID=Region_2;Name=Repeated_region;query_repeat_len=5;difference_type=tandem_duplication;difference_coord_query=2691-2775;difference_len=85
query_1 NucDiff_v2.0 SO:0000657 3776 3935 . . . ID=Region_3;Name=Repeated_region;query_repeat_len=160;difference_type=tandem_duplication;difference_coord_query=3936-4090;difference_len=155
query_1 NucDiff_v2.0 SO:0000657 3931 3935 . . . ID=Region_4;Name=Repeated_region;query_repeat_len=5;difference_type=tandem_duplication;difference_coord_query=3936-4090;difference_len=155
query_1 NucDiff_v2.0 SO:0000657 5090 5350 . . . ID=Region_5;Name=Repeated_region;query_repeat_len=261;difference_type=tandem_duplication;difference_coord_query=5351-5605;difference_len=255
query_1 NucDiff_v2.0 SO:0000657 5345 5350 . . . ID=Region_6;Name=Repeated_region;query_repeat_len=6;difference_type=tandem_duplication;difference_coord_query=5351-5605;difference_len=255
The query_additional.gff file contains the following information (see Figure 1 for notations):
GFF3 fields | Content for Repeat_q_1 | Content for Repeat_q_2 | Notes |
---|---|---|---|
col 1 | Query_seq | Query_seq | |
col 2 | NucDiff_v2.0 | NucDiff_v2.0 | name and current version of the tool |
col 3 | SO:0000657 | SO:0000657 | Sequence Ontology accession number corresponding to the "repeat_region" SO term |
col 4 | St_q_1 | St_q_2 | |
col 5 | End_q | End_q | |
col 6/col 7/col8 | . | . | core/strand/phase fields are not used |
col 9, ID | "Region_1" | "Region_2" | IDs in query_additional.gff and ref_additional.gff are independent |
col 9, Name | "Repeated_region" | "Repeated_region" | |
col 9, query_repeat_len | Length(Repeat_q_1) | Length(Repeat_q_2) | |
col 9, difference_type | "tandem_duplication" | "tandem_duplication" | |
col 9, difference_coord_query | Ins_st - Ins_end | Ins_st - Ins_end | |
col 9, difference_len | Length(Tandem_duplication) | Length(Tandem_duplication) |
An example with the additional information in ref_additional.gff :
##gff-version 3
##sequence-region ref_1 1 75565
ref_1 NucDiff_v2.0 SO:0000657 2556 2645 . . . ID=Region_1;Name=Repeated_region;ref_repeat_len=90;difference_type=tandem_duplication;difference_coord_ref=2645-2645;difference_len=85;color=#DB0101
ref_1 NucDiff_v2.0 SO:0000657 2641 2645 . . . ID=Region_2;Name=Repeated_region;ref_repeat_len=5;difference_type=tandem_duplication;difference_coord_ref=2645-2645;difference_len=85;color=#DB0101
ref_1 NucDiff_v2.0 SO:0000657 3646 3805 . . . ID=Region_3;Name=Repeated_region;ref_repeat_len=160;difference_type=tandem_duplication;difference_coord_ref=3805-3805;difference_len=155;color=#DB0101
ref_1 NucDiff_v2.0 SO:0000657 3801 3805 . . . ID=Region_4;Name=Repeated_region;ref_repeat_len=5;difference_type=tandem_duplication;difference_coord_ref=3805-3805;difference_len=155;color=#DB0101
ref_1 NucDiff_v2.0 SO:0000657 4805 5065 . . . ID=Region_5;Name=Repeated_region;ref_repeat_len=261;difference_type=tandem_duplication;difference_coord_ref=5065-5065;difference_len=255;color=#DB0101
ref_1 NucDiff_v2.0 SO:0000657 5060 5065 . . . ID=Region_6;Name=Repeated_region;ref_repeat_len=6;difference_type=tandem_duplication;difference_coord_ref=5065-5065;difference_len=255;color=#DB0101
The ref_additional.gff file contains the following information (see Figure 1 for notations):
GFF3 fields | Content for Repeat_r_1 | Content for Repeat_r_2 | Notes |
---|---|---|---|
col 1 | Ref_seq | Ref_seq | |
col 2 | NucDiff_v2.0 | NucDiff_v2.0 | name and current version of the tool |
col 3 | SO:0000657 | SO:0000657 | Sequence Ontology accession number corresponding to the "repeat_region" SO term |
col 4 | St_r_1 | St_r_2 | |
col 5 | Ref_pos | Ref_pos | |
col 6/col 7/col8 | . | . | score/strand/phase fields are not used |
col 9, ID | "Region_1" | "Region_2" | IDs in query_additional.gff and ref_additional.gff are independent |
col 9, Name | "Repeated_region" | "Repeated_region" | |
col 9, ref_repeat_len | Length(Repeat_r_1) | Length(Repeat_r_2) | |
col 9, difference_type | "tandem_duplication" | "tandem_duplication" | |
col 9, difference_coord_ref | Ref_pos - Ref_pos | Ref_pos - Ref_pos | |
col 9, difference_len | Length(Tandem_duplication) | Length(Tandem_duplication) |