MSA anchoring #70

rmhubley · 2024-05-24T17:45:02Z

I have a naive question about your MSA methodology. Why would the aligner choose to align short segments (<20 bp) at the start of the alignment and then allow a large deletion (>1000bp) before aligning again? Perhaps I am misunderstanding the parameters to abPOA.

Example: I have a 100 homologous sequences -- a few full length (~2200 bp) and many fragments (~300 bp in length). I used local alignment as there is always a chance in these datasets that small portions of the edges of these sequences could be unrelated sequence, although in this dataset it doesn't look to be the case. I ran abPOA as:

../bin/abpoa family-seqs.fa -t comparison.matrix -O 25,0 -E 5,0  -m 1 -r 1

Where the comparison matrix is simply:

    A   G   C   T   N      
A   9  -6 -15 -17  -1     
G  -6  10 -15 -15  -1     
C -15 -15  10  -6  -1     
T -17 -15  -6   9  -1     
N  -1  -1  -1  -1  -1

Here is the start of the MSA with the first 11 sequences anchoring with minor or no overlap to other sequences before entering a large deletion:

gi|526567    1 AAGC------------------------------------------------------------------------------------------------    4 [1]
gi|527347    1   GC------------------------------------------------------------------------------------------------    2 [2]
gi|526255    1     ATGACA---------------------T--------------------------------------------------------------------    7 [3]
gi|526360    1         CA---------------------T--------------------------------------------------------------------    3 [4]
gi|527067    1           TG----------------------------------------------------------------------------------------    2 [5]
gi|527527    1             ATA-------------------------------------------------------------------------------------    3 [6]
gi|527416    1                C------------------------------------------------------------------------------------    1 [7]
gi|526343    1                 CCACCTGGAAG-------------------------------------------------------------------------    11 [8]
gi|527241    1                            TGCTT--------------------------------------------------------------------    5 [9]
gi|530654    1                                 CTA-----------------------------------------------------------------    3 [10]
gi|525985    1                                    TACAT------------------------------------------------------------    5 [11]
gi|525696    1                                         CCTGTCTTCTACCATCTCCGAACTCCAGATACTCCAACAGCGAAAGGGATCTGGGCCCAA    60 [12]
gi|525698    1                                         CCTGTCTTCCACCATCTCCGAACTCCAGATACTCCTACAGCGAAAGGGATCTGGGCCCAA    60 [13]
gi|525697    1                                         CCTGTCTTCCACCATCTCTGAACTCCAGACACTCCAACAGTGAAAGGGATCTAGGGCCAC    60 [14]
gi|525699    1                                         CCTGTCTTCCACCATCTCTGAACTCCAGACACTCCAACAGTGAAAGGGATCTAGGGCCAC    60 [15]

The gap in the first 11 sequences goes on for about 1000 bp in the alignment before starting backup up again:

gi|526567    5 --------------------------------------------------------------------------------ATAT------AGAG-T--T-    14 [1]
gi|527347    3 --------------------------------------------------------------------------------ATGT------CAA-----T-    10 [2]
gi|526255    8 -----------------------------------------------------------------------------------TAT---GAG------T-    14 [3]
gi|526360    4 -----------------------------------------------------------------------------------TGT---GGGG-----C-    11 [4]
gi|527067    3 -------------------------------------------------------------------------------TGTATAT-----AAT-G--T-    14 [5]
gi|527527    4 ------------------------------------------------------------------------AA------ATGT------AAA-------    12 [6]
gi|527416    2 --------------------------------------------------------------------------------GTGA------AAGTA-----    10 [7]
gi|526343   12 ---------------------------------------------------------------------------------------CCAAAGT-TCATG    23 [8]
gi|527241    6 -----------------------------------------------------------------------------------T------AAAT-T--T-    12 [9]
gi|530654    3 ----------------------------------------------------------------------------------------------------    3 [10]
gi|525985    6 ------------------GGCAGAGGAAAAAGTG-CACAAATTGGGACCAGAAAAT---ATAA---A--GGGGAG---TTGTGT------AGAT-T--T-    65 [11]
gi|525696 1040 CAGATGGTTAAATGGGAGGACAGAAAATGAAT-G-CACAAG--TGGACCAATAAAT-GAATGATCCATTGGGAAGCATCTGTGCAT---GAAAT-C--T-    1127 [12]
gi|525698 1040 CAGATGGTTAAATGGGAGGGCAGAAAATGAAT-G-CACAAG--TGGATCTATAAAT-GAATGATCCATTGGGAAGCATCTGTGCAT---GAAAT-C--T-    1127 [13]
gi|525697 1042 CAGATGGTTAAATGGGAGGGCAGAAAATGAAT-G-CACAAG--TGGACCAATAAAT-GAATGATCCATTGGGAAGCATCTGTGTAT---GAAAT-C--T-    1129 [14]
gi|525699 1040 CAGATGGTTAAATGGGAGGGCAGAAAATGTAG-G-CACAAG--GGGACCAATAAAT-GAATGATCTATTGAGAAGCATCTGTGCAT---GAAAT-C--T-    1127 [15]
gi|525700  749 CAGATGGGTAAATGCGAGGGCAGAGAATGAAT-G-CACAAG--GGTACCAATAAAT-GAATGATCCATTGGGAAGCATCTGTGCAC---CAAAT-C--TG    837 [16]

Certainly there was enough opportunity to align the first two bases of gi|527347 ('GC') somewhere much closer. Perhaps my affine gap parameters (values I use typically in pairwise alignment) are not stringent enough in this context? Any other ideas?

The text was updated successfully, but these errors were encountered:

yangao07 · 2024-05-24T20:06:27Z

The above MSA result may look weird, but it is indeed correct given your input.
The local alignment in abPOA aligns part of the sequence to the current graph with the best local alignment score, the other part will be left unaligned (like soft clipping in cigars of SAM file).

In your case, the first several bases of the first 11 sequences are considered unaligned when the longer sequence is aligned to the graph. So the gaps in the first 11 sequences have no gap penalty.

My initial impression is that the order of the input sequences is crucial.
Have you tried placing shorter sequences at the end of the input file?
Or, if you want all the bases to be included in the alignment, you can try to use global alignment mode.

rmhubley · 2024-05-24T22:46:32Z

I would understand placing the unaligned bases in the MSA if this was a global alignment, however with a local alignment wouldn't those simply be omitted from the output? What about the column with the T's in it, those are aligned (at least among three sequences), wouldn't the gap parameters be applied between those T's and the bulk of the aligned sequence ~1000bp later? I guess I am confused as to what is considered aligned and what isn't in this output.

gi|526567    1 AAGC------------------------------------------------------------------------------------------------    4 [1]
gi|527347    1   GC------------------------------------------------------------------------------------------------    2 [2]
gi|526255    1     ATGACA---------------------T--------------------------------------------------------------------    7 [3]
gi|526360    1         CA---------------------T--------------------------------------------------------------------    3 [4]
gi|527067    1           TG----------------------------------------------------------------------------------------    2 [5]
gi|527527    1             ATA-------------------------------------------------------------------------------------    3 [6]
gi|527416    1                C------------------------------------------------------------------------------------    1 [7]
gi|526343    1                 CCACCTGGAAG-------------------------------------------------------------------------    11 [8]
gi|527241    1                            TGCTT--------------------------------------------------------------------    5 [9]
gi|530654    1                                 CTA-----------------------------------------------------------------    3 [10]
gi|525985    1                                    TACAT------------------------------------------------------------    5 [11]
gi|525696    1                                         CCTGTCTTCTACCATCTCCGAACTCCAGATACTCCAACAGCGAAAGGGATCTGGGCCCAA    60 [12]
gi|525698    1                                         CCTGTCTTCCACCATCTCCGAACTCCAGATACTCCTACAGCGAAAGGGATCTGGGCCCAA    60 [13]
gi|525697    1                                         CCTGTCTTCCACCATCTCTGAACTCCAGACACTCCAACAGTGAAAGGGATCTAGGGCCAC    60 [14]
gi|525699    1                                         CCTGTCTTCCACCATCTCTGAACTCCAGACACTCCAACAGTGAAAGGGATCTAGGGCCAC    60 [15]

yangao07 · 2024-05-25T13:49:06Z

I could not reproduce your above MSA, so I can only guess what is going on there:
The Ts were considered aligned for the top sequences, but not the bottom ones (longer ones).
So the gaps are not penalized when the longer sequences are added to the MSA.

yangao07 · 2024-05-25T13:55:50Z

One more thing you may want to know is that the local mode is good for consensus calling purposes, as the unaligned part is unlikely to be part of the consensus, but not for the MSA purpose.
This is why the MSA always looks weird for local mode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSA anchoring #70

MSA anchoring #70

rmhubley commented May 24, 2024

yangao07 commented May 24, 2024

rmhubley commented May 24, 2024

yangao07 commented May 25, 2024

yangao07 commented May 25, 2024

MSA anchoring #70

MSA anchoring #70

Comments

rmhubley commented May 24, 2024

yangao07 commented May 24, 2024

rmhubley commented May 24, 2024

yangao07 commented May 25, 2024

yangao07 commented May 25, 2024