Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSA anchoring #70

Open
rmhubley opened this issue May 24, 2024 · 4 comments
Open

MSA anchoring #70

rmhubley opened this issue May 24, 2024 · 4 comments

Comments

@rmhubley
Copy link

I have a naive question about your MSA methodology. Why would the aligner choose to align short segments (<20 bp) at the start of the alignment and then allow a large deletion (>1000bp) before aligning again? Perhaps I am misunderstanding the parameters to abPOA.

Example: I have a 100 homologous sequences -- a few full length (~2200 bp) and many fragments (~300 bp in length). I used local alignment as there is always a chance in these datasets that small portions of the edges of these sequences could be unrelated sequence, although in this dataset it doesn't look to be the case. I ran abPOA as:

../bin/abpoa family-seqs.fa -t comparison.matrix -O 25,0 -E 5,0  -m 1 -r 1

Where the comparison matrix is simply:

    A   G   C   T   N      
A   9  -6 -15 -17  -1     
G  -6  10 -15 -15  -1     
C -15 -15  10  -6  -1     
T -17 -15  -6   9  -1     
N  -1  -1  -1  -1  -1     

Here is the start of the MSA with the first 11 sequences anchoring with minor or no overlap to other sequences before entering a large deletion:

gi|526567    1 AAGC------------------------------------------------------------------------------------------------    4 [1]
gi|527347    1   GC------------------------------------------------------------------------------------------------    2 [2]
gi|526255    1     ATGACA---------------------T--------------------------------------------------------------------    7 [3]
gi|526360    1         CA---------------------T--------------------------------------------------------------------    3 [4]
gi|527067    1           TG----------------------------------------------------------------------------------------    2 [5]
gi|527527    1             ATA-------------------------------------------------------------------------------------    3 [6]
gi|527416    1                C------------------------------------------------------------------------------------    1 [7]
gi|526343    1                 CCACCTGGAAG-------------------------------------------------------------------------    11 [8]
gi|527241    1                            TGCTT--------------------------------------------------------------------    5 [9]
gi|530654    1                                 CTA-----------------------------------------------------------------    3 [10]
gi|525985    1                                    TACAT------------------------------------------------------------    5 [11]
gi|525696    1                                         CCTGTCTTCTACCATCTCCGAACTCCAGATACTCCAACAGCGAAAGGGATCTGGGCCCAA    60 [12]
gi|525698    1                                         CCTGTCTTCCACCATCTCCGAACTCCAGATACTCCTACAGCGAAAGGGATCTGGGCCCAA    60 [13]
gi|525697    1                                         CCTGTCTTCCACCATCTCTGAACTCCAGACACTCCAACAGTGAAAGGGATCTAGGGCCAC    60 [14]
gi|525699    1                                         CCTGTCTTCCACCATCTCTGAACTCCAGACACTCCAACAGTGAAAGGGATCTAGGGCCAC    60 [15]

The gap in the first 11 sequences goes on for about 1000 bp in the alignment before starting backup up again:

gi|526567    5 --------------------------------------------------------------------------------ATAT------AGAG-T--T-    14 [1]
gi|527347    3 --------------------------------------------------------------------------------ATGT------CAA-----T-    10 [2]
gi|526255    8 -----------------------------------------------------------------------------------TAT---GAG------T-    14 [3]
gi|526360    4 -----------------------------------------------------------------------------------TGT---GGGG-----C-    11 [4]
gi|527067    3 -------------------------------------------------------------------------------TGTATAT-----AAT-G--T-    14 [5]
gi|527527    4 ------------------------------------------------------------------------AA------ATGT------AAA-------    12 [6]
gi|527416    2 --------------------------------------------------------------------------------GTGA------AAGTA-----    10 [7]
gi|526343   12 ---------------------------------------------------------------------------------------CCAAAGT-TCATG    23 [8]
gi|527241    6 -----------------------------------------------------------------------------------T------AAAT-T--T-    12 [9]
gi|530654    3 ----------------------------------------------------------------------------------------------------    3 [10]
gi|525985    6 ------------------GGCAGAGGAAAAAGTG-CACAAATTGGGACCAGAAAAT---ATAA---A--GGGGAG---TTGTGT------AGAT-T--T-    65 [11]
gi|525696 1040 CAGATGGTTAAATGGGAGGACAGAAAATGAAT-G-CACAAG--TGGACCAATAAAT-GAATGATCCATTGGGAAGCATCTGTGCAT---GAAAT-C--T-    1127 [12]
gi|525698 1040 CAGATGGTTAAATGGGAGGGCAGAAAATGAAT-G-CACAAG--TGGATCTATAAAT-GAATGATCCATTGGGAAGCATCTGTGCAT---GAAAT-C--T-    1127 [13]
gi|525697 1042 CAGATGGTTAAATGGGAGGGCAGAAAATGAAT-G-CACAAG--TGGACCAATAAAT-GAATGATCCATTGGGAAGCATCTGTGTAT---GAAAT-C--T-    1129 [14]
gi|525699 1040 CAGATGGTTAAATGGGAGGGCAGAAAATGTAG-G-CACAAG--GGGACCAATAAAT-GAATGATCTATTGAGAAGCATCTGTGCAT---GAAAT-C--T-    1127 [15]
gi|525700  749 CAGATGGGTAAATGCGAGGGCAGAGAATGAAT-G-CACAAG--GGTACCAATAAAT-GAATGATCCATTGGGAAGCATCTGTGCAC---CAAAT-C--TG    837 [16]

Certainly there was enough opportunity to align the first two bases of gi|527347 ('GC') somewhere much closer. Perhaps my affine gap parameters (values I use typically in pairwise alignment) are not stringent enough in this context? Any other ideas?

@yangao07
Copy link
Owner

The above MSA result may look weird, but it is indeed correct given your input.
The local alignment in abPOA aligns part of the sequence to the current graph with the best local alignment score, the other part will be left unaligned (like soft clipping in cigars of SAM file).

In your case, the first several bases of the first 11 sequences are considered unaligned when the longer sequence is aligned to the graph. So the gaps in the first 11 sequences have no gap penalty.

My initial impression is that the order of the input sequences is crucial.
Have you tried placing shorter sequences at the end of the input file?
Or, if you want all the bases to be included in the alignment, you can try to use global alignment mode.

@rmhubley
Copy link
Author

I would understand placing the unaligned bases in the MSA if this was a global alignment, however with a local alignment wouldn't those simply be omitted from the output? What about the column with the T's in it, those are aligned (at least among three sequences), wouldn't the gap parameters be applied between those T's and the bulk of the aligned sequence ~1000bp later? I guess I am confused as to what is considered aligned and what isn't in this output.

gi|526567    1 AAGC------------------------------------------------------------------------------------------------    4 [1]
gi|527347    1   GC------------------------------------------------------------------------------------------------    2 [2]
gi|526255    1     ATGACA---------------------T--------------------------------------------------------------------    7 [3]
gi|526360    1         CA---------------------T--------------------------------------------------------------------    3 [4]
gi|527067    1           TG----------------------------------------------------------------------------------------    2 [5]
gi|527527    1             ATA-------------------------------------------------------------------------------------    3 [6]
gi|527416    1                C------------------------------------------------------------------------------------    1 [7]
gi|526343    1                 CCACCTGGAAG-------------------------------------------------------------------------    11 [8]
gi|527241    1                            TGCTT--------------------------------------------------------------------    5 [9]
gi|530654    1                                 CTA-----------------------------------------------------------------    3 [10]
gi|525985    1                                    TACAT------------------------------------------------------------    5 [11]
gi|525696    1                                         CCTGTCTTCTACCATCTCCGAACTCCAGATACTCCAACAGCGAAAGGGATCTGGGCCCAA    60 [12]
gi|525698    1                                         CCTGTCTTCCACCATCTCCGAACTCCAGATACTCCTACAGCGAAAGGGATCTGGGCCCAA    60 [13]
gi|525697    1                                         CCTGTCTTCCACCATCTCTGAACTCCAGACACTCCAACAGTGAAAGGGATCTAGGGCCAC    60 [14]
gi|525699    1                                         CCTGTCTTCCACCATCTCTGAACTCCAGACACTCCAACAGTGAAAGGGATCTAGGGCCAC    60 [15]

@yangao07
Copy link
Owner

I could not reproduce your above MSA, so I can only guess what is going on there:
The Ts were considered aligned for the top sequences, but not the bottom ones (longer ones).
So the gaps are not penalized when the longer sequences are added to the MSA.

@yangao07
Copy link
Owner

One more thing you may want to know is that the local mode is good for consensus calling purposes, as the unaligned part is unlikely to be part of the consensus, but not for the MSA purpose.
This is why the MSA always looks weird for local mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants