-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSA anchoring #70
Comments
The above MSA result may look weird, but it is indeed correct given your input. In your case, the first several bases of the first 11 sequences are considered unaligned when the longer sequence is aligned to the graph. So the gaps in the first 11 sequences have no gap penalty. My initial impression is that the order of the input sequences is crucial. |
I would understand placing the unaligned bases in the MSA if this was a global alignment, however with a local alignment wouldn't those simply be omitted from the output? What about the column with the T's in it, those are aligned (at least among three sequences), wouldn't the gap parameters be applied between those T's and the bulk of the aligned sequence ~1000bp later? I guess I am confused as to what is considered aligned and what isn't in this output.
|
I could not reproduce your above MSA, so I can only guess what is going on there: |
One more thing you may want to know is that the local mode is good for consensus calling purposes, as the unaligned part is unlikely to be part of the consensus, but not for the MSA purpose. |
I have a naive question about your MSA methodology. Why would the aligner choose to align short segments (<20 bp) at the start of the alignment and then allow a large deletion (>1000bp) before aligning again? Perhaps I am misunderstanding the parameters to abPOA.
Example: I have a 100 homologous sequences -- a few full length (~2200 bp) and many fragments (~300 bp in length). I used local alignment as there is always a chance in these datasets that small portions of the edges of these sequences could be unrelated sequence, although in this dataset it doesn't look to be the case. I ran abPOA as:
Where the comparison matrix is simply:
Here is the start of the MSA with the first 11 sequences anchoring with minor or no overlap to other sequences before entering a large deletion:
The gap in the first 11 sequences goes on for about 1000 bp in the alignment before starting backup up again:
Certainly there was enough opportunity to align the first two bases of gi|527347 ('GC') somewhere much closer. Perhaps my affine gap parameters (values I use typically in pairwise alignment) are not stringent enough in this context? Any other ideas?
The text was updated successfully, but these errors were encountered: