Investigate supporting fixed-length MNP variants #51

timothymillar · 2020-09-21T23:20:09Z

Freebayes likes to output MNPs e.g. TGC TAC,TGG.
Currently the best approach with these is to 'atomize' them into SNPs before assembly (removing redundant bases).

There are two options to support these without any user pre-processing:

automatically atomize them
generalize the encoding of read distributions to include MNPs

The second option would technically be the most computationally efficient because it only requires fewer variants.
The first option assesses all possible combinations of the atomized SNPs, not just the alts of the MNP.

The main complexity of encoding an MNP is how to handle the case where a read does not extend the entire length of the MNP.

The text was updated successfully, but these errors were encountered:

timothymillar · 2020-09-22T03:06:39Z

This may require allowing read allele "distributions" in which the allele probabilities at a position sum to > 1.
This is because these probabilities are interpreted as "the probability of drawing this read from a haplotype with these alleles assuming the read can only cover the bases that it does".
Currently this means that P(T,A,- | T,A,A) + P(T,A,- | T,A,T) > 1 because a gap - is ignored when calculating the probability.
Encoded as an MNP we should get the same result P(TA- | TAA) + P(TA- | TAT) > 1.

For example two SNPs that could be encoded as a single MNP with all possible combinations of SNP alleles:

>>> haps = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
>>> onehot = (haps[..., None] == np.arange(0,2)).astype(np.int)  # onehot encoding of haps

The probabilities for matching a full read (i.e. covers both SNPs) sum to 1:

>>> read_full = np.array([[0.9, 0.1], [0.9, 0.1]])
>>> np.nanprod((read_full * onehot).sum(axis=-1), axis=-1)  # calculate prob of IIS with each haplotype
array([0.81, 0.09, 0.09, 0.01])

But not in the case of a partial read:

>>> read_partial = np.array([[0.9, 0.1], [np.nan, np.nan]])
>>> np.nanprod((read_partial * onehot).sum(axis=-1), axis=-1)
array([0.9, 0.9, 0.1, 0.1])

These should be the probabilities used to in the equivalent MNP encoding as a single variant e.g.:

>>> read_full =np.array([[0.81, 0.09, 0.09, 0.01]])
>>> read_partial = np.array([[0.9, 0.9, 0.1, 0.1]])

However, if not all possible SNP encodings where present (i.e. the MNP is restricted to a subset of the possible haplotypes) then these probabilities would need to be normalized.
In which case the probabilities of the partial read should be 'normalized' to sum to the total found if all SNP combinations are present?

timothymillar added the enhancement New feature or request label Sep 21, 2020

timothymillar self-assigned this Sep 21, 2020

timothymillar added the low priority label Sep 22, 2020

timothymillar mentioned this issue Sep 24, 2020

Check probability encoding of SNPs #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate supporting fixed-length MNP variants #51

Investigate supporting fixed-length MNP variants #51

timothymillar commented Sep 21, 2020

timothymillar commented Sep 22, 2020

Investigate supporting fixed-length MNP variants #51

Investigate supporting fixed-length MNP variants #51

Comments

timothymillar commented Sep 21, 2020

timothymillar commented Sep 22, 2020