Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check probability encoding of SNPs #56

Closed
timothymillar opened this issue Sep 24, 2020 · 8 comments
Closed

Check probability encoding of SNPs #56

timothymillar opened this issue Sep 24, 2020 · 8 comments
Assignees

Comments

@timothymillar
Copy link
Collaborator

Currently the error prob is being treated as "probability that the true allele it is not the called allele".
The alternate interpretation is "Probability that call is meaningless" in which case some portion of that probability
should be assigned to the called allele i.e. the calling went wrong but happened to result in a correct call.

This difference has a negligible affect on haplotype calls but has technical implementations when incorporating MNPs #51.

@timothymillar timothymillar self-assigned this Sep 24, 2020
@timothymillar
Copy link
Collaborator Author

There appears to be 2 ways to interpret an error rate e for a SNP call

  1. The probability that the called allele is correct = 1 - e and the probability that another allele is the true allele = e / (a - 1) where a is the number of possible alleles i.e 4
  2. Error rate is the probability that the call is a random choice including a the correct allele e.g. the probability that the called allele is correct = (1 - e) + e / a and the probability of any other allele being correct = e / a

In the case of an enforced bi-allelic SNP (i.e. a constraint that the allele call must be one of two possibilities) there are two ways to generalize the above calculations such that the probabilities still sum to 1:

a. Use a = 2
b. Calculate as though all 4 alleles are possible then remove non-allowed alleles and normalize the result

For example in in a unconstrained SNP with a reference call and e = 0.3 then the possible encodings are:

  • 1 = [0.7, 0.1, 0.1, 0.1]
  • 2 = [0.775, 0.075, 0.075, 0.075]

If we add the constrain that the allele is bi-allelic then the possible (normalized) encodings are:

  • 1a = [0.7, 0.3]
  • 1b = [0.875, 0.125] (= [0.7, 0.1] / 0.8)
  • 2a = [0.85, 0.15] (= [0.7, 0.0] + 0.3/2)
  • 2b = [0.91176..., 0.0882...]

If a non-allowed call has been made then the allele is encoded as a gap -.
In practice the error rates are much smaller than 0.3 so the difference will be much smaller.

Currently we are using option 1a.
Interpretation 1 of error rate appears to be more common and is probably the most intuitive.
Option 1a for constrained situations seems incorrect as it only increases the probability of the alt allele.
Option 1b seems more appropriate but raises the question of is normalization necessary here?

If encoding MNPs #51 then non-normalized distributions arise due to reads with only partial coverage of the MNP.

@timothymillar
Copy link
Collaborator Author

Function for 2a if needed

def as_probabilistic(array, n_alleles, p=1.0, dtype=np.float):
    # check inputs
    array = np.array(array, copy=False)
    n_alleles = np.array(n_alleles, copy=False)
    p = np.array(p, copy=False)
    
    # special case for zero-length reads
    if array.shape[-1] == 0:
        return np.empty(array.shape + (0,), dtype=dtype)
    
    max_allele = np.max(n_alleles)
    n_alleles = n_alleles[..., None]
        
    # onehot encoding of alleles
    new = (array[..., None] == np.arange(0, max_allele)).astype(np.float)
    
    # probability that call is correct
    new *= p[..., None]
    
    # even probability if incorrect call
    new += (1 - np.expand_dims(p, -1)) / n_alleles
    
    # fill gaps with nans
    new[array < 0] = np.nan
    
    # fill non-alleles
    new[..., n_alleles <= np.arange(max_allele)] = 0
    
    return new

@timothymillar
Copy link
Collaborator Author

timothymillar commented Sep 29, 2020

Function for 1b

def as_probabilistic(array, n_alleles, p=1.0, error_factor=3, dtype=np.float):
    array = np.array(array, copy=False)
    n_alleles = np.array(n_alleles, copy=False)
    error_factor = np.array(error_factor, copy=False)
    p = np.array(p, copy=False)
    
    # special case for zero-length reads
    if array.shape[-1] == 0:
        return np.empty(array.shape + (0,), dtype=dtype)
    
    vector_length = np.max(n_alleles)

    # onehot encoding of alleles
    onehot = array[..., None] == np.arange(vector_length)
    
    # basic probs
    new = ((1 - p) / error_factor)[..., None] * ~onehot
    calls = p[..., None] * onehot
    new[onehot] = calls[onehot]
    
    # zero out non-alleles
    new[..., n_alleles[..., None] <= np.arange(vector_length)] = 0
    
    # normalize 
    new /= np.nansum(new, axis=-1, keepdims=True)
    
    # nan fill gaps and re-zero
    new[array < 0] = np.nan
    new[..., n_alleles[..., None] <= np.arange(vector_length)] = 0
    return new

@timothymillar
Copy link
Collaborator Author

Decided on option 1b as this seems to be the best combination of intuitive and correctly interpreting error probability in a base call of a read.

  • The error rate is specified without knowledge of the potential bi/tri-allelic constraint
  • Therefore the error is the probability that the real allele is is not the called allele
  • There are three other nucleotides that could be the true call, so prob of each is error rate / 3
  • If we remove some of the options (bi/tri-allelic constraint) then the probabilities of remaining options should be scaled proportionally such that they sum to 1

This is not perfect because the bi/tri-allelic constraint implies additional prior knowledge that not all other alleles are equally likely but it seems to be the most sensible/conservative solution that doesn't require per-SNP input from the user.

@timothymillar
Copy link
Collaborator Author

Done in #57

@timothymillar
Copy link
Collaborator Author

If we remove some of the options (bi/tri-allelic constraint) then the probabilities of remaining options should be scaled proportionally such that they sum to 1

On second thought this is true in the general case for encoding a generic probability distribution but here we are technically just encoding the probability of identity between reads and haplotypes.
This is only used in the context of the likelihood function P(reads | haplotypes) and hence the haplotypes are treated as constants.
Any bi/tri-allelic constraints are applied to the haplotypes via the proposal distribution.

The likelihood function is asking "what is is the probability of this read coming from this genotype?" in which case the read can encode all possible alleles, not just those that the haplotypes are limited to.
This means that a non-allowed variant call in a read could be encoded as e / a for all allowed alleles.

@timothymillar timothymillar reopened this Oct 2, 2020
@timothymillar timothymillar mentioned this issue Oct 6, 2020
@timothymillar
Copy link
Collaborator Author

Done in #59

@timothymillar
Copy link
Collaborator Author

It also looks like this is a generalized equivalent to ANDSG and GATK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant