-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check probability encoding of SNPs #56
Comments
There appears to be 2 ways to interpret an error rate e for a SNP call
In the case of an enforced bi-allelic SNP (i.e. a constraint that the allele call must be one of two possibilities) there are two ways to generalize the above calculations such that the probabilities still sum to 1: a. Use a = 2 For example in in a unconstrained SNP with a reference call and e = 0.3 then the possible encodings are:
If we add the constrain that the allele is bi-allelic then the possible (normalized) encodings are:
If a non-allowed call has been made then the allele is encoded as a gap Currently we are using option 1a. If encoding MNPs #51 then non-normalized distributions arise due to reads with only partial coverage of the MNP. |
Function for 2a if needed def as_probabilistic(array, n_alleles, p=1.0, dtype=np.float):
# check inputs
array = np.array(array, copy=False)
n_alleles = np.array(n_alleles, copy=False)
p = np.array(p, copy=False)
# special case for zero-length reads
if array.shape[-1] == 0:
return np.empty(array.shape + (0,), dtype=dtype)
max_allele = np.max(n_alleles)
n_alleles = n_alleles[..., None]
# onehot encoding of alleles
new = (array[..., None] == np.arange(0, max_allele)).astype(np.float)
# probability that call is correct
new *= p[..., None]
# even probability if incorrect call
new += (1 - np.expand_dims(p, -1)) / n_alleles
# fill gaps with nans
new[array < 0] = np.nan
# fill non-alleles
new[..., n_alleles <= np.arange(max_allele)] = 0
return new |
Function for 1b def as_probabilistic(array, n_alleles, p=1.0, error_factor=3, dtype=np.float):
array = np.array(array, copy=False)
n_alleles = np.array(n_alleles, copy=False)
error_factor = np.array(error_factor, copy=False)
p = np.array(p, copy=False)
# special case for zero-length reads
if array.shape[-1] == 0:
return np.empty(array.shape + (0,), dtype=dtype)
vector_length = np.max(n_alleles)
# onehot encoding of alleles
onehot = array[..., None] == np.arange(vector_length)
# basic probs
new = ((1 - p) / error_factor)[..., None] * ~onehot
calls = p[..., None] * onehot
new[onehot] = calls[onehot]
# zero out non-alleles
new[..., n_alleles[..., None] <= np.arange(vector_length)] = 0
# normalize
new /= np.nansum(new, axis=-1, keepdims=True)
# nan fill gaps and re-zero
new[array < 0] = np.nan
new[..., n_alleles[..., None] <= np.arange(vector_length)] = 0
return new |
Decided on option 1b as this seems to be the best combination of intuitive and correctly interpreting error probability in a base call of a read.
This is not perfect because the bi/tri-allelic constraint implies additional prior knowledge that not all other alleles are equally likely but it seems to be the most sensible/conservative solution that doesn't require per-SNP input from the user. |
Done in #57 |
On second thought this is true in the general case for encoding a generic probability distribution but here we are technically just encoding the probability of identity between reads and haplotypes. The likelihood function is asking "what is is the probability of this read coming from this genotype?" in which case the read can encode all possible alleles, not just those that the haplotypes are limited to. |
Done in #59 |
It also looks like this is a generalized equivalent to ANDSG and GATK |
Currently the error prob is being treated as "probability that the true allele it is not the called allele".
The alternate interpretation is "Probability that call is meaningless" in which case some portion of that probability
should be assigned to the called allele i.e. the calling went wrong but happened to result in a correct call.
This difference has a negligible affect on haplotype calls but has technical implementations when incorporating MNPs #51.
The text was updated successfully, but these errors were encountered: