-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider refactoring genotype
column
#45
Comments
Hi @willgdjones, what are your thoughts on this? |
This seems like a sensible approach 👍 . We have just encountered the issue you describe in #43 ourselves in fact with Sano genotype data files where the X chromosome is duplicated for male samples. |
Hi @apriha - were you in the midst of tackling this? This is functionality that will be soon be important for us so I'm considering jumping in on this. |
In fact - it is more #43 that is directly relevant for us so I will look to fix that first. |
Hey @willgdjones , I started looking at this as part of #44 and plan to push some parsing updates related to this soon. I was just thinking - perhaps for backwards compatibility, we can also on the fly optionally generate the current data frame structure (i.e., with the |
I like that suggestion about keeping the |
handle unusual 23andme files with missing values
Currently,
snps
normalizes data into a dataframe with four columns namedrsid
,chrom
,pos
, andgenotype
.genotype
can either benp.nan
or a string of length 1 or 2. For autosomal SNPs and the X chromosome, the genotype is always a length 2 string. See #43.The Y chromosome and mtDNA alleles are often strings of length 1; however, SNPs in the pseudoautosomal region on the X and Y chromosomes often have two alleles reported.
So, to better handle various numbers of alleles reported for the X, Y, and mtDNA chromosomes, consider refactoring the
genotype
column intoallele1
andallele2
.Note that this would also more naturally support phased genotypes (vs. indexing a length 2 string genotype), wherein
allele1
could be alleles on one chromosome, andallele2
could be alleles on the other. See #44.The text was updated successfully, but these errors were encountered: