Consider refactoring `genotype` column #45

apriha · 2019-11-28T06:13:48Z

Currently, snps normalizes data into a dataframe with four columns named rsid, chrom, pos, and genotype.

genotype can either be np.nan or a string of length 1 or 2. For autosomal SNPs and the X chromosome, the genotype is always a length 2 string. See #43.

The Y chromosome and mtDNA alleles are often strings of length 1; however, SNPs in the pseudoautosomal region on the X and Y chromosomes often have two alleles reported.

So, to better handle various numbers of alleles reported for the X, Y, and mtDNA chromosomes, consider refactoring the genotype column into allele1 and allele2.

Note that this would also more naturally support phased genotypes (vs. indexing a length 2 string genotype), wherein allele1 could be alleles on one chromosome, and allele2 could be alleles on the other. See #44.

The text was updated successfully, but these errors were encountered:

apriha · 2019-11-28T06:16:03Z

Hi @willgdjones, what are your thoughts on this?

willgdjones · 2019-12-01T11:08:23Z

This seems like a sensible approach 👍 . We have just encountered the issue you describe in #43 ourselves in fact with Sano genotype data files where the X chromosome is duplicated for male samples.

willgdjones · 2019-12-06T15:33:07Z

Hi @apriha - were you in the midst of tackling this? This is functionality that will be soon be important for us so I'm considering jumping in on this.

willgdjones · 2019-12-06T15:34:43Z

In fact - it is more #43 that is directly relevant for us so I will look to fix that first.

apriha · 2019-12-06T19:48:01Z

Hey @willgdjones , I started looking at this as part of #44 and plan to push some parsing updates related to this soon.

I was just thinking - perhaps for backwards compatibility, we can also on the fly optionally generate the current data frame structure (i.e., with the genotype column as described above) when requested, while using allele1 and allele2 in the underlying structure.

willgdjones · 2019-12-06T19:58:18Z

I like that suggestion about keeping the genotype column for backward compatibility.

handle unusual 23andme files with missing values

apriha added the enhancement New feature or request label Dec 20, 2019

apriha pushed a commit that referenced this issue Aug 24, 2022

Merge pull request #45 from sanogenetics/feature/23andme-missings

6ce1f8f

handle unusual 23andme files with missing values

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider refactoring `genotype` column #45

Consider refactoring `genotype` column #45

apriha commented Nov 28, 2019

apriha commented Nov 28, 2019

willgdjones commented Dec 1, 2019

willgdjones commented Dec 6, 2019

willgdjones commented Dec 6, 2019

apriha commented Dec 6, 2019

willgdjones commented Dec 6, 2019

Consider refactoring genotype column #45

Consider refactoring genotype column #45

Comments

apriha commented Nov 28, 2019

apriha commented Nov 28, 2019

willgdjones commented Dec 1, 2019

willgdjones commented Dec 6, 2019

willgdjones commented Dec 6, 2019

apriha commented Dec 6, 2019

willgdjones commented Dec 6, 2019

Consider refactoring `genotype` column #45

Consider refactoring `genotype` column #45