Consolidate code to resolve SNP issues #42

apriha · 2019-11-28T05:37:34Z

Create a new Resolver class to resolve SNP issues. Refactor SNPs._assign_par_snps to this class. Also, this class could be used to implement solutions to #13 and #19 .

The text was updated successfully, but these errors were encountered:

apriha · 2020-01-08T07:29:00Z

Central to resolving SNP issues (e.g., assigning SNPs on chrom 0 (#13), populating missing RSIDs (#19), and assigning PAR SNPs) is having a resource that maintains a list of RSIDs, and at a minimum, for each RSID its chromosome and position (for at least GRCh37 - remapping to other builds could be performed dynamically).

Initially I was thinking that a local resource of RSIDs could be built-up by querying an API (e.g., like demonstrated in #19). However, in the absence of a database, that could result in concurrency issues if multiple threads are trying to build the resource.

Additionally, there could be a large amount of data transferred for these requests. For more than 100K SNPs to process, NCBI Variation Services recommends downloading data from the FTP site.

Therefore, I think the best solution here is to download dbSNP as a resource. Specifically, ftp://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz would provide a list of all SNPs for GRCh37, with the RSID, reference allele, alternate alleles, and a variety of flags indicating properties of the SNP (e.g. clinical significance).

Having this reference would also make saving SNPs as a VCF more robust, providing the ALT alleles, plus the additional flags.

The only problem is the compressed VCF file is ~14GB (~100GB uncompressed) - too large to load into memory with pandas. A tabix file is provided that indexes the VCF, but a pure Python solution loading this resource would be preferred.

Perhaps it'd be possible to parse the tabix and load portions of the gzip in a similar manner to how large VCFs are de-compressed and loaded dynamically with snps?

apriha · 2020-01-08T15:41:26Z

Alternatively, I think it could be possible to read the gzip and post-process it into compressed pickles, one for each chromosome (perhaps more chunks if these are still too large). Then the Resolver could fix issues for SNPs, loading the pickles chromosome by chromosome.

willgdjones · 2020-01-09T16:11:49Z

So this is an interesting proposition - my one concern is that the 14GB would need to be downloaded for this functionality to work. In our use case, for example, we actually use snps within a Lambda function which has no persistent memory. Therefore this data would need to re-downloaded with each request. I'm not sure how common my particular use case would be, however!

If we are able to query 100k snps via the NCBI Variation Services, then we could sequentially call this endpoint with 100k queries each time until all the snps have been retrieved.

We could also have both of these pieces of functionality - and let the package user decide which method suits them best.

apriha · 2020-01-10T07:48:32Z

So with the NCBI Variation Services API, requests are limited to one / second. I think the following two endpoints could be used to resolve issues:

/vcf/file/set_rsids
This could be used to solve Populate missing RSIDs #19. However, it's limited to 50K SNPs per request. Additionally, data for this endpoint is POSTed to the API; for each SNP, it requires the CHROM, POS, REF allele, and ALT allele. To anonymize the data, generic ALT alleles could be POSTed.
/refsnp/{rsid}
This is currently used to assign PAR SNPs, and in a similar manner, it could be used to solve Assign SNPs on chromosome 0 #13. However, this endpoint returns details for one SNP at a time, and since the API limit is one request per second, the time required for this increases linearly with the number of SNPs to process.

I agree that both types of processing should be provided, especially since the dbSNP VCF is 14GB, and users may only need to assign PAR SNPs. But for bulk processing of files, I think that the dbSNP VCF would be much more efficient.

Inspecting the dbSNP VCF, it has all of the information required to resolve these issues - and more, e.g., replacing the need for downloading the FASTA sequences to write VCF, replacing the "ID" genotype with the actual insertions / deletions, and populating missing pos in some files.

apriha · 2020-01-10T07:55:08Z

In our use case, for example, we actually use snps within a Lambda function which has no persistent memory.

Would there be any way to build the resources into the container / image filesystem where snps is being executed? Then, snps could point to that resource directory and load local resources.

willgdjones · 2020-01-10T13:45:38Z

Would there be any way to build the resources into the container / image filesystem where snps is being executed? Then, snps could point to that resource directory and load local resources.

On Lambda, this functionality is not available unfortunately.

avoid race condition error

apriha mentioned this issue Dec 1, 2019

Assign SNPs on chromosome 0 #13

Open

apriha added the enhancement New feature or request label Dec 20, 2019

apriha pushed a commit that referenced this issue Jun 24, 2021

Merge pull request #42 from sanogenetics/feature/atomic-process-conflict

9883749

avoid race condition error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate code to resolve SNP issues #42

Consolidate code to resolve SNP issues #42

apriha commented Nov 28, 2019

apriha commented Jan 8, 2020

apriha commented Jan 8, 2020

willgdjones commented Jan 9, 2020 •

edited

Loading

apriha commented Jan 10, 2020

apriha commented Jan 10, 2020

willgdjones commented Jan 10, 2020

Consolidate code to resolve SNP issues #42

Consolidate code to resolve SNP issues #42

Comments

apriha commented Nov 28, 2019

apriha commented Jan 8, 2020

apriha commented Jan 8, 2020

willgdjones commented Jan 9, 2020 • edited Loading

apriha commented Jan 10, 2020

apriha commented Jan 10, 2020

willgdjones commented Jan 10, 2020

willgdjones commented Jan 9, 2020 •

edited

Loading