-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate code to resolve SNP issues #42
Comments
Central to resolving SNP issues (e.g., assigning SNPs on chrom 0 (#13), populating missing RSIDs (#19), and assigning PAR SNPs) is having a resource that maintains a list of RSIDs, and at a minimum, for each RSID its chromosome and position (for at least GRCh37 - remapping to other builds could be performed dynamically). Initially I was thinking that a local resource of RSIDs could be built-up by querying an API (e.g., like demonstrated in #19). However, in the absence of a database, that could result in concurrency issues if multiple threads are trying to build the resource. Additionally, there could be a large amount of data transferred for these requests. For more than 100K SNPs to process, NCBI Variation Services recommends downloading data from the FTP site. Therefore, I think the best solution here is to download dbSNP as a resource. Specifically, ftp://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz would provide a list of all SNPs for GRCh37, with the RSID, reference allele, alternate alleles, and a variety of flags indicating properties of the SNP (e.g. clinical significance). Having this reference would also make saving SNPs as a VCF more robust, providing the The only problem is the compressed VCF file is ~14GB (~100GB uncompressed) - too large to load into memory with pandas. A tabix file is provided that indexes the VCF, but a pure Python solution loading this resource would be preferred. Perhaps it'd be possible to parse the tabix and load portions of the gzip in a similar manner to how large VCFs are de-compressed and loaded dynamically with |
Alternatively, I think it could be possible to read the gzip and post-process it into compressed pickles, one for each chromosome (perhaps more chunks if these are still too large). Then the |
So this is an interesting proposition - my one concern is that the 14GB would need to be downloaded for this functionality to work. In our use case, for example, we actually use If we are able to query 100k snps via the NCBI Variation Services, then we could sequentially call this endpoint with 100k queries each time until all the snps have been retrieved. We could also have both of these pieces of functionality - and let the package user decide which method suits them best. |
So with the NCBI Variation Services API, requests are limited to one / second. I think the following two endpoints could be used to resolve issues:
I agree that both types of processing should be provided, especially since the dbSNP VCF is 14GB, and users may only need to assign PAR SNPs. But for bulk processing of files, I think that the dbSNP VCF would be much more efficient. Inspecting the dbSNP VCF, it has all of the information required to resolve these issues - and more, e.g., replacing the need for downloading the FASTA sequences to write VCF, replacing the "ID" genotype with the actual insertions / deletions, and populating missing |
Would there be any way to build the resources into the container / image filesystem where |
On Lambda, this functionality is not available unfortunately. |
Create a new
Resolver
class to resolve SNP issues. RefactorSNPs._assign_par_snps
to this class. Also, this class could be used to implement solutions to #13 and #19 .The text was updated successfully, but these errors were encountered: