Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing warning message #10

Open
gspirito opened this issue Apr 6, 2022 · 4 comments
Open

Preprocessing warning message #10

gspirito opened this issue Apr 6, 2022 · 4 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@gspirito
Copy link

gspirito commented Apr 6, 2022

Hi, I ran superSTR preprocessing on a cohort of WGS samples (in bam format). The output seems ok for all samples, however, for some of them I get this message in the log:


[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/90a8a5267a890ccd78717e88de59232d": Protocol not supported
[E::fai_build3_core] Failed to open the file /gpfs/internal/sweng/production/Resources/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa
[E::refs_load_fai] Failed to open reference file '/gpfs/internal/sweng/production/Resources/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa'
[E::cram_get_ref] Failed to populate reference for id 3241
[E::cram_next_slice] Failure to decode slice
Successful run; processed 793068206 reads.

Is it safe to proceed with the analysis anyway?

Thanks.

@lfearnley
Copy link
Contributor

That's odd for a BAM - HTSlib seems to be treating this file as though it's a CRAM, then throwing an error when it tries to download reference sequences from ENA.

Is there any chance I could get you to please provide one of the commands that's generating the error?

@lfearnley lfearnley self-assigned this Apr 6, 2022
@lfearnley lfearnley added the question Further information is requested label Apr 6, 2022
@gspirito
Copy link
Author

gspirito commented Apr 7, 2022

Hi, thanks for the answer, actually some of the files in my cohort are cram and others are bam, however, in the manual I saw that I should use "--mode=bam" for both cram and bam; furthermore the error occurs only for some cram files and not other cram.

Here's the command I use:

superstr --mode=bam -o /homenfs/user/HG01813/ -t 0.49 /homenfs/user/HG01813.final.cram

Thanks

@lfearnley
Copy link
Contributor

Yep! You've got the correct command.

The issue is in htslib and how it reads CRAMs (which superSTR relies on, as well as samtools and many other pieces of software). htslib is failing to download the reference sequence that you've aligned to for some reason from ENA; this happens sometimes transiently I've found, but can also happen in server/workflow contexts.

If you're aligned to GRCh38, running the following should work (from here):

wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai 
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.ref_cache.tar.gz

tar xzf Homo_sapiens_assembly38.ref_cache.tar.gz
export REF_PATH="$(pwd)/ref/cache/%2s/%2s/%s:http://www.ebi.ac.uk/ena/cram/md5/%s"
export REF_CACHE="$(pwd)/ref/cache/%2s/%2s/%s"

Full documentation is available here. If you're running in an HPC, you'll need to make sure that REF_CACHE and REF_PATH are correctly set in the job environment.

Unfortunately, you'll need to re-run those CRAMs for which the error was generated :(

@lfearnley lfearnley added documentation Improvements or additions to documentation and removed question Further information is requested labels Apr 7, 2022
@lfearnley
Copy link
Contributor

Oh, as a note to myself - I'll add some comments on this in the documentation before closing this out.

If there's any other issues you hit with superSTR let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants