Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange truncation of fastq files when demultiplexing short Illumina inserts libraries #90

Open
j0n0curry opened this issue Aug 27, 2024 · 2 comments

Comments

@j0n0curry
Copy link

Hi there,

Have been using your library to demultiplex libraries - 384 well illumina libraries per ONT run - which have a similar construction to Tru-seq paired indexes.

The first sets tested worked very well - library insert size range 200 - 600 bp - targeted PCR library with intentional bias to see if the input ratio would be picked up and was.

However, we are also interested in the performance of the indexing plates - both for cross contamination and total library read counts per index - then compare back to the original Illumina library run. Using a 90 bp insert (x 4 - one per quadrant of the plate) and final amplicon 165 bp long. The combined demultiplexing report of 800 plus files provides a relatively realistic number of index pairs found - however the demux fastq.gz files - one per 384 well, provides only short reads of mean 26 bases +/- 40 bp. Strangely. The filtered fastq files were filtered for 180 to 400 bp ranges and pre-mapped to verify that number quantity of mappable reads. Post Anglerfish demux results in low or no alignment (bowtie2 or Minknow medeka) - looking through with fastqc showed they were too short - BLAST did no find many that aligned with insert or the i5 or i7 region.

I have attached a zipped file with: output of Anglerfish / text with settings used (have tried with and without -lenient) / fastq filtered file / the resulting demux reads from this file / a copy of the samples sheet - 384 indexes. I am using v0.6.1 - Ubuntu on WSL - Python 3.10 - minimap2 is installed. Unsure if very short amplicons causes problems with the pipeline. The fragment range includes singles and duplicated reads - expected.

anglerfish_test.zip

Thanks very much for this pipeline! We are hoping to pipe to this for real-time demultiplexing.

@remiolsen
Copy link
Member

Hi!

Thanks for sharing the result files, and for testing out anglerfish. Sorry if I'm misunderstanding something about your library or sequencing setup.

the combined demultiplexing report of 800 plus files provides a relatively realistic number of index pairs found - however the demux fastq.gz files - one per 384 well, provides only short reads of mean 26 bases +/- 40 bp. [...] Post Anglerfish demux results in low or no alignment (bowtie2 or Minknow medeka) - looking through with fastqc showed they were too short - BLAST did no find many that aligned with insert or the i5 or i7 region.

Anglerfish will only output in the demuxed.fastq.gz files the library inserts or more accurately, the sequence between the two matching i7 and i5 templates. I'll give an example using paftools view for the minimap2 alignment file you included, for one read (the query sequence)

(base) remi.olsen@remiombp anglerfish_test % paftools view truseq_dual.paf | grep "002774fa-5542-4b67-a3da-42f289e20508" -A 3
>002774fa-5542-4b67-a3da-42f289e20508	182	128	181	+	truseq_dual_i7	65	0	53	45	45	42	NM:i:8	ms:i:262	AS:i:262	nn:i:8	tp:A:P	cm:i:10	s1:i:36	s2:i:0	de:f:0.1509	rl:i:0
Ref+:          1 =================================nnnnnnnn============ 53        
                 |||||||||||||||||||||||||||||||||        ||||||||||||
Qry+:        129 =================================tatgcacg============ 181       
--
>002774fa-5542-4b67-a3da-42f289e20508	182	33	102	+	truseq_dual_i5	70	0	70	61	62	60	NM:i:9	ms:i:352	AS:i:352	nn:i:8	tp:A:P	cm:i:13	s1:i:46	s2:i:0	de:f:0.1286	rl:i:0
Ref+:          1 =============================nnnnnnnn======t========================== 70        
                 |||||||||||||||||||||||||||||        |||||| ||||||||||||||||||||||||||
Qry+:         34 =============================gcagacga======-========================== 102 

We see that the first 33 bases of this 182 bp long read does not match the template (but probably still is adaptor), the second not mathcing area is between bp 102 and 129, which anglerfish counts as the insert. And indeed the majority of the 4000 reads - it's very fortunate that all except 2 reads map to either i5 or i7, because then I know the length distribution - seem to be this short (190 +- 31 bp):

cut -f1,2 truseq_dual.paf | sort | uniq | datamash mean 2 sstdev 2 count 2
190.48724362181	31.057328130074	3998

This is all to say that, if you expected longer inserts (90 as you say, in all these wells or only 4 of them?) I am not able to explain why these are so short from the anglerfish output alone. Even the few "long" ones, say over 300 bp

cut -f1,2 truseq_dual.paf | sort | uniq | awk '{if($2 > 300){print $0}}' | wc -l
      61

Seem to be mostly be concatenated shorter fragments, e.g. i5-insert-i7-i5-insert-i7 constructs. Which is not uncommon to see.

We are hoping to pipe to this for real-time demultiplexing.

That's really cool. It's something I've had in the back of my mind, but never implemented because it doesn't really fit with the workflow in our lab.

@j0n0curry
Copy link
Author

Yes.. figured it out - but thank you for getting back to me. About 3% of reads are conatamers. It is finding the indexes - and trimming the seqence 'mappable' read) between the flanking i7 i5 region (the intended behaviour). A reference for bowtie or medeka with just central insert which is 25 bp long works. My bad should have looked at the sequences more closely - more than 99% of the reads map to the trimmed reafernece insert alone.
Pile up inserts
image

contig alignment to central (unique) region.
image

thanks again for getting back to me. Yes... these are all 384 plates beinig tested. Example 4000 read output - with conditions is about 1 minute to process - 32 Gb ram /intel i7 laptop or about 12 hours for a full run or work well as the reads come off the nanopore and merge all fastq per well at the end.

Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants