Replies: 13 comments
-
Hi @wagnerde , Thanks for reaching out and your interest in alevin. Yes, we are aware of the issue and we are working on a stable API to provide the support for a regex based CB & UMI extraction. However, in case you are in urgent need, here's a fast solution.
|
Beta Was this translation helpful? Give feedback.
-
Hi @k3yavi, Thanks for the great tool! :) To follow on @wagnerde request, have you any news to provide on this regex based CB and UMI feature? In addition to I guess meanwhile you would suggest to use UMI-tools for this kind of data? Cheers, |
Beta Was this translation helpful? Give feedback.
-
Hi @mbahin , Thanks again for reaching out. Unfortunately we are still working on this, for now one might have to preprocess the sequence of R1 file to remove non CB and UMI sequence before running Alevin quantification pipeline. We will drop the update here once we have a stable api for regex based processing. |
Beta Was this translation helpful? Give feedback.
-
Hi all, Just to make sure we size-up the design space correctly, what type of specifications do folks think would be "expressive enough"? |
Beta Was this translation helpful? Give feedback.
-
Thanks for your quick answer. In a ideal world, BCs could be split into several pieces (the concatenated BC could be quite long, longer than the 20bp currently authorized) and fixed-length sequences could be found between any pieces (BC or UMI). And I would add that, one way or another, we could specify whether the fragment is on R1 or R2. I don't know if this answers the question but it's just to show a particular use case! :) Cheers, |
Beta Was this translation helpful? Give feedback.
-
Hi everyone, Thanks once again to the Salmon/Alevin team for an incredible tool! I second Mathieu's thoughts. The ability to flexibly specify the start/stop positions of the cDNA, UMI, and cell barcode would be helpful. In addition, the ability to split the cell barcode or UMI over multiple reads (R1, R2, R3) would likely provide the flexibility needed to accommodate any scRNA-seq technology (past or future). Thanks! |
Beta Was this translation helpful? Give feedback.
-
Hi @wagnerde, Thanks for your support and input on this. The ability to arbitrarily split the CB/UMI over multiple reads is certainly possible, but would require some non-trivial changes to the current codebase (with the requisite testing, of course). What I propose is that we introduce a flexible syntax for specifying the CB and UMI geometry and implement it. Initially, it will be a more versatile version of our current "Custom Rule", in that it will allow discontiguous CB or UMI substrings, but may restrict them to be on the same read. The syntax will support split CB/UMI, but the initial implementation may not support it (i.e. it would issue an error when parsing the specification). However, with this ability in tow, we could then expand the underlying code to fully support the more generic barcode and umi geometry. Currently, we're thinking of something along the lines of |
Beta Was this translation helpful? Give feedback.
-
Hi @rob-p, Thanks for these exciting precision. This step by step approach indeed seem wise. I'd have two remarks about what you propose:
Anyway, we are very interested in the first step you propose and would be glad to give you feedbacks on its usage. Cheers, |
Beta Was this translation helpful? Give feedback.
-
Hi @mbahin, These are great points, that this level of flexibility is like what is enabled in the regex-based extraction of umi-tools. This is super flexible ... but also quite computationally costly. Specifically, having to allow inexact matches (particularly with indels) or approximate positions changes the asymptotic complexity of barcode / umi extraction. I've been speaking a bit with @k3yavi and others on our team to think about the right trade-off of what complexity should exist within alevin itself, and what should be pushed off to another (e.g. pre-processing) program. I think it's not unreasonable to have alevin, itself, have the ability to handle arbitrary but precise barcode geometry (that is, the positions to be spliced together are known, and perhaps span reads 1 and 2). In fact, apart from the spanning reads part, that is already in our initial implementation (this commit). However, the complexity necessary for having approximate matching, or for allowing > 2 reads, starts to get pretty large — that is, the number of places that have to be touched in the code become pretty high. One thought we had on how to manage the complexity is as follows. For arbitrary but precise barcode geometry that is within a single read or spans reads 1 & 2, we can use a syntax like the above. For approximate or more complex barcode geometry, we can provide a "streaming" transformer program. This would be a program whose only purpose is to consume an input stream of 1, 2, or 3 synchronized FASTQ files, and to convert them into an output stream of 2 synchronized FASTQ files whose barcode geometry could be described in the arbitrary but precise language. Crucially, this transformer program would have the capability to read from and write-to streams (named pipes), to avoid having to actually write intermediate files. While this would be a little bit of extra work to use compared to having the capability built in, that extra complexity would go away once wrapped in a script or a workflow. However, it would be much easier to maintain, improve, and develop, as the program would not be intertwined with the (quite large) salmon/alevin codebase, and it would be easier to change and improve it without worrying about unintended side-effects. As I mentioned above, we already have a first draft of the arbitrary but precise specification language, and we'll definitely keep everyone on this thread up to date as we expand and polish this capability. --Rob edit: updated commit link. |
Beta Was this translation helpful? Give feedback.
-
Hi again, Thanks for the detailed answer and taking into considerations users constraints, I understand and agree with the idea that you should keep alevin so fast (one of its strength!). (I think that you swapped "issues" with "commits" in the URL that you pointed.) Cheers, |
Beta Was this translation helpful? Give feedback.
-
Indeed; just fixed the link. Thanks for the feedback. We'll keep you updated! --Rob |
Beta Was this translation helpful? Give feedback.
-
Dear Rob, This is a fascinating discussion & possibility! For the multi-read use cases, I completely understand the appeal of addressing these issues with a modular pre-processing step that would not require drastic changes to Alevin itself. The streaming/piping option would be able to convert arbitrary input FASTQs into the format that Alevin expects, & likely provide limitless flexibility. Thank you for taking the time to discuss this issue!! I look forward to hearing if this develops further! Best, |
Beta Was this translation helpful? Give feedback.
-
Hi all, Much of this capability is currently available in the 1.4.0 release. If you are using alevin with alevin-fry (the alevin-fry pipeline) then the UMI and barcodes can both span reads 1 and 2 (we don't yet have a system for supporting > 2 reads / fragment). If you are using just alevin, then the barcode is currently constrained to reside just within read 1, but the new syntax is available and more flexible (e.g. the barcode could be discontiguous etc.). Since I think this issue now makes more sense as a discussion, I'm moving it to GitHub discussions in this repo. |
Beta Was this translation helpful? Give feedback.
-
Hello! Thank you for developing these incredible tools. My laboratory uses inDrops (primarily V3) for scRNAseq, and we would like to use Salmon/Alevin to process our datasets. At present, inDrops V3 does not appear to be supported. Our libraries are distributed over three FASTQ files:
First FASTQ: first half of cell barcode (8bp)
Second FASTQ: second half of cell barcode (8bp) + UMI (6bp). (14 bp total)
Third FASTQ: biological read (~60bp)
We also have a whitelist of our cell barcodes.
Is there a command that would allow the processing of such libraries with alevin?
I have been attempting to use kallisto/bus to process our data, but that pipeline does not appear to handle short (i.e. 6bp) UMIs correctly. However, kallisto/bus has an interesting feature that allows users to input settings for any arbitrary technology (see below). I was wondering if the alevin team is considering adding a feature like this? It would allow alevin to process an unlimited number of different technology types (and future technologies), without having to hard code each one individually.
On a side note, my lab primarily studies zebrafish, so we like the idea of preserving our aquatic theme by using your tool! Thanks again!
From: http://pachterlab.github.io/kallisto/manual.html :
Additionally kallisto bus will accept a string specifying a new technology in the format of bc:umi:seq where each of bc,umi and seq are a triplet of integers separated by a comma, denoting the file index, start and stop of the sequence used. For example to specify the 10xV2 technology we would use 0,0,16:0,16,26:1,0,0. The first part bc is 0,0,16 indicating it is in the 0-th file (also known as the first file in plain english), the barcode starts at the 0-th bp and ends at the 16-th bp in the sequence (i.e. 16bp barcode), the UMI is similarly in the same file, right after the barcode in position 16-26 (a 10bp UMI), finally the sequence is in a separate file, starts at 0 and ends at 0 (in this case stopping at 0 means there is no limit, we use the entire sequence).
This scheme also allows the cell barcode to be split over multiple files. I would use the following command to specify our library design:
-x 0,0,8,1,0,8:1,8,14:2,0,0
Beta Was this translation helpful? Give feedback.
All reactions