-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide more details about custom barcodes with unique Illumina barcodes for the start and end of the sequence #559
Comments
Hi @gearhq I created a new issue for your question. Sorry I missed this earlier! The barcode sequences need to go into a separate FASTA file. This file will look something like
then in your arrangements file, you can see
where |
Hi @tijyojwad, thank you! I appreciate your time. I tried demultiplexing my fastq file with these barcodes (I am using 10 pairs, but I put 3 here for brevity):
I am using this arrangement file:
I used these parameters: Then, I got this error:
For this test I used the pre-compiled package dorado-0.5.1-linux-x64. Could you help me understand what is wrong? |
Hi @gearhq - you found a bug in our documentation! It needs to be |
Hi @tijyojwad , that's great! I changed the toml file and now I get a different error:
So I suppose I need to provide the indices with the numbers (i5_01, i5_02, i7_01, i7_02 and so on) and not the sample names (like I did i5_LIGH103656, i7_LIGH103656 and so on), is that correct? So, to demultiplex the fastq files and use the sample names as file names, do I need to manually rename them or provide the "--sample-sheet" parameter? I'm asking this because I did not find any sample sheet templates in the barcode section of the documentation. |
that's right, you need to name them as Here's some documentation on sample sheet aliasing - https://github.com/nanoporetech/dorado/blob/master/documentation/SampleSheets.md |
Alright, I tried adding this sample spreadsheet with minimal information:
I updated the barcode fasta headers:
I also updated the dorado parameters with the tsv file:
The software worked! Unfortunately all the reads were unclassified, I still need to find out what is going on. Is there a serious problem, since I don't have the mask1_front/rear and mask2_front/rear sequences? |
yeah it won't really work when you provide no masks. The current algorithm expects at least one of the masks to be provided - we need to add a check for that in the config parsing so we don't allow this mode. And that's because the heuristic we use to find barcodes depends on having a flank sequence to identify the approximate location of the barcode. Alternatively we can look into adding a flank free search, but that would potentially be lower accuracy. Generally barcodes have flanks though - would you be able to find that info? |
Yes, after some alignments and grep searches I found that most of the flanking regions are similar to the TruSeq universal adapter sequence "AATGATACGGCGACCACCGAGATCTACAC" and the TruSeq index/reverse adapter sequence "TCTGAACTCCAGTCAC". I tried various combinations of these sequences in mask1_front/rear and mask2_front/rear, but the result was all unclassified sequences. Additionally, I tried shorter versions of these sequences (e.g. just "ACAC" or "CAC" at 5' end) and got nothing classified as well. I am not sure what could be wrong now. I think the flank free implementation would be great, in addition you could reduce the number of input files as well. Probably only the table would be easier to configure. Furthermore, the table could have less mandatory information (perhaps just the sample name and the 5' and 3' barcodes), with all other information being optional. In my custom demultiplex solution, I implemented a flank free search using python and biopython. I used the biopython alignment algorithm allowing a configurable number of mismatches. I then looked at the Hamming distances between my barcodes and configured the allowed mismatches accordingly. It worked very well, despite being very slow and not yet having the option for trimming the barcode, adapter and primer regions. The reason I did this implementation was the fact that other solutions did not work correctly according to the tests I performed on my data. Some of them did not work on high-quality base calls, others found barcodes and flanking regions that did not match my sequences. The reason I did this implementation was the fact that other solutions did not work correctly according to the tests I performed on my data. Some of them did not work on high-quality base calls (not being able to classify any barcode as it is happening now on demux), others found barcodes and flanking regions that did not match my sequences. |
Hi @gearhq - that's unfortunate to hear. Would it be possible for you to share a few reads that you're confident is barcoded from your custom script, and share that? I can take a look to see what's not working with the barcode sequences and flanks you've provided. I've made a note to add a flank free implementation into dorado. I will let you know when we prioritize it |
Resolve nanoporetech#559 Bug in documentation
Hello @tijyojwad, I was only able to get back to the demultiplex problem now. Unfortunately, I did not have better results than those I described previously in recent attempts to demultiplex this sequencing. If you could give me some guidance on how to use the Dorado demultiplexer with this data I would be very grateful. Here is a dropbox link with our pilot data with high quality base calls to perform the demultiplex and a file with the barcodes we used: If you have any access problems, just let me know. |
I was able to get classifications with this setting arrangement.toml
sequences.fasta
|
Hi @tijyojwad , With the configuration you provided, I was able to demultiplex by barcode. Now I have some questions. So for Dorado to identify the barcodes, am I required to use this naming pattern for my barcodes in my fasta file? Thus how I supposed to map barcodes to sample names in the sample spreadsheet? Could you please provide an example? Thank you! |
Hi, If we have paired indices, will it only consider the combination of matching barcodes, so the first barcode_1 with the first barcode_2 and so on or all combinations? |
Hi @tijyojwad, I am using this sample sheet: Then, at my results folder, the file names still "test_github_kit_barcode01.fastq", "test_github_kit_barcode02.fastq", (...), instead of "test_github_kit_sample3656.fastq", "test_github_kit_sample3665.fastq", (...). Could you figure out what am I doing wrong? |
Hi @gearhq - this is a subtlety that I don't think we've mentioned in our docs (or have proper checks for) so I made a mistake. The So if you update your toml to
and your sample sheets to
it'll work |
Hi @andreott - it will check pairwise (i.e. first barcode_1 with first barcode_2) |
Thanks! So if we have non unique adapters (same p5 index combined with multiple p7), it is no problem to have repeated barcode sequences with different ids, right? However it seems like it will not work in that case. If I provide a second occurrence of the same p5 barcode with a different p7, the first combination is not detected/reported anymore. |
the way the algorithm works currently, having repeated sequences will cause the demux algorithm to think that 2 different barcodes have a confident hit for the read (since we check for at least barcode_1 or barcode_2 to be found for each barcode pair). That will then cause the read to be unclassified. This particular use case has come up a couple of times now, so I think I will add support for it. |
@tijyojwad Could you provide more details about custom barcodes? I have a custom demultiplex solution that is very inefficient. I'm using a board with unique Illumina barcodes for the start and end of the sequence (i5 and i7 from the IDT UDI plate, they are unique, so any detection would put the sequence read into the correct demultiplexed file). But I don't understand if the barcode sequences should be in a toml file or a separate file. I also have multiple i5 and multiple i7, should I put them in the same fasta file? In which order?
Originally posted by @gearhq in #495 (comment)
The text was updated successfully, but these errors were encountered: