-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confusion/Documentation with IndexedFasta.getSequence #316
Comments
Hi @IanSudbery , sorry about this. The reason for the default conversion is our comparative genomics past and follows exonerate output and UCSC practice for some formats such as the chain format, which uses reverse strand coordinates for the reverse strand. Changing the default should be fine, but indeed will need some thought. I can put it on my todo list. |
No worries. It hasn’t been a major bug this time, but I did wonder how many
other people have made this mistake without realizing it. I was only
investigating it to work out why 5 of my 10 primer pairs didn’t work. One
supposes that this sort of thing could get buried deep inside a pipeline
and no one might find the error until after something had been published
(if ever!).
|
Okay, there are still some weird things going on here. GTF.py assumes that you have not set the converter and does some gymnastics with interval co-ordinates to get around that (It does these wrong BTW, the co-ordinates are off-by-one, so if you do GTF.toSequence on a transcript, the first base of the transcript is missing). This means you can't open a fasta and use it both to convert your own co-ordinates to sequence and also use the same file in GTF.toSequence. |
Hi @IanSudbery , sorry about this. The reason for the default conversion is our comparative genomics past and follows exonerate output and UCSC practice for some formats such as the chain format, which uses reverse strand coordinates for the reverse strand. Changing the default should be fine, but indeed will need some thought. I can put it on my todo list. |
1 similar comment
Hi @IanSudbery , sorry about this. The reason for the default conversion is our comparative genomics past and follows exonerate output and UCSC practice for some formats such as the chain format, which uses reverse strand coordinates for the reverse strand. Changing the default should be fine, but indeed will need some thought. I can put it on my todo list. |
I will add some tests. |
I have naively assuming that IndexedFasta.getSequence(contig, "-", start, end) would return the reverse complement of the sequence between contig:start..end. But this is only the case if the obscure
mConverter
attribute of the IndexedFasta object is set to a coordinate system that includes "both" in its name (e.g.fasta.setConverter(IndexedFasta.getConverter("zero-both-open"))
).If the converter is not set, fetching fasta on the -ve strand actually returns the seqeuence
contig:contig_length-end..contig_length-start
.This is so unobvious that the author of
gff2fasta
jumps though all sorts of hoops to get the correct sequence out of the fasta rather than use the coordinates converter.This is a whole world of bugs just waiting to happen (I just ordered a whole load of primers that amplify absolutely nothing because they target completely the wrong sequences).
I recommend that the "zero-both-open" converter should be seq as default on the creation of a IndexedFasta object. But this will require making sure that all uses of IndexedFasta in the collection are compatible with this.
The text was updated successfully, but these errors were encountered: