Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Need option to add filtering (SEG or DUST) to automatically-generated BLAST databases #21

Open
mattb112885 opened this issue Nov 7, 2012 · 2 comments

Comments

@mattb112885
Copy link
Owner

Low-complexity region filtering will be useful particularly for TBLASTN... repeats in the genome that can give a false result. If I understand the NCBI docs correctly SEG is only added to the query sequences automatically, not the target sequences...

@mattb112885
Copy link
Owner Author

James - have we decided whether or not we want to do this?

@JamesRH
Copy link
Contributor

JamesRH commented Apr 24, 2013

Our rule should be that (except for RPSblast, which is a special case),
whenever there is any masking of query or database, we always pass the
softmasking option (for query or database).

Since we use the query softmasking option for tblastn (where query
masking is on by default) in the wrapper, and since you got rid of the
scripts that did not do this, we should be fine. We just need to make
sure that it is mentioned in the tblastn wrapper -h text and that we are
not masking the databases we build in the main scripts. If you make
those changes to documentation, then this can be closed.

I talked to Rachel about this some time ago, but I tried to email her
about it and what I've learned about the new blast+, and she didn't
remember our earlier conversation (she also didn't know about the new
blast+, which is like professors being out of the lab long enough they
don't know how to use the new pipettor, I suppose). I will double-check
again, but I think we have the right approach.

James H

On 04/23/2013 09:22 PM, mattb112885 wrote:

James - have we decided whether or not we want to do this?


Reply to this email directly or view it on GitHub
#21 (comment).

On 03/21/2013 05:10 PM, James Henriksen wrote:

Should I submit a bug?
James H

On 03/21/2013 04:37 PM, Matthew Benedict wrote:

James:

Responses below... thanks for doing all the legwork on this. We still
do need to decide whether or not to mask the query DB though... this
is probably an argument someone has had with Rachel already and that
we don't care to repeat, so next time you talk to Rachel could you
ask her about it?

Thanks and Best

Matthew Benedict
Chemical Engineering Graduate Student
University of Illinois
Email: [email protected] mailto:[email protected]

On Thu, Mar 21, 2013 at 3:03 PM, James Henriksen
<[email protected] mailto:[email protected]> wrote:

Matt,

I bit the bullet and read the NCBI documentation for the new
blast+.  I
had to skim the entire online book, as it is horribly organized.
 Here
are my take-home messages for iTEP.  The main one is that when we use
masking or when it is the default (as with the query masking
default in
tblastn), we should use soft masking (for masking of queries, this is
the -soft_masking option, for database masking it is
-db_soft_mask).  In
searches without database or query masking, we don't need to use that
setting.

So by search:

tblastn and tblastx
Default is to mask the query without softmasking!  We must use
-soft_masking when we run these or alignments will break into pieces.
The default is NOT to use database masking, but if we do use it, we
should also use -db_soft_mask (see below).

ITEP's wrapper currently uses the -soft_masking flag so nothing to
worry about here unless we want to use -db_soft_mask as well.

blastn and blastx
Default IS to use masking of query with dust and to soft-mask the
query,
which is what we probably want.  To be explicit, we COULD use
-soft_masking in our code, but I don't think this is necessary.
 If you
do want to, blast should then use the default filter, which right
now is
-dust 'yes' for the '20 64 1' setting, I think.  The default is
NOT to
use database masking, but if we do use it, we should also use
-db_soft_mask (see below).

blastp
Default is NOT to use softmasking of query, but also NOT to mask at
all.  The default is NOT to use database masking, but if we do
use it,
we should also use -db_soft_mask (see below).  We could use
-soft_masking and -seg 'yes' to get faster searches, but I don't
think
it is necessary.  Low-information queries will just have a ton of
hits,
and that is OK.  It would probably speed up our searches if we were
masking queries, but if that is the goal we should probably also
softmask the databases.  I don't think we should do either.

Its OK as long as the entire query isn't low-information...small
low-information hits may be filtered out because they only hit over
small parts of the protein... and excluding things in the middle of
proteins may cause the HSPs to break up and fubar our scoring metric
(I'm not sure if this is really true but I'd be concerned about it.
Could you answer this with your knowledge of how the masking works?)
.

rpsblast
Default is to mask the query without softmasking, but given what it
does, I think this is OK.  The default is NOT to use database
masking,
but if we do use it, we should also use -db_soft_mask (see below).

*I could add softmasking the query to this as well if you think it
would make the results better. *

For all of the above:
Default is NOT to mask the query DB.  If we wanted to we could use
-db_soft_mask  ##, where ## is the filter ID applied to the blast db,
which we would have to generate for each database.  If we wanted to
speed up our searches (on the order of 10 seconds vs. 30 min for
megablast), we COULD also mask the database and/or the query, but we
would miss low information regions, even if we did use softmasking.

If there's low information we can't necessarily be confident in its
correctness anyway right? Doing the masking on the DB isn't hard now
that we have the right programs installed - just need to talk to
Rachel to find out what the lab consensus is (and maybe change around
the code a bit to make this an option to the user)

   James H.


PS Other information of use (you may know all this, but it was
news to
me since I tended to still use the old blast).

You can now search directly against a fasta file, without making a
database (it is slower, however):
like: blastn -subject bigfile.fasta -query sequences.fasta

I've done this by mistake calling -subject on a database and
wondered why it segfaulted. But yes this would be a useful feature
particualrly for deciding what to do with pulled in TBLASTN sequences

You can create virtual blast databases by aliasing together multiple
blast databases, or creating a subset based on gi_list (if it was
made
with the -parse_seqids option)
http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.I57_Use_blastdb_aliast
I think that if the database is generated with -taxid_map and
-parse_seqids than you can subset on those as well.  However,
trying to
implement this may mean we would have to make NCBI-style deflines
in our
original files, which may not like our IDs due to the "|" character.

*Yes that would be a pain.

  • There is an option that actually finds the best hit, instead of
    trying
    to find the best hit in a list of outputs (the
    first-is-often-not-the-best-hit problem):
    http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.I4212_BestHits_filteri
    (using -best_hit_score_edge 0.05 -best_hit_overhang 0.25)

    In addition to searching using a Conserved Domain Database query, you
    can use delta-blast to search using a PSSM or even using an alignment
    (if you give it an aliment that includes your query and the name
    of the
    protein you want to use as a query, it builds the PSSM on the fly).
    This may be very useful for finding distant homologs to a
    ortholog/cluster (prevents having to search using every sequence
    in the
    cluster), but of course only works for proteins.

Isn't this how PSI-blast works as well? I messed with PSI-blast,
quickly got confused / crappy results and didn't go back to it
although if we want to be serious about looking at larger genetic
distances we will need to do something like this...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants