title | layout |
---|---|
FAQ |
howto |
It is the list of Frequently Asked Questions about BioPerl.
Please contact the bioperl-l mailing list.
We have a universal version number for a release. This is set in the module Bio::Root::Version which is universally applied to every module which inherits from Bio::Root::RootI. To check the version, just use the following one liner:
perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION'
Now when we use version tuples that are not just decimal numbers, Perl converts these silently to Unicode representations. What that means is, to actually print the version number you have to use formatted printing like this
perl -MBio::Root::Version -e
'printf "%vd\n", $Bio::Root::Version::VERSION'
Printing the version number can be done on any module in BioPerl (and should be consistent) so for example, printing out the version number of Bio::SeqIO, which is different from the overall Bioperl version number.
perl -MBio::SeqIO -e 'printf "%vd ", $Bio::SeqIO::VERSION'
BioPerl is a toolkit of perl modules useful in building bioinformatics solutions in Perl. It is built in an object-oriented manner so that many modules depend on each other to achieve a task. The collection of modules in the bioperl-live repository consist of the core of the functionality of bioperl. Additionally auxiliary modules for creating persistent storage in RDMBS (bioperl-db) and running and parsing the results from hundreds of bioinformatics applications (bioperl-run) are all available in our Git repository.
Some early articles about BioPerl:
Please see the INSTALL file.
Developer releases are odd numbered releases (e.g. 1.3, 1.5) not intended to be completely stable (although all tests should pass). Stable releases are even numbered (1.0, 1.2, 1.6) and intended to provide a stable API so that modules will continue to respect the API throughout a stable release series. We cannot guarantee that APIs are stable between releases (i.e. 1.6 may not be completely compatible with scripts written for 1.4), but we endeavor to keep the API stable so that upgrading is easy.
The 0.7 series (0.7.0, 0.7.2) were all released in 2001 and were stable releases on 0.7 branch. This means they had a set of functionality that is maintained throughout (no experimental modules) and were guaranteed to have all tests and subsequent bug fix releases with the 0.7 designation would not have any API changes.
The 0.9 series was our first attempt at releasing so called developer releases. These are snapshots of the actively developed code that at a minimum pass all our tests.
% perldoc MODULE
Careful - spelling and case count! If you are not sure about case you can use the -i
switch with perldoc. You may also find useful documentation in the form of a HOWTOs. There are also many scripts in the examples/
and scripts/
directories that may be useful - see the BioPerl scripts page for brief descriptions.
Additionally we have written many tests, these are a great source of example code, look in the test dir, called t/
.
Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehväslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, and Birney E. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002 Oct;12(10):1611-8.
- http://dx.doi.org/10.1101/gr.361602
- http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=12368254
BioPerl is licensed under the same terms as Perl itself which is dually-licensed under the terms of the Perl Artistic License or http://www.opensource.org/licenses/artistic-license.html or the GNU GPL.
BioPerl is a pretty diverse collection of modules which has grown from the direct needs of the developers participating in the project. So if you don't have a need for a specific module in the toolkit it becomes hard to just describe ways it needs to be expanded or adapted. One area, however is the development of stand-alone scripts which use BioPerl components for common tasks. Some starting points for scripts: find out what people in your institution do routinely that a shortcut can be developed for. Identify modules in BioPerl that need easy interfaces and write that wrapper - you'll learn how to use the module inside and out.
We always need people to help fix bugs - check the GitHub bug tracking system. Submitting bugs in the documentation and code is very helpful as has been said about open source software "Given enough eyeballs, all bugs are shallow".
There's plenty of developer documentation as well, see the Using Git HOWTO, the Advanced BioPerl HOWTO, and Best Practices HOWTO.
Post your idea on the bioperl-l mailing list. If you have written it already, or if you have been thinking about the API already, post the API, ideally with usage documentation, e.g., the POD that would normally go with each method, and some usage examples, e.g., what would otherwise go into the synopsis section of the module's POD.
Once you completed gathering feedback and incorporating into your module as appropriate, you either post it on GitHub, or, if you have a developer account already, you should just commit it once you have convinced yourself that the (yours and the pre-existing) tests pass.
Post your idea to the bioperl-l mailing list. If it is a really new idea consider taking us through your thought process. We'll help you tease out the necessary information such as what methods you'll want and how it can interact with other BioPerl modules. If it is a port of something you've already worked on, give us a summary of the current methods. Make sure there is an interface to the module, not just an implementation and make sure there will be a set of tests that will be in the t/
directory to insure that your module is tested.
If you have a suggested patch and/or code enhancement you can submit it to the GitHub tracking system. See the Using Git HOWTO for more information.
This a problem with Perl, not only with Bioperl. To list all the methods, you have to walk the inheritance tree and standard perl is not able to do it. As usual, help can be found in the CPAN. Install Class::Inspector and put the following script into your path and run it.
#!/usr/bin/perl -w
use Class::Inspector;
$class = shift || die "Usage: methods perl_class_name";
eval "require $class";
print join (" ", sort @{Class::Inspector->methods($class,'full','public')});
There is also a project called Deobfuscator developed during the 2005 Bioinformatics course at Cold Spring Harbor Labs. The Deobfuscator displays available methods for an object type and provide links to the return types of the methods. An older version can also be found here.
There is no simple answer to this question. Simply put, this is a toolkit which has grown organically. The goals and user audience has evolved. Some decisions have been made and we have been forced to live by them rather than destroy backward compatibility. In addition there are different philosophies of software development. The major developers on the project have tried to impose a set of standards on the code so that the project can be coordinated without every commit being cleared by a few key individuals (see Eric S. Raymond's essay "The Cathedral and the Bazaar" for different styles of running an open source project - we are clearly on the Bazaar end).
Advanced BioPerl talks more about specific design goals. The clear consensus of the project developers is that BioPerl should be consistent. This may cause us to pay the price of some copy-and-paste of code, with the Get/Set accessor methods being a sore spot for some, and the lack of using AUTOLOAD. By being consistent we hope that someone can grok the gist of a module from the basic documentation, see example code, and get a set of methods from the API documentation. We aim to make the core object design easy to understand. This has not been realized by any stretch of the imagination as the toolkit has well over 1000 modules in bioperl-live and bioperl-run alone. That said we do want to improve things. We want to experiment with newer modules which make Perl more object-oriented. We have high hopes for some of the promises of Perl6. To try and realize this goal we are encouraging developers to play with new object models in a bioperl-experimental project.
Some useful discussion on the mailing list can be found at this node http://bioperl.org/pipermail/bioperl-l/2003-December/014406.html. We encourage you to participate in the discussion and to join in the development process either on existing BioPerl code or the bioperl-experimental code if you have a particular interest in making the toolkit more object-oriented.
Use the Bio::SeqIO
system, this will create objects for you from file input. For more information see the SeqIO HOWTO, or type perldoc Bio::SeqIO
.
If you are running an old BioPerl version, NCBI changed the web CGI script that provided this access. You must use a modern version, 1.4 or greater.
How can I get NT_ or NM_ or NP_ accessions from NCBI (Reference sequences)?
To retrieve GenBank reference sequences, or RefSeqs, use Bio::DB::RefSeq. This is still an area of active development because the data providers have not provided the best interface for us to query. EBI has provided a mirror with their dbfetch
system which is accessible through the object however, there are cases where NT_
accession numbers will not be retrievable.
Use this code to parse sequence records from a string:
use IO::String;
use Bio::SeqIO;
my $stringfh = new IO::String($string);
my $seqio = new Bio::SeqIO(-fh => $stringfh,
-format => 'fasta');
while( my $seq = $seqio->next_seq ) {
# process each seq
}
And here is how to write to a string:
use IO::String;
use Bio::SeqIO;
my $s;
my $io = IO::String->new($s);
my $seqOut = new Bio::SeqIO(-format => 'swiss', -fh => $io);
$seqOut->write_seq($seq1);
print $s;
# $s contains the record in swissprot format and is stored in the string
I'm using in order to retrieve sequences from my indexed fasta file but I keep seeing MSG: Did not provide a valid Bio::PrimarySeqI object
when I call fetch
followed by write_seq()
on a handle. Why?
It's likely that fetch
didn't retrieve a object. There are few possible explanations but the most common cause is that the id you're passing to fetch
is not the key to that sequence in the index. For example, if the FASTA header is >gi|12366
and your id is 12366
then fetch
won't find the sequence, it expects to see gi|12366
. You need to use the get_id
method to specify the key used in indexing, like this:
$inx = Bio::Index::Fasta->new(-filename =>$indexname);
$inx->id_parser(&get_id);
$inx->make_index($fastaname);
sub get_id {
my $header = shift;
$header =~ /^>gi|(d+)/;
$1;
}
The same issue arises when you use Bio::DB::Fasta, and in that case the code might look like this:
$inx = Bio::DB::Fasta->new($fastaname, -makeid => &get_id);
If you parse a FASTA sequence format file with the sequences won't have the accession number. What to do?
All the data is in the $seq->display_id
it just needs to be parsed out. Here is some code to set the accession number.
my ($gi,$acc,$locus);
(undef,$gi,undef,$acc,$locus) = split(/|/, $seq->display_id);
$seq->accession_number($acc);
Why don't we just go ahead and do this? For one, we don't make any assumptions about the format of the ID part of the sequence. Perhaps the parser code could try and detect if it is a GenBank formatted ID and go ahead and set the accession number field.
Also see http://bioperl.org/pipermail/bioperl-l/2005-August/019579.html
This question has a few different answers, see the Getting Genomic Sequences HOWTO.
You want to use the method preferred_id_type()
. Here's some example code:
use Bio::SeqIO;
my $seqin = Bio::SeqIO->new(-file => $file, -format => 'genbank');
my $seqout = Bio::SeqIO->new(-fh => $out, -format => 'fasta');
# From Bio::SeqIO::fasta
$seqout->preferred_id_type('display');
my $count = 1;
while (my $seq = $seqin->next_seq) {
# override the regular display_id with your own
$seq->display_id('foo'.$count);
$seqout->write_seq($seq);
$count++;
}
You can pass one of the following values to preferred_id_type
: "accession", "accession.version", "display", "primary". The description line is automatically appended to the preferred id type but this can also be set, like so:
$seq->desc($some_string);
Read the SearchIO HOWTO for more information.
Bio::Tools::Blast* is no longer supported, as of BioPerl version 1.1. It has just been replaced by a more generic approach to reports. This generic approach allows us to just write pluggable modules for FASTA and BLAST parsing while using the same framework. This is completely analogous to the system of parsing sequence files. However, the objects produced are of the rather than variety. See the SearchIO HOWTO.
It is as simple as parsing text BLAST results - you simply need to specify the format as fasta
or blastxml
and the parser will load the appropriate module for you. You can use the exact logic and code for all of these formats as we have generalized the modules for sequence database searching. The page describing provides a table showing how the formats match up to particular modules. Note that, for parsing BLAST XML output, you will need and that is recommended to speed up parsing.
Look at to see how to use the water
and needle
alignment programs that are part of the EMBOSS suite. is part of the bioperl-run package.
Or you can use the pSW module for DNA alignments or the dpAlign module for protein alignments. These are part of the bioperl-ext package.
You can also use prss34 (part of FASTA package) to assess the significance of a pairwise alignment by shuffling the sequences.
I'm using Bio::Search and its frame()
to parse BLAST but I'm seeing 0, 1, or 2 instead of the expected -3, -2, -1, +1, +2, +3. Why am I seeing these different numbers and how do I get the frame according to BLAST?
These are GFF frames - so +1 is 0 in GFF, -3 will be encoded with a frame of 2 with the strand being set to -1.
Frames are relative to the hit or query sequence so you need to query it based on sequence you are interested in:
$hsp->hit->strand; $hsp->hit->frame;
or
$hsp->query->strand; $hsp->query->frame;
So the value according to a blast report of -3 can be constructed as:
my $blastframe = ($hsp->query->frame + 1) * $hsp->query->strand;
For example:
SH2_5: domain 2 of 2, from 349 to 432: score 104.4, E = 1.9e-26
Not directly but you can compute it since the domains are numbered by their order on the protein:
my @domains = $hit->domains;
my $domainnum = 1;
my $total = scalar @domains;
for my $domain ( sort { $a->start <=> $b->start } $hit->domains ) {
print "domain $domainnum of $total,\n";
$domainnum++;
}
How about all the features which are exons or have a /note
field that contains a certain gene name?
To get all the features:
my @features = $seq->all_SeqFeatures();
To get all the features filtering on only those which have the primary tag (ie. feature type) exon
.
my @genes = grep { $_->primary_tag eq 'exon'} $seq->all_SeqFeatures();
To get all the features filtering on this which have the /note
tag and within the note field contain the requested string $noteval
.
my @f_with_note = grep { my @a = $_->has_tag('note') ?
$_->each_tag_value('note') : ();
grep { $noteval } @a; } $seq->all_SeqFeatures();
How do I parse the CDS join
or complement
statements in GenBank or EMBL files to get the sub-locations?
For example, how can I get the the coordinates 45
and 122
in join(45..122,233..267)
?
You could use primary_tag
to find the CDS
features and the object to get the coordinates:
for my $feature ($seqobj->top_SeqFeatures){
if ( $feature->location->isa('Bio::Location::SplitLocationI')
and $feature->primary_tag eq 'CDS' ) {
for my $location ( $feature->location->sub_Location ) {
print $location->start , ".." , $location->end, "\n";
}
}
}
You could go through the protein's feature table and find the coded_by
value. The trick is to associate the coded_by
nucleotide coordinates to the nucleotide entry, which you'll retrieve using the accession number from the same feature.
use Bio::Factory::FTLocationFactory;
use Bio::DB::GenPept;
use Bio::DB::GenBank;
my $gp = Bio::DB::GenPept->new;
my $gb = Bio::DB::GenBank->new;
# factory to turn strings into Bio::Location objects
my $loc_factory = Bio::Factory::FTLocationFactory->new;
my $prot_obj = $gp->get_Seq_by_id($protein_gi);
for my $feat ( $prot_obj->top_SeqFeatures ) {
if ( $feat->primary_tag eq 'CDS' ) {
# example: 'coded_by="U05729.1:1..122"'
my @coded_by = $feat->each_tag_value('coded_by');
my ($nuc_acc,$loc_str) = split /:/, $coded_by[0];
my $nuc_obj = $gb->get_Seq_by_acc($nuc_acc);
# create Bio::Location object from a string
my $loc_object = $loc_factory->from_string($loc_str);
# create a Feature object by using a Location
my $feat_obj = Bio::SeqFeature::Generic->new(-location =>$loc_object);
# associate the Feature object with the nucleotide Seq object
$nuc_obj->add_SeqFeature($feat_obj);
my $cds_obj = $feat_obj->spliced_seq;
print "CDS sequence is ",$cds_obj->seq,"\n";
}
}
You can use the spliced_seq
method. For example:
my $seq_obj = $db->get_Seq_by_id($gi);
for my $feat ( $seq_obj->top_SeqFeatures ) {
if ( $feat->primary_tag eq 'CDS' ) {
my $cds_obj = $feat->spliced_seq;
print "CDS sequence is ",$cds_obj->seq,"\n";
}
}
The problematic features have coordinates like this:
join(complement(AY421753.1:1..6),complement(3813..5699))
To retrieve this, you need to pass a Genbank database handle to the spliced_seq
method. For example:
my $db = Bio::DB::GenBank->new();
my $io = Bio::SeqIO->new(-file=>'funnyfile.gb', -format=>'genbank');
while ( my $seq = $seq_in->next_seq ) {
for my $feat ( $seq->get_SeqFeatures ) {
if ( $feat->primary_tag eq 'CDS' ) {
my $cds = $feat->spliced_seq(-db => $db, -nosort => 0);
print $cds->translate->seq, "\n";
}
}
}
One way is to pass the location to subseq
in the form of a object. This object holds strand information as well as coordinates.
use Bio::Location::Simple;
my $location = Bio::Location::Simple->new(-start => $start,
-end => $end,
-strand => "-1");
# assume we already have a sequence object
my $rev_comp_substr = $seq_obj->subseq($location);
Wow, you're using an old version! You'll see this error because the modules and interface has changed starting with BioPerl 1.0. Before v1.0 there was a Bio::Annotation module with add_Comment
, add_Reference
, each_Comment
, and each_Reference
methods.
After v1.0 there is a module with add_Annotation('comment', $ann)
and get_Annotations('comment')
.
Please update your BioPerl in order to avoid seeing these warning messages.
How do I find all the ORFs in a nucleotide sequence? Antigenic sites in a protein? Calculate nucleotide melting temperature? Find repeats?
In fact, none of these functions are built into BioPerl but they are all available in the EMBOSS package, as well as many others. The BioPerl developers created a simple interface to EMBOSS such that any and all EMBOSS programs can be run from within BioPerl. See for more information, it's in the bioperl-run package.
If you can't find the functionality you want in BioPerl then make sure to look for it in EMBOSS, these packages integrate quite gracefully with BioPerl. Of course, you will have to install EMBOSS to get this functionality.
In addition, BioPerl after version 1.0.1 contains the Pise/Bioperl modules. The Pise package was designed to provide a uniform interface to bioinformatics applications, and currently provides wrappers to greater than 250 such applications! Included amongst these wrapped apps are HMMER, PHYLIP, BLAST, GENSCAN, and the EMBOSS suite. Use of the Pise/BioPerl modules does not require installation of Pise locally as it runs through the HTTP protocol of the web. Also, see the BioMOBY project for information on running applications remotely.
How do I do motif searches with BioPerl? Can I do "find all sequences that are 75% identical" to a given motif?
There are a number of approaches. Within BioPerl take a look at . Or, take a look at the TFBS package. This BioPerl-compliant package specializes in pattern searching of nucleotide sequence using matrices.
It's also conceivable that the combination of BioPerl and Perl's regular expressions could do the trick. You might also consider the CPAN module (this module addresses the percent match query), but experienced users question whether its distance estimates are correct, the Unix agrep command is thought to be faster and more accurate. Finally, you could use EMBOSS, as discussed in the previous question (or you could use Pise to run EMBOSS applications). The relevant programs would be fuzzpro
or fuzznuc
. Complex RNA sequence secondary structural 'motifs' can be searched with Tom Macke's RNAMotif available from the Case group at Scripps. See Bio::Tools::RNAMotif.
How do I merge a set of sequences along with their features and annotations?
Try the cat()
method in Bio::SeqUtils:
$merged_seq = Bio::SeqUtils->cat(@seqs)
This method uses the first sequence in the array as a foundation and adds the subsequent sequences to it, along with their features and annotations.
Use BlastPlus.
bioperl-run is the package with application wrappers.
This file is missing in version 1.2. Install the latest BioPerl.
bioperl-ext is a package of code for C-extensions (hence the 'ext') to BioPerl. These include interfacing to the staden IO library (the io_lib
library) for reading in chromatogram files and which is a Smith-Waterman implementation.
Make sure you read the README about copying files over. Some more items to check off before asking.
- Are you sure io_lib is installed where you think it is, and that the install path is seen by Perl (did you answer the questions during
perl Makefile.PL
?) - Did you copy the various missing *.h files (os.h config.h if I remember right) from your io_lib source directory into the install include directory when installing io_lib?
- When you ran make for io_lib did you see any errors or messages about how you should probably run "ranlib" on the library object?
- Did you run "ranlib" on the installed libread file(s)?
Note that newer versions of io_lib no longer support ABI sequences unless the Staden Package is also installed.
The BioPerl db package contains interfaces and adaptors that work with a BioSQL database to serialize and de-serialize Bioperl objects. We strongly recommend you use the Git version with the latest biosql-schema.
The BioPerl network package is code to read protein-protein interaction data and convert it to networks of BioPerl objects for analysis.
The Microarray package provides some basic tools for microarray functionality. It was started by Allen Day and may need some more work before it is a mature product.
The GUI package provides a Graphical User Interface for interacting with sequence and feature objects. It is used as part of the Genquire project.
The Pedigree package was started by Jason Stajich and provides basic tools for interacting with pedigree data and rendering pedigree plots.
I am using Ensembl. How do I do XYZ?
Though BioPerl is used in Ensembl, the version used is rather old and most of the sequence parsing infrastructure has evolved beyond using Bioperl directly (see below for an explanation). The best place to look for answers to any Ensembl-related matter is the Ensembl mail list and web site:
- http://www.ensembl.org/info/about/contact.html
- http://www.ensembl.org/info/software/core/core_tutorial.html
- http://www.ensembl.org/info/software/Pdoc/ensembl/index.html
Why is the version of BioPerl (v.1.2.3) used in Ensembl so old? Haven't there been bug fixes?
In short (in this thread):
Ensembl doesn't make heavy use of Bioperl anymore - most of the critical things we re-wrote, mainly due to speed/memory issues. I think the short answer is that it probably works with 1.5, but we don't have a strong desire to move up as certainly there are no problems with the 1.2.3 release we are using.
In another thread:
Ensembl has slowly migrated away from Bioperl, mainly due to speed issues in the bioperl framework. One project kicked around is to make a thin "facade" across Ensembl objects to make them Bioperl compliant, handling for example what the "get_Annotations()" call will actually do (if you into the look into the Annotation objects in bioperl you will get a sense of why we can't have these objects in the main Ensembl API- way too heavyweight).