- Fixed a bug in the documentation in which the
source
button would link to decorator code, instead of the relevant function (#2184).
- Updated documentation to include description of how to stream data through stdin with scikit-bio's
read
function (2185)
- Python 3.13+ is now supported (#2146).
- Added Balanced Minimum Evolution (BME) function for phylogenetic reconstruction and
balanced
option for NNI (#2105 and #2169). - Added functions
rf_dists
,wrf_dists
andpath_dists
underskbio.tree
to calculate multiple pariwise distance metrics among an arbitrary number of trees. They correspond toTreeNode
methodscompare_rfd
,compare_wrfd
andcompare_cophenet
for two trees (#2166). - Added
height
anddepth
methods underTreeNode
to calculate the height and depth of a given node. - Added
TreeNode.compare_wrfd
to calculate the weighted Robinson-Foulds distance or its variants between two trees (#2144). - Wrapped UPGMA and WPGMA from SciPy's linkage method (#2094).
- Added
TreeNode
methods:bipart
,biparts
andcompare_biparts
to encode and compare bipartitions in a tree (#2144). - Added
TreeNode.has_caches
to check if a tree has caches (#2103). - Added
TreeNode.is_bifurcating
to check if a tree is bifurcating (i.e., binary) (#2117). - Added support for Python's
pathlib
module in the IO system (#2119). - Added
TreeNode.path
to return a list of nodes representing the path from one node to another (#2131). - Exposed
vectorize_counts_and_tree
function from thediversity
module to allow use for improving ML accuracy in downstream pipelines (#2173)
- Significantly improved the performance of the neighbor joining (NJ) algorithm (
nj
) (#2147) and the greedy minimum evolution (GME) algorithm (gme
) for phylogenetic reconstruction, and the NNI algorithm for tree rearrangement (#2169). - Significantly improved the performance of
TreeNode.cophenet
(renamed fromtip_tip_distances
) for computing a patristic distance matrix among all or selected tips of a tree (#2152). - Supported Robinson-Foulds distance calculation (
TreeNode.compare_rfd
) based on bipartitions (equivalent tocompare_biparts
). This is automatically enabled when the input tree is unrooted. Otherwise the calculation is still based on subsets (equivalent tocompare_subsets
). The user can override this behavior using therooted
parameter (#2144). - Re-wrote the underlying algorithm of
TreeNode.compare_subsets
because it is equivalent to the Robinson-Foulds distance on rooted trees. Added parameterproportion
. Renamed parameterexclude_absent_taxa
asshared_only
(#2144). - Added parameter
include_self
toTreeNode.subset
. Added parameterswithin
,include_full
andinclude_tips
toTreeNode.subsets
(#2144). - Improved the performance and customizability of
TreeNode.total_length
(renamed fromdescending_branch_length
). Added parametersinclude_stem
andinclude_self
. - Improved the performance of
TreeNode.lca
(#2132). - Improved the performance of
TreeNode
methods:ancestors
,siblings
, andneighbors
(#2133, #2135). - Improved the performance of tree traversal algorithms (#2093).
- Improved the performance of tree copying (#2103).
- Further improved the caching mechanism of
TreeNode
. Specifically: 1. Node attribute caches are only registered at the root node, which improves memory efficiency. 2. Methodclear_caches
can be customized to clear node attribute and/or lookup caches, or specified attribute caches (#2099). 3. Added parameteruncache
to multiple methods that involves tree manipulation. Default is True. When one knows that caches are not present or relevant, one may set this parameter as False to skip cache clearing to significantly improve performance (#2103). - Expanded the functionality of
TreeNode.cache_attr
. It can now take a custom function to combine children and self attributes. This makes it possible to cache multiple useful clade properties such as node count and total branch length. Also enriched the method's docstring to provide multiple examples of caching clade properties (#2099). - Added parameter
inplace
to methodsshear
,root_at
,root_at_midpoint
androot_by_outgroup
ofTreeNode
to enable manipulating the tree in place (True), which is more efficient that making a manipulated copy of the tree (False, default) (#2103). TreeNode.extend
can accept any iterable type of nodes as input (#2103).- Added parameter
strict
toTreeNode.shear
(#2103). - Added parameter
exclude_attrs
toTreeNode.unrooted_copy
(#2103). - Added support for legacy random generator to
get_rng
, such that outputs of scikit-bio functions become reproducible with code that starts withnp.random.seed
or usesRandomState
(#2130). - Allowed
shuffle
andcompare_cophenet
(renamed fromcompare_tip_distances
) ofTreeNode
to accept a random seed or random generator to generate the shuffling function, which ensures output reproducibility (#2118). - Replaced
accumulate_to_ancestor
withdepth
underTreeNode
. The latter has expanded functionality which covers the default behavior of the former. - Added beta diversity metric
jensenshannon
, which calculates Jensen-Shannon distance. Thank @quliping for suggesting this in #2125. - Added parameter
include_self
toTreeNode.ancestors
to optionally include the initial node in the path (default: False) (#2135). - Added parameter
seed
to functionspcoa
,anosim
,permanova
,permdisp
,randdm
,lladser_pe
,lladser_ci
,isubsample
,subsample_power
,subsample_paired_power
,paired_subsamples
andhommola_cospeciation
to accept a random seed or random generator to ensure output reproducibility (#2120 and #2129). - Made the
IORegistry
sniffer only attempt file formats which are logical given a specific object, thus improving reading efficiency. - Allowed the
number_of_dimensions
parameter in the functionpcoa
to accept float values between 0 and 1 to capture fractional cumulative variance.
- Fixed a bug in
TreeNode.find
which returns the input node object even if it's not in the current tree (#2153). - Fixed a bug in
TreeNode.get_max_distance
which returns tip names instead of tip instances when there are single-child nodes in the tree (#2144). - Fixed an issue in
subsets
andcophenet
(renamed fromtip_tip_distances
) ofTreeNode
which leaves remnant attributes at each node after execution (#2144). - Fixed a bug in
TreeNode.compare_rfd
which raises an error if taxa of the two trees are not subsets of each other (#2144). - Fixed a bug in
TreeNode.compare_subsets
which includes the full set (not a subset) of shared taxa between two trees if a basal clade of either tree consists of entirely unshared taxa (#2144). - Fixed a bug in
TreeNode.lca
which returns the parent of input node X instead of X itself if X is ancestral to other input nodes (#2132). - Fixed a bug in
TreeNode.find_all
which does not look for other nodes with the same name if aTreeNode
instance is provided, as in contrast to what the documentation claims (#2099). - Fixed a bug in
skbio.io.format.embed
which was not correctly updating the idptr sizing. (#2100). - Fixed a bug in
TreeNode.unrooted_move
which does not respect specified branch attributes (#2103). - Fixed a bug in
skbio.diversity.get_beta_diversity_metrics
which does not display metrics other than UniFrac (#2126). - Raises an error when beta diversity metric
mahalanobis
is called but sample number is smaller than or equal to feature number in the data. Thank @quliping for noting this in #2125. - Fixed a bug in
io.format.fasta
that improperly handled sequences containing spaces. (#2156)
- Added a parameter
warn_neg_eigval
topcoa
andpermdisp
to control when to raise a warning when negative eigenvalues are encountered. The default setting is more relaxed than the previous behavior, therefore warnings will not be raised when the negative eigenvalues are small in magnitude, which is the case in many real-world scenarios #2154. - Refactored
dirmult_ttest
to use a separate function for fitting data to Dirichlet-multinomial distribution (#2113) - Remodeled documentation. Special methods (previously referred to as built-in methods) and inherited methods of a class no longer have separate stub pages. This significantly reduced the total number of webpages in the documentation (#2110).
- Renamed
invalidate_caches
asclear_caches
underTreeNode
, because the caches are indeed deleted rather than marked as obsolete. The old name is preserved as an alias (#2099). - Renamed
remove_deleted
asremove_by_func
underTreeNode
. The old name is preserved as an alias (#2103). - Renamed
descending_branch_length
astotal_length
underTreeNode
. The old name is preserved as an alias. - Under
TreeNode
, renamedget_max_distance
asmaxdist
. Renamedtip_tip_distances
ascophenet
. Renamedcompare_tip_distances
ascompare_cophenet
. The new names are consistent with SciPy's relevant functions and the main body of the literature. The old names are preserved as aliases.
- Method
TreeNode.subtree
is deprecated. It will become a private member in version 0.7.0 (#2103).
- Dropped support for Python 3.8 as it has reached end-of-life (EOL). scikit-bio may still be installed under Python 3.8 and will likely work, but the development team no longer guarantee that all functionality will work as intended.
- Removed
skbio.util.SkbioWarning
. Now there are no specific warnings to scikit-bio. - Removed
skbio.util.EfficiencyWarning
. Previously it was only used in the Python implementations of pairwise sequence alignment algorithms. The new code replaced it withPendingDeprecationWarning
. - Removed
skbio.util.RepresentationWarning
. Previously it was only used inTreeNode.tip_tip_distances
when a node has no branch length. The new code removed this behavior (#2152).
- Added Greedy Minimum Evolution (GME) function for phylogenetic reconstruction (#2087).
- Added support for Microsoft Windows operating system. (#2071, #2068, #2067, #2061, #2046, #2040, #2036, #2034, #2032, #2005)
- Added alpha diversity metrics: Hill number (
hill
), Renyi entropy (renyi
) and Tsallis entropy (tsallis
) (#2074). - Added
rename
method forOrdinationResults
andDissimilarityMatrix
classes (#2027, #2085). - Added
nni
function for phylogenetic tree rearrangement using nearest neighbor interchange (NNI) (#2050). - Added method
TreeNode.unrooted_move
, which resemblesTreeNode.unrooted_copy
but rearranges the tree in place, thus avoid making copies of the nodes (#2073). - Added method
TreeNode.root_by_outgroup
, which reroots a tree according to a given outgroup (#2073). - Added method
TreeNode.unroot
, which converts a rooted tree into unrooted by trifucating its root (#2073). - Added method
TreeNode.insert
, which inserts a node into the branch connecting self and its parent (#2073).
- The time and memory efficiency of
TreeNode
has been significantly improved by making its caching mechanism lazy (#2082). Treenode.copy
andTreeNode.unrooted_copy
can now perform shallow copy of a tree in addition to deep copy.TreeNode.unrooted_copy
can now copy all attributes of the nodes, in addition to name and length (#2073).- Paremter
above
was added toTreeNode.root_at
, such that the user can root the tree within the branch connecting the given node and its parent, thereby creating a rooted tree (#2073). - Parameter
branch_attrs
was added to theunrooted_copy
,root_at
, androot_at_midpoint
methods ofTreeNode
, such that the user can customize which node attributes should be considered as branch attributes and treated accordingly during the rerooting operation. The default behavior is preserved but is subject ot change in version 0.7.0 (#2073). - Parameter
root_name
was added to theunrooted_copy
,root_at
, androot_at_midpoint
methods ofTreeNode
, such that the user can customize (or omit) the name to be given to the root node. The default behavior is preserved but is subject ot change in version 0.7.0 (#2073).
- Cleared the internal node references after performing midpoint rooting (
TreeNode.root_at_midpoint
), such that a deep copy of the resulting tree will not result in infinite recursion (#2073). - Fixed the Zenodo link in the README to always point to the most recent version (#2078).
- Added statsmodels as a dependency of scikit-bio. It replaces some of the from-scratch statistical analyses in scikit-bio, including Welch's t-test (with confidence intervals), Benjamini-Hochberg FDR correction, and Holm-Bonferroni FDR correction (#2049, (#2063)).
- Methods
deepcopy
andunrooted_deepcopy
ofTreenode
are deprecated. Usecopy
andunrooted_copy
instead.
- NumPy 2.0 is now supported (#2051). We thank @rgommers 's advice on this (#1964).
- Added module
skbio.embedding
to provide support for storing and manipulating embeddings for biological objects, such as protein embeddings outputted from protein language models (#2008). - Added an efficient sequence alignment path data structure
AlignPath
and its derivativePairAlignPath
to provide a uniform interface for various multiple and pariwise alignment formats (#2011). - Added
simpson_d
as an alias fordominance
(Simpson's dominance index, a.k.a. Simpson's D) (#2024). - Added
inv_simpson
(inverse Simpson index), which is equivalent toenspie
(#2024). - Added parameter
exp
toshannon
to calculate the exponential of Shannon index (i.e., perplexity, or effective number of species) (#2024). - Added parameter
finite
to Simpson's D (dominance
) and derived metrics (simpson
,simpson_e
andinv_simpson
) to correct for finite samples (#2024). - Added support for dictionary and pandas DataFrame as input for
TreeNode.from_taxonomy
(#2042).
subsample_counts
now uses an optimized method frombiom-format
(#2016).- Improved efficiency of counts matrix and vector validation prior to calculating community diversity metrics (#2024).
- Default logarithm base of Shannon index (
shannon
) was changed from 2 to e. This is to ensure consistency with other Shannon-based metrics (pielou_e
), and with literature and implementations in the field. Meanwhile, parameterbase
was added topielou_e
such that the user can control this behavior (#2024). See discussions in 1884 and 2014. - Improved treatment of empty communities (i.e., all taxa have zero counts, or there is no taxon) when calculating alpha diversity metrics. Most metrics will return
np.nan
and do not raise a warning due to zero division. Exceptions are metrics that describe observed counts, includngsobs
,singles
,doubles
andosd
, which return zero (#2024). See discussions in #2014. - Return values of
pielou_e
andheip_e
were set to 1.0 for one-taxon communities, such that NaN is avoided, while honoring the definition (evenness of taxon abundance(s)) and the rationale (ratio between observed and maximum) (#2024). - Removed hdmedians as a dependency by porting its
geomedian
function (geometric median) into scikit-bio (#2003). - Removed 98% warnings issued during the test process (#2045 and #2037).
- Launched the new scikit-bio website: https://scikit.bio. The previous domain names scikit-bio.org and skbio.org continue to work and redirect to the new website.
- Migrated the scikit-bio website repo from the
gh-pages
branch of thescikit-bio
repo to a standalone repo:scikit-bio.github.io
. - Replaced the Bootstrap theme with the PyData theme for building documentation using Sphinx. Extended this theme to the website. Customized design elements (#1934).
- Improved the calculation of Fisher's alpha diversity index (
fisher_alpha
). It is now compatible with optimizers in SciPy 1.11+. Edge cases such as all singletons can be handled correctly. Handling of errors and warnings was improved. Documentation was enriched (#1890). - Allowed
delimiter=None
which represents whitespace of arbitrary length in reading lsmat format matrices (#1912).
- Added biom-format Table import and updated corresponding requirement files (#1907).
- Added biom-format 2.1.0 IO support (#1984).
- Added
Table
support toalpha_diversity
andbeta_diversity
drivers (#1984). - Implemented a mechanism to automatically build documentation and/or homepage and deploy them to the website (#1934).
- Added the Benjamini-Hochberg method as an option for FDR correction (in addition to the existing Holm-Bonferroni method) for
ancom
anddirmult_ttest
(#1988). - Added function
dirmult_ttest
, which performs differential abundance test using a Dirichilet multinomial distribution. This function mirrors the method provided by ALDEx2 (#1956). - Added method
Sequence.to_indices
to convert a sequence into a vector of indices of characters in an alphabet (can be from a substitution matrix) or unique characters observed in the sequence. Supports gap masking and wildcard substitution (#1917). - Added class
SubstitutionMatrix
to support subsitution matrices for nucleotides, amino acids are more general cases (#1913). - Added alpha diversity metric
sobs
, which is the observed species richness (S_{obs}) of a sample.sobs
will replaceobserved_otus
, which uses the historical term "OTU". Also added metricobserved_features
to be compatible with the QIIME 2 terminology. All three metrics are equivalent (#1902). beta_diversity
now supports use of Pandas aDataFrame
index, issue #1808.- Added alpha diversity metric
phydiv
, which is a generalized phylogenetic diversity (PD) framework permitting unrooted or rooted tree, unweighted or weighted by abundance, and an exponent parameter of the weight term (#1893). - Adopted NumPy's new random generator
np.random.Generator
(see NEP 19) (#1889). - SciPy 1.11+ is now supported (#1887).
- Removed IPython as a dependency. Scikit-bio continues to support displaying plots in IPython, but it no longer requires importing IPython functionality (#1901).
- Made Matplotlib an optional dependency. Scikit-bio no longer requires Matplotlib except for plotting, during which it attempts to import Matplotlib if it is present in the system, and raises an error if not (#1901).
- Ported the QIIME 2 metadata object into skbio. (#1929)
- Python 3.12+ is now supported, thank you @actapia (#1930)
- Introduced native character conversion ([#1971])(scikit-bio#1971)
- Beta diversity metric
kulsinski
was removed. This was motivated by that SciPy replaced this distance metric withkulczynski1
in version 1.11 (see SciPy issue #2009), and that both metrics do not return 0 on two identical vectors (#1887).
- Fixed documentation interface of
vlr
and relevant functions (#1934). - Fixed broken link in documentation of Simpson's evenness index. See issue #1923.
- Safely handle
Sequence.iter_kmers
wherek
is greater than the sequence length (#1723) - Re-enabled OpenMP support, which has been mistakenly disabled in 0.5.8 (#1874)
permanova
andpermdist
operate on aDistanceMatrix
and a grouping object. Element IDs must be synchronized to compare correct sets of pairwise distances. This failed in case the grouping was provided as apandas.Series
, because it was interpreted as an orderedlist
and indices were ignored (see issue #1877 for an example). Note:pandas.DataFrame
was handled correctly. This behavior has been fixed with PR #1879- Fixed slicing for
TabularMSALoc
on Python 3.12. See issue #1926.
- Replaced the historical term "OTU" with the more generic term "taxon" (plural: "taxa"). As a consequence, the parameter "otu_ids" in phylogenetic alpha and beta diversity metrics was replaced by "taxa". Meanwhile, the old parameter "otu_ids" is still kept as an alias of "taxa" for backward compatibility. However it will be removed in a future release.
- Revised contributor's guidelines.
- Renamed function
multiplicative_replacement
asmulti_replace
for conciseness (#1988). - Renamed parameter
multiple_comparisons_correction
asp_adjust
of functionancom
for conciseness (#1988). - Enabled code coverage reporting via Codecov. See #1954.
- Renamed the default branch from "master" to "main". See #1953.
- Enabled subclassing of DNA, RNA and Protein classes to allow secondary development.
- Dropped support for NumPy < 1.17.0 in order to utilize the new random generator.
- Use CYTHON by default during build (#1874)
- Implemented augmented assignments proposed in issue #1789
- Incorporated Ruff's formatting and linting via pre-commit hooks and GitHub Actions. See PR #1924.
- Improved docstrings for functions accross the entire codebase. See #1933 and #1940
- Removed API lifecycle decorators in favor of deprecation warnings. See #1916
- Adding Variance log ratio estimators in
skbio.stats.composition.vlr
andskbio.stats.composition.pairwise_vlr
(#1803) - Added
skbio.stats.composition.tree_basis
to construct ILR bases fromTreeNode
objects. (#1862) IntervalMetadata.query
now defaults to obtaining all results, see #1817.
- With the introduction of the
tree_basis
object, the ILR bases are now represented in log-odds coordinates rather than in probabilities to minimize issues with numerical stability. Furthermore, theilr
andilr_inv
functions now takes thebasis
input parameter in terms of log-odds coordinates. This affects theskbio.stats.composition.sbp_basis
as well. (#1862)
- Complex multiple axis indexing operations with
TabularMSA
have been removed from testing due to incompatibilities with modern versions of Pandas. (#1851) - Pinning
scipy <= 1.10.1
(#1851)
- Fixed a bug that caused build failure on the ARM64 microarchitecture due to floating-point number handling. (#1859)
- Never let the Gini index go below 0.0, see #1844.
- Fixed bug #1847 in which the edge from the root was inadvertantly included in the calculation for
descending_branch_length
- Replaced dependencies
CacheControl
andlockfile
withrequests
to avoid a dependency inconsistency issue of the former. (See #1863, merged in #1859) - Updated installation instructions for developers in
CONTRIBUTING.md
(#1860)
- Added NCBI taxonomy database dump format (
taxdump
) (#1810). - Added
TreeNode.from_taxdump
for converting taxdump into a tree (#1810). - scikit-learn has been removed as a dependency. This was a fairly heavy-weight dependency that was providing minor functionality to scikit-bio. The critical components have been implemented in scikit-bio directly, and the non-criticial components are listed under "Backward-incompatible changes [experimental]".
- Python 3.11 is now supported.
-
With the removal of the scikit-learn dependency, three beta diversity metric names can no longer be specified. These are
wminkowski
,nan_euclidean
, andhaversine
. On testing,wminkowski
andhaversine
did not work throughskbio.diversity.beta_diversity
(orsklearn.metrics.pairwise_distances
). The former was deprecated in favor of callingminkowski
with a vector of weights provided as kwargw
(example below), and the latter does not work with data of this shape.nan_euclidean
can still be accessed fron scikit-learn directly if needed, if a user installs scikit-learn in their environment (example below).counts = [[23, 64, 14, 0, 0, 3, 1], [0, 3, 35, 42, 0, 12, 1], [0, 5, 5, 0, 40, 40, 0], [44, 35, 9, 0, 1, 0, 0], [0, 2, 8, 0, 35, 45, 1], [0, 0, 25, 35, 0, 19, 0], [88, 31, 0, 5, 5, 5, 5], [44, 39, 0, 0, 0, 0, 0]] # new mechanism of accessing wminkowski from skbio.diversity import beta_diversity beta_diversity("minkowski", counts, w=[1,1,1,1,1,1,2]) # accessing nan_euclidean through scikit-learn directly import skbio from sklearn.metrics import pairwise_distances sklearn_dm = pairwise_distances(counts, metric="nan_euclidean") skbio_dm = skbio.DistanceMatrix(sklearn_dm)
skbio.alignment.local_pairwise_align_ssw
has been deprecated (#1814) and will be removed or replaced in scikit-bio 0.6.0.
- Use
oldest-supported-numpy
as build dependency. This fixes problems with environments that use an older version of numpy than the one used to build scikit-bio (#1813).
- Introduce support for Python 3.10 (#1801).
- Tentative support for Apple M1 (#1709).
- Added support for reading and writing a binary distance matrix object format. (#1716)
- Added support for
np.float32
withDissimilarityMatrix
objects. - Added support for method and number_of_dimensions to permdisp reducing the runtime by 100x at 4000 samples, issue #1769.
- OrdinationResults object is now accepted as input for permdisp.
- Avoid an implicit data copy on construction of
DissimilarityMatrix
objects. - Avoid validation on copy of
DissimilarityMatrix
andDistanceMatrix
objects, see PR #1747 - Use an optimized version of symmetry check in DistanceMatrix, see PR #1747
- Avoid performing filtering when ids are identical, see PR #1752
- center_distance_matrix has been re-implemented in cython for both speed and memory use. Indirectly speeds up pcoa PR #1749
- Use a memory-optimized version of permute in DistanceMatrix, see PR #1756.
- Refactor pearson and spearman skbio.stats.distance.mantel implementations to drastically improve memory locality. Also cache intermediate results that are invariant across permutations, see PR #1756.
- Refactor permanova to remove intermediate buffers and cythonize the internals, see PR #1768.
- Fix windows and 32bit incompatibility in
unweighted_unifrac
.
- Python 3.6 has been removed from our testing matrix.
- Specify build dependencies in pyproject.toml. This allows the package to be installed without having to first manually install numpy.
- Update hdmedians package to a version which doesn't require an initial manual numpy install.
- Now buildable on non-x86 platforms due to use of the SIMD Everywhere library.
- Regenerate Cython wrapper by default to avoid incompatibilities with installed CPython.
- Update documentation for the
skbio.stats.composition.ancom
function. (#1741)
-
Added option to return a capture group compiled regex pattern to any class inheriting
GrammaredSequence
through theto_regex
method. (#1431) -
Added
Dissimilarity.within
and.between
to obtain the respective distances and express them as aDataFrame
. (#1662) -
Added Kendall Tau as possible correlation method in the
skbio.stats.distance.mantel
function (#1675). -
Added support for IUPAC amino acid codes U (selenocysteine), O (pyrrolysine), and J (leucine or isoleucine). (#1576
- Changed
skbio.tree.TreeNode.support
from a method to a property. - Added
assign_supports
method toskbio.tree.TreeNode
to extract branch support values from node labels. - Modified the way a node's label is printed:
support:name
if both exist, orsupport
orname
if either exists.
-
Require
Sphinx <= 3.0
. Newer Sphinx versions caused build errors. #1719 -
skbio.stats.ordination
tests have been relaxed. (#1713)
-
Fixes build errors for newer versions of NumPy, Pandas, and SciPy.
-
Corrected a criticial bug in
skbio.alignment.StripedSmithWaterman
/skbio.alignment.local_pairwise_align_ssw
which would cause the formatting of the aligned sequences to misplace gap characters by the number of gap characters present in the opposing aligned sequence up to that point. This was caused by a faulty implementation of CIGAR string parsing, see #1679 for full details. -
Fixes build errors for newer versions of NumPy, Pandas, and SciPy.
-
Corrected a criticial bug in
skbio.alignment.StripedSmithWaterman
/skbio.alignment.local_pairwise_align_ssw
which would cause the formatting of the aligned sequences to misplace gap characters by the number of gap characters present in the opposing aligned sequence up to that point. This was caused by a faulty implementation of CIGAR string parsing, see #1679 for full details.
-
skbio.diversity.beta_diversity
now accepts a pandas DataFrame as input. -
Avoid pandas 1.0.0 import warning (#1688)
-
Added support for Python 3.8 and dropped support for Python 3.5.
-
This version now depends on
scipy >= 1.3
andpandas >= 1.0
.
skbio.stats.composition
now has methods to compute additive log-ratio transformation and inverse additive log-ratio transformation (alr
,alr_inv
) as well as a method to build a basis from a sequential binary partition (sbp_basis
).
-
Python 3.6 and 3.7 compatibility is now supported
-
A pytest runner is shipped with every installation (#1633)
-
The nosetest framework has been replaced in favor of pytest (#1624)
-
This version is now compatible with numpy >= 1.17.0 and Pandas >= 0.23. (#1627)
- Added
FSVD
, an alternative fast heuristic method to perform Principal Coordinates Analysis, toskbio.stats.ordination.pcoa
.
- Added optimized utility methods
f_matrix_inplace
ande_matrix_inplace
which performf_matrix
ande_matrix
computations in-place and are used by the newcenter_distance_matrix
method inskbio.stats.ordination
.
-
Added
unpack
andunpack_by_func
methods toskbio.tree.TreeNode
to unpack one or multiple internal nodes. Theunpack
operation removes an internal node and regrafts its children to its parent while retaining the overall length. (#1572) -
Added
support
toskbio.tree.TreeNode
to return the support value of a node. -
Added
permdisp
toskbio.stats.distance
to test for the homogeniety of groups. (#1228). -
Added
pcoa_biplot
toskbio.stats.ordination
to project descriptors into a PCoA plot. -
Fixed pandas to 0.22.0 due to this: pandas-dev/pandas#20527
- Relaxing type checking in diversity calculations. (#1583).
-
Added
skbio.io.format.embl
for reading and writing EMBL files forDNA
,RNA
andSequence
classes. -
Removing ValueError check in
skbio.stats._subsample.subsample_counts
whenreplace=True
andn
is greater than the number of items in counts. #1527 -
Added
skbio.io.format.gff3
for reading and writing GFF3 files forDNA
,Sequence
, andIntervalMetadata
classes. (#1450) -
skbio.metadata.IntervalMetadata
constructor has a new keyword argument,copy_from
, for creating anIntervalMetadata
object from an existingIntervalMetadata
object with specifiedupper_bound
. -
skbio.metadata.IntervalMetadata
constructor allowsNone
as a valid value forupper_bound
. Anupper_bound
ofNone
means that theIntervalMetadata
object has no upper bound. -
skbio.metadata.IntervalMetadata.drop
has a new boolean parameternegate
to indicate whether to drop or keep the specifiedInterval
objects.
skbio.tree.nj
wall-clock runtime was decreased by 99% for a 500x500 distance matrix and 93% for a 100x100 distance matrix. (#1512, #1513)
-
The
include_self
parameter was not being honored inskbio.TreeNode.tips
. The scope of this bug was that ifTreeNode.tips
was called on a tip, it would always result in an emptylist
when unrolled. -
In
skbio.stats.ordination.ca
,proportion_explained
was missing in the returnedOrdinationResults
object. (#1345) -
skbio.diversity.beta_diversity
now handles qualitative metrics as expected such thatbeta_diversity('jaccard', mat) == beta_diversity('jaccard', mat > 0)
. Please see #1549 for further detail. -
skbio.stats.ordination.rda
The occasional column mismatch in outputbiplot_scores
is fixed (#1519).
- scikit-bio now depends on pandas >= 0.19.2, and is compatible with newer pandas versions (e.g. 0.20.3) that were previously incompatible.
- scikit-bio now depends on
numpy >= 1.17.0, < 1.14.0
for compatibility with Python 3.4, 3.5, and 3.6 and the available numpy conda packages indefaults
andconda-forge
channels. - added support for running tests from
setup.py
. Bothpython setup.py nosetests
andpython setup.py test
are now supported, howeverpython setup.py test
will only run a subset of the full test suite. (#1341)
- Added
IntervalMetadata
andInterval
classes inskbio.metadata
to store, query, and manipulate information of a sub-region of a sequence. (#1414) Sequence
and its child classes (includingGrammaredSequence
,RNA
,DNA
,Protein
) now acceptIntervalMetadata
in their constructor API. Some of their relevant methods are also updated accordingly. (#1430)- GenBank parser now reads and writes
Sequence
or its subclass objects withIntervalMetadata
. (#1440) DissimilarityMatrix
now has a new constructor method calledfrom_iterable
. (#1343).DissimilarityMatrix
now allows non-hollow matrices. (#1343).DistanceMatrix.from_iterable
now accepts avalidate=True
parameter. (#1343).DistanceMatrix
now has a new method calledto_series
to create apandas.Series
from aDistanceMatrix
(#1397).- Added parallel beta diversity calculation support via
skbio.diversity.block_beta_diversity
. The issue and idea is discussed in (#1181, while the actual code changes are in #1352).
-
The constructor API for
Sequence
and its child classes (includingGrammaredSequence
,RNA
,DNA
,Protein
) are changed from(sequence, metadata=None, positional_metadata=None, lowercase=False)
to(sequence, metadata=None, positional_metadata=None, interval_metadata=None, lowercase=False)
The changes are made to allow these classes to adopt
IntervalMetadata
object for interval features on the sequence. Theinterval_metadata
parameter is added imediately afterpositional_metadata
instead of appended to the end, because it is more natural and logical and, more importantly, because it is unlikely in practice to break user code. A user's code would break only if they had suppliedmetadata
,postional_metadata
, andlowercase
parameters positionally. In the unlikely event that this happens, users will get an error telling them a bool isn't a validIntervalMetadata
type, so it won't silently produce buggy behavior.
- Modifying basis handling in
skbio.stats.composition.ilr_inv
prior to checking for orthogonality. Now the basis is strictly assumed to be in the Aitchison simplex. DistanceMatrix.from_iterable
default behavior is now to validate matrix by computing all pairwise distances. Passvalidate=False
to get the previous behavior (no validation, but faster execution).(#1343).- GenBank I/O now parses sequence features into the attribute of
interval_metadata
instead ofpositiona_metadata
. And the key ofFEATURES
is removed frommetadata
attribute.
TreeNode.shear
was rewritten for approximately a 25% performance increase. (#1399)- The
IntervalMetadata
allows dramatic decrease in memory usage in reading GenBank files of feature rich sequences. (#1159)
skbio.tree.TreeNode.prune
and implicitlyskbio.tree.TreeNode.shear
were not handling a situation in which a parent was validly removed during pruning operations as may happen if the resulting subtree does not include the root. Previously, anAttributeError
would raise asparent
would beNone
in this situation.- numpy linking was fixed for installation under El Capitan.
- A bug was introduced in #1398 into
TreeNode.prune
and fixed in #1416 in which, under the special case of a single descendent existing from the root, the resulting children parent references were not updated. The cause of the bug was a call made toself.children.extend
as opposed toself.extend
where the former is alist.extend
without knowledge of the tree, while the latter isTreeNode.extend
which is able to adjust references toself.parent
.
- Removed deprecated functions from
skbio.util
:is_casava_v180_or_later
,remove_files
, andcreate_dir
. - Removed deprecated
skbio.Sequence.copy
method.
IMPORTANT: scikit-bio is no longer compatible with Python 2. scikit-bio is compatible with Python 3.4 and later.
- Added more descriptive error message to
skbio.io.registry
when attempting to read without specifyinginto
and when there is no generator reader. (#1326) - Added support for reference tags to
skbio.io.format.stockholm
reader and writer. (#1348) - Expanded error message in
skbio.io.format.stockholm
reader whenconstructor
is not passed, in order to provide better explanation to user. (#1327) - Added
skbio.sequence.distance.kmer_distance
for computing the kmer distance between two sequences. (#913) - Added
skbio.sequence.Sequence.replace
for assigning a character to positions in aSequence
. (#1222) - Added support for
pandas.RangeIndex
, lowering the memory footprint of default integer index objects.Sequence.positional_metadata
andTabularMSA.positional_metadata
now usepd.RangeIndex
as the positional metadata index.TabularMSA
now usespd.RangeIndex
as the default index. Usage ofpd.RangeIndex
over the previouspd.Int64Index
should be transparent, so these changes should be non-breaking to users. scikit-bio now depends on pandas >= 0.18.0 (#1308) - Added
reset_index=False
parameter toTabularMSA.append
andTabularMSA.extend
for resetting the MSA's index to the default index after appending/extending. - Added support for partial pairwise calculations via
skbio.diversity.partial_beta_diversity
. (#1221, #1337). This function is immediately deprecated as its return type will change in the future and should be used with caution in its present form (see the function's documentation for details). TemporaryFile
andNamedTemporaryFile
are now supported IO sources forskbio.io
and related functionality. (#1291)- Added
tree_node_class=TreeNode
parameter toskbio.tree.majority_rule
to support returning consensus trees of typeTreeNode
(the default) or a type that has the same interface asTreeNode
(e.g.TreeNode
subclasses) (#1193) TreeNode.from_linkage_matrix
andTreeNode.from_taxonomy
now support constructingTreeNode
subclasses.TreeNode.bifurcate
now supportsTreeNode
subclasses (#1193)- The
ignore_metadata
keyword has been added toTabularMSA.iter_positions
to improve performance when metadata is not necessary. - Pairwise aligners in
skbio.alignment
now propagate per-sequencemetadata
objects (this does not includepositional_metadata
).
TabularMSA.append
andTabularMSA.extend
now require one ofminter
,index
, orreset_index
to be provided when incorporating new sequences into an MSA. Previous behavior was to auto-increment the index labels ifminter
andindex
weren't provided and the MSA had a default integer index, otherwise error. Usereset_index=True
to obtain the previous behavior in a more explicit way.skbio.stats.composition.ancom
now returns twopd.DataFrame
objects, where it previously returned one. The first contains the ANCOM test results, as before, and the second contains percentile abundances of each feature in each group. The specific percentiles that are computed and returned is controlled by the newpercentiles
parameter toskbio.stats.composition.ancom
. In the future, this secondpd.DataFrame
will not be returned by this function, but will be available through the contingency table API. (#1293)skbio.stats.composition.ancom
now performs multiple comparisons correction by default. The previous behavior of not performing multiple comparisons correction can be achieved by passingmultiple_comparisons_correction=None
.- The
reject
column in the firstpd.DataFrame
returned fromskbio.stats.composition.ancom
has been renamedReject null hypothesis
for clarity. (#1375)
- Fixed row and column names to
biplot_scores
in theOrdinationResults
object fromskbio.stats.ordination
. This fix affect thecca
andrda
methods. (#1322) - Fixed bug when using
skbio.io.format.stockholm
reader on file with multi-line tree with no id. Previously this raised anAttributeError
, now it correctly handles this type of tree. (#1334) - Fixed bug when reading Stockholm files with GF or GS features split over multiple lines. Previously, the feature text was simply concatenated because it was assumed to have trailing whitespace. There are examples of Stockholm files with and without trailing whitespace for multi-line features, so the
skbio.io.format.stockholm
reader now adds a single space when concatenating feature text without trailing whitespace to avoid joining words together. Multi-line trees stored as GF metadata are concatenated as they appear in the file; a space is not added when concatenating. (#1328) - Fixed bug when using
Sequence.iter_kmers
on emptySequence
object. Previously this raised aValueError
, now it returns an empty generator. - Fixed minor bug where adding sequences to an empty
TabularMSA
with MSA-widepositional_metadata
would result in aTabularMSA
object in an inconsistent state. This could happen usingTabularMSA.append
orTabularMSA.extend
. This bug only affects aTabularMSA
object without sequences that has MSA-widepositional_metadata
(for example,TabularMSA([], positional_metadata={'column': []})
). TreeNode.distance
now handles the situation in whichself
orother
are ancestors. Previosly, a node further up the tree was used resulting in inflated distances. (#807)TreeNode.prune
can now handle a root with a single descendent. Previously, the root was ignored from possibly having a single descendent. (#1247)- Providing the
format
keyword toskbio.io.read
when creating a generator with an empty file will now return an empty generator instead of raisingStopIteration
. (#1313) OrdinationResults
is now importable fromskbio
andskbio.stats.ordination
and correctly linked from the documentation (#1205)- Fixed performance bug in pairwise aligners resulting in 100x worse performance than in 0.2.4.
- Deprecated use of the term "non-degenerate", in favor of "definite".
GrammaredSequence.nondegenerate_chars
,GrammaredSequence.nondegenerates
, andGrammaredSequence.has_nondegenerates
have been renamed toGrammaredSequence.definite_chars
,GrammaredSequence.definites
, andGrammaredSequence.has_definites
, respectively. The old names will be removed in scikit-bio 0.5.2. Relevant affected public classes includeGrammaredSequence
,DNA
,RNA
, andProtein
.
- Deprecated function
skbio.util.create_dir
. This function will be removed in scikit-bio 0.5.1. Please use the Python standard library functionality described here. (#833) - Deprecated function
skbio.util.remove_files
. This function will be removed in scikit-bio 0.5.1. Please use the Python standard library functionality described here. (#833) - Deprecated function
skbio.util.is_casava_v180_or_later
. This function will be removed in 0.5.1. Functionality moved to FASTQ sniffer. (#833)
- When installing scikit-bio via
pip
, numpy must now be installed first (#1296)
Minor maintenance release. This is the last Python 2.7 compatible release. Future scikit-bio releases will only support Python 3.
- Added
skbio.tree.TreeNode.bifurcate
for converting multifurcating trees into bifurcating trees. (#896) - Added
skbio.io.format.stockholm
for reading Stockholm files into aTabularMSA
and writing from aTabularMSA
. (#967) - scikit-bio
Sequence
objects have better compatibility with numpy. For example, callingnp.asarray(sequence)
now converts the sequence to a numpy array of characters (the same as callingsequence.values
). - Added
skbio.sequence.distance
subpackage for computing distances between scikit-bioSequence
objects (#913) - Added
skbio.sequence.GrammaredSequence
, which can be inherited from to create grammared sequences with custom alphabets (e.g., for use with TabularMSA) (#1175) - Added
skbio.util.classproperty
decorator
- When sniffing or reading a file (
skbio.io.sniff
,skbio.io.read
, or the object-oriented.read()
interface), passingnewline
as a keyword argument toskbio.io.open
now raises aTypeError
. This backward-incompatible change to a stable API is necessary because it fixes a bug (more details in bug fix section below). - When reading a FASTQ or QSEQ file and passing
variant='solexa'
,ValueError
is now raised instead ofNotImplementedError
. This backward-incompatible change to a stable API is necessary to avoid creating a spin-locked process due to a bug in Python. See #1256 for details. This change is temporary and will be reverted toNotImplementedError
when the bug is fixed in Python.
skbio.io.format.genbank
: When reading GenBank files, the date field of the LOCUS line is no longer parsed into adatetime.datetime
object and is left as a string. When writing GenBank files, the locus date metadata is expected to be a string instead of adatetime.datetime
object (#1153)Sequence.distance
now converts the input sequence (other
) to its type before passing both sequences tometric
. Previous behavior was to always convert toSequence
.
- Fixed bug when using
Sequence.distance
orDistanceMatrix.from_iterable
to compute distances betweenSequence
objects with differingmetadata
/positional_metadata
and passingmetric=scipy.spatial.distance.hamming
(#1254) - Fixed performance bug when computing Hamming distances between
Sequence
objects inDistanceMatrix.from_iterable
(#1250) - Changed
skbio.stats.composition.multiplicative_replacement
to raise an error whenever a large value ofdelta
is chosen (#1241) - When sniffing or reading a file (
skbio.io.sniff
,skbio.io.read
, or the object-oriented.read()
interface), passingnewline
as a keyword argument toskbio.io.open
now raises aTypeError
. The file format'snewline
character will be used when opening the file. Previous behavior allowed overriding the format'snewline
character but this could cause issues with readers that assume newline characters are those defined by the file format (which is an entirely reasonable assumption). This bug is very unlikely to have surfaced in practice as the defaultnewline
behavior is universal newlines mode. - DNA, RNA, and Protein are no longer inheritable because they assume an IUPAC alphabet.
DistanceMatrix
constructor provides more informative error message when data contains NaNs (#1276)
- Warnings raised by scikit-bio now share a common subclass
skbio.util.SkbioWarning
.
- The
TabularMSA
object was added to represent and operate on tabular multiple sequence alignments. This satisfies RFC 1. See theTabularMSA
docs for full details. - Added phylogenetic diversity metrics, including weighted UniFrac, unweighted UniFrac, and Faith's Phylogenetic Diversity. These are accessible as
skbio.diversity.beta.unweighted_unifrac
,skbio.diversity.beta.weighted_unifrac
, andskbio.diversity.alpha.faith_pd
, respectively. - Addition of the function
skbio.diversity.alpha_diversity
to support applying an alpha diversity metric to multiple samples in one call. - Addition of the functions
skbio.diversity.get_alpha_diversity_metrics
andskbio.diversity.get_beta_diversity_metrics
to support discovery of the alpha and beta diversity metrics implemented in scikit-bio. - Added
skbio.stats.composition.ancom
function, a test for OTU differential abundance across sample categories. (#1054) - Added
skbio.io.format.blast7
for reading BLAST+ output format 7 or BLAST output format 9 files into apd.DataFrame
. (#1110) - Added
skbio.DissimilarityMatrix.to_data_frame
method for creating apandas.DataFrame
from aDissimilarityMatrix
orDistanceMatrix
. (#757) - Added support for one-dimensional vector of dissimilarities in
skbio.stats.distance.DissimilarityMatrix
constructor. (#6240) - Added
skbio.io.format.blast6
for reading BLAST+ output format 6 or BLAST output format 8 files into apd.DataFrame
. (#1110) - Added
inner
,ilr
,ilr_inv
andclr_inv
,skbio.stats.composition
, which enables linear transformations on compositions (#892 - Added
skbio.diversity.alpha.pielou_e
function as an evenness metric of alpha diversity. (#1068) - Added
to_regex
method toskbio.sequence._iupac_sequence
ABC - it returns a regex object that matches all non-degenerate versions of the sequence. - Added
skbio.util.assert_ordination_results_equal
function for comparingOrdinationResults
objects in unit tests. - Added
skbio.io.format.genbank
for reading and writing GenBank/GenPept forDNA
,RNA
,Protein
andSequence
classes. - Added
skbio.util.RepresentationWarning
for warning about substitutions, assumptions, or particular alterations that were made for the successful completion of a process. TreeNode.tip_tip_distances
now supports nodes without an associated length. In this case, a length of 0.0 is assumed and anskbio.util.RepresentationWarning
is raised. Previous behavior was to raise aNoLengthError
. (#791)DistanceMatrix
now has a new constructor method calledfrom_iterable
.Sequence
now acceptslowercase
keyword likeDNA
and others. Updatedfasta
,fastq
, andqseq
readers/writers forSequence
to reflect this.- The
lowercase
method has been moved up toSequence
meaning all sequence objects now have alowercase
method. - Added
reverse_transcribe
class method toRNA
. - Added
Sequence.observed_chars
property for obtaining the set of observed characters in a sequence. (#1075) - Added
Sequence.frequencies
method for computing character frequencies in a sequence. (#1074) - Added experimental class-method
Sequence.concat
which will produce a new sequence from an iterable of existing sequences. Parameters control how positional metadata is propagated during a concatenation. TreeNode.to_array
now supports replacingnan
branch lengths in the resulting branch length vector with the value provided asnan_length_value
.skbio.io.format.phylip
now supports sniffing and reading strict, sequential PHYLIP-formatted files intoskbio.Alignment
objects. (#1006)- Added
default_gap_char
class property toDNA
,RNA
, andProtein
for representing gap characters in a new sequence.
-
Sequence.kmer_frequencies
now returns adict
. Previous behavior was to return acollections.Counter
ifrelative=False
was passed, and acollections.defaultdict
ifrelative=True
was passed. In the case of a missing key, theCounter
would return 0 and thedefaultdict
would return 0.0. Because the return type is now always adict
, attempting to access a missing key will raise aKeyError
. This change may break backwards-compatibility depending on how theCounter
/defaultdict
is being used. We hope that in most cases this change will not break backwards-compatibility because bothCounter
anddefaultdict
aredict
subclasses.If the previous behavior is desired, convert the
dict
into aCounter
/defaultdict
:import collections from skbio import Sequence seq = Sequence('ACCGAGTTTAACCGAATA') # Counter freqs_dict = seq.kmer_frequencies(k=8) freqs_counter = collections.Counter(freqs_dict) # defaultdict freqs_dict = seq.kmer_frequencies(k=8, relative=True) freqs_default_dict = collections.defaultdict(float, freqs_dict)
Rationale: We believe it is safer to return
dict
instead ofCounter
/defaultdict
as this may prevent error-prone usage of the return value. Previous behavior allowed accessing missing kmers, returning 0 or 0.0 depending on therelative
parameter. This is convenient in many cases but also potentially misleading. For example, consider the following code:from skbio import Sequence seq = Sequence('ACCGAGTTTAACCGAATA') freqs = seq.kmer_frequencies(k=8) freqs['ACCGA']
Previous behavior would return 0 because the kmer
'ACCGA'
is not present in theCounter
. In one respect this is the correct answer because we asked for kmers of length 8;'ACCGA'
is a different length so it is not included in the results. However, we believe it is safer to avoid this implicit behavior in case the user assumes there are no'ACCGA'
kmers in the sequence (which there are!). AKeyError
in this case is more explicit and forces the user to consider their query. Returning adict
will also be consistent withSequence.frequencies
.
- Replaced
PCoA
,CCA
,CA
andRDA
inskbio.stats.ordination
with equivalent functionspcoa
,cca
,ca
andrda
. These functions now takepd.DataFrame
objects. - Change
OrdinationResults
to have its attributes based onpd.DataFrame
andpd.Series
objects, instead of pairs of identifiers and values. The changes are as follows:species
andspecies_ids
have been replaced by apd.DataFrame
namedfeatures
.site
andsite_ids
have been replaced by apd.DataFrame
namedsamples
.eigvals
is now apd.Series
object.proportion_explained
is now apd.Series
object.biplot
is now apd.DataFrame
object namedbiplot_scores
.site_constraints
is now apd.DataFrame
object namedsample_constraints
.
short_method_name
andlong_method_name
are now required arguments of theOrdinationResults
object.- Removed
skbio.diversity.alpha.equitability
. Please useskbio.diversity.alpha.pielou_e
, which is more accurately named and better documented. Note thatequitability
by default used logarithm base 2 whilepielou_e
uses logarithm basee
as described in Heip 1974. skbio.diversity.beta.pw_distances
is now calledskbio.diversity.beta_diversity
. This function no longer defines a default metric, andmetric
is now the first argument to this function. This function can also now take a pairwise distances function aspairwise_func
.- Deprecated function
skbio.diversity.beta.pw_distances_from_table
has been removed from scikit-bio as scheduled. Code that used this should be adapted to useskbio.diversity.beta_diversity
. TreeNode.index_tree
now returns a 2-D numpy array as its second return value (the child node index) instead of a 1-D numpy array.- Deprecated functions
skbio.draw.boxplots
andskbio.draw.grouped_distributions
have been removed from scikit-bio as scheduled. These functions generated plots that were not specific to bioinformatics. These types of plots can be generated with seaborn or another general-purpose plotting package. - Deprecated function
skbio.stats.power.bootstrap_power_curve
has been removed from scikit-bio as scheduled. Useskbio.stats.power.subsample_power
orskbio.stats.power.subsample_paired_power
followed byskbio.stats.power.confidence_bound
. - Deprecated function
skbio.stats.spatial.procrustes
has been removed from scikit-bio as scheduled in favor ofscipy.spatial.procrustes
. - Deprecated class
skbio.tree.CompressedTrie
and functionskbio.tree.fasta_to_pairlist
have been removed from scikit-bio as scheduled in favor of existing general-purpose Python trie packages. - Deprecated function
skbio.util.flatten
has been removed from scikit-bio as scheduled in favor of solutions available in the Python standard library (see here and here for examples). - Pairwise alignment functions in
skbio.alignment
now return a tuple containing theTabularMSA
alignment, alignment score, and start/end positions. The returnedTabularMSA
'sindex
is always the default integer index; sequence IDs are no longer propagated to the MSA. Additionally, the pairwise alignment functions now accept the following input types to align:local_pairwise_align_nucleotide
:DNA
orRNA
local_pairwise_align_protein
:Protein
local_pairwise_align
:IUPACSequence
global_pairwise_align_nucleotide
:DNA
,RNA
, orTabularMSA[DNA|RNA]
global_pairwise_align_protein
:Protein
orTabularMSA[Protein]
global_pairwise_align
:IUPACSequence
orTabularMSA
local_pairwise_align_ssw
:DNA
,RNA
, orProtein
. Additionally, this function now overrides theprotein
kwarg based on input type.constructor
parameter was removed because the function now determines the return type based on input type.
- Removed
skbio.alignment.SequenceCollection
in favor of using a list or other standard library containers to store scikit-bio sequence objects (mostSequenceCollection
operations were simple list comprehensions). UseDistanceMatrix.from_iterable
instead ofSequenceCollection.distances
(passkey="id"
to exactly match original behavior). - Removed
skbio.alignment.Alignment
in favor ofskbio.alignment.TabularMSA
. - Removed
skbio.alignment.SequenceCollectionError
andskbio.alignment.AlignmentError
exceptions as their corresponding classes no longer exist.
Sequence
objects now handle slicing of empty positional metadata correctly. Any metadata that is empty will no longer be propagated by the internal_to
constructor. (#1133)DissimilarityMatrix.plot()
no longer leaves a white border around the heatmap it plots (PR #1070).- TreeNode.root_at_midpoint`` no longer fails when a node with two equal length child branches exists in the tree. (#1077)
TreeNode._set_max_distance
, as called throughTreeNode.get_max_distance
orTreeNode.root_at_midpoint
would store distance information aslist
s in the attributeMaxDistTips
on each node in the tree, however, these distances were only valid for the node in which the call to_set_max_distance
was made. The values contained inMaxDistTips
are now correct across the tree following a call toget_max_distance
. The scope of impact of this bug is limited to users that were interacting directly withMaxDistTips
on descendant nodes; this bug does not impact any known method within scikit-bio. (#1223)- Added missing
nose
dependency to setup.py'sinstall_requires
. (#1214) - Fixed issue that resulted in legends of
OrdinationResult
plots sometimes being truncated. (#1210)
skbio.Sequence.copy
has been deprecated in favor ofcopy.copy(seq)
andcopy.deepcopy(seq)
.
- Doctests are now written in Python 3.
make test
now validates MANIFEST.in using check-manifest. (#461)- Many new alpha diversity equations added to
skbio.diversity.alpha
documentation. (#321) - Order of
lowercase
andvalidate
keywords swapped inDNA
,RNA
, andProtein
.
Initial beta release. In addition to the changes detailed below, the following subpackages have been mostly or entirely rewritten and most of their APIs are substantially different (and improved!):
skbio.sequence
skbio.io
The APIs of these subpackages are now stable, and all others are experimental. See the API stability docs for more details, including what we mean by stable and experimental in this context. We recognize that this is a lot of backward-incompatible changes. To avoid these types of changes being a surprise to our users, our public APIs are now decorated to make it clear to developers when an API can be relied upon (stable) and when it may be subject to change (experimental).
- Added
skbio.stats.composition
for analyzing data made up of proportions - Added new
skbio.stats.evolve
subpackage for evolutionary statistics. Currently contains a single function,hommola_cospeciation
, which implements a permutation-based test of correlation between two distance matrices. - Added support for
skbio.io.util.open_file
andskbio.io.util.open_files
to pull files from HTTP and HTTPS URLs. This behavior propagates to the I/O registry. - FASTA/QUAL (
skbio.io.format.fasta
) and FASTQ (skbio.io.format.fastq
) readers now allow blank or whitespace-only lines at the beginning of the file, between records, or at the end of the file. A blank or whitespace-only line in any other location will continue to raise an error #781. - scikit-bio now ignores leading and trailing whitespace characters on each line while reading FASTA/QUAL and FASTQ files.
- Added
ratio
parameter toskbio.stats.power.subsample_power
. This allows the user to calculate power on groups for uneven size (For example, draw twice as many samples from Group B than Group A). Ifratio
is not set, group sizes will remain equal across all groups. - Power calculations (
skbio.stats.power.subsample_power
andskbio.stats.power.subsample_paired_power
) can use test functions that return multiple p values, like some multivariate linear regression models. Previously, the power calculations required the test to return a single p value. - Added
skbio.util.assert_data_frame_almost_equal
function for comparingpd.DataFrame
objects in unit tests.
- The speed of quality score decoding has been significantly improved (~2x) when reading
fastq
files. - The speed of
NucleotideSequence.reverse_complement
has been improved (~6x).
- Changed
Sequence.distance
to raise an error any time two sequences are passed of different lengths regardless of thedistance_fn
being passed. (#514) - Fixed issue with
TreeNode.extend
where if given the children of anotherTreeNode
object (tree.children
), both trees would be left in an incorrect and unpredictable state. (#889) - Changed the way power was calculated in
subsample_paired_power
to move the subsample selection before the test is performed. This increases the number of Monte Carlo simulations performed during power estimation, and improves the accuracy of the returned estimate. Previous power estimates fromsubsample_paired_power
should be disregarded and re-calculated. (#910) - Fixed issue where
randdm
was attempting to create asymmetric distance matrices.This was causing an error to be raised by theDistanceMatrix
constructor inside of theranddm
function, so thatranddm
would fail when attempting to create large distance matrices. (#943)
- Deprecated
skbio.util.flatten
. This function will be removed in scikit-bio 0.3.1. Please use standard python library functionality described here Making a flat list out of lists of lists, Flattening a shallow list (#833) - Deprecated
skbio.stats.power.bootstrap_power_curve
will be removed in scikit-bio 0.4.1. It is deprecated in favor of usingsubsample_power
orsample_paired_power
to calculate a power matrix, and then the use ofconfidence_bounds
to calculate the average and confidence intervals.
- Removed the following deprecated functionality:
skbio.parse
subpackage, includingSequenceIterator
,FastaIterator
,FastqIterator
,load
,parse_fasta
,parse_fastq
,parse_qual
,write_clustal
,parse_clustal
, andFastqParseError
; please useskbio.io
instead.skbio.format
subpackage, includingfasta_from_sequence
,fasta_from_alignment
, andformat_fastq_record
; please useskbio.io
instead.skbio.alignment.SequenceCollection.int_map
; please useSequenceCollection.update_ids
instead.skbio.alignment.SequenceCollection
methodsto_fasta
andtoFasta
; please useSequenceCollection.write
instead.constructor
parameter inskbio.alignment.Alignment.majority_consensus
; please convert returned biological sequence object manually as desired (e.g.,str(seq)
).skbio.alignment.Alignment.to_phylip
; please useAlignment.write
instead.skbio.sequence.BiologicalSequence.to_fasta
; please useBiologicalSequence.write
instead.skbio.tree.TreeNode
methodsfrom_newick
,from_file
, andto_newick
; please useTreeNode.read
andTreeNode.write
instead.skbio.stats.distance.DissimilarityMatrix
methodsfrom_file
andto_file
; please useDissimilarityMatrix.read
andDissimilarityMatrix.write
instead.skbio.stats.ordination.OrdinationResults
methodsfrom_file
andto_file
; please useOrdinationResults.read
andOrdinationResults.write
instead.skbio.stats.p_value_to_str
; there is no replacement.skbio.stats.subsample
; please useskbio.stats.subsample_counts
instead.skbio.stats.distance.ANOSIM
; please useskbio.stats.distance.anosim
instead.skbio.stats.distance.PERMANOVA
; please useskbio.stats.distance.permanova
instead.skbio.stats.distance.CategoricalStatsResults
; there is no replacement, please useskbio.stats.distance.anosim
orskbio.stats.distance.permanova
, which will return apandas.Series
object.
skbio.alignment.Alignment.majority_consensus
now returnsBiologicalSequence('')
if the alignment is empty. Previously,''
was returned.min_observations
was removed fromskbio.stats.power.subsample_power
andskbio.stats.power.subsample_paired_power
. The minimum number of samples for subsampling depends on the data set and statistical tests. Having a default parameter to set unnecessary limitations on the technique.
- Changed testing procedures
- Developers should now use
make test
- Users can use
python -m skbio.test
- Added
skbio.util._testing.TestRunner
(available throughskbio.util.TestRunner
). Used to provide atest
method for each module init file. This class represents a unified testing path which wraps allskbio
testing functionality. - Autodetect Python version and disable doctests for Python 3.
- Developers should now use
numpy
is no longer required to be installed before installing scikit-bio!- Upgraded checklist.py to check source files non-conforming to new header style. (#855)
- Updated to use
natsort
>= 4.0.0. - The method of subsampling was changed for
skbio.stats.power.subsample_paired_power
. Rather than drawing a paired sample for the run and then subsampling for each count, the subsample is now drawn for each sample and each run. In test data, this did not significantly alter the power results. - checklist.py now enforces
__future__
imports in .py files.
- Modified
skbio.stats.distance.pwmantel
to accept a list of filepaths. This is useful as it allows for a smaller amount of memory consumption as it only loads two matrices at a time as opposed to requiring that all distance matrices are loaded into memory. - Added
skbio.util.find_duplicates
for finding duplicate elements in an iterable.
- Fixed floating point precision bugs in
Alignment.position_frequencies
,Alignment.position_entropies
,Alignment.omit_gap_positions
,Alignment.omit_gap_sequences
,BiologicalSequence.k_word_frequencies
, andSequenceCollection.k_word_frequencies
(#801).
- Removed
feature_types
attribute fromBiologicalSequence
and all subclasses (#797). - Removed
find_features
method fromBiologicalSequence
andProteinSequence
(#797). BiologicalSequence.k_word_frequencies
now returns acollections.defaultdict
of typefloat
instead of typeint
. This only affects the "default" case, when a key isn't present in the dictionary. Previous behavior would return0
as anint
, while the new behavior is to return0.0
as afloat
. This change also affects thedefaultdict
s that are returned bySequenceCollection.k_word_frequencies
.
DissimilarityMatrix
andDistanceMatrix
now report duplicate IDs in theDissimilarityMatrixError
message that can be raised during validation.
- Added
plot
method toskbio.stats.distance.DissimilarityMatrix
for creating basic heatmaps of a dissimilarity/distance matrix (see #684). Also added_repr_png_
and_repr_svg_
methods for automatic display in the IPython Notebook, withpng
andsvg
properties for direct access. - Added
__str__
method toskbio.stats.ordination.OrdinationResults
. - Added
skbio.stats.distance.anosim
andskbio.stats.distance.permanova
functions, which replace theskbio.stats.distance.ANOSIM
andskbio.stats.distance.PERMANOVA
classes. These new functions provide simpler procedural interfaces to running these statistical methods. They also provide more convenient access to results by returning apandas.Series
instead of aCategoricalStatsResults
object. These functions have more extensive documentation than their previous versions. If significance tests are suppressed, p-values are returned asnp.nan
instead ofNone
for consistency with other statistical methods in scikit-bio. #754 - Added
skbio.stats.power
for performing empirical power analysis. The module uses existing datasets and iteratively draws samples to estimate the number of samples needed to see a significant difference for a given critical value. - Added
skbio.stats.isubsample
for subsampling from an unknown number of values. This method supports subsampling from multiple partitions and does not require that all items be stored in memory, requiring approximatelyO(N*M)`` space where
Nis the number of partitions and
M` is the maximum subsample size. - Added
skbio.stats.subsample_counts
, which replacesskbio.stats.subsample
. See deprecation section below for more details (#770).
- Fixed issue where SSW wouldn't compile on i686 architectures (#409).
- Deprecated
skbio.stats.p_value_to_str
. This function will be removed in scikit-bio 0.3.0. Permutation-based p-values in scikit-bio are calculated as(num_extreme + 1) / (num_permutations + 1)
, so it is impossible to obtain a p-value of zero. This function historically existed for correcting the number of digits displayed when obtaining a p-value of zero. Since this is no longer possible, this functionality will be removed. - Deprecated
skbio.stats.distance.ANOSIM
andskbio.stats.distance.PERMANOVA
in favor ofskbio.stats.distance.anosim
andskbio.stats.distance.permanova
, respectively. - Deprecated
skbio.stats.distance.CategoricalStatsResults
in favor of usingpandas.Series
to store statistical method results.anosim
andpermanova
returnpandas.Series
instead ofCategoricalStatsResults
. - Deprecated
skbio.stats.subsample
in favor ofskbio.stats.subsample_counts
, which provides an identical interface; only the function name has changed.skbio.stats.subsample
will be removed in scikit-bio 0.3.0.
- Deprecation warnings are now raised using
DeprecationWarning
instead ofUserWarning
(#774).
- The
pandas.DataFrame
returned byskbio.stats.distance.pwmantel
now stores p-values as floats and does not convert them to strings with a specific number of digits. p-values that were previously stored as "N/A" are now stored asnp.nan
for consistency with other statistical methods in scikit-bio. See note in "Deprecated functionality" above regardingp_value_to_str
for details. - scikit-bio now supports versions of IPython < 2.0.0 (#767).
This is an alpha release of scikit-bio. At this stage, major backwards-incompatible API changes can and will happen. Unified I/O with the scikit-bio I/O registry was the focus of this release.
- Added
strict
andlookup
optional parameters toskbio.stats.distance.mantel
for handling reordering and matching of IDs when providedDistanceMatrix
instances as input (these parameters were previously only available inskbio.stats.distance.pwmantel
). skbio.stats.distance.pwmantel
now accepts an iterable ofarray_like
objects. Previously, onlyDistanceMatrix
instances were allowed.- Added
plot
method toskbio.stats.ordination.OrdinationResults
for creating basic 3-D matplotlib scatterplots of ordination results, optionally colored by metadata in apandas.DataFrame
(see #518). Also added_repr_png_
and_repr_svg_
methods for automatic display in the IPython Notebook, withpng
andsvg
properties for direct access. - Added
skbio.stats.ordination.assert_ordination_results_equal
for comparingOrdinationResults
objects for equality in unit tests. BiologicalSequence
(and its subclasses) now optionally store Phred quality scores. A biological sequence's quality scores are stored as a 1-Dnumpy.ndarray
of nonnegative integers that is the same length as the biological sequence. Quality scores can be provided upon object instantiation via the keyword argumentquality
, and can be retrieved via theBiologicalSequence.quality
property.BiologicalSequence.has_quality
is also provided for determining whether a biological sequence has quality scores or not. See #616 for more details.- Added
BiologicalSequence.sequence
property for retrieving the underlying string representing the sequence characters. This was previously (and still is) accessible viaBiologicalSequence.__str__
. It is provided via a property for convenience and explicitness. - Added
BiologicalSequence.equals
for full control over equality testing of biological sequences. By default, biological sequences must have the same type, underlying sequence of characters, identifier, description, and quality scores to compare equal. These properties can be ignored via the keyword argumentignore
. The behavior ofBiologicalSequence.__eq__
/__ne__
remains unchanged (only type and underlying sequence of characters are compared). - Added
BiologicalSequence.copy
for creating a copy of a biological sequence, optionally with one or more attributes updated. BiologicalSequence.__getitem__
now supports specifying a sequence of indices to take from the biological sequence.- Methods to read and write taxonomies are now available under
skbio.tree.TreeNode.from_taxonomy
andskbio.tree.TreeNode.to_taxonomy
respectively. - Added
SequenceCollection.update_ids
, which provides a flexible way of updating sequence IDs on aSequenceCollection
orAlignment
(note that a new object is returned, since instances of these classes are immutable). DeprecatedSequenceCollection.int_map
in favor of this new method; it will be removed in scikit-bio 0.3.0. - Added
skbio.util.cardinal_to_ordinal
for converting a cardinal number to ordinal string (e.g., useful for error messages). - New I/O Registry: supports multiple file formats, automatic file format detection when reading, unified procedural
skbio.io.read
andskbio.io.write
in addition to OOP interfaces (read/write
methods) on the below objects. Seeskbio.io
for more details.- Added "clustal" format support:
- Has sniffer
- Readers:
Alignment
- Writers:
Alignment
- Added "lsmat" format support:
- Has sniffer
- Readers:
DissimilarityMatrix
,DistanceMatrix
- Writers:
DissimilarityMatrix
,DistanceMatrix
- Added "ordination" format support:
- Has sniffer
- Readers:
OrdinationResults
- Writers:
OrdinationResults
- Added "newick" format support:
- Has sniffer
- Readers:
TreeNode
- Writers:
TreeNode
- Added "phylip" format support:
- No sniffer
- Readers: None
- Writers:
Alignment
- Added "qseq" format support:
- Has sniffer
- Readers: generator of
BiologicalSequence
or its subclasses,SequenceCollection
,BiologicalSequence
,NucleotideSequence
,DNASequence
,RNASequence
,ProteinSequence
- Writers: None
- Added "fasta"/QUAL format support:
- Has sniffer
- Readers: generator of
BiologicalSequence
or its subclasses,SequenceCollection
,Alignment
,BiologicalSequence
,NucleotideSequence
,DNASequence
,RNASequence
,ProteinSequence
- Writers: same as readers
- Added "fastq" format support:
- Has sniffer
- Readers: generator of
BiologicalSequence
or its subclasses,SequenceCollection
,Alignment
,BiologicalSequence
,NucleotideSequence
,DNASequence
,RNASequence
,ProteinSequence
- Writers: same as readers
- Added "clustal" format support:
- Removed
constructor
parameter fromAlignment.k_word_frequencies
,BiologicalSequence.k_words
,BiologicalSequence.k_word_counts
, andBiologicalSequence.k_word_frequencies
as it had no effect (it was never hooked up in the underlying code).BiologicalSequence.k_words
now returns a generator ofBiologicalSequence
objects instead of strings. - Modified the
Alignment
constructor to verify that all sequences have the same length, if not, raise anAlignmentError
exception. Updated the methodAlignment.subalignment
to calculate the indices only once now that identical sequence length is guaranteed.
-
Deprecated
constructor
parameter inAlignment.majority_consensus
in favor of having users callstr
on the returnedBiologicalSequence
. This parameter will be removed in scikit-bio 0.3.0. -
Existing I/O functionality deprecated in favor of I/O registry, old functionality will be removed in scikit-bio 0.3.0. All functionality can be found at
skbio.io.read
,skbio.io.write
, and the methods listed below:-
Deprecated the following "clustal" readers/writers:
write_clustal
->Alignment.write
parse_clustal
->Alignment.read
-
Deprecated the following distance matrix format ("lsmat") readers/writers:
DissimilarityMatrix.from_file
->DissimilarityMatrix.read
DissimilarityMatrix.to_file
->DissimilarityMatrix.write
DistanceMatrix.from_file
->DistanceMatrix.read
DistanceMatrix.to_file
->DistanceMatrix.write
-
Deprecated the following ordination format ("ordination") readers/writers:
OrdinationResults.from_file
->OrdinationResults.read
OrdinationResults.to_file
->OrdinationResults.write
-
Deprecated the following "newick" readers/writers:
TreeNode.from_file
->TreeNode.read
TreeNode.from_newick
->TreeNode.read
TreeNode.to_newick
->TreeNode.write
-
Deprecated the following "phylip" writers:
Alignment.to_phylip
->Alignment.write
-
Deprecated the following "fasta"/QUAL readers/writers:
SequenceCollection.from_fasta_records
->SequenceCollection.read
SequenceCollection.to_fasta
->SequenceCollection.write
fasta_from_sequences
->skbio.io.write(obj, into=<file>, format='fasta')
fasta_from_alignment
->Alignment.write
parse_fasta
->skbio.io.read(<fasta>, format='fasta')
parse_qual
->skbio.io.read(<fasta>, format='fasta', qual=<file>)
BiologicalSequence.to_fasta
->BiologicalSequence.write
-
Deprecated the following "fastq" readers/writers:
parse_fastq
->skbio.io.read(<fastq>, format='fastq')
format_fastq_record
->skbio.io.write(<fastq>, format='fastq')
-
skbio.stats.distance.mantel
now returns a 3-element tuple containing correlation coefficient, p-value, and the number of matching rows/cols in the distance matrices (n
). The return value was previously a 2-element tuple containing only the correlation coefficient and p-value.skbio.stats.distance.mantel
reorders inputDistanceMatrix
instances based on matching IDs (see optional parametersstrict
andlookup
for controlling this behavior). In the past,DistanceMatrix
instances were treated the same asarray_like
input and no reordering took place, regardless of ID (mis)matches.array_like
input behavior remains the same.- If mismatched types are provided to
skbio.stats.distance.mantel
(e.g., aDistanceMatrix
andarray_like
), aTypeError
will be raised.
- Added git timestamp checking to checklist.py, ensuring that when changes are made to Cython (.pyx) files, their corresponding generated C files are also updated.
- Fixed performance bug when instantiating
BiologicalSequence
objects. The previous runtime scaled linearly with sequence length; it is now constant time when the sequence is already a string. See #623 for details. - IPython and six are now required dependencies.
This is an initial alpha release of scikit-bio. At this stage, major backwards-incompatible API changes can and will happen. Many backwards-incompatible API changes were made since the previous release.
- Added ability to compute distances between sequences in a
SequenceCollection
object (#509), and expandedAlignment.distance
to allow the user to pass a function for computing distances (the default distance metric is stillscipy.spatial.distance.hamming
) (#194). - Added functionality to not penalize terminal gaps in global alignment. This functionality results in more biologically relevant global alignments (see #537 for discussion of the issue) and is now the default behavior for global alignment.
- The python global aligners (
global_pairwise_align
,global_pairwise_align_nucleotide
, andglobal_pairwise_align_protein
) now support aligning pairs of sequences, pairs of alignments, and a sequence and an alignment (see #550). This functionality supports progressive multiple sequence alignment, among other things such as adding a sequence to an existing alignment. - Added
StockholmAlignment.to_file
for writing Stockholm-formatted files. - Added
strict=True
optional parameter toDissimilarityMatrix.filter
. - Added
TreeNode.find_all
for finding all tree nodes that match a given name.
- Fixed bug that resulted in a
ValueError
fromlocal_align_pairwise_nucleotide
(see #504) under many circumstances. This would not generate incorrect results, but would cause the code to fail.
- Removed
skbio.math
, leavingstats
anddiversity
to become top level packages. For example, instead offrom skbio.math.stats.ordination import PCoA
you would now importfrom skbio.stats.ordination import PCoA
. - The module
skbio.math.gradient
as well as the contents ofskbio.math.subsample
andskbio.math.stats.misc
are now found inskbio.stats
. As an example, to import subsample:from skbio.stats import subsample
; to import everything from gradient:from skbio.stats.gradient import *
. - The contents of
skbio.math.stats.ordination.utils
are now inskbio.stats.ordination
. - Removed
skbio.app
subpackage (i.e., the application controller framework) as this code has been ported to the standalone burrito Python package. This code was not specific to bioinformatics and is useful for wrapping command-line applications in general. - Removed
skbio.core
, leavingalignment
,genetic_code
,sequence
,tree
, andworkflow
to become top level packages. For example, instead offrom skbio.core.sequence import DNA
you would now importfrom skbio.sequence import DNA
. - Removed
skbio.util.exception
andskbio.util.warning
(see #577 for the reasoning behind this change). The exceptions/warnings were moved to the following locations:
FileFormatError
,RecordError
,FieldError
, andEfficiencyWarning
have been moved toskbio.util
BiologicalSequenceError
has been moved toskbio.sequence
SequenceCollectionError
andStockholmParseError
have been moved toskbio.alignment
DissimilarityMatrixError
,DistanceMatrixError
,DissimilarityMatrixFormatError
, andMissingIDError
have been moved toskbio.stats.distance
TreeError
,NoLengthError
,DuplicateNodeError
,MissingNodeError
, andNoParentError
have been moved toskbio.tree
FastqParseError
has been moved toskbio.parse.sequences
GeneticCodeError
,GeneticCodeInitError
, andInvalidCodonError
have been moved toskbio.genetic_code
- The contents of
skbio.genetic_code
formerlyskbio.core.genetic_code
are now inskbio.sequence
. TheGeneticCodes
dictionary is now a functiongenetic_code
. The functionality is the same, except that because this is now a function rather than a dict, retrieving a genetic code is done using a function call rather than a lookup (so, for example,GeneticCodes[2]
becomesgenetic_code(2)
. - Many submodules have been made private with the intention of simplifying imports for users. See #562 for discussion of this change. The following list contains the previous module name and where imports from that module should now come from.
skbio.alignment.ssw
toskbio.alignment
skbio.alignment.alignment
toskbio.alignment
skbio.alignment.pairwise
toskbio.alignment
skbio.diversity.alpha.base
toskbio.diversity.alpha
skbio.diversity.alpha.gini
toskbio.diversity.alpha
skbio.diversity.alpha.lladser
toskbio.diversity.alpha
skbio.diversity.beta.base
toskbio.diversity.beta
skbio.draw.distributions
toskbio.draw
skbio.stats.distance.anosim
toskbio.stats.distance
skbio.stats.distance.base
toskbio.stats.distance
skbio.stats.distance.permanova
toskbio.stats.distance
skbio.distance
toskbio.stats.distance
skbio.stats.ordination.base
toskbio.stats.ordination
skbio.stats.ordination.canonical_correspondence_analysis
toskbio.stats.ordination
skbio.stats.ordination.correspondence_analysis
toskbio.stats.ordination
skbio.stats.ordination.principal_coordinate_analysis
toskbio.stats.ordination
skbio.stats.ordination.redundancy_analysis
toskbio.stats.ordination
skbio.tree.tree
toskbio.tree
skbio.tree.trie
toskbio.tree
skbio.util.misc
toskbio.util
skbio.util.testing
toskbio.util
skbio.util.exception
toskbio.util
skbio.util.warning
toskbio.util
- Moved
skbio.distance
contents intoskbio.stats.distance
.
- Relaxed requirement in
BiologicalSequence.distance
that sequences being compared are of equal length. This is relevant for Hamming distance, so the check is still performed in that case, but other distance metrics may not have that requirement. See #504). - Renamed
powertrip.py
repo-checking script tochecklist.py
for clarity. checklist.py
now ensures that all unit tests import from a minimally deep API. For example, it will produce an error ifskbio.core.distance.DistanceMatrix
is used overskbio.DistanceMatrix
.- Extra dimension is no longer calculated in
skbio.stats.spatial.procrustes
. - Expanded documentation in various subpackages.
- Added new scikit-bio logo. Thanks Alina Prassas!
This is a pre-alpha release. At this stage, major backwards-incompatible API changes can and will happen.
- Added Python implementations of Smith-Waterman and Needleman-Wunsch alignment as
skbio.core.alignment.pairwise.local_pairwise_align
andskbio.core.alignment.pairwise.global_pairwise_align
. These are much slower than native C implementations (e.g.,skbio.core.alignment.local_pairwise_align_ssw
) and as a result raise anEfficencyWarning
when called, but are included as they serve as useful educational examples as they’re simple to experiment with. - Added
skbio.core.diversity.beta.pw_distances
andskbio.core.diversity.beta.pw_distances_from_table
. These provide convenient access to thescipy.spatial.distance.pdist
beta diversity metrics from within scikit-bio. Theskbio.core.diversity.beta.pw_distances_from_table
function will only be available temporarily, until thebiom.table.Table
object is merged into scikit-bio (see #489), at which pointskbio.core.diversity.beta.pw_distances
will be updated to use that. - Added
skbio.core.alignment.StockholmAlignment
, which provides support for parsing Stockholm-formatted alignment files and working with those alignments in the context RNA secondary structural information. - Added
skbio.core.tree.majority_rule
function for computing consensus trees from a list of trees.
- Function
skbio.core.alignment.align_striped_smith_waterman
renamed tolocal_pairwise_align_ssw
and now returns anAlignment
object instead of anAlignmentStructure
- The following keyword-arguments for
StripedSmithWaterman
andlocal_pairwise_align_ssw
have been renamed:gap_open
->gap_open_penalty
gap_extend
->gap_extend_penalty
match
->match_score
mismatch
->mismatch_score
- Removed
skbio.util.sort
module in favor of natsort package.
- Added powertrip.py script to perform basic sanity-checking of the repo based on recurring issues that weren't being caught until release time; added to Travis build.
- Added RELEASE.md with release instructions.
- Added intersphinx mappings to docs so that "See Also" references to numpy, scipy, matplotlib, and pandas are hyperlinks.
- The following classes are no longer
namedtuple
subclasses (see #359 for the rationale):skbio.math.stats.ordination.OrdinationResults
skbio.math.gradient.GroupResults
skbio.math.gradient.CategoryResults
skbio.math.gradient.GradientANOVAResults
- Added coding guidelines draft.
- Added new alpha diversity formulas to the
skbio.math.diversity.alpha
documentation.
This is a pre-alpha release. At this stage, major backwards-incompatible API changes can and will happen.
- Added
enforce_qual_range
parameter toparse_fastq
(on by default, maintaining backward compatibility). This allows disabling of the quality score range-checking. - Added
skbio.core.tree.nj
, which applies neighbor-joining for phylogenetic reconstruction. - Added
bioenv
,mantel
, andpwmantel
distance-based statistics toskbio.math.stats.distance
subpackage. - Added
skbio.math.stats.misc
module for miscellaneous stats utility functions. - IDs are now optional when constructing a
DissimilarityMatrix
orDistanceMatrix
(monotonically-increasing integers cast as strings are automatically used). - Added
DistanceMatrix.permute
method for randomly permuting rows and columns of a distance matrix. - Added the following methods to
DissimilarityMatrix
:filter
,index
, and__contains__
for ID-based filtering, index lookup, and membership testing, respectively. - Added
ignore_comment
parameter toparse_fasta
(off by default, maintaining backward compatibility). This handles stripping the comment field from the header line (i.e., all characters beginning with the first space) before returning the label. - Added imports of
BiologicalSequence
,NucleotideSequence
,DNA
,DNASequence
,RNA
,RNASequence
,Protein
,ProteinSequence
,DistanceMatrix
,align_striped_smith_waterman
,SequenceCollection
,Alignment
,TreeNode
,nj
,parse_fasta
,parse_fastq
,parse_qual
,FastaIterator
,FastqIterator
,SequenceIterator
inskbio/__init__.py
for convenient importing. For example, it's now possible tofrom skbio import Alignment
, rather thanfrom skbio.core.alignment import Alignment
.
- Fixed a couple of unit tests that could fail stochastically.
- Added missing
__init__.py
files to a couple of test directories so that these tests won't be skipped. parse_fastq
now raises an error on dangling records.- Fixed several warnings that were raised while running the test suite with Python 3.4.
- Functionality imported from
skbio.core.ssw
must now be imported fromskbio.core.alignment
instead.
- Code is now flake8-compliant; added flake8 checking to Travis build.
- Various additions and improvements to documentation (API, installation instructions, developer instructions, etc.).
__future__
imports are now standardized across the codebase.- New website front page and styling changes throughout. Moved docs site to its own versioned subdirectories.
- Reorganized alignment data structures and algorithms (e.g., SSW code,
Alignment
class, etc.) into anskbio.core.alignment
subpackage.
Fixes to setup.py. This is a pre-alpha release. At this stage, major backwards-incompatible API changes can and will happen.
Initial pre-alpha release. At this stage, major backwards-incompatible API changes can and will happen.