All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
1.12.1 - 2025-01-23
- Fix bug when cloning
FeatureToken
.
1.12.0 - 2025-01-11
CombinerFeatureDocumentParser.include_detached_features
to default usingFeatureToken.{get,set}_feature
semantics.- Dropped support for Python 3.10.
- A text indexing and search class which find feature spans in text with mangled white space.
- Feature ID mapping in the aggregating parser
CombinerFeatureDocumentParser
class.
- Replaced
FeatureToken.{get,set}_value
with a more robust{get,set}_feature
. - Upgraded to zensols.util version 1.15.
1.11.1 - 2024-05-11
- A method
FeatureToken.set_value
that sets a value by attribute. - A token container decorator that copies features.
1.11.0 - 2024-04-14
Feature release with significant modification to feature merging document parsers.
- A composite parser that combines several parsers, each with their own rules of copying (or clobbering).
- The combiner parser
CombinerFeatureDocumentParser
, and subclasses, are now optimized to avoid re-parsing for shared parsers. This is the case with the zensols.mednlp parsers that migrate features down to the same parser. - Fixed some features not copied in combiner parsers after a token clone.
- The spaCy and combiner parsers are removed from the default
zensols.nlp
package import.
- Add
TokenContainer
class to decorator hierarchy. - Rename classes:
StripSentenceDecorator
toStripTokenContainerDecorator
UpdateDocumentDecorator
toUpdateTokenContainerDecorator
- Rename resource library configuration:
strip_sentence_decorator
tostrip_token_container_decorator
update_document_decorator
toupdate_token_container_decorator
CombinerFeatureDocumentParser
now extends fromDecoratedFeatureDocumentParser
withtarget_parser
becomingdelegate
. Token features now come from the delegate or stored in theDecoratedFeatureDocumentParser
when they don't exist in the delegate.
1.10.0 - 2024-02-27
A class name typo is the impetuous for this being a new minor release (even if the release is mostly for bug fixes).
- Add token level annotations to
TokenAnnotatedFeatureDocument
. - Yielded feature defaults in
CombinerFeatureDocumentParser
.
- Class name typo for
TokenAnnotatedFeatureDocument
. - Fixed bug on
CombinerFeatureDocumentParser
whereNone
s were not replaced by a source parser. - Added
toaken_feature_ids
toCombinerFeatureDocumentParser
to facilitate token feature passing. - Lexical span gaps end boundary edge case bug fix.
- Minor bug fixes.
1.9.2 - 2024-01-11
- The
CachingFeatureDocumentParser
is now configurable with decorators.
1.9.1 - 2024-01-04
- Added an API, parser components, and unit tests to split tokens.
- Adding missing
text
column on the feature document Pandas dataframe.
- Bug fixes to
FeatureDocument
sentence combining. - White space tokenization parser no longer inherits the spaCy parser, and needs no configuration.
1.9.0 - 2023-12-05
Upgrade and Python deprecation release.
- Upgrade to spaCy version 3.6.
- Upgrade to zensols.util version 1.14.
- Support for Python 3.11.
- Optional dependencies for scoring methods.
- Support for Python 3.9.
1.8.1 - 2023-11-29
- A simple
FeatureSentenceFactory
that creates sentence instances from tokens.
FeatureToken
bug fixes.- Reduce pickle data footprint.
- Span normalization.
- Reduce flake8 warning, typehints, documentation.
1.8.0 - 2023-08-16
Functional and downstream moderate risk update release.
TokenContainer.norm
removes newlines of the normalized text.FeatureToken
hash function.- Fix text mangling in sub-document
FeatureDocument.get_overlapping
method. - Refactor hash and equal compare methods in
TokenContainer
- Terse writing for
TokenContainer
andFeatureToken
.
- Rule based paragraph and list item chunkers.
FeatureDocument.reindex
and method to clear cached state with unit tests.
1.7.3 - 2023-06-29
FeatureToken
detached features are transmitted by theCombinerFeatureDocumentParser
.
1.7.2 - 2023-06-27
- Move spaCy parser and supporting classes to a separate module.
- Feature to auto load any missing spaCy models at runtime. This feature
doc_parser.auto_install_model
must be turned on to be used.
1.7.1 - 2023-06-20
- Feature to add
None
values to missing overwritten features inCombinerFeatureDocumentParser
.
1.7.0 - 2023-06-07
- Fixed type exception bug on
Feature.to_sentence
. - Fix raised exception for overlapped methods on 0-length documents.
- Remove spaCy artifacts from parser decorators
(i.e.
SpacyFeatureDocumentDecorator
->FeatureDocumentDecorator
) to generalize to non-spaCy document parsers and other components (deepnlp
transformer embedding populators).
- Right lexical span inclusive parameter for all
TokenContainer.get_overlapping*
methods. - Empty versions of
TokenContainer
subclasses. - Added a default instance of a
FeatureDocumentParser
that does not require a resource library configuration. - A
TokenContainer.canonical
that provides a canonical representation of the token container. - A right inclusive flag on
TokenContainer
overlapping methods. - Container methods to update token spans for split entities and a decorator.
- Levenshtein edit distance based scoring module.
- Exact match scoring module.
- SemEval-2013 Task 9.1 scoring module.
1.6.0 - 2023-04-05
- Backwards compatible scoring: error handling and correlation IDs.
- More unit tests.
- Handle errors during scoring and robustly provide scores when reporting.
- Make token containers are hashable.
- Fixed token overlap on left side of lexical spans.
1.5.0 - 2023-01-23
- Fix
TokenContainer
indexing bug with edge case on split on space. - Updated zensols.util to 1.12.1.
- Scoring framework. This includes Bleu via NLTK by default, and optionally ROUGE via optional package support.
- Contiguous sentence index (i_sent) in
FeatureDocument.to_sentece
. - Default feature ID set to
FeatureToken
.
- Unused Levenshtein dependency.
1.4.1 - 2022-10-02
- Fixed token indexing bug
1.4.0 - 2022-09-30
- A document stash caching parser
CachingFeatureDocumentParser
. - The InterLap library to speed up overlapping token queries.
- Sentence decorator and sentence split space decorator.
FeatureDocument.sents
changed from alist
to atuple
.- Add checks for
FeatureDocument.sents
andFeatureSentence.sent_tokens
as tuples. - Better (English) normalization of text by adding more apostrophe/contraction syntax.
- The
FeatureToken.NONE
constant changed from<none>
to-<N>-
. - Speed up
FeatureToken
equals.
- Removed
stemmer
module from default imports. Useimport zensols.nlp.stemmer
.
1.3.0 - 2022-08-06
- Token indexing mappings accounting for (named entity) multi-word tokens.
- IOB (
iob_
,iob
) features. - Re-loadable components and component initializers.
- Upgraded to spaCy 3.2
- Add spaCy tokens to spaCy feature tokens.
- Bug fixes in combining and overlapping sentences.
- Switched to shallow copy of document in overlapping sentence doc methods.
1.2.0 - 2022-06-16
- Remove resource library
regular_expression_escape:dollar
configuration. Use zensols.utilconf_esc:dollar
as a replacement.
1.1.2 - 2022-06-14
- Dependency bump.
1.1.1 - 2022-05-15
- Dependency bump.
1.1.0 - 2022-05-04
- Fix resource leaks and other bugs.
- Persist original text along with
FeatureDocument
rather than reconstruct it from sentence and/or token text.
- An lexical overlapping utility module (
overlap
). - A token normalizer that merges tokens in to spans (
JoinTokenMapper
). - Regular expression matching for entity and merge components (similar to
JoinTokenMapper
). - Add back
TokenAnnotatedFeatureSentence
for down stream packages. - Add token decorator to spacy parser to allow for add/modify features on creation separate from parser class hierarchy.
1.0.1 - 2022-01-25
- Sentences and tokens accessible by index.
- More robust regular expression for token splitting.
- Mapping combiner is persistable with spaCy tokens and handles split named entities.
1.0.0 - 2021-10-22
First major development release.
- A
FeatureDocumentCombiner
that merges features from different document parsers. - Top level library
NLPError
. - A pipeline component and resource configuration library entry to remove sentence boundaries in a spaCy document.
- Split out optional resource library content in to
mappers.conf
. - The spaCy model has attribute
langres
set onLanguageResource
to enable creation of factory instances from registered pipe components. - Fix issue with component creation with no pipeline arguments.
- The
DocStash
instance as it was too simple for any practical application.
0.1.3 - 2021-09-21
- Dependency.
zensols.nlp.lang.DocStash
0.1.2 - 2021-09-21
- Make
FeatureDocumentParser
callable. - Fix memory leak in
LanguageResource
.
- Configuration Resource library.
- Configuration for keyword arguments to the
add_pipe_comp
and example.
0.1.1 - 2021-09-07
- Fixed bug with creating a
dict
from aFeatureToken
. - Fixed/improved how
Feature{Token,Sentence,Document}
aredict
ified with (asdict
) and how they are written as text withwrite
.
- Creates a Pandas dataframe from token feature attributes.
- Add back
FeatureToken
feature ID -> type for write dumping - Add lexical location
SpacyTokenFeatures.loc
location in the document as an (starting, ending) range.
0.1.0 - 2021-08-16
This release simplifies the token attributes level classes in the features
module by:
- Using feature IDs instead of trying to make sense of the class property/attribute member data.
- Using the
FeatureDocumentParser
andFeatureToken
to copy spaCy resources to simple picklable Python classes.
Not only does this greatly reduce complexity in class hierarchy and data copy/move functionality, but speeds things up.
- Attributes set on detached token features are no longer robust. Before, if a
token feature ID was specified, but didn't exist on the source token feature
set, it would copy over a
None
. This now raises anAttributeError
instead. - For
TokenAttributes
, creation ofdicts
(either byasdict
orget_features
) is now consistent with the set attributes and properties of the class. Only those specified passed to methods, which default toFIELD_IDS
of the class (which can be overridden at a class level).
- The dictionary creation of attribute/property individual features methods
TokenAttributes.{string}features
. These methods are obviated by theget_features
, which returns all features inFIELD_IDS
. FeatureDocumentParser.additional_token_feature_ids
to simplify token feature IDs passed to feature tokens.- The
TokenAttributes
class, as it was just a metadata member holder.
- A SpaCy implementation of the
TokenFeatures
class, that somewhat resembles the oldTokenFeatures
of the old class hierarchy.
0.0.15 - 2021-08-07
- Upgrade from spaCy 2.x to 3.x.
- POS feature inclusion by default to support
is_pronoun
, which is needed after spaCy 3 changed how lemmatization works. - Move feature containers and parser from
zensols.deepnlp
, including test cases. - A sentence index feature (
i_sent
). - An index of sentence feature (
sent_i
). - Advanced spacy configuration by adding component classes. This gives more control over configuring the spaCy pipeline.
- Add feature containers (
FeatureDocument
) and parser (FeatureDocumentParser
), which were moved over from zensols.deepnlp.
0.0.14 - 2021-04-29
- Upgrade to zensols.util version 1.4.1.
- Upgrade documentation API generation.
- Nail dependencies to spacy 2.3.5 until pip deps are fixed.
- Added sentence index features to reconstruct sentences from documents.
0.0.13 - 2021-01-14
- Fix component adds for spacy > 2.0.
- Add langres model to API documentation.
0.0.12 - 2020-12-29
- Upgraded zenbuild.
- Switched from Travis to GitHub workflows.
- Tested with Python 3.9.1.
0.0.11 - 2020-12-09
- Add basic token features for non-spacy parse use cases.
- Rename feature type to feature id.
TokeFeatures
is now a dictable with to_dict -> asdict.
0.0.10 - 2020-12-09
- Sphinx documentation, which includes API docs.
- Settable detached
TokenAttributes
instances. - Make
dataclasses
, and therefore, needs >= Python 3.7.
0.0.9 - 2020-05-10
- Home/master move lemmatizing out of default token normalizer.
- Update super method calls to modern (at least) Python 3.7.
- Fix annoying can't find smart_open.gcs bogus warning.
- Remove language resource factory.
- Upgrade to zensols.util 1.2.0 and get rid of custom factories.
- Feature to parse whole special tokens.
- Added porter stemmer from nltk.
- Moved word2vec embedding (
word2vec.py
) to zensols.deepnlp library. - Moved feature normalization (
fnorm.py
) to zensols.deepnlp library.
0.0.8 - 2020-04-14
- Upgrade to
spaCy
2.2.4 andtextacy
0.10.0
0.0.7 - 2020-01-24
- Added the Porter stemmer from the [NTLK].
- Better class naming for token mapper.
- Features debugging bug fix.
0.0.6 - 2019-12-14
- Fix Travis.
0.0.5 - 2019-12-14
Data classes are now used so Python 3.7 is now a requirement.
- Feature normalizers were added for neural networks.
- Implemented a better strategy for using language resources with token normalization.
0.0.4 - 2019-11-21
- Adding detachable and picklable token feature set.
0.0.3 - 2019-07-31
DocStash
that parses documents as a factory stash.
0.0.2 - 2019-07-25
- Feature to disable SpaCy pipeline components.
- Add configuration for removing punctuation and determiners.
- Skip textacy for document creation since it wasn't used. This is more efficient.
0.0.1 - 2019-07-06
- Initial version.