Change Log

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

Unreleased

1.12.1 - 2025-01-23

Changed

Fix bug when cloning FeatureToken.

1.12.0 - 2025-01-11

Removed

CombinerFeatureDocumentParser.include_detached_features to default using FeatureToken.{get,set}_feature semantics.
Dropped support for Python 3.10.

Added

A text indexing and search class which find feature spans in text with mangled white space.
Feature ID mapping in the aggregating parser CombinerFeatureDocumentParser class.

Changed

Replaced FeatureToken.{get,set}_value with a more robust {get,set}_feature.
Upgraded to zensols.util version 1.15.

1.11.1 - 2024-05-11

Added

A method FeatureToken.set_value that sets a value by attribute.
A token container decorator that copies features.

1.11.0 - 2024-04-14

Feature release with significant modification to feature merging document parsers.

Added

A composite parser that combines several parsers, each with their own rules of copying (or clobbering).

Changed

The combiner parser CombinerFeatureDocumentParser, and subclasses, are now optimized to avoid re-parsing for shared parsers. This is the case with the zensols.mednlp parsers that migrate features down to the same parser.
Fixed some features not copied in combiner parsers after a token clone.

Removed

The spaCy and combiner parsers are removed from the default zensols.nlp package import.

Changed

Add TokenContainer class to decorator hierarchy.
Rename classes:
- StripSentenceDecorator to StripTokenContainerDecorator
- UpdateDocumentDecorator to UpdateTokenContainerDecorator
Rename resource library configuration:
- strip_sentence_decorator to strip_token_container_decorator
- update_document_decorator to update_token_container_decorator
CombinerFeatureDocumentParser now extends from DecoratedFeatureDocumentParser with target_parser becoming delegate. Token features now come from the delegate or stored in the DecoratedFeatureDocumentParser when they don't exist in the delegate.

1.10.0 - 2024-02-27

A class name typo is the impetuous for this being a new minor release (even if the release is mostly for bug fixes).

Added

Add token level annotations to TokenAnnotatedFeatureDocument.
Yielded feature defaults in CombinerFeatureDocumentParser.

Changed

Class name typo for TokenAnnotatedFeatureDocument.
Fixed bug on CombinerFeatureDocumentParser where Nones were not replaced by a source parser.
Added toaken_feature_ids to CombinerFeatureDocumentParser to facilitate token feature passing.
Lexical span gaps end boundary edge case bug fix.
Minor bug fixes.

1.9.2 - 2024-01-11

Changed

The CachingFeatureDocumentParser is now configurable with decorators.

1.9.1 - 2024-01-04

Added

Added an API, parser components, and unit tests to split tokens.
Adding missing text column on the feature document Pandas dataframe.

Changed

Bug fixes to FeatureDocument sentence combining.
White space tokenization parser no longer inherits the spaCy parser, and needs no configuration.

1.9.0 - 2023-12-05

Upgrade and Python deprecation release.

Changed

Upgrade to spaCy version 3.6.
Upgrade to zensols.util version 1.14.

Added

Support for Python 3.11.
Optional dependencies for scoring methods.

Removed

Support for Python 3.9.

1.8.1 - 2023-11-29

Added

A simple FeatureSentenceFactory that creates sentence instances from tokens.

Changed

FeatureToken bug fixes.
Reduce pickle data footprint.
Span normalization.
Reduce flake8 warning, typehints, documentation.

1.8.0 - 2023-08-16

Functional and downstream moderate risk update release.

Changed

TokenContainer.norm removes newlines of the normalized text.
FeatureToken hash function.
Fix text mangling in sub-document FeatureDocument.get_overlapping method.
Refactor hash and equal compare methods in TokenContainer
Terse writing for TokenContainer and FeatureToken.

Added

Rule based paragraph and list item chunkers.
FeatureDocument.reindex and method to clear cached state with unit tests.

1.7.3 - 2023-06-29

Changed

FeatureToken detached features are transmitted by the CombinerFeatureDocumentParser.

1.7.2 - 2023-06-27

Changed

Move spaCy parser and supporting classes to a separate module.
Feature to auto load any missing spaCy models at runtime. This feature doc_parser.auto_install_model must be turned on to be used.

1.7.1 - 2023-06-20

Added

Feature to add None values to missing overwritten features in CombinerFeatureDocumentParser.

1.7.0 - 2023-06-07

Changed

Fixed type exception bug on Feature.to_sentence.
Fix raised exception for overlapped methods on 0-length documents.
Remove spaCy artifacts from parser decorators (i.e. SpacyFeatureDocumentDecorator -> FeatureDocumentDecorator) to generalize to non-spaCy document parsers and other components (deepnlp transformer embedding populators).

Added

Right lexical span inclusive parameter for all TokenContainer.get_overlapping* methods.
Empty versions of TokenContainer subclasses.
Added a default instance of a FeatureDocumentParser that does not require a resource library configuration.
A TokenContainer.canonical that provides a canonical representation of the token container.
A right inclusive flag on TokenContainer overlapping methods.
Container methods to update token spans for split entities and a decorator.
Levenshtein edit distance based scoring module.
Exact match scoring module.
SemEval-2013 Task 9.1 scoring module.

1.6.0 - 2023-04-05

Added

Backwards compatible scoring: error handling and correlation IDs.
More unit tests.
Handle errors during scoring and robustly provide scores when reporting.
Make token containers are hashable.

Changed

Fixed token overlap on left side of lexical spans.

1.5.0 - 2023-01-23

Changed

Fix TokenContainer indexing bug with edge case on split on space.
Updated zensols.util to 1.12.1.

Added

Scoring framework. This includes Bleu via NLTK by default, and optionally ROUGE via optional package support.
Contiguous sentence index (i_sent) in FeatureDocument.to_sentece.
Default feature ID set to FeatureToken.

Removed:

Unused Levenshtein dependency.

1.4.1 - 2022-10-02

Changed

Fixed token indexing bug

1.4.0 - 2022-09-30

Added

A document stash caching parser CachingFeatureDocumentParser.
The InterLap library to speed up overlapping token queries.
Sentence decorator and sentence split space decorator.

Changed

FeatureDocument.sents changed from a list to a tuple.
Add checks for FeatureDocument.sents and FeatureSentence.sent_tokens as tuples.
Better (English) normalization of text by adding more apostrophe/contraction syntax.
The FeatureToken.NONE constant changed from <none> to -<N>-.
Speed up FeatureToken equals.

Removed

Removed stemmer module from default imports. Use import zensols.nlp.stemmer.

1.3.0 - 2022-08-06

Added

Token indexing mappings accounting for (named entity) multi-word tokens.
IOB (iob_, iob) features.
Re-loadable components and component initializers.

Changed

Upgraded to spaCy 3.2
Add spaCy tokens to spaCy feature tokens.
Bug fixes in combining and overlapping sentences.
Switched to shallow copy of document in overlapping sentence doc methods.

1.2.0 - 2022-06-16

Removed

Remove resource library regular_expression_escape:dollar configuration. Use zensols.util conf_esc:dollar as a replacement.

1.1.2 - 2022-06-14

Changed

Dependency bump.

1.1.1 - 2022-05-15

Changed

Dependency bump.

1.1.0 - 2022-05-04

Changed

Fix resource leaks and other bugs.
Persist original text along with FeatureDocument rather than reconstruct it from sentence and/or token text.

Added

An lexical overlapping utility module (overlap).
A token normalizer that merges tokens in to spans (JoinTokenMapper).
Regular expression matching for entity and merge components (similar to JoinTokenMapper).
Add back TokenAnnotatedFeatureSentence for down stream packages.
Add token decorator to spacy parser to allow for add/modify features on creation separate from parser class hierarchy.

1.0.1 - 2022-01-25

Added

Sentences and tokens accessible by index.

Changed

More robust regular expression for token splitting.
Mapping combiner is persistable with spaCy tokens and handles split named entities.

1.0.0 - 2021-10-22

First major development release.

Added

A FeatureDocumentCombiner that merges features from different document parsers.
Top level library NLPError.
A pipeline component and resource configuration library entry to remove sentence boundaries in a spaCy document.

Changed

Split out optional resource library content in to mappers.conf.
The spaCy model has attribute langres set on LanguageResource to enable creation of factory instances from registered pipe components.
Fix issue with component creation with no pipeline arguments.

Removed

The DocStash instance as it was too simple for any practical application.

0.1.3 - 2021-09-21

Changed

Dependency.

Removed

zensols.nlp.lang.DocStash

0.1.2 - 2021-09-21

Changed

Make FeatureDocumentParser callable.
Fix memory leak in LanguageResource.

Added

Configuration Resource library.
Configuration for keyword arguments to the add_pipe_comp and example.

0.1.1 - 2021-09-07

Changed

Fixed bug with creating a dict from a FeatureToken.
Fixed/improved how Feature{Token,Sentence,Document} are dictified with (asdict) and how they are written as text with write.

Added

Creates a Pandas dataframe from token feature attributes.
Add back FeatureToken feature ID -> type for write dumping
Add lexical location SpacyTokenFeatures.loc location in the document as an (starting, ending) range.

0.1.0 - 2021-08-16

This release simplifies the token attributes level classes in the features module by:

Using feature IDs instead of trying to make sense of the class property/attribute member data.
Using the FeatureDocumentParser and FeatureToken to copy spaCy resources to simple picklable Python classes.

Not only does this greatly reduce complexity in class hierarchy and data copy/move functionality, but speeds things up.

Changes

Attributes set on detached token features are no longer robust. Before, if a token feature ID was specified, but didn't exist on the source token feature set, it would copy over a None. This now raises an AttributeError instead.
For TokenAttributes, creation of dicts (either by asdict or get_features) is now consistent with the set attributes and properties of the class. Only those specified passed to methods, which default to FIELD_IDS of the class (which can be overridden at a class level).

Removed

The dictionary creation of attribute/property individual features methods TokenAttributes.{string}features. These methods are obviated by the get_features, which returns all features in FIELD_IDS.
FeatureDocumentParser.additional_token_feature_ids to simplify token feature IDs passed to feature tokens.
The TokenAttributes class, as it was just a metadata member holder.

Added

A SpaCy implementation of the TokenFeatures class, that somewhat resembles the old TokenFeatures of the old class hierarchy.

0.0.15 - 2021-08-07

Changes

Upgrade from spaCy 2.x to 3.x.

Added

POS feature inclusion by default to support is_pronoun, which is needed after spaCy 3 changed how lemmatization works.
Move feature containers and parser from zensols.deepnlp, including test cases.
A sentence index feature (i_sent).
An index of sentence feature (sent_i).
Advanced spacy configuration by adding component classes. This gives more control over configuring the spaCy pipeline.
Add feature containers (FeatureDocument) and parser (FeatureDocumentParser), which were moved over from zensols.deepnlp.

0.0.14 - 2021-04-29

Changes

Upgrade to zensols.util version 1.4.1.
Upgrade documentation API generation.
Nail dependencies to spacy 2.3.5 until pip deps are fixed.
Added sentence index features to reconstruct sentences from documents.

0.0.13 - 2021-01-14

Changes

Fix component adds for spacy > 2.0.
Add langres model to API documentation.

0.0.12 - 2020-12-29

Changed

Upgraded zenbuild.
Switched from Travis to GitHub workflows.
Tested with Python 3.9.1.

0.0.11 - 2020-12-09

Changed

Add basic token features for non-spacy parse use cases.
Rename feature type to feature id.
TokeFeatures is now a dictable with to_dict -> asdict.

0.0.10 - 2020-12-09

Added

Sphinx documentation, which includes API docs.

Changed

Settable detached TokenAttributes instances.
Make dataclasses, and therefore, needs >= Python 3.7.

0.0.9 - 2020-05-10

Changed

Home/master move lemmatizing out of default token normalizer.
Update super method calls to modern (at least) Python 3.7.
Fix annoying can't find smart_open.gcs bogus warning.
Remove language resource factory.
Upgrade to zensols.util 1.2.0 and get rid of custom factories.

Added

Feature to parse whole special tokens.
Added porter stemmer from nltk.

Removed

Moved word2vec embedding (word2vec.py) to zensols.deepnlp library.
Moved feature normalization (fnorm.py) to zensols.deepnlp library.

0.0.8 - 2020-04-14

Changed

Upgrade to spaCy 2.2.4 and textacy 0.10.0

0.0.7 - 2020-01-24

Added

Added the Porter stemmer from the [NTLK].

Changed

Better class naming for token mapper.
Features debugging bug fix.

0.0.6 - 2019-12-14

Changed

Fix Travis.

0.0.5 - 2019-12-14

Data classes are now used so Python 3.7 is now a requirement.

Added

Feature normalizers were added for neural networks.
Implemented a better strategy for using language resources with token normalization.

0.0.4 - 2019-11-21

Added

Adding detachable and picklable token feature set.

0.0.3 - 2019-07-31

Added

DocStash that parses documents as a factory stash.

0.0.2 - 2019-07-25

Added

Feature to disable SpaCy pipeline components.
Add configuration for removing punctuation and determiners.

Changed

Skip textacy for document creation since it wasn't used. This is more efficient.

0.0.1 - 2019-07-06

Added

Initial version.

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Change Log

Unreleased

1.12.1 - 2025-01-23

Changed

1.12.0 - 2025-01-11

Removed

Added

Changed

1.11.1 - 2024-05-11

Added

1.11.0 - 2024-04-14

Added

Changed

Removed

Changed

1.10.0 - 2024-02-27

Added

Changed

1.9.2 - 2024-01-11

Changed

1.9.1 - 2024-01-04

Added

Changed

1.9.0 - 2023-12-05

Changed

Added

Removed

1.8.1 - 2023-11-29

Added

Changed

1.8.0 - 2023-08-16

Changed

Added

1.7.3 - 2023-06-29

Changed

1.7.2 - 2023-06-27

Changed

1.7.1 - 2023-06-20

Added

1.7.0 - 2023-06-07

Changed

Added

1.6.0 - 2023-04-05

Added

Changed

1.5.0 - 2023-01-23

Changed

Added

Removed:

1.4.1 - 2022-10-02

Changed

1.4.0 - 2022-09-30

Added

Changed

Removed

1.3.0 - 2022-08-06

Added

Changed

1.2.0 - 2022-06-16

Removed

1.1.2 - 2022-06-14

Changed

1.1.1 - 2022-05-15

Changed

1.1.0 - 2022-05-04

Changed

Added

1.0.1 - 2022-01-25

Added

Changed

1.0.0 - 2021-10-22

Added

Changed

Removed