From 65da537267b47281a4d4442cb9176098b7ae78df Mon Sep 17 00:00:00 2001 From: Pascal Essiembre Date: Mon, 18 Oct 2021 00:22:23 -0400 Subject: [PATCH] Preparing for release. --- norconex-importer/CHANGES.xml | 8 +- norconex-importer/TODO.txt | 133 ---------------------------------- norconex-importer/pom.xml | 4 +- 3 files changed, 6 insertions(+), 139 deletions(-) delete mode 100644 norconex-importer/TODO.txt diff --git a/norconex-importer/CHANGES.xml b/norconex-importer/CHANGES.xml index 92fdc2b9..abeac067 100644 --- a/norconex-importer/CHANGES.xml +++ b/norconex-importer/CHANGES.xml @@ -7,14 +7,14 @@ - + + New NoContentTransformer ported from version 3.0.0. + + TitleGeneratorTagger now defaults to the first 10,000 characters for analysis to improve performance on large files. - - New NoContentTransformer ported from version 3.0.0. - Transformers are now invoked at least once, even when a document has no content. diff --git a/norconex-importer/TODO.txt b/norconex-importer/TODO.txt deleted file mode 100644 index bc38d95c..00000000 --- a/norconex-importer/TODO.txt +++ /dev/null @@ -1,133 +0,0 @@ -TODO: -============== - -- Have a .misc package for handlers for those not falling into any of the 4 - types (like DebugTagger and FieldReportTagger). - -- Add Tagger for creating document summary. - -- Have a Prefix tagger to prefix all metadata with something. - Also modify the RenameTagger to do bulk renaming - https://github.com/Norconex/collector-http/issues/553 - -- Have a generic construct for set/replace/append/prepend/insertat, for multivalue - maps (Properties) but also for attributes on taggers and such. - -- Consider a generic Matcher/Replacer class that supports either Regex, Normal, or - WildCard matches/replacements. - -- Have a transformer that eliminates the content (and/or store into a field). - https://github.com/Norconex/collector-filesystem/issues/30#issuecomment-384499927 - -- Boost memory for handlers loading docs in memory for processing to 1GB or - x% of free memory and throw warning when having to split. - -- Consider having a flag for text handlers that detect if text or binary - and by default will handle only text unless forced otherwise. - -- Have a DOMTRansformer and a tagger that splits multivalue in individual fields - giving a each index position a specific name. - -- See if we can return an empty output stream in IDocumentSplitter to - eliminate parent document. - -- Have option to parse content as XML instead of plain text. Should - it be a parser hint? - -- Have a copy of importer launch scripts with collectors. - -- Have the option for the importer to ignore suplied content-type/charset and - always perform detection (with option to fall back to supplied ones if could - not detect). - -- When content type is provided to importer, but is wrong, catch any exception - and try again after auto-detecting if the detected type is different. - -- Add support for SentimentParser and other Tika recent features. - -- Add onConflict to CopyTagger (add,set,ignore) and wherever appropriate - (where there is "overwrite"?) - -- Switch to Commons CLI 2.x - -- Once Norconex Commons Lang upgrades to Velocity 2.0 add Velocity as a - scripting language option where applicable (e.g. ScriptTagger). - -- Consider adding a "mergeElements" to DOMTagger for the number of elements to - merge, to accomodate for senarios where key/values are repeated, without a - parent wrapping tag, as in: https://github.com/Norconex/importer/issues/54 - -- Add ability to pass a class resolver when loading an XML, which would - for example allows to try loading the class with a predefined set of - package paths. This would allow users to supply only the class name, - making configs easier to read/maintain. - -- Maybe have default "text-only" flag for each handlers?? - -- Have a tagger that looks up metadata in a relational database? - -- Add ability to do batch rename of field names (e.g. replacing dots with ...) - -- Add support for tika SentimentParser. - -- Have new taggers: - - ExtensionTagger, given a URL, tries to get extension from content type - if not found in reference. - - Add overwrite=true|false to ReplaceTagger? - -- To remove/adjust when released in Apache Tika: - - - Remove XFDL from - GenericDocumentParserFactory as well as from custom-mimetypes.xml. - https://issues.apache.org/jira/browse/TIKA-1946 - https://issues.apache.org/jira/browse/TIKA-2228 - https://issues.apache.org/jira/browse/TIKA-2222 - -- Consider adding LIRE support (image info extraction for image search). - http://www.lire-project.net/ - -- Consider creating an ExternalTagger which expects metadata extraction patterns - from STDOUT/STDERR or from output file. - -- Allow to specify data unit for DocumentLengthTagger (with locale and decimal - precision). - -- Find out if we can reduce metadata extraction on images to avoid - OOMException on some images with massive amount of metadata. - -- Investigate Tika Named Entity Parser: - https://wiki.apache.org/tika/TikaAndNER - -- Investigate Tika Natural Language Toolkit: - https://wiki.apache.org/tika/TikaAndNLTK - -- Maybe ship with a default tika-config on a given path so it can easily be - modified: https://tika.apache.org/1.12/configuring.html - -- Add better defined Geospatial Data Abstraction Library (GDAL) support, - leveraging Tika GDAL support (requires external app install, like - Tesserac OCR feature). - -- Create an ImageConverterTransformer that would convert images from/to - format of choice. This could allow for instance to convert some - formats non-supported by Tesseract OCR into some that are. - -- Have a maximum recursivity setting somewhere in GenericDocumentParserFactory? - Alternatively, consider moving to using RecursiveParserWrapper which - already supports that. - -- MAYBE: Consider interactive shell script invoking the importer. - -- MAYBE: Have being optional wrapping tag that can group - multiple other handlers, so the condition does not have to be repeated in - each. E.g.: - - - - - - - - - Have this in addition or as a replacement to current approach? - \ No newline at end of file diff --git a/norconex-importer/pom.xml b/norconex-importer/pom.xml index fed54720..c4cb8679 100644 --- a/norconex-importer/pom.xml +++ b/norconex-importer/pom.xml @@ -19,7 +19,7 @@ 4.0.0 com.norconex.collectors norconex-importer - 2.11.0-SNAPSHOT + 2.11.0 Norconex Importer @@ -28,7 +28,7 @@ 1.18 2.0.11 - 1.15.1 + 1.15.2 2009