diff --git a/CHANGES.xml b/CHANGES.xml
index 39fcce46..6899638d 100644
--- a/CHANGES.xml
+++ b/CHANGES.xml
@@ -7,8 +7,8 @@
-
+
Maven dependency updates: Apache Tika 1.27 (and its many transitive
dependencies), UCAR jj2000 5.4, Opencsv 5.5.2,
diff --git a/README.md b/README.md
index a88fee74..c3b7881a 100644
--- a/README.md
+++ b/README.md
@@ -7,4 +7,4 @@ its format (HTML, PDF, Word, etc). In addition, it allows you to perform any
manipulation on the extracted text before importing/using it in your own
service or application.
-Website: http://www.norconex.com/collectors/importer/
+Website: https://opensource.norconex.com/importer/
diff --git a/TODO.txt b/TODO.txt
index 539e6bff..4aff5d81 100644
--- a/TODO.txt
+++ b/TODO.txt
@@ -1,40 +1,29 @@
TODO:
==============
-- Deprecate "restrictTo" in favor of XML Flow conditions.
-- Deprecate "filters" in favor of XML Flow conditions.
+- Deprecate "restrictTo" in favor of XML Flow conditions?
+- Deprecate "filters" in favor of XML Flow conditions?
-- Instead of naming handlers wit "Metadata" "Content", made/recommended for
+- Instead of naming handlers with "Metadata" "Content", made/recommended for
Pre or Post, or what type does it support out of the box vs recommended...
Use custom annotations that would generate appropriate javadoc.
-- Maybe have the following two tags a or content tag:
- rejects the document
- more risky: abort the execution flow but consider the doc valid
-
-
+- For XMLFlow, in addition of , maybe have (more risky):
+ abort the execution flow but consider the doc valid
+
- Replace DOMDeleteTransformer with DOMTransformer that gives the option
to only keep what is matching, deleting the rest, or delete what is matching.
+- Consider merging tagger and transformer and detecting if content has changed,
+ and offer in most case to do operation on either field or content (or both).
+
- Add a TrimTagger/TrimTransformer
- Add a more convinient way to collapse on white spaces.
- Modify ImporterEvent so that Importer is the source, as it should (as opposed to the doc).
-- Consider a generic Matcher/Replacer class that supports either Regex, Normal, or
- WildCard matches/replacements.
- - Write tests for it
- - Do it also for "restrictTo".
-
- Remove all the @since x.x.x referencing versions before 3.0.0
-- In the JavaDoc, point to the Summary page with anchor or in appropriate class
- description method for documentation instead of repreating it.
- E.g. EncryptKey, restrictTo, "storing values in an existing field", etc.
- Use -tag and -taglet? Have tag(let) for thinks such as :
- if it can be used as pre and/or post handlers, xml configuration usage,
- sample usage, main doc, etc.
-
- Have a .misc package for handlers for those not falling into any of the 4
types (like DebugTagger and FieldReportTagger).
@@ -44,16 +33,6 @@ TODO:
CSVSplitter, etc).
- Have a Prefix tagger to prefix all metadata with something.
- Also modify the RenameTagger to do bulk renaming
- https://github.com/Norconex/collector-http/issues/553
-
-- Rename all "[xxx]Field" attributes to either sourceField or targetField.
-
-- Add to website generic assumptions such as:
- - how white spaces are handled in XML.
- - how all booleans default to false
- - how all PropertySetter default to APPEND
- - etc.
- Add to scripts "-Dnashorn.args=--no-deprecation-warning" to silence
deprecation warning on some JVM.
@@ -65,45 +44,28 @@ TODO:
- Add ReduceConsecutiveTagger.
-- Move GenericDocumentParserFactory to .impl (for consistency).
+- Move GenericDocumentParserFactory to .impl (for consistency) or do not make
+ it a factory?
- Maybe: have a @taglet for if it can be used as pre-post or both?
- CountValueTagger (one to count mattching patterns, one to count number of multi-value entries)
-- EmptyFilter
- DOMTransformer, JSON*(handlers)
- EmptyTagger or CompactTagger (eliminate empty list values and/or duplicate values.
-- Move GenericParserFactory to .impl, or do not make it a factory?
-
-- Add "onSet" to parser so implementors can decide what to do with extracted
- metadata.
-
-- Package importer with a log4j file that exclude useless errors (e.g. jbig2)
-
- Have a StripAccentsTagger and Transformer. See: StringUtils.stripAccents(str)
-- Have a ContentToFieldTagger to easily take the content and store it in a field
-
- Add ability to convert binary content into hex/base64 into a text field, or
to replace body.
- Consider making the MS .docs memory fix permanent:
https://github.com/Norconex/collector-filesystem/issues/39#issuecomment-419327401
-- Make OnMatch an IXMLConfigurable object instead of an abstract one.
-
-- Remove "cachedPattern" instances and replace with Model equivalent
- (transient with string being serialisable).
-
- Maybe: rename references to "metadata" to be references to "fields" ?
- In load/save XML reference local fields instead of getters/setters.
- Convert all arrays to final List for consistency (with unmodifiable getters).
-- Consider merging tagger and transformer and detecting if content has changed,
- and offer in most case to do operation on either field or content (or both).
-
- Consider using updated Tika RecursiveParser instead of custom one.
- Have a handler that stores the file in its current state in a location
@@ -115,13 +77,6 @@ TODO:
- Fix external links in Javadoc (all projects).
-- Have a transformer that eliminates the content (and/or store into a field).
- And mark as resolved this (closed) ticket:
- https://github.com/Norconex/collector-filesystem/issues/30#issuecomment-384499927
-
-- Boost memory for handlers loading docs in memory for processing to 1GB or
- x% of free memory and throw warning when having to split.
-
- Consider having a flag for text handlers that detect if text or binary
and by default will handle only text unless forced otherwise.
@@ -145,11 +100,6 @@ TODO:
- Add support for SentimentParser and other Tika recent features.
-- Add onConflict to CopyTagger (add,set,ignore) and wherever appropriate
- (where there is "overwrite"?)
-
-- Switch to Commons CLI 2.x
-
- Once Norconex Commons Lang upgrades to Velocity 2.0 add Velocity as a
scripting language option where applicable (e.g. ScriptTagger).
@@ -157,19 +107,10 @@ TODO:
merge, to accomodate for senarios where key/values are repeated, without a
parent wrapping tag, as in: https://github.com/Norconex/importer/issues/54
-- Add ability to pass a class resolver when loading an XML, which would
- for example allows to try loading the class with a predefined set of
- package paths. This would allow users to supply only the class name,
- making configs easier to read/maintain.
-
- Maybe have default "text-only" flag for each handlers??
- Have a tagger that looks up metadata in a relational database?
-- Add ability to do batch rename of field names (e.g. replacing dots with ...)
-
-- Add support for tika SentimentParser.
-
- Have new taggers:
- ExtensionTagger, given a URL, tries to get extension from content type
if not found in reference.
@@ -186,9 +127,6 @@ TODO:
- Consider adding LIRE support (image info extraction for image search).
http://www.lire-project.net/
-- Consider creating an ExternalTagger which expects metadata extraction patterns
- from STDOUT/STDERR or from output file.
-
- Allow to specify data unit for DocumentLengthTagger (with locale and decimal
precision).
@@ -208,30 +146,11 @@ TODO:
leveraging Tika GDAL support (requires external app install, like
Tesserac OCR feature).
-- Create an ImageConverterTransformer that would convert images from/to
- format of choice. This could allow for instance to convert some
- formats non-supported by Tesseract OCR into some that are.
-
- Have a maximum recursivity setting somewhere in GenericDocumentParserFactory?
Alternatively, consider moving to using RecursiveParserWrapper which
already supports that.
- MAYBE: Consider interactive shell script invoking the importer.
-- MABYE Have a base handler class that takes a functional interface for the
+- MABYE: Have a base handler class that takes a functional interface for the
different types?
-
-
-- MAYBE: Have being optional wrapping tag that can group
- multiple other handlers, so the condition does not have to be repeated in
- each. E.g.:
-
-
-
-
-
-
-
-
- Have this in addition or as a replacement to current approach?
-
\ No newline at end of file