DOMTagger only for html but also index PDF and other Documents #441

tschechniker · 2017-12-18T08:16:49Z

Hi,

currently i have the situation that i want to only have the "main" content parsed in an html document. Like this:

 <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
                        <dom selector="main" toField="content" overwrite="true"/>
                        <restrictTo field="document.contentType">text/html</restrictTo>
</tagger>

But this does not overwrite the content. It only sets a content MetaField.

If i want to upload this i have to configure my CloudSearch commiter to use this MetaField als content.

<committer class="com.norconex.committer.cloudsearch.CloudSearchCommitter">
                <serviceEndpoint>XYZ</serviceEndpoint>
                <fixBadIds>true</fixBadIds>
                <sourceContentField>content</sourceContentField>
            </committer>

So for HTML files i got this running. But whats about PDF Files and other documents? They still have there content in the content field and don't have any content MetaField. I was unable to find a "CopyContentToMetaField" config. Or is there a posibility to overwrite the content with the DOM Tagger (or any other config)? The current behavior is that the content for PDF files and other documents which is commited to CloudSearch is empty.

essiembre · 2017-12-19T02:33:21Z

There is already a feature request to for a DOMTransformer that would do what you are after: Norconex/importer#62.

In the meantime, you have a few options. If you want to modify the content rather than store in a new field, you can use the ReplaceTransformer to perform search & replace using regular expressions. You can also look at using a combination of StripBeforeTransformer and StripAfterTransformer. All would need to be set up as pre-parse handlers.

To copy the content to a metadata field, you can use TextPatternTagger, in a similar way (untested):

  <tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger" >
      <pattern field="content">.*</pattern>
      <restrictTo field="document.contentType">^(?!text/html$).*$</restrictTo>
  </tagger>

Can one of the above work for you?

essiembre closed this as completed Dec 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOMTagger only for html but also index PDF and other Documents #441

DOMTagger only for html but also index PDF and other Documents #441

tschechniker commented Dec 18, 2017

essiembre commented Dec 19, 2017

DOMTagger only for html but also index PDF and other Documents #441

DOMTagger only for html but also index PDF and other Documents #441

Comments

tschechniker commented Dec 18, 2017

essiembre commented Dec 19, 2017