Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize rules for skipping content #286

Open
tokee opened this issue Apr 6, 2022 · 0 comments
Open

Generalize rules for skipping content #286

tokee opened this issue Apr 6, 2022 · 0 comments

Comments

@tokee
Copy link
Collaborator

tokee commented Apr 6, 2022

warc-indexer has the "index or no index of a WARC-record"-properties record_type_include, response_include, protocol_include, exclusions and url_exclude.

With some rewriting this could be fully generalized to work on any field content for the generated SolrDocument (with optimizations for the situations where "no index" can be determined before analyzing), making it posssible to use white/black-lists for MIME types, domains etc. It could be folded into the fields in the config or be a separate section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant