Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warc-Indexer remove port :80 from url/links when normalising. #284

Open
thomasegense opened this issue Mar 30, 2022 · 0 comments
Open

Warc-Indexer remove port :80 from url/links when normalising. #284

thomasegense opened this issue Mar 30, 2022 · 0 comments

Comments

@thomasegense
Copy link
Contributor

thomasegense commented Mar 30, 2022

This is an example of an url_norm in Solr with the port 80.
url_norm:"http://train-aarhus.dk:80/visbillede.asp?fp=brandnewheavies.jpg"

In this case the url comes from the ARC (not WARC) header:

Arc Header

http://train-aarhus.dk:80/visbillede.asp?fp=brandnewheavies.jpg 194.239.250.54 20001021042018 text/html 1699

HTTP/1.1 200 OK

Server: Microsoft-IIS/4.0


Also when parsing links (a href) on a page port 80 should also be removed. Having links with and without port 80 will
result in playback issues since url can not be matched.

Same goes for https port 443

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant