Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with internal link cache re-checking logic (4.0.0.rc3) #695

Closed
riccardoporreca opened this issue Jan 9, 2022 · 8 comments · Fixed by #696 or #699
Closed

Problems with internal link cache re-checking logic (4.0.0.rc3) #695

riccardoporreca opened this issue Jan 9, 2022 · 8 comments · Fixed by #696 or #699

Comments

@riccardoporreca
Copy link
Collaborator

Cached internal broken links are not re-checked

rm -rf example_site
mkdir -p example_site
echo '
<!doctype html>
<html>
<head><title>Example</title></head>
<body>
<a href="/missing.html">Missing internal link</a>
</body>
</html>
' > example_site/index.html

Running first with empty cache

rm -rf tmp/.htmlproofer
bundle exec htmlproofer example_site --log-level debug --cache '{"timeframe": "1d"}'
# [...]
# Adding /missing.html to internal cache
# Adding 1 new internal link to the cache
# Removing 0 outdated internal links from the cache
# Checking 1 internal link
# Ran on 1 file!
#
#
# For the Links > Internal check, the following failures were found:
#
# * In example_site/index.html (line 5):
#
#   internally linking to /missing.html, which does not exist
#
#
# HTML-Proofer found 1 failure!

Run again: the broken link is not re-checked

rm -rf tmp/.htmlproofer
bundle exec htmlproofer example_site --log-level debug --cache '{"timeframe": "1d"}'
# Found 1 internal link in the cache
# Adding 0 new internal links to the cache
# Removing 0 outdated internal links from the cache
# Checking 0 internal links
# Ran on 1 file!
#
#
# HTML-Proofer finished successfully.
@riccardoporreca
Copy link
Collaborator Author

I have been investigating this, which is primarily caused by overriding the metadata, which loses the found information (confirmed by found being null in cache.json after the second round above)

additions = urls_detected.reject do |url, metadata|
if @cache_log[type].include?(url)
@cache_log[type][url][:metadata] = metadata
# if this is false, we're trying again
if type == :external
@cache_log[type][url][:found]
else
@cache_log[type][url][:metadata].none? { |m| m[:found] }
end
else
@logger.log :debug, "Adding #{url} to #{type} cache"
false
end
end

I am trying to propose a fix following my investigations

riccardoporreca added a commit to riccardoporreca/html-proofer that referenced this issue Jan 9, 2022
* Reveals the issue gjtorikian#695 with checking failed links
* Also fix test cache for within-date checking
riccardoporreca added a commit to riccardoporreca/html-proofer that referenced this issue Jan 9, 2022
…torikian#695)

* We cannot overwrite the cache metadata before checking if the internal link was found.
* Don't recheck only if all cached metadata references to an internal link passed.
* Log the re-addition of not found urls (debug level)
@riccardoporreca
Copy link
Collaborator Author

See PR #696

Note that there is still one main flaw even after the proposed fix, which re-checks an internal url if not all cached checks were OK:

@cache_log[type][url][:metadata].all? { |m| m[:found] }

This is not enough, since a relative URL that was OK for all the files previously checked and cached does not provide any guarantee against e.g. the same URL being OK for a new file (for instance, this could happen if a file is moved and the relative urls in it would not be updated). This is very related to #671 (comment)

riccardoporreca added a commit to riccardoporreca/html-proofer that referenced this issue Jan 9, 2022
…ache

* See gjtorikian#695 (comment)
* The link should be re-checked as it was not checked for the given file, regardless of whether it was found OK in the cache for other locations.
* This is currently failing, as we miss the logic for determining this should be re-checked.
@riccardoporreca
Copy link
Collaborator Author

riccardoporreca commented Jan 9, 2022

I have created a new branch off the one of PR #696, where I could add a test for the problem mentioned in #695 (comment) above

riccardoporreca added a commit to riccardoporreca/html-proofer that referenced this issue Jan 9, 2022
…ache

* See gjtorikian#695 (comment)
* The link should be re-checked as it was not checked for the given file, regardless of whether it was found OK in the cache for other locations.
* This is currently failing, as we miss the logic for determining this should be re-checked.
riccardoporreca added a commit to riccardoporreca/html-proofer that referenced this issue Jan 9, 2022
…data

* See gjtorikian#695 (comment)
* The new combined key prevents using the metadata to store the information, and uniquely identifies what is relevant for checking internal links.
* This aligns internal and external collection and cache.
* The combined key needs to be serialized to a string when writing the cache to JSON and parsed when reading.
riccardoporreca added a commit to riccardoporreca/html-proofer that referenced this issue Jan 9, 2022
* To show the issue mentioned in gjtorikian#695 (comment) is addressed with the new approach to internal cache keys.
@riccardoporreca
Copy link
Collaborator Author

@gjtorikian, I took a bit of liberty (and fun) to investigate a possible rethinking of internal links cache keys, by using a combined key including the relevant "context" used for checking them. This was also largely inspired by your considerations in #671 (comment) and #671 (comment)

You can find this drafted in a branch: riccardoporreca/html-proofer@feature/695-fix-internal-cache-metadata-check...riccardoporreca:feature/695-investigate-internal-cache-key-metadata

The approach seems to work, although it has downsides in the JSON cache, where keys become less readable.
The failing tests are only for the cache, since the test expectations and data are not aligned. I only aligned the newly-introduced test revealing the issue.

@gjtorikian gjtorikian reopened this Jan 9, 2022
gjtorikian pushed a commit that referenced this issue Jan 9, 2022
* To show the issue mentioned in #695 (comment) is addressed with the new approach to internal cache keys.
@gjtorikian
Copy link
Owner

The approach seems to work, although it has downsides in the JSON cache, where keys become less readable.
The failing tests are only for the cache, since the test expectations and data are not aligned. I only aligned the newly-introduced test revealing the issue.

Gotcha. Thanks for the explanation. I'm far less concerned about a human-readable version of the cache.

@gjtorikian
Copy link
Owner

I need to look at this one a bit more carefully, because I bungled up your branch with a force push. There are a couple of tests still failing but in general I think what you've done makes sense.

@riccardoporreca
Copy link
Collaborator Author

There are a couple of tests still failing but in general I think what you've done makes sense.

Indeed, I did not want to fix tests before knowing if the approach would be something workable at your end. I think it is important to align all test cache files to the new structure: tests checking if a link is added because it was failed in the cache or the cache expired might pass for the wrong reason of the key not being in the cache

Let me know if I can help with this...

@gjtorikian
Copy link
Owner

I just remembered I dropped this—will try to make time for it this weekend.

gjtorikian pushed a commit that referenced this issue Feb 16, 2022
* Reveals the issue #695 with checking failed links
* Also fix test cache for within-date checking
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants