Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent duplicates by cleaning up HTML tags with timestamps or tokens #1223

Open
grossir opened this issue Oct 22, 2024 · 0 comments
Open

Prevent duplicates by cleaning up HTML tags with timestamps or tokens #1223

grossir opened this issue Oct 22, 2024 · 0 comments

Comments

@grossir
Copy link
Contributor

grossir commented Oct 22, 2024

We have seen this in coloctapp, which uses a vlex backend #1215 . There, some <img> tags had AWS tokens that changed each time

I have a new example in the older files (from before July 2012) for scctapp_u, which have a timestamped <script> tag from "Incapsula"
Open this example a few seconds apart and this tag will change:
<script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3&ns=1&cb=1672922138" async></script>

So we could define a default cleaning step for HTML files that tries to remove elements that may hold tokens or timestamps.

Additionally, we could compute the hash in Juriscraper, so that one can inspect if this changes in development

@flooie flooie moved this to General Backlog in Case Law Sprint Nov 19, 2024
@flooie flooie moved this from General Backlog to Feb 10 to Feb 21 in Case Law Sprint Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Feb 10 to Feb 21
Development

No branches or pull requests

1 participant