-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARC File Comparison #85
Comments
WARCreate: This Web Archiver places all pertinent information at the top of the file for quick access in plain text. It records each link found on the page and places them in the outlink section. then gives you the HTTP request information and the source code. From there you can use WAIL software to open the .warc file. webrecorder.io: This Web Archiver is very simple to use. You input the URL you want to archive and it will open it in its own browser. You have to scroll through the web page (you can also use the auto-pilot tool) and it will gather any information it can. From there your archive is saved and you can go back to it any time. From what I can tell its super in-depth and grabs just about everything. The web page will look exactly like its real counterpart. The images and media content is saved to their server and the WARC file isn't legible by a text-reader but it contains everything essential. The source code is the same except everything is linked through the webrecorder.io server. |
Are you suggesting that the embedded resources like images and media are not packaged in the WARC file, but hosted on their server separately and referenced from the WARC file?
This perhaps is because they use
Please elaborate on this, what do you mean by, "everything is linked through the webrecorder.io server"? |
I couldn't surmise if the information was stored in the file or simply linked to their sever the source code would change the URL as such: Source: WARC: which could indicate the content is being pulled into the local server but it's much more likely it is stored in the .warc file upon further review. This perhaps is because they use .warc extension for files that should actually be .warc.gz. They do it to avoid automatic extraction done by MacOS (neither do I like Apple's behavior here nor the misleading workaround of Webrecorder). If you append .gz at the end of the downloaded .warc file and then unzip it, you will find it equally as legible in a text editor as other WARC files. Please explore it and report your findings back. This did make the file partially legible. There are portions that contain image / other forms of data that wouldn't be legible in a text reader though. With this working, I can with some assurance say that the content is actually stored within the .warc file. The data can be understood is HTTP response information and some WARC metadata. Please elaborate on this, what do you mean by, "everything is linked through the webrecorder.io server"? For example:
is changed to:
So it appears the information is hosted on their server but I cannot confirm this is how it works. [USERNAME] = is my redacted username |
This is called URL-rewriting and it is necessary for proper replay of archived resources to avoid live leakage (we call it zombies). This rewriting is done on the fly at the replay time, not in the WARC itself (if you could hunt the WARC file and find it otherwise, it will be something to report as a bug). If you were to replay the same WARC locally, those references will change accordingly.
This should be no different for WARC files created by other tools. If you attended the relevant lecture, I did mention that the payload could be binary, but both HTTP and WARC headers are text-based. Did you not find binary data in the WARC created by the other too you are comparing it against? Also, can you summarize number of WARC records of different types (such as request, response, end metadata etc.) in the two tools? This would help estimate which tool is more effective in discovering most of the resources. You can use some WARC processing tools (such as |
Compare and contrast the resulting WARC files on the
https://odu.edu/compsci
URI generated by any two of the following tools:The text was updated successfully, but these errors were encountered: