You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this zimit run the seed URL is not HTML. The answer is a 200 with content-type: application/x-directory and no payload. warc2zim understood this and continued processing eventually failing on a missing attribute
We should either harden those processing or exit directly if the seed is not HTML.
[zimit::2025-02-14 22:25:15,583] INFO:Calling warc2zim with these args: ['--favicon=https://pixijs.com/images/logo.svg', '--name=pixijs.download_2b58a03e', '--zim-file=pixijs.download_2b58a03e.zim', '--publisher=openZIM', '--scraper-suffix', 'zimit 2.1.8', '--output', '/output', '--url', 'https://pixijs.download/release/docs/', '--description', 'PixiJS API Documentation', '-v', '--progress-file', '/output/warc2zim.json', '/output/.tmp3qz2ow61/collections/crawl-20250214222511797/archive']
[warc2zim::2025-02-14 22:25:15,589] DEBUG:Attempting to confirm output is writable in directory /output
[warc2zim::2025-02-14 22:25:15,590] DEBUG:Output is writable. Temporary file used for test: /output/tmpn43ov84l
[warc2zim::2025-02-14 22:25:15,591] DEBUG:Confirming ZIM file can be created using name: pixijs.download_2b58a03e.zim
[warc2zim::2025-02-14 22:25:15,592] DEBUG:1 WARC files found
[warc2zim::2025-02-14 22:25:15,598] WARNING:Main page is not an HTML Page, mime type is: application/x-directory - Skipping Favicon and Language detection
[warc2zim::2025-02-14 22:25:15,599] INFO:Expecting 1 ZIM entries to files
[warc2zim::2025-02-14 22:25:15,599] DEBUG:Preparing 0 redirections
[warc2zim::2025-02-14 22:25:15,599] DEBUG:0 redirections will be ignored
[warc2zim::2025-02-14 22:25:15,599] INFO:Expecting 1 ZIM entries including redirects
[warc2zim::2025-02-14 22:25:15,599] WARNING:No valid ZIM language, fallbacking to `eng`.
[zimit::2025-02-14 22:25:15,621] INFO:
[zimit::2025-02-14 22:25:15,621] INFO:
[zimit::2025-02-14 22:25:15,621] INFO:SIGINT/SIGTERM received, stopping zimit
[zimit::2025-02-14 22:25:15,621] INFO:
[zimit::2025-02-14 22:25:15,621] INFO:
Traceback (most recent call last):
File "/usr/bin/zimit", line 8, in <module>
sys.exit(zimit.zimit())
~~~~~~~~~~~^^
File "/app/zimit/lib/python3.13/site-packages/zimit/zimit.py", line 688, in zimit
sys.exit(run(sys.argv[1:]))
~~~^^^^^^^^^^^^^^
File "/app/zimit/lib/python3.13/site-packages/zimit/zimit.py", line 609, in run
return warc2zim(warc2zim_args)
File "/app/zimit/lib/python3.13/site-packages/warc2zim/main.py", line 168, in main
return converter.run()
~~~~~~~~~~~~~^^
File "/app/zimit/lib/python3.13/site-packages/warc2zim/converter.py", line 307, in run
self.retrieve_illustration()
~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/app/zimit/lib/python3.13/site-packages/warc2zim/converter.py", line 838, in retrieve_illustration
favicon_url.value, self.favicon_contents[favicon_url]
^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Converter' object has no attribute 'favicon_contents'
The text was updated successfully, but these errors were encountered:
I didn't know that, looking at warcit I thought it's implied that it WARC may contain non-HTML too, but it's understandable if the scope of warc2zim is limited to websites.
I looked at libzim and it isn't limiting the mimetypes contained in the WARC.
Do not misunderstand my statement, your quote is too short. Correct one below:
warc2zim is not meant for non-HTML website
In other word, a "website / thing" based on another kind of primary document (e.g. application/x-directory, even I don't really know what it means) will most probably not work in warc2zim (we will not properly rewrite links, etc ...).
But just like warcit, warc2zim is of course processing all kind of resources a website is using (images, CSS, ... and HTML obviously).
In this zimit run the seed URL is not HTML. The answer is a
200
withcontent-type: application/x-directory
and no payload. warc2zim understood this and continued processing eventually failing on a missing attributeWe should either harden those processing or exit directly if the seed is not HTML.
The text was updated successfully, but these errors were encountered: