Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not create ZIM if seed URL is not HTML #442

Open
rgaudin opened this issue Feb 17, 2025 · 3 comments
Open

Do not create ZIM if seed URL is not HTML #442

rgaudin opened this issue Feb 17, 2025 · 3 comments
Labels
bug Something isn't working
Milestone

Comments

@rgaudin
Copy link
Member

rgaudin commented Feb 17, 2025

In this zimit run the seed URL is not HTML. The answer is a 200 with content-type: application/x-directory and no payload. warc2zim understood this and continued processing eventually failing on a missing attribute

We should either harden those processing or exit directly if the seed is not HTML.

[zimit::2025-02-14 22:25:15,583] INFO:Calling warc2zim with these args: ['--favicon=https://pixijs.com/images/logo.svg', '--name=pixijs.download_2b58a03e', '--zim-file=pixijs.download_2b58a03e.zim', '--publisher=openZIM', '--scraper-suffix', 'zimit 2.1.8', '--output', '/output', '--url', 'https://pixijs.download/release/docs/', '--description', 'PixiJS API Documentation', '-v', '--progress-file', '/output/warc2zim.json', '/output/.tmp3qz2ow61/collections/crawl-20250214222511797/archive']
[warc2zim::2025-02-14 22:25:15,589] DEBUG:Attempting to confirm output is writable in directory /output
[warc2zim::2025-02-14 22:25:15,590] DEBUG:Output is writable. Temporary file used for test: /output/tmpn43ov84l
[warc2zim::2025-02-14 22:25:15,591] DEBUG:Confirming ZIM file can be created using name: pixijs.download_2b58a03e.zim
[warc2zim::2025-02-14 22:25:15,592] DEBUG:1 WARC files found
[warc2zim::2025-02-14 22:25:15,598] WARNING:Main page is not an HTML Page, mime type is: application/x-directory - Skipping Favicon and Language detection
[warc2zim::2025-02-14 22:25:15,599] INFO:Expecting 1 ZIM entries to files
[warc2zim::2025-02-14 22:25:15,599] DEBUG:Preparing 0 redirections
[warc2zim::2025-02-14 22:25:15,599] DEBUG:0 redirections will be ignored
[warc2zim::2025-02-14 22:25:15,599] INFO:Expecting 1 ZIM entries including redirects
[warc2zim::2025-02-14 22:25:15,599] WARNING:No valid ZIM language, fallbacking to `eng`.
[zimit::2025-02-14 22:25:15,621] INFO:
[zimit::2025-02-14 22:25:15,621] INFO:
[zimit::2025-02-14 22:25:15,621] INFO:SIGINT/SIGTERM received, stopping zimit
[zimit::2025-02-14 22:25:15,621] INFO:
[zimit::2025-02-14 22:25:15,621] INFO:

Traceback (most recent call last):
  File "/usr/bin/zimit", line 8, in <module>
    sys.exit(zimit.zimit())
             ~~~~~~~~~~~^^
  File "/app/zimit/lib/python3.13/site-packages/zimit/zimit.py", line 688, in zimit
    sys.exit(run(sys.argv[1:]))
             ~~~^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.13/site-packages/zimit/zimit.py", line 609, in run
    return warc2zim(warc2zim_args)
  File "/app/zimit/lib/python3.13/site-packages/warc2zim/main.py", line 168, in main
    return converter.run()
           ~~~~~~~~~~~~~^^
  File "/app/zimit/lib/python3.13/site-packages/warc2zim/converter.py", line 307, in run
    self.retrieve_illustration()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/app/zimit/lib/python3.13/site-packages/warc2zim/converter.py", line 838, in retrieve_illustration
    favicon_url.value, self.favicon_contents[favicon_url]
                       ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Converter' object has no attribute 'favicon_contents'

@rgaudin rgaudin added bug Something isn't working question Further information is requested labels Feb 17, 2025
@benoit74 benoit74 removed the question Further information is requested label Feb 17, 2025
@benoit74 benoit74 added this to the 2.3.0 milestone Feb 17, 2025
@benoit74
Copy link
Collaborator

I don't think there is much room for discussion, warc2zim is not meant for non-HTML website, nor the ZIM format / readers.

@benoit74 benoit74 changed the title Should we create ZIM if seed URL is not HTML? Do not create ZIM if seed URL is not HTML Feb 17, 2025
@wsdookadr
Copy link
Contributor

wsdookadr commented Feb 21, 2025

warc2zim is not meant for non-HTML

I didn't know that, looking at warcit I thought it's implied that it WARC may contain non-HTML too, but it's understandable if the scope of warc2zim is limited to websites.

I looked at libzim and it isn't limiting the mimetypes contained in the WARC.

Btw do PDFs fall under non-HTML too?

@benoit74
Copy link
Collaborator

Do not misunderstand my statement, your quote is too short. Correct one below:

warc2zim is not meant for non-HTML website

In other word, a "website / thing" based on another kind of primary document (e.g. application/x-directory, even I don't really know what it means) will most probably not work in warc2zim (we will not properly rewrite links, etc ...).

But just like warcit, warc2zim is of course processing all kind of resources a website is using (images, CSS, ... and HTML obviously).

PDFs as well of course, see https://library.kiwix.org/#lang=eng&q=survivor ; but this website is the perfect example: it works because the primary page is HTML

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants