Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance ZIM progress computation #440

Open
wsdookadr opened this issue Feb 15, 2025 · 1 comment
Open

Enhance ZIM progress computation #440

wsdookadr opened this issue Feb 15, 2025 · 1 comment

Comments

@wsdookadr
Copy link
Contributor

wsdookadr commented Feb 15, 2025

So I was looking at the update_stats method because I use --progress-file so I can know how much time it will take to build a ZIM

def update_stats(self):
"""write progress as JSON to self.stats_filename if requested"""
if not self.stats_filename:
return
self.written_records += 1
with open(self.stats_filename, "w") as fh:
json.dump(
{"written": self.written_records, "total": self.total_records}, fh
)

but when I look in the actual progress file I see:

user@dcrawl-1:~$ tail -f progress.txt
{"written": 911841, "total": 911841}tail: progress.txt: file truncated
{"written": 911842, "total": 911842}tail: progress.txt: file truncated
{"written": 911843, "total": 911843}tail: progress.txt: file truncated
{"written": 911844, "total": 911844}tail: progress.txt: file truncated
[..]

The "written" key is always equal to the "total" key. I was thinking that "total" should be a grand total counting the number of WARC records (across all input warc files) and "written" would be how many were written to the ZIM so far.

When I looked at the code it seems like total_records is updated every time a new WARC record is written and so is written_records

self.total_records += 1

self.written_records += 1

Shouldn't total_records be computed in gather_information_from_warc (the first pass) and stay constant all throughout add_items_for_warc_record . I'm looking for feedback on the above. Thanks!

Versions used:

  • warc2zim 2.2.1
@wsdookadr wsdookadr changed the title Enhance ZIM creation progress tracking Enhance ZIM progress computation Feb 15, 2025
@benoit74
Copy link
Collaborator

It is indeed pretty dumb, don't know how we achieved to do this, must have missed something, or this is the consequence of code move.

total_records should indeed directly be the total number of items in the WARC. And written_records should be processed_records which is incremented systematically after one WARC item is processed, even if it is ignored for whatever reason. This is probably the most reliable way avoiding to have to deal with all corner cases which lead us to not add a WARC record into the ZIM for various reasons.

PR welcomed !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants