You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The "written" key is always equal to the "total" key. I was thinking that "total" should be a grand total counting the number of WARC records (across all input warc files) and "written" would be how many were written to the ZIM so far.
When I looked at the code it seems like total_records is updated every time a new WARC record is written and so is written_records
It is indeed pretty dumb, don't know how we achieved to do this, must have missed something, or this is the consequence of code move.
total_records should indeed directly be the total number of items in the WARC. And written_records should be processed_records which is incremented systematically after one WARC item is processed, even if it is ignored for whatever reason. This is probably the most reliable way avoiding to have to deal with all corner cases which lead us to not add a WARC record into the ZIM for various reasons.
So I was looking at the
update_stats
method because I use--progress-file
so I can know how much time it will take to build a ZIMwarc2zim/src/warc2zim/converter.py
Lines 240 to 248 in 62d3fe5
but when I look in the actual progress file I see:
The "written" key is always equal to the "total" key. I was thinking that "total" should be a grand total counting the number of WARC records (across all input warc files) and "written" would be how many were written to the ZIM so far.
When I looked at the code it seems like
total_records
is updated every time a new WARC record is written and so iswritten_records
warc2zim/src/warc2zim/converter.py
Line 991 in 62d3fe5
warc2zim/src/warc2zim/converter.py
Line 244 in 62d3fe5
Shouldn't
total_records
be computed ingather_information_from_warc
(the first pass) and stay constant all throughoutadd_items_for_warc_record
. I'm looking for feedback on the above. Thanks!Versions used:
The text was updated successfully, but these errors were encountered: