-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically redirect to articles with same checksum #86
Conversation
Codecov Report
@@ Coverage Diff @@
## master #86 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 31 31
Lines 1159 1186 +27
=========================================
+ Hits 1159 1186 +27
Continue to review full report at Codecov.
|
@rgaudin Could you please review this code? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @benoit74 , sorry for the delay in responding to this PR.
I'll write my comments here as those are general ones mostly:
- After giving it some though, I think this belongs in a separate module and not within the Creator.
- The Creator is a central piece to every scraper and we should be conserative in adding features to it.
- Creator is inherited from pylibzim which is mostly a C++ wrapper with its own challenges. Keeping it somewhat isolated is crutial for debug/maintenance.
- By nature, this code is meant to consume a growing amount of memory. Puting it inside the Creator makes debugging memory leaks harder.
- All the ZIM content passes through the Creator ; using moving parts such as ContentProviders.
- The Creator is expected to live during the full lifespan of the process and python doesn't releases memory back to the system while running.
- your approach is probably good for most scrapers but some super large ones may require an out-of-python version as well.
- I'd suggest we use md5 instead of sha256
- md5 is faster. This might have an important impact if numerous large files are passes through it.
- we don't need the security.
- digest is half the size so half the memory consumption.
- note that Codefactor will complain so you'll probably need to use
# nosec
.
Could you move your code to a new cache
module ? You may want to offer the feature as an independant tool, a Creator subclass or a Mixin depending on whether you want to keep the single-step addition or not.
Let me know if this makes sense or not. A docstring on those two methods might help as well.
@benoit74 if you have reasons to believe that md5 will generate many exceptions (@kelson42 tells me you mentioned it somewhere), you may take a look at http://cyan4973.github.io/xxHash/ (from openzim/libzim#614) |
Hi @rgaudin, thank you for the review. To be honest, I'm embarrassed about it. First because I'm quite sure to not have the competencies / time to do the requested modifications, second because I'm not sure to really agree about your arguments / PoV. I won't start a detailed discussion about it here due to the combination of these both issues. In addition, since you will be the one(s) to support the long term maintenance of this library, I would be more than embarrassed to "force" you to do something that could be proven to be wrong / costly without supporting these consequences myself. I will hence close this PR and live with my own non-standard implementation in my scrapper. No worries anyway, I'm not upset at all, I just have my own agenda / priorities and you have yours, quite normal. |
@benoit74, I'm glad you're not upset ; definitely not my intent but I am a bit disappointed myself. Not willing to do a change or nor having time to is 100% OK. You're volunteering time on this project and we're thankful for it. Yet, not sharing your disagreements is meh (can't find a better word). We're all doing this in the open so we can all benefit from other's point of views and experience. We may find time to implement this at some point, or another volunteer could or maybe my assumptions are incorrect and the PR should be merged… but we'd need to know why we're wrong or why you think we are. I do know how time and resource consuming it can be to argument or demonstrate a point. You think it's not worth it and that's probably true from your POV but that's what collaboration is about. We understand that time is precious which is why we're OK with pointers, approximations or hunches ; it's all over our repositories. Let us know what you think is wrong and we'll have a chance to think about/research it. I'm not saying we'll do it either way but a criticism is a chance to improve. Thanks again for your contribution. 🙏 |
Fix #33
This is clearly a first draft / proposal, open to review / fixes.
This change adds a method
add_autodedup_filter
to specify which paths should be checked for duplicates (checksums are only computed on this path, and checked for duplicates on this path, in order to limit the footprint).This change modify the method
add_item_for
to first compute the SHA256 (to be sure to avoid collision) checksum ofcontent
or content atfpath
; if duplicate is found, anadd_redirect
is done instead ; if no duplicate, the checksum is added to a dictionary of checksum => item_path.