-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for T274511 #22
Fix for T274511 #22
Conversation
Hello @pamputt , you are requesting a code review right ? I read on phabricator:
So (1) the code of this PR already tested live on Kurdish, right ? There is the thing I see forward with such tool. Did you anticipated side effects ? For French, when there is 20+ recordings for a word, doesn't this risks to force inject the "missing audios" ? or does the language-specific python script set limits for those ? (Note: my understanding of the bot is low so I need to ask basic questions, doing my best to still provide a review and think about everythings.) I would recommend to make a short test run with French, our largest language, to check if something goes wrong. Few dozen or hundreds words ? Is that possible ? |
Yes, it is tested only on kuwiktionary for now.
Actually this is a script, so the idea is just to run it manually. It is not part of Lingua Libre Bot itself, it is a tool to make using Lingua Libre Bot easier.
The number of maximum recordings is managed wiki by wiki depending on the wiki rules (frwiktionary may have a different policy than kuwiktionary). So no side effects are expected.
Once I have finished on kuwiktionary and checked that everything looks fine, I will run the script on the all wikis together to add all missing pronunciations that were not added during the bot outage.
More or less. First, I wanted to host the code somewhere that is visible for everyone, i.e. here. Thus, we can discuss about it and report if any issue is detected. At the end, we may decide to merge if we think this is useful. |
So it's a convenience script built around lingualibre bot, a part of the toolkit, to put in the same repository to occasionally serve a specific purpose. Seems pretty cool ! You lead the way. I cant review the python code much but there are my few "reviewer" advices On code:
Around it:
The final run may involve 100,000 edits. I guess you plan to closely monitor the first edits of each language ?
Also, if it is ran by hand and has no expected side effect, we can merge into master when you feel ready, within this week. ✌🏼 |
Hello @pamputt , i corrected a minor typo per commit above. Second, i m not sure Third, the conditional switch for large languages requiring time-frangmented sparql calls, and other languages requiring unique sparql call, seems superflux. Let time-frangment them all, it only make our code more resilient. Fourth, IMHO and optionally, this time-fragmenting code could be moved to llbot.py. Fifth, since we now have avenues to provide language + number of records (see hugolpz/Sparql2Data#4), we could add to llbot.py a parameter and conditional switch |
The 5 second sleep is there because otherwise the Lingua Libre SPARQL sends the 429 error. It happends when I query the list of all recordings. I did not get this error elsewhere. Probably the sleep can be smaller (2-3 seconds) but this is not a big deal. About your third point, why not; it is just a bit slower because there is this sleep 5 between every call. About the fourth point, I do not think it is a good idea to put that in llbot.py (more precisely in lili.py) because we cannot know what will happen in the future. For example, if the server Lingua Libre SPARQL server becomes more robust, it might not be required to split the request. And how to know the granularity for the splitting? Is 6 months is correct, too reduced, too large? It may not fit if 100k recordings are done within this 6-month period. |
Ok for me as for most points. Your lead. 👍🏼
Your observation encourages to split more, not less. Current trends suggest our blazegraph is getting slower since records per languages is getting larger. Python is doing the math, so there is no cost at splitting all languages by 3 months chunks. Whereas not splitting will result in crash when a new languages suddenly become large but we have no developer to hard code it into the list of exceptions. |
Yes we could split more but what is the good granularity (every year, every month, every week, every day, ...). We cannot guess it and that is why it is not a good idea to add it in lili.py, i.e. do not let the user adapt the splitting to its need. About your last sentence, it means that currently there is the "sleep 5" that slows down the script if we do too many sleep. But as I already said, it is not a big deal; we can implement the splitting for all languages vent if I do not see any advantage to do it. |
I currently do my best to explore and learn how this python bot distributes tasks and works. This allows me to raise those previous points to you but I'm still blind and not confident enough in my understanding to do more. You considered those points, so we are good to go. We follow your lead Pamputt 🌻 👍🏼 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with code. I let Pamputt do the merge.
No description provided.