-
-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout: The file lock 'some/path/to/8738.tldextract.json.lock' could not be acquired #254
Comments
It's fetching and saving the latest version of the top level domains list. A lock is to prevent multiple threads and processes from needlessly requesting the data and then contending as they write the data to the same location. Timeout is currently set to 20 seconds. tldextract/tldextract/cache.py Line 78 in 40205f6
I'd suggest either disabling the list update or doing it beforehand and then disabling it. See the readme for details. |
I did look at it but it's not too clear to me whether
I don't feel a need for any custom path so in which order would you run If I understand correctly, I need to fetch some random domain using the first instance (like google.com) first and then use the second instance for my parallel task. Is that it? (or would it be enough to simply create the object to perform the fetch?) |
I have trouble cobbling together when this project fetches and when it caches too. And I wrote much of the project. 😅 The docs could be improved.
Close. You only need the first instance. After you call the instance once (with say google.com), the list is fetched and cached, and future calls to that same instance should have no contention. |
Fetching updated TLD lists is already disabled, so the default TLD list that is included with the library will always be used and there is nothing to cache. Explicitly disabling the cache prevents possible occurences of john-kurkowski/tldextract#254.
In distributed setting, different workers still need to re-initialize their own instances. So, I think this won't work in this case? Should we just put the dry run in the source code? |
The discussion on #339 got me thinking about this issue. Does your distributed setting pickle Python objects to send to workers? If so, after a dry-run extraction, a large
Can you say more how this is an issue in your setting? Did you find a workaround? To officially support the dry-run, I'd consider a new |
I start getting this error when I increase the number of processes / threads to a certain point.
Is there a way to increase the timeout value?
More importantly, why is lock needed here if tldextract isn't writing anything, only reading?
The text was updated successfully, but these errors were encountered: