Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout: The file lock 'some/path/to/8738.tldextract.json.lock' could not be acquired #254

Open
rudolfovic opened this issue Mar 7, 2022 · 5 comments

Comments

@rudolfovic
Copy link

I start getting this error when I increase the number of processes / threads to a certain point.

Is there a way to increase the timeout value?

More importantly, why is lock needed here if tldextract isn't writing anything, only reading?

@brycedrennan
Copy link
Collaborator

It's fetching and saving the latest version of the top level domains list.

A lock is to prevent multiple threads and processes from needlessly requesting the data and then contending as they write the data to the same location.

Timeout is currently set to 20 seconds.

def __init__(self, cache_dir, lock_timeout=20):

I'd suggest either disabling the list update or doing it beforehand and then disabling it. See the readme for details.

@rudolfovic
Copy link
Author

I did look at it but it's not too clear to me whether cache_dir=False disables writing to the cache (downloading new info) vs reading from the cache (fetching directly from the internet) in these examples:

# extract callable that reads/writes the updated TLD set to a different path
custom_cache_extract = tldextract.TLDExtract(cache_dir='/path/to/your/cache/')
custom_cache_extract('http://www.google.com')

# extract callable that doesn't use caching
no_cache_extract = tldextract.TLDExtract(cache_dir=False)
no_cache_extract('http://www.google.com')

I don't feel a need for any custom path so in which order would you run tldextract.TLDExtract() and tldextract.TLDExtract(cache_dir=False)?

If I understand correctly, I need to fetch some random domain using the first instance (like google.com) first and then use the second instance for my parallel task. Is that it? (or would it be enough to simply create the object to perform the fetch?)

@john-kurkowski
Copy link
Owner

I have trouble cobbling together when this project fetches and when it caches too. And I wrote much of the project. 😅 The docs could be improved.

If I understand correctly, I need to fetch some random domain using the first instance (like google.com) first and then use the second instance for my parallel task. Is that it? (or would it be enough to simply create the object to perform the fetch?)

Close. You only need the first instance. After you call the instance once (with say google.com), the list is fetched and cached, and future calls to that same instance should have no contention.

mpkuth added a commit to mpkuth/pypac that referenced this issue Nov 9, 2022
Fetching updated TLD lists is already disabled, so the default TLD list
that is included with the library will always be used and there is
nothing to cache. Explicitly disabling the cache prevents possible
occurences of john-kurkowski/tldextract#254.
beikejinmiao added a commit to beikejinmiao/HawkSec that referenced this issue Apr 8, 2023
@jordane95
Copy link

jordane95 commented Apr 3, 2024

I have trouble cobbling together when this project fetches and when it caches too. And I wrote much of the project. 😅 The docs could be improved.

If I understand correctly, I need to fetch some random domain using the first instance (like google.com) first and then use the second instance for my parallel task. Is that it? (or would it be enough to simply create the object to perform the fetch?)

Close. You only need the first instance. After you call the instance once (with say google.com), the list is fetched and cached, and future calls to that same instance should have no contention.

In distributed setting, different workers still need to re-initialize their own instances. So, I think this won't work in this case? Should we just put the dry run in the source code?

@john-kurkowski
Copy link
Owner

In distributed setting, different workers still need to re-initialize their own instances. So, I think this won't work in this case?

The discussion on #339 got me thinking about this issue. Does your distributed setting pickle Python objects to send to workers? If so, after a dry-run extraction, a large TLDExtract instance, with all its supporting data, would be sent over the wire, file locks already read. Then workers would unpickle it, no need to reinitialize, no calling __init__ again, no need to block on file locks.

Should we just put the dry run in the source code?

Can you say more how this is an issue in your setting? Did you find a workaround?

To officially support the dry-run, I'd consider a new TLDExtract constructor argument, to avoid lazy loading the cache, something like eager=True.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants