Timeout: The file lock 'some/path/to/8738.tldextract.json.lock' could not be acquired #254

rudolfovic · 2022-03-07T12:58:10Z

I start getting this error when I increase the number of processes / threads to a certain point.

Is there a way to increase the timeout value?

More importantly, why is lock needed here if tldextract isn't writing anything, only reading?

brycedrennan · 2022-03-07T15:49:54Z

It's fetching and saving the latest version of the top level domains list.

A lock is to prevent multiple threads and processes from needlessly requesting the data and then contending as they write the data to the same location.

Timeout is currently set to 20 seconds.

tldextract/tldextract/cache.py

Line 78 in 40205f6

def __init__(self, cache_dir, lock_timeout=20):

I'd suggest either disabling the list update or doing it beforehand and then disabling it. See the readme for details.

rudolfovic · 2022-03-07T16:11:29Z

I did look at it but it's not too clear to me whether cache_dir=False disables writing to the cache (downloading new info) vs reading from the cache (fetching directly from the internet) in these examples:

# extract callable that reads/writes the updated TLD set to a different path
custom_cache_extract = tldextract.TLDExtract(cache_dir='/path/to/your/cache/')
custom_cache_extract('http://www.google.com')

# extract callable that doesn't use caching
no_cache_extract = tldextract.TLDExtract(cache_dir=False)
no_cache_extract('http://www.google.com')

I don't feel a need for any custom path so in which order would you run tldextract.TLDExtract() and tldextract.TLDExtract(cache_dir=False)?

If I understand correctly, I need to fetch some random domain using the first instance (like google.com) first and then use the second instance for my parallel task. Is that it? (or would it be enough to simply create the object to perform the fetch?)

john-kurkowski · 2022-03-19T20:45:10Z

I have trouble cobbling together when this project fetches and when it caches too. And I wrote much of the project. 😅 The docs could be improved.

If I understand correctly, I need to fetch some random domain using the first instance (like google.com) first and then use the second instance for my parallel task. Is that it? (or would it be enough to simply create the object to perform the fetch?)

Close. You only need the first instance. After you call the instance once (with say google.com), the list is fetched and cached, and future calls to that same instance should have no contention.

Fetching updated TLD lists is already disabled, so the default TLD list that is included with the library will always be used and there is nothing to cache. Explicitly disabling the cache prevents possible occurences of john-kurkowski/tldextract#254.

cache问题(john-kurkowski/tldextract#254)

jordane95 · 2024-04-03T09:04:44Z

I have trouble cobbling together when this project fetches and when it caches too. And I wrote much of the project. 😅 The docs could be improved.

If I understand correctly, I need to fetch some random domain using the first instance (like google.com) first and then use the second instance for my parallel task. Is that it? (or would it be enough to simply create the object to perform the fetch?)

Close. You only need the first instance. After you call the instance once (with say google.com), the list is fetched and cached, and future calls to that same instance should have no contention.

In distributed setting, different workers still need to re-initialize their own instances. So, I think this won't work in this case? Should we just put the dry run in the source code?

john-kurkowski · 2024-11-19T04:21:57Z

In distributed setting, different workers still need to re-initialize their own instances. So, I think this won't work in this case?

The discussion on #339 got me thinking about this issue. Does your distributed setting pickle Python objects to send to workers? If so, after a dry-run extraction, a large TLDExtract instance, with all its supporting data, would be sent over the wire, file locks already read. Then workers would unpickle it, no need to reinitialize, no calling __init__ again, no need to block on file locks.

Should we just put the dry run in the source code?

Can you say more how this is an issue in your setting? Did you find a workaround?

To officially support the dry-run, I'd consider a new TLDExtract constructor argument, to avoid lazy loading the cache, something like eager=True.

mpkuth mentioned this issue Nov 9, 2022

disable tldextract caching carsonyl/pypac#64

Merged

john-kurkowski mentioned this issue Jan 12, 2023

Ocasional PermissionError when running tldextract.extract #277

Open

beikejinmiao added a commit to beikejinmiao/HawkSec that referenced this issue Apr 8, 2023

修复tldextract

723bdd3

cache问题(john-kurkowski/tldextract#254)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout: The file lock 'some/path/to/8738.tldextract.json.lock' could not be acquired #254

Timeout: The file lock 'some/path/to/8738.tldextract.json.lock' could not be acquired #254

rudolfovic commented Mar 7, 2022

brycedrennan commented Mar 7, 2022

rudolfovic commented Mar 7, 2022

john-kurkowski commented Mar 19, 2022

jordane95 commented Apr 3, 2024 •

edited

Loading

john-kurkowski commented Nov 19, 2024

Timeout: The file lock 'some/path/to/8738.tldextract.json.lock' could not be acquired #254

Timeout: The file lock 'some/path/to/8738.tldextract.json.lock' could not be acquired #254

Comments

rudolfovic commented Mar 7, 2022

brycedrennan commented Mar 7, 2022

rudolfovic commented Mar 7, 2022

john-kurkowski commented Mar 19, 2022

jordane95 commented Apr 3, 2024 • edited Loading

john-kurkowski commented Nov 19, 2024

jordane95 commented Apr 3, 2024 •

edited

Loading