7280 alternative fix: wrap existing rmtree to retry longer only for windows access errors. #7363

choyrim · 2019-11-16T10:05:46Z

Another approach to fixing #7280 - this variation wraps the existing rmtree with retries. However this wrapper will only retry if the error raised by shutils.rmtree is a windows access error.

chrahunt

I think that something targeted like this is a better approach. Thanks for trying it out! 👍

We can clean things up a bit with the retry wrapper's retry_on_exception argument. We can also define this version of rmtree (with that argument) on Windows only, which should alleviate some concerns about changing behavior on all platforms. e.g.

def rmtree_internal(...):
    ...

if sys.platform == "win32":
    @retry(stop_max_attempt_number=3, retry_on_exception=is_windows_permission_error)
    @retry(...)  # same as before
    def rmtree(...):
        rmtree_internal(...)
else:
    @retry(...)  # same as before
    def rmtree(...):
        rmtree_internal(...)

For tests, I would add a test directly of is_win_access_error (with a real error generated on Windows using a sub-process to hold some file open) and then a Windows and Unix-specific test that take a similar approach as before checking the duration. For right now I think we can do a straightforward update to the tests even though it increases the duration slightly (since the overall increase as a % is relatively small) and then make a note in #4497 to take a better approach in a followup (e.g. what I described in #7361 (comment)).

choyrim · 2019-11-16T17:30:35Z

wow thanks for the advice. I'll try those approaches out later.

choyrim · 2019-11-17T00:13:41Z

@chrahunt i'm trying out the changes you suggested and it works. however the code feels a bit redundant. also there isn't a good place to log a warning that the rmtree was unable to delete files.

if sys.platform == "win32":
    def is_win_access_error(e):
        """returns True if the OSError is a Windows access error which indicates
        that another process is preventing the object from being deleted."""
        return (
            isinstance(e, OSError) and
            e.errno == errno.EACCES and
            e.winerror == 5
            )

    @retry(stop_max_attempt_number=5, retry_on_exception=is_win_access_error)
    @retry(stop_max_delay=3000, wait_fixed=500)
    def rmtree(dir, ignore_errors=False):
        rmtree_internal(dir, ignore_errors)
else:
    @retry(stop_max_delay=3000, wait_fixed=500)
    def rmtree(dir, ignore_errors=False):
        rmtree_internal(dir, ignore_errors)

chrahunt · 2019-11-17T00:30:32Z

For the code duplication, you can do something like

@retry(stop_max_delay=3000, wait_fixed=500)
def rmtree_internal(dir, ignore_errors=False):
    ...

Then

if sys.platform == "win32":
    def is_win_access_error(e):
        """returns True if the OSError is a Windows access error which indicates
        that another process is preventing the object from being deleted."""
        return (
            isinstance(e, OSError) and
            e.errno == errno.EACCES and
            e.winerror == 5
            )

    rmtree = retry(
        stop_max_attempt_number=5, retry_on_exception=is_win_access_error
    )(rmtree_internal)
else:
    rmtree = rmtree_internal

but I would not be surprised if that causes mypy to complain. Better to have the very minor duplication like you showed than a bunch of # type: ignore comments, in my opinion. Also, we may want to change the retry interval on one platform or the other in the future and if we de-duplicate then it will need a bigger change.

For tracing the exception, it isn't pretty but you could wrap is_win_access_error in another function that traces the warning if it returns true.

choyrim · 2019-11-18T22:20:07Z

@chrahunt based on the python27 error logs in the CI checks, it looks like double-decorating a function doesn't work in python27, or seems to be ignored.

choyrim · 2019-11-19T04:39:00Z

I installed python 2.7 and was able to reproduce the error with tox.

chrahunt · 2019-11-19T05:43:35Z

I don't see anything obvious that would cause that. I would make sure the function is getting called and validate the argument and return value. I didn't see any open issues in the retrying project to explain it.

pradyunsg · 2019-11-19T05:57:10Z

speculation: retry might not support cascading decorators or we’re hitting an edge case?

chrahunt · 2019-11-19T06:04:07Z

Not to jump ahead but if it does turn out to be a retrying bug then we may want to switch to tenacity per rholder/retrying#79.

update the code to handle the different errors for locked files on windows between python 2.7 and python 3.

choyrim · 2019-11-19T07:27:08Z

@chrahunt @pradyunsg the problem was that on python 2.7 the error raised is different, and also more difficult to fake in tests. no problems with the retry decoration.

choyrim · 2019-11-19T14:52:33Z

@chrahunt feel free to ask for other changes. like naming and what not.

the problem i encountered would have been more obvious if I had followed your advice and created a test that only checks whether we can detect the error caused by file handle locking. But I am not quite sure how i'd coordinate with the other process. i guess i can try just forking and sleeping.

chrahunt · 2019-11-19T15:21:36Z

Maybe something like:

Invoke multiprocessing.Pipe() to get parent and child ends
Create a file and note its path
Create a multiprocessing.Process (setting daemon to True) which invokes a function (open_and_hold_file(pipe, path)), where pipe is child from above and path is a string path.
Wait for true from parent in the parent process
In open_and_hold_file, open the file from path and send true over the pipe, then wait for true from child
On receipt of true, the parent should try to delete the file, record the exception if any, and then send true and join the child process
Finally, assert things about the raised error

chrahunt

I added specific comments to the code, and have 2 general comments:

(minor) for the docstrings I would capitalize the first letter for consistency with the rest of the module
we should have a test like described above to make sure we're catching the actual error that would be raised

Also, do we know whether this is the only circumstance under which this error is raised, or is it also raised if we simply don't have permission? I don't think it's a deal-breaker but it would be good to know so we could potentially refine it in a followup.

src/pip/_internal/utils/misc.py

tests/unit/test_utils.py

choyrim · 2019-11-20T09:36:12Z

Maybe something like:

Invoke multiprocessing.Pipe() to get parent and child ends

Create a file and note its path

Create a multiprocessing.Process (setting daemon to True) which invokes a function (open_and_hold_file(pipe, path)), where pipe is child from above and path is a string path.

Wait for true from parent in the parent process

In open_and_hold_file, open the file from path and send true over the pipe, then wait for true from child

On receipt of true, the parent should try to delete the file, record the exception if any, and then send true and join the child process

Finally, assert things about the raised error

@chrahunt I attempted to implement something like this but it didn't make the same errors that the virus scanners did. I think the errors are different if the competing process is privileged. when I have a child process hold a handle to the file, I get a different error which specifically points out that another process is using the file. When a virus scanner is holding it, I get an "Access denied" error. So I wasn't able to get the test to simulate a virus scanner.

I didn't notice your other comments until just now. So I'll have to handle them later.

techalchemy · 2019-11-23T07:49:32Z

Hey @choyrim thanks for looking at this super frustrating issue, from experience it is not fun to tackle.

I know I have asked this question before but I can't remember the answer and whether I just never got around to implementing the fix -- was there a reason that you were aware why the TempDir implementation in pip doesn't take advantage of weakref.finalize to help clean it up? I've run into this issue a number of times and seen a number of cases of it coming up -- @pradyunsg @chrahunt

/cc @pfmoore @zooba

choyrim · 2019-11-23T12:59:38Z

wow this weakref thing sounds like it could be more elegant. i will have to read up on it and try it out. do you have an example i could imitate?

…

On Sat, Nov 23, 2019 at 03:49 Dan Ryan ***@***.***> wrote: Hey @choyrim <https://github.com/choyrim> thanks for looking at this super frustrating issue, from experience it is not fun to tackle. I know I have asked this question before but I can't remember the answer and whether I just never got around to implementing the fix -- was there a reason that you were aware why the TempDir implementation in pip doesn't take advantage of weakref.finalize to help clean it up? I've run into this issue a number of times and seen a number of cases of it coming up -- @pradyunsg <https://github.com/pradyunsg> @chrahunt <https://github.com/chrahunt> /cc @pfmoore <https://github.com/pfmoore> @zooba <https://github.com/zooba> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7363?email_source=notifications&email_token=AAC42ZMUJU54TFZ26YXXX63QVDOA3A5CNFSM4JOEPB32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE7P22Y#issuecomment-557776235>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAC42ZOCCXBXUIAXAGSSCM3QVDOA3ANCNFSM4JOEPB3Q> .

chrahunt · 2019-11-23T18:04:58Z

@techalchemy is that related to this fix? Sorry if I'm missing something. It would definitely be useful in #6982.

choyrim · 2019-11-26T13:02:13Z

Dan, I read up on weakref.finalize but struggled to figure out how to use it. are you suggesting that we use finalize instead of the exit handler to do the rmtree?

…

On Sat, Nov 23, 2019 at 02:49 Dan Ryan ***@***.***> wrote: Hey @choyrim <https://github.com/choyrim> thanks for looking at this super frustrating issue, from experience it is not fun to tackle. I know I have asked this question before but I can't remember the answer and whether I just never got around to implementing the fix -- was there a reason that you were aware why the TempDir implementation in pip doesn't take advantage of weakref.finalize to help clean it up? I've run into this issue a number of times and seen a number of cases of it coming up -- @pradyunsg <https://github.com/pradyunsg> @chrahunt <https://github.com/chrahunt> /cc @pfmoore <https://github.com/pfmoore> @zooba <https://github.com/zooba> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7363?email_source=notifications&email_token=AAC42ZMUJU54TFZ26YXXX63QVDOA3A5CNFSM4JOEPB32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE7P22Y#issuecomment-557776235>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAC42ZOCCXBXUIAXAGSSCM3QVDOA3ANCNFSM4JOEPB3Q> .

pfmoore · 2019-11-26T13:19:59Z

I agree, I'm not clear how weakref.finalize is relevant here. The lifetime of the temporary directory as far as Python is concerned, is completely clear and deterministic. The problem is that a external agent (in this case, an anti-virus program) is blocking the code from removing the directory when it should be removed.

I guess what we could do is, after we do the normal deletion-with-retry (without the longer delay this PR introduces) if we still haven't managed to delete the directory after the retries, we could register a final attempt to delete the directory when pip exits (weakref.finalize would be a perfectly OK way to do this, just register the finalizer against a "permanent" object that we create - although I'd have thought that atexit.register would be a simpler approach).

The problem is that pip isn't that long-running, so the atexit function could potentially run while the virus checker still has the directory locked, at which point we still have to decide how long to wait before giving up. So I don't see any practical benefit here.

danizen · 2019-12-30T20:31:50Z

I would suggest that you also wait for Environment Error, OSError number 32, because I can easily reproduce that one in the real world.

danizen · 2019-12-30T22:07:32Z

@chrahunt, in the spirit of making pull requests small. I volunteer to implement an integration test along the lines you indicate above. My experience suggests that will cause WinError 32, and then I'd add that only if that's what I got.

danizen · 2019-12-31T22:21:01Z

@choyrim I'm going to continue to pursue a more localized change to pip/_internal/utils/temp_dir.py.

In the meantime, this gist is intended to do the multiprocessing as suggested by @chrahunt, with a small difference. Instead of immediately opening the file, the FileOpener class creates the other process, which will wait for the file path and an action to take (beyond opening the file). Because the actual sending of the file path can be deferred, this class may be suitable for use as a pytest fixture (which was my goal). It is certainly usable as a context manager.

I've also attempted to worry about the following errors that could occur:

parent process never sends the file - this is not an error, but a needed thing for this is a fixture
file doesn't exist and the child cannot open the file - the child process has to catch and print the exception, and still waits for the parent to send True, because otherwise it would break the text protocol. Not doing this, in any case, caused odd hang-ups in ipython.

FileOpener supports two additional actions, only one of which is implemented:

"lock" the file (not implemented)
"noaccess" make the file unreadable by the owning user (implementation taken from tests/lib/filesystem.py

So far, I still get windows error 32 no matter what I do.

choyrim · 2020-05-12T05:10:05Z

since #7479 will also make this issue no longer a problem, perhaps I close this now?

uranusjr · 2020-05-12T06:42:00Z

@choyrim Wouldn’t the problem still occur if the anti virus is still scanning when pip exits (and tries to delete the folder)? Although indeed it’d reduce the probability this happens.

choyrim · 2020-05-12T20:51:10Z

@choyrim Wouldn’t the problem still occur if the anti virus is still scanning when pip exits (and tries to delete the folder)? Although indeed it’d reduce the probability this happens.

well i can't find the code that takes care of deleting the temp files. if pip will fail the install when deleting the folder, then the answer is yes. however, since the cylance scan seems to occur soon after the download starts, the probability of a conflict will be greatly reduced with the deletion running after the install.

uranusjr · 2020-05-12T23:00:45Z

@choyrim Makes sense. pip will fail if it fails to delete the temp directory after installation, but indeed the possibility would be reduced. I guess we can always fix it when someone complains.

choyrim added 2 commits November 16, 2019 04:57

wrap existing rmtree to retry longer only for windows access errors.

64257a3

add news item

0c2262b

chrahunt suggested changes Nov 16, 2019

View reviewed changes

chrahunt mentioned this pull request Nov 16, 2019

#7280: increase retry wait for rmtree #7361

Closed

choyrim added 3 commits November 18, 2019 14:19

add newline at end of news.

9a577fb

rewrite based on suggestions from @chrahunt

244181f

make lint happy.

2d9866f

differences between py27 & py3 on win32:

16b77d4

update the code to handle the different errors for locked files on windows between python 2.7 and python 3.

chrahunt suggested changes Nov 19, 2019

View reviewed changes

cleaned up the tests.

42665b2

use PY2; simplify is_locked inner function; as advised by @chrahunt.

17816e8

choyrim added 2 commits December 2, 2019 02:30

better doc; use a constant.

61744fd

explain the waits in the windows retry tests.

1a8053d

make lint happy

078fa18

chrahunt mentioned this pull request Dec 8, 2019

pip upgrade fails due to permissions error, then claims it didn't fail #6893

Closed

pradyunsg mentioned this pull request Dec 31, 2019

Switch from retrying to tenacity #7530

Closed

danizen mentioned this pull request Jan 2, 2020

Address issue #7280 with a WARNING rather than exception #7544

Closed

pradyunsg added the S: needs triage Issues/PRs that need to be triaged label Feb 6, 2020

sbidoul mentioned this pull request Feb 22, 2020

Adding a direct URL dependency on setup.py breaks versioning, cached wheel is always used #7766

Closed

uranusjr closed this May 12, 2020

pradyunsg removed the S: needs triage Issues/PRs that need to be triaged label Feb 12, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

7280 alternative fix: wrap existing rmtree to retry longer only for windows access errors. #7363

7280 alternative fix: wrap existing rmtree to retry longer only for windows access errors. #7363

choyrim commented Nov 16, 2019

chrahunt left a comment

choyrim commented Nov 16, 2019

choyrim commented Nov 17, 2019

chrahunt commented Nov 17, 2019

choyrim commented Nov 18, 2019

choyrim commented Nov 19, 2019

chrahunt commented Nov 19, 2019

pradyunsg commented Nov 19, 2019

chrahunt commented Nov 19, 2019

choyrim commented Nov 19, 2019

choyrim commented Nov 19, 2019

chrahunt commented Nov 19, 2019

chrahunt left a comment

choyrim commented Nov 20, 2019

techalchemy commented Nov 23, 2019

choyrim commented Nov 23, 2019 via email

chrahunt commented Nov 23, 2019

choyrim commented Nov 26, 2019 via email

pfmoore commented Nov 26, 2019

danizen commented Dec 30, 2019

danizen commented Dec 30, 2019

danizen commented Dec 31, 2019

choyrim commented May 12, 2020

uranusjr commented May 12, 2020

choyrim commented May 12, 2020

uranusjr commented May 12, 2020

7280 alternative fix: wrap existing rmtree to retry longer only for windows access errors. #7363

7280 alternative fix: wrap existing rmtree to retry longer only for windows access errors. #7363

Conversation

choyrim commented Nov 16, 2019

chrahunt left a comment

Choose a reason for hiding this comment

choyrim commented Nov 16, 2019

choyrim commented Nov 17, 2019

chrahunt commented Nov 17, 2019

choyrim commented Nov 18, 2019

choyrim commented Nov 19, 2019

chrahunt commented Nov 19, 2019

pradyunsg commented Nov 19, 2019

chrahunt commented Nov 19, 2019

choyrim commented Nov 19, 2019

choyrim commented Nov 19, 2019

chrahunt commented Nov 19, 2019

chrahunt left a comment

Choose a reason for hiding this comment

choyrim commented Nov 20, 2019

techalchemy commented Nov 23, 2019

choyrim commented Nov 23, 2019 via email

chrahunt commented Nov 23, 2019

choyrim commented Nov 26, 2019 via email

pfmoore commented Nov 26, 2019

danizen commented Dec 30, 2019

danizen commented Dec 30, 2019

danizen commented Dec 31, 2019

choyrim commented May 12, 2020

uranusjr commented May 12, 2020

choyrim commented May 12, 2020

uranusjr commented May 12, 2020