Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7280 alternative fix: wrap existing rmtree to retry longer only for windows access errors. #7363

Closed
wants to merge 11 commits into from

Conversation

choyrim
Copy link

@choyrim choyrim commented Nov 16, 2019

Another approach to fixing #7280 - this variation wraps the existing rmtree with retries. However this wrapper will only retry if the error raised by shutils.rmtree is a windows access error.

Copy link
Member

@chrahunt chrahunt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that something targeted like this is a better approach. Thanks for trying it out! 👍

We can clean things up a bit with the retry wrapper's retry_on_exception argument. We can also define this version of rmtree (with that argument) on Windows only, which should alleviate some concerns about changing behavior on all platforms. e.g.

def rmtree_internal(...):
    ...

if sys.platform == "win32":
    @retry(stop_max_attempt_number=3, retry_on_exception=is_windows_permission_error)
    @retry(...)  # same as before
    def rmtree(...):
        rmtree_internal(...)
else:
    @retry(...)  # same as before
    def rmtree(...):
        rmtree_internal(...)

For tests, I would add a test directly of is_win_access_error (with a real error generated on Windows using a sub-process to hold some file open) and then a Windows and Unix-specific test that take a similar approach as before checking the duration. For right now I think we can do a straightforward update to the tests even though it increases the duration slightly (since the overall increase as a % is relatively small) and then make a note in #4497 to take a better approach in a followup (e.g. what I described in #7361 (comment)).

@choyrim
Copy link
Author

choyrim commented Nov 16, 2019

wow thanks for the advice. I'll try those approaches out later.

@choyrim
Copy link
Author

choyrim commented Nov 17, 2019

@chrahunt i'm trying out the changes you suggested and it works. however the code feels a bit redundant. also there isn't a good place to log a warning that the rmtree was unable to delete files.

if sys.platform == "win32":
    def is_win_access_error(e):
        """returns True if the OSError is a Windows access error which indicates
        that another process is preventing the object from being deleted."""
        return (
            isinstance(e, OSError) and
            e.errno == errno.EACCES and
            e.winerror == 5
            )

    @retry(stop_max_attempt_number=5, retry_on_exception=is_win_access_error)
    @retry(stop_max_delay=3000, wait_fixed=500)
    def rmtree(dir, ignore_errors=False):
        rmtree_internal(dir, ignore_errors)
else:
    @retry(stop_max_delay=3000, wait_fixed=500)
    def rmtree(dir, ignore_errors=False):
        rmtree_internal(dir, ignore_errors)

@chrahunt
Copy link
Member

For the code duplication, you can do something like

@retry(stop_max_delay=3000, wait_fixed=500)
def rmtree_internal(dir, ignore_errors=False):
    ...

Then

if sys.platform == "win32":
    def is_win_access_error(e):
        """returns True if the OSError is a Windows access error which indicates
        that another process is preventing the object from being deleted."""
        return (
            isinstance(e, OSError) and
            e.errno == errno.EACCES and
            e.winerror == 5
            )

    rmtree = retry(
        stop_max_attempt_number=5, retry_on_exception=is_win_access_error
    )(rmtree_internal)
else:
    rmtree = rmtree_internal

but I would not be surprised if that causes mypy to complain. Better to have the very minor duplication like you showed than a bunch of # type: ignore comments, in my opinion. Also, we may want to change the retry interval on one platform or the other in the future and if we de-duplicate then it will need a bigger change.

For tracing the exception, it isn't pretty but you could wrap is_win_access_error in another function that traces the warning if it returns true.

@choyrim
Copy link
Author

choyrim commented Nov 18, 2019

@chrahunt based on the python27 error logs in the CI checks, it looks like double-decorating a function doesn't work in python27, or seems to be ignored.

@choyrim
Copy link
Author

choyrim commented Nov 19, 2019

I installed python 2.7 and was able to reproduce the error with tox.

@chrahunt
Copy link
Member

I don't see anything obvious that would cause that. I would make sure the function is getting called and validate the argument and return value. I didn't see any open issues in the retrying project to explain it.

@pradyunsg
Copy link
Member

speculation: retry might not support cascading decorators or we’re hitting an edge case?

@chrahunt
Copy link
Member

Not to jump ahead but if it does turn out to be a retrying bug then we may want to switch to tenacity per rholder/retrying#79.

update the code to handle the different errors for locked files
on windows between python 2.7 and python 3.
@choyrim
Copy link
Author

choyrim commented Nov 19, 2019

@chrahunt @pradyunsg the problem was that on python 2.7 the error raised is different, and also more difficult to fake in tests. no problems with the retry decoration.

@choyrim
Copy link
Author

choyrim commented Nov 19, 2019

@chrahunt feel free to ask for other changes. like naming and what not.

the problem i encountered would have been more obvious if I had followed your advice and created a test that only checks whether we can detect the error caused by file handle locking. But I am not quite sure how i'd coordinate with the other process. i guess i can try just forking and sleeping.

@chrahunt
Copy link
Member

Maybe something like:

  1. Invoke multiprocessing.Pipe() to get parent and child ends
  2. Create a file and note its path
  3. Create a multiprocessing.Process (setting daemon to True) which invokes a function (open_and_hold_file(pipe, path)), where pipe is child from above and path is a string path.
  4. Wait for true from parent in the parent process
  5. In open_and_hold_file, open the file from path and send true over the pipe, then wait for true from child
  6. On receipt of true, the parent should try to delete the file, record the exception if any, and then send true and join the child process
  7. Finally, assert things about the raised error

Copy link
Member

@chrahunt chrahunt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added specific comments to the code, and have 2 general comments:

  1. (minor) for the docstrings I would capitalize the first letter for consistency with the rest of the module
  2. we should have a test like described above to make sure we're catching the actual error that would be raised

Also, do we know whether this is the only circumstance under which this error is raised, or is it also raised if we simply don't have permission? I don't think it's a deal-breaker but it would be good to know so we could potentially refine it in a followup.

@choyrim
Copy link
Author

choyrim commented Nov 20, 2019

Maybe something like:

  1. Invoke multiprocessing.Pipe() to get parent and child ends
  2. Create a file and note its path
  3. Create a multiprocessing.Process (setting daemon to True) which invokes a function (open_and_hold_file(pipe, path)), where pipe is child from above and path is a string path.
  4. Wait for true from parent in the parent process
  5. In open_and_hold_file, open the file from path and send true over the pipe, then wait for true from child
  6. On receipt of true, the parent should try to delete the file, record the exception if any, and then send true and join the child process
  7. Finally, assert things about the raised error

@chrahunt I attempted to implement something like this but it didn't make the same errors that the virus scanners did. I think the errors are different if the competing process is privileged. when I have a child process hold a handle to the file, I get a different error which specifically points out that another process is using the file. When a virus scanner is holding it, I get an "Access denied" error. So I wasn't able to get the test to simulate a virus scanner.

I didn't notice your other comments until just now. So I'll have to handle them later.

@techalchemy
Copy link
Member

Hey @choyrim thanks for looking at this super frustrating issue, from experience it is not fun to tackle.

I know I have asked this question before but I can't remember the answer and whether I just never got around to implementing the fix -- was there a reason that you were aware why the TempDir implementation in pip doesn't take advantage of weakref.finalize to help clean it up? I've run into this issue a number of times and seen a number of cases of it coming up -- @pradyunsg @chrahunt

/cc @pfmoore @zooba

@choyrim
Copy link
Author

choyrim commented Nov 23, 2019 via email

@chrahunt
Copy link
Member

@techalchemy is that related to this fix? Sorry if I'm missing something. It would definitely be useful in #6982.

@choyrim
Copy link
Author

choyrim commented Nov 26, 2019 via email

@pfmoore
Copy link
Member

pfmoore commented Nov 26, 2019

I agree, I'm not clear how weakref.finalize is relevant here. The lifetime of the temporary directory as far as Python is concerned, is completely clear and deterministic. The problem is that a external agent (in this case, an anti-virus program) is blocking the code from removing the directory when it should be removed.

I guess what we could do is, after we do the normal deletion-with-retry (without the longer delay this PR introduces) if we still haven't managed to delete the directory after the retries, we could register a final attempt to delete the directory when pip exits (weakref.finalize would be a perfectly OK way to do this, just register the finalizer against a "permanent" object that we create - although I'd have thought that atexit.register would be a simpler approach).

The problem is that pip isn't that long-running, so the atexit function could potentially run while the virus checker still has the directory locked, at which point we still have to decide how long to wait before giving up. So I don't see any practical benefit here.

@danizen
Copy link

danizen commented Dec 30, 2019

I would suggest that you also wait for Environment Error, OSError number 32, because I can easily reproduce that one in the real world.

@danizen
Copy link

danizen commented Dec 30, 2019

@chrahunt, in the spirit of making pull requests small. I volunteer to implement an integration test along the lines you indicate above. My experience suggests that will cause WinError 32, and then I'd add that only if that's what I got.

@danizen
Copy link

danizen commented Dec 31, 2019

@choyrim I'm going to continue to pursue a more localized change to pip/_internal/utils/temp_dir.py.

In the meantime, this gist is intended to do the multiprocessing as suggested by @chrahunt, with a small difference. Instead of immediately opening the file, the FileOpener class creates the other process, which will wait for the file path and an action to take (beyond opening the file). Because the actual sending of the file path can be deferred, this class may be suitable for use as a pytest fixture (which was my goal). It is certainly usable as a context manager.

I've also attempted to worry about the following errors that could occur:

  • parent process never sends the file - this is not an error, but a needed thing for this is a fixture
  • file doesn't exist and the child cannot open the file - the child process has to catch and print the exception, and still waits for the parent to send True, because otherwise it would break the text protocol. Not doing this, in any case, caused odd hang-ups in ipython.

FileOpener supports two additional actions, only one of which is implemented:

  1. "lock" the file (not implemented)
  2. "noaccess" make the file unreadable by the owning user (implementation taken from tests/lib/filesystem.py

So far, I still get windows error 32 no matter what I do.

@choyrim
Copy link
Author

choyrim commented May 12, 2020

since #7479 will also make this issue no longer a problem, perhaps I close this now?

@uranusjr
Copy link
Member

@choyrim Wouldn’t the problem still occur if the anti virus is still scanning when pip exits (and tries to delete the folder)? Although indeed it’d reduce the probability this happens.

@choyrim
Copy link
Author

choyrim commented May 12, 2020

@choyrim Wouldn’t the problem still occur if the anti virus is still scanning when pip exits (and tries to delete the folder)? Although indeed it’d reduce the probability this happens.

well i can't find the code that takes care of deleting the temp files. if pip will fail the install when deleting the folder, then the answer is yes. however, since the cylance scan seems to occur soon after the download starts, the probability of a conflict will be greatly reduced with the deletion running after the install.

@uranusjr
Copy link
Member

@choyrim Makes sense. pip will fail if it fails to delete the temp directory after installation, but indeed the possibility would be reduced. I guess we can always fix it when someone complains.

@uranusjr uranusjr closed this May 12, 2020
@pradyunsg pradyunsg removed the S: needs triage Issues/PRs that need to be triaged label Feb 12, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants