Investigation: Add support for instance profiles #322

nivintw · 2022-10-02T05:59:02Z

dependent on / includes changes from #321

Related to issue #319

This is an initial pass at the implementation, and is functionally tested (by me) and is running in a deployed environment.
I've been trying to track down exactly how/where/when/why but here's what I believe I've observed:

Caveat: while I have a number of years of experience as a developer/programmer/engineer, i've traditionally stayed away from specifically web programming, so might be missing some of the web server specific details.

The current implementation of pypicloud (the one on master) appears to creates the storage backend once, when the cache is created ref
In the case of the S3 backend this in turn creates a reference to the bucket resource (boto3.resource("s3").Bucket).

It appears that this bucket is kept alive for an indeterminate amount of time. To be honest, it seems like it's kept alive for the length of the server from what i'm observing, but the "request" argument to the constructor is really making me doubt myself here.

I do know that without these patches (or equivalent) we reproducibly observe 500 - internal server error messages (which when you look at the logs on the pod are specifically expired credentials) about 15 minutes after server startup, which is the same length of time that the instance profile credentials are valid for.

Generally, Boto3 / botocore will handle rotating the instance profile credentials automatically in almost all cases, transparently to the user. It appears however that the resource objects do not automatically refresh their credentials when they are long lived. This observation is what led me to suspect that the resource is kept alive.

The implementation here makes the self.bucket member a property which will create the bucket if it doesn't exist the first time it's retrieved and will return a new boto3 Bucket resource to that same bucket on subsequent calls.

This should hopefully minimize any disruption to anyone else.

I ended up removing the get_bucket method since it was only used in one location, but am happy to add it back; it ended up being only the two lines that are now the in bucket getter, so it seemed redundant to keep it as a separate function.

I also added a few (What I believed to be minor) updates, and left comments explaining my thinking where I thought it'd be useful.

Welcome all feedback and discussion here; this is now the last remaining patch I manually maintain for our deployments so i'd be very happy to see it merged upstream.

TYVM!
-Tyler

stevearc · 2022-10-04T04:33:12Z

I've started reviewing this, but it'll take more time than I have today. Will hopefully be able to get through it this week

nivintw · 2022-10-04T05:53:54Z

Hey @stevearc I appreciate you taking the time.
I did a lot more digging within the last day or so, and TBH I'm kind of at a loss on why the observed behavior is happening, and why the solution fixes it.

Digging deep into how botocore handles rotating credentials for the instance profile credentials provider and how boto3 creates and uses sessions, everything should work? My only remaining dark corner is specifically around the collections for the service resource, e.g. Bucket.all().filter, etc.

When I get some time I plan to create a couple of minimal test to try to isolate the specific failure.
I just checked our deployment and it's continued to run with no restarts and no errors seen, so believe the fix does address it.

nivintw · 2022-10-04T05:58:01Z

Currently failing test is related to changes for the health, and are in line with the change made.

This was kind of something I was thinking through and proposing, but happy to revert that piece if you disagree about the points made.

Failing test:

def test_check_health_fail(self):
        """check_health returns False for bad connection"""
        dbmock = self.storage.bucket.meta.client = MagicMock()

        def throw(*_, **__):
            """Throw an exception"""
            raise ClientError({"Error": {}}, "OP")

        dbmock.head_bucket.side_effect = throw
        ok, msg = self.storage.check_health()
        self.assertFalse(ok)

nivintw · 2022-10-04T06:04:54Z

Hm, actually after doing the math just going to revert that part...

My thought was why pay for something if we'd never expect it to change, but there are
3.154e+7 seconds in a year
0.0004 Dollars per 1000 requests ref
= 12.62 Dollars
if queried once per second continuously for the whole year.
Given all other parameters of running the mirror, it seems worth the $12

nivintw · 2022-10-04T06:31:21Z

ok, so latest test failures are due to how the test mocks out the bucket instance.
Since with these changes the bucket is a property that returns a new Bucket instance each time, it makes sense that the current mocking approach is failing.

Unsure of immediate solution, but will think on it a bit

nivintw · 2022-10-04T22:19:32Z

Please hold on this PR; I am doing some additional investigation to identify root cause of issue.

I've been investigating more by looking through botocore and boto3 source code and want to make sure that the problem is resolved and that the fix specifically addresses it.

TY for your time so far

nivintw · 2022-10-04T22:23:42Z

Trying to not write a miniature novel of findings and progress on the investigation because honestly there is just a ton there. However, key takeaways:

Based on exploration of boto3 and botocore source code and how the authentication works, this should not happen. At all.
It's unclear how / why this is happening.
I continue to not see the problem with the above changes
I also couldn't recreate the problem in a slightly different environment (using ec2 / instance profiles) when I expected that i'd be able to
The issue might not be with the code, it might be with our k8s cluster setup. If that's the case, it'd be pretty surprising news and would impact a lot of things, so don't want to rush to assume that is what is going on.

Basically, all signs are pointing to "this should not be possible", so need to dig in a bit more.

As an aside, since upgrading from 1.3.7 to 1.3.11 we're seeing much lower memory usage, which is awesome :)

nivintw · 2022-10-04T22:25:53Z

Main threads of investigation (just sharing in case someone is interested):

Potentially our setup (although probably not)
Potentially related to threading? boto3 resources and sessions aren't thread-safe. Not saying this is the case; it's just on the short list of things to look into.

stevearc · 2022-10-06T05:37:16Z

Potentially related to threading?

When configured to run with multiple processes, Pyramid will fork after setting up some resources. For most python objects that will transparently work, but it has caused some problems in the past with SQL session handles. Not sure if anything like that is happening here, but worth noting if you're going to investigate threading.

I also couldn't recreate the problem in a slightly different environment (using ec2 / instance profiles) when I expected that i'd be able to

This to me is the most obvious place to start. If it doesn't reproduce in a simpler environment when you would expect it to, then there's something about the problem that we don't understand.

nivintw · 2022-10-08T11:26:20Z

Wanted to just drop an update:
Today I started seeing the issue again with the changes on this PR.

I got lucky this time and found: boto/botocore#1617 which seems very relevant and related.
So this situation may not actually be related to the implementation of pypicloud at all.

I'm going to leave open for now while I continue to investigate

nivintw · 2022-10-12T18:59:17Z

Hey @stevearc!

So after talking more with our infrastructure team, we're starting to really suspect the issue is actually with kiam.
The result of our deep-dive efforts into botocore/boto3 to find some proof of the issue being resolved has resulted in us being relatively sure that this shouldn't be happening at all from the boto3/botocore/pypicloud side of things.

I'm going to close this PR for now, but will keep my fork up with the changes in case we end up finding that there's more than one thing happening.
Appreciate the time and attention!

Tyler Nivin added 3 commits September 30, 2022 22:19

us-east-1 is an invalid value for location constraint

92442c8

initial support for instance profile usage

8a5f146

update comment

9fd8095

run black with no config

b722054

revert changes to health check

39d3a56

nivintw changed the title ~~Add support for instance profiles~~ Investigation: Add support for instance profiles Oct 8, 2022

nivintw closed this Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigation: Add support for instance profiles #322

Investigation: Add support for instance profiles #322

nivintw commented Oct 2, 2022 •

edited

Loading

stevearc commented Oct 4, 2022

nivintw commented Oct 4, 2022

nivintw commented Oct 4, 2022

nivintw commented Oct 4, 2022

nivintw commented Oct 4, 2022

nivintw commented Oct 4, 2022

nivintw commented Oct 4, 2022

nivintw commented Oct 4, 2022

stevearc commented Oct 6, 2022

nivintw commented Oct 8, 2022

nivintw commented Oct 12, 2022

Investigation: Add support for instance profiles #322

Investigation: Add support for instance profiles #322

Conversation

nivintw commented Oct 2, 2022 • edited Loading

stevearc commented Oct 4, 2022

nivintw commented Oct 4, 2022

nivintw commented Oct 4, 2022

nivintw commented Oct 4, 2022

nivintw commented Oct 4, 2022

nivintw commented Oct 4, 2022

nivintw commented Oct 4, 2022

nivintw commented Oct 4, 2022

stevearc commented Oct 6, 2022

nivintw commented Oct 8, 2022

nivintw commented Oct 12, 2022

nivintw commented Oct 2, 2022 •

edited

Loading