-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-1503887: Snowflake Driver Stuck opening Session #1804
Comments
Hey @tpoll, Thanks for providing such a detailed explanation. A quick question for you, when you say you tested this locally, did you create a unit test that easily reproduces the issue, or was this just merely running a Java application using Hikari that requests several connections at a time? I'm just trying to figure out the easiest way to reproduce this if possible. |
I was merely running an application locally to connect to a snowflake account (that seemed itself to maybe not exist anymore) and stepping through the code. I think to reproduce, you just need to attempt to sign in to an account where the login will always fail with a 500 style error and force a lot of retries. |
Thanks for clarifying. I'm unlikely going to hit a 50x error if I leave it up to chance. I'll see if I can work out something. There should be a way to mock this. |
@tpoll so as a starting point, I wanted to see if I can reproduce this with the latest version 3.17.0 because there was a new retry strategy that was implemented recently. I ran a test where I sent requests through a proxy server and I intercepted the HTTP requests so that every single response returned to the JDBC driver had a status code of 503. The JDBC driver kept retrying until it eventually gave up with the error:
While the issue you pointed out is worth looking into, I believe the impact is that you're going to have very short backoff times, or none at all (negative value). However, the code should be keeping track of the retry attempts and will eventually stop. It's unclear why in your case it's hanging indefinitely, there could be something else going on, but I did want to ask if it was possible for you to try your tests again locally with version 3.17.0 and let me know if you were able to reproduce the problem or not. Update: I just tried with v3.16.1 and that was a lot more liberal with the retries, so it retried the login request 39 times but it eventually failed with the same error. @tpoll I'll be interested to know what the outcome is on your end when testing this with v3.17.0. |
Hey @sfc-gh-wfateem here is a gist of me trying locally https://gist.github.com/tpoll/953ecdb5dd141aa844675473924c0768, you can still see there are many very fast retries. I believe the problem still exists in 3.17. It's non deterministic, so I haven't been able to repro the hanging locally with debug logging enabled, but my clients would always get stuck on this thread sleep when I took thread dumps of my production instances.
I think the issue is specifically the rare set of conditions outlined in my issue description that causes this line to be hit
I think the easiest solution would be to just update |
@sfc-gh-wfateem I was actually able to repro locally by chance with debug logging and confirm the bug is what I suspected. See https://gist.github.com/tpoll/6e1459978e99d28f46f423d75f4f3106 The final log lines show:
Sleeping for |
@tpoll Thanks for that. I'll discuss this with the team. |
I was also able to reproduce the issue, just FYI. |
@tpoll the PR is merged and will be included in the next release. |
Thanks! |
@tpoll JDBC driver version 3.18.0 has been released with the fix. |
Please answer these questions before submitting your issue.
In order to accurately debug the issue this information is required. Thanks!
What version of JDBC driver are you using?
3.16.1
What operating system and processor architecture are you using?
ubuntu jammy arm64
What version of Java are you using?
21.0.1
I can't provide the exact code, but I'm simply trying to open a jdbc driver connection to snowflake. This code usually works 99% of the time, but in the last ~week we have been seeing intermittent issues with the snowflake driver hanging indefinitely (e.g. 10+ hours) attempting to open a new session. The same code for the same customer will work if you try it in another process 5 minutes later, so we haven't been able to reproduce locally, as there's some sort of non determinism happening here. It primarily seems to be happing on accounts where the login code produces a
HTTP status=513
error. Our code usually hits the default login timeout window, but sometimes it just hangs forever on the following stacktrace.The stack trace of the hanging code is:
I've been looking at and stepping through the underlying code, and believe I see a few potential issues and reason for this:
take this block below from
RestRequest
that has the sleep that is guarding the hang.When I test locally, I see that we pretty quickly reach a state where our backoff is negative, since we compute the backoff relative to the backoff itself and not a static base value. This means if you get unlucky from
backoff.chooseRandom
, we could get a bunch of negative jitters, and then quickly end up with a negative or very low backOff value which causes the loop to spin forever until the top level timeout login is reached.The other issue is
long elapsedMilliForLastCall = System.currentTimeMillis() - startTimePerRequest;
. SinceSystem.currentTimeMillis()
is not monotonic (see this thread) it's possible that this call could actually jump back and causeelapsedMilliForLastCall
to be negative or lower than it actually should be.So given these issue above, I believe what is happening is that sometimes the backOffInMilli become a very low value like 2 millis, which is usually above
elapsedMilliForLastCall
. We skip the backoff part of the loop and quickly jack up the number of retries since we end up just making requests in a loop.So if we have backoffInMilli = 1
and retryCount=60,
then and then get an unlucky tick on the last request and
elapsedMilliForLastCall
drops below 1,we re-enter the loop.
When we hit
Math.pow(2, retryCount) + jitteredBackoffInMilli)
this line, we do 2^60, get an absolutely massive number, and then end up having our next request essentially sleep infinitely.Proposed fixes:
I think there's a few approaches to fix.
backoffInMilli
e.g. so that we never sleep more than say 3 minutes.elapsedMilliForLastCall
is calculated to useSystem.nanoTime
which is guaranteed to never jump negative.I'm happy to PR fixes if this doesn't make sense, or discuss further.
The text was updated successfully, but these errors were encountered: