-
Notifications
You must be signed in to change notification settings - Fork 99
Conversation
This code is overly complex and I loath changing it. 😭 I'm probably missing something obvious, but where are we demonstrating there is a a deadlock and where is the deadlock? |
peer_client.go
Outdated
// Take the write lock since we're going to modify the closing state | ||
c.mutex.Lock() | ||
// Shutdown waits until outstanding requests have finished | ||
func (c *PeerClient) Shutdown() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shutdown/Close methods should always accept and context.Cancel
as there may be some unknown or unforeseen reason the close cannot complete and it then hangs forever. This has happened in production on other services and it's always good to put a limit on how long you will wait for a shutdown to complete. This protects the service from having to be killed by the orchestration system without successfully saving it's internal state. Shutting down a connection is a low priority thing, if for some reason the workgroup Wait()
hangs forever, then we can cancel the shutdown wait, and not interfere with normal shutdown of any service that uses gubernator as a library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never seen a shutdown function accept a context. E.g. sql.Close, or connection.Close.
The context object is meant to be used when there is the potential chance that the request is cancelled by the user. This is not the case, Shutdown is invoked internally, not by a client. We should not use the context as a crutch to fix deadlock issues.
I can refactor this to avoid mutex altogether, but I would like to know if the work would be useful (given the v3 release)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few things in the works. We are moving development to a new repo which will not be associated with the mailgun GitHub organization. After that, the plan is to bump to a v3 release, drop GRPC support and continue development there.
But first we want to merge all these open PRs.
I had hoped we could finish merging last week, as I'm traveling this week, but hopefully we can get it all done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to add, Mailgun plans to abandon this repo in favor of @thrawn01's new repo. I suppose you can continue using the /v2
flavor with the new repo until ready to migrate to /v3
.
Either way, it looks like /v2
will be deprecated very soon and gRPC support along with it. I suggest not investing too much time into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! I can lend a hand in the new repo 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of interest, what was the biggest factor for the decision to drop gRPC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of interest, what was the biggest factor for the decision to drop gRPC?
@thrawn01 and I had some big conversations about this. Reduced complexity and improved performance. HTTP can perform more efficiently at scale than gRPC if done right.
This would be a good time for @thrawn01 to mention his work on Duh-RPC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HTTP can perform more efficiently at scale than gRPC if done right.
What's the source for this? My understanding was this this in fact the opposite 😄 https://aws.amazon.com/compare/the-difference-between-grpc-and-rest/
A gRPC API is better for these use cases:
High-performance systems
High data loads
Real-time or streaming applications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another source: https://github.com/grpc-ecosystem/grpc-gateway#background
gRPC is great -- it generates API clients and server stubs in many programming languages, it is fast, easy-to-use, bandwidth-efficient and its design is combat-proven by Google. However, you might still want to provide a traditional RESTful JSON API as well. Reasons can range from maintaining backward-compatibility, supporting languages or clients that are not well supported by gRPC, to simply maintaining the aesthetics and tooling involved with a RESTful JSON architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intend to publish some more formal information about our GRPC journey at Mailgun. But as a quick overview, we are doing some very high performance and large scale stuff with GRPC and it hasn't lived up to all the hype.
In fact we have discovered that GRPC is slightly slower than using plan old HTTP/1 !!!!!!
As a result, I started working on a project called DUH-RPC, which is still in the works, but you can see the benchmarks comparing HTTP/1 with GRPC here https://github.com/duh-rpc/duh-go-benchmarks
The spec I'm working on is in progress but is here, feed back is welcome.
https://github.com/duh-rpc/duh-go
I'll be traveling for the next 2 weeks, and I'm not always near my laptop, so I don't expect much movement on these projects until after. I still get notifications for these conversations, so I'll try to reply as I can.
The deadlock is proven by the flaky test. (If you run go test with Sometimes I get "panic: send on close channel". It happens because when you call shutdown, we are closing the channel, but we are not preventing any further requests, which will then try to write to that channel And sometimes the test times out, it's because one of the lines of code never finishes acquiring the mutex. I forget which line it was |
@miparnisari Is this PR ready for merge? Not concerned about the go-bench error. But the PR is still in draft mode. |
} | ||
|
||
assert.True(t, true, isExpectedErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is asserting that true == true
😅 it's also bad practice to do assertions inside goroutines (https://github.com/Antonboom/testifylint#go-require)
switch err.(type) { | ||
case *gubernator.PeerErr: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not unwrapping errors, so it will never enter the 1st case
@Baliedge i don't know what to do about the newly failing [26/02/24 12:36:35] ~/GitHub/gubernator (fix-peer-client) $ go test ./... -run=TestLeakyBucketDivBug -race -count=1 -p=1
ok github.com/mailgun/gubernator/v2 2.411s
ok github.com/mailgun/gubernator/v2/cluster 2.014s [no tests to run]
? github.com/mailgun/gubernator/v2/cmd/gubernator [no test files]
? github.com/mailgun/gubernator/v2/cmd/gubernator-cli [no test files]
? github.com/mailgun/gubernator/v2/cmd/gubernator-cluster [no test files]
? github.com/mailgun/gubernator/v2/cmd/healthcheck [no test files]
[26/02/24 12:36:47] ~/GitHub/gubernator (fix-peer-client) $ I have the suspicion that |
Since |
294c616
to
db541c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
@Baliedge |
@miparnisari I noticed after merging. It does work on retry. If you have an idea how to make it better, let's try it. |
Description
Close #222.
The mutex deadlock is proven by the flaky
TestPeerClientShutdown
:Sometimes this test also gives "panic: send on close channel". It happens because when you call
shutdown
, we are closing the channel, but we are not preventing any further requests, which will then try to write to that channel.Testing
With this PR: