-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Directory nodes constantly disconnecting #1435
Comments
Fixes #1435. Prior to this commit, if connections were immediately closed, no backoff or jitter were applied to delay the next connection attempt, resulting in too many connection attempts to directory nodes that the client could not reach, and spam in the logs. After this commit, we enforce that we always shut down the ClientService on any disconnection trigger or connection failure, and only attempt to reconnect with a manual backoff delay (4 seconds increasing by 50% for 20 attempts, plus a few seconds jitter), for directory nodes, while for non-directory nodes we always give up on any failure.
We got a disconnect event
See #1435. Prior to this commit, if connections were immediately closed, no backoff or jitter were applied to delay the next connection attempt, resulting in too many connection attempts to directory nodes that the client could not reach, and spam in the logs. After this commit, we enforce that we always shut down the ClientService on any disconnection trigger or connection failure, and only attempt to reconnect with a manual backoff delay (4 seconds increasing by 50% for 20 attempts, plus a few seconds jitter), for directory nodes, while for non-directory nodes we always give up on any failure. Max delay is a bit above 3 hours.
Just dropping a note on what I've been seeing for the last few days since #1436 merge: though we back off (with that code ofc, most users are on 0.9.8), we still seem to be basically permanently shut off from two of the three directory-nodes. More work required, even though that patch did have the desired effect, and bots can actually run normally without log spam. Re: "more work", I think a good place to focus on is how the directory node onion services are configured. Perhaps somewhere buried there's a setting that can allow more connections? Or at least, figure out what the connection limits are (both in timing of requests, and total number, say)? |
Just focusing about Tor, this could be a start https://community.torproject.org/onion-services/advanced/dos/ For more crude details we might have to dig Tor source code and spec at some point. |
Was reading docs for a related project, and noticed this: https://tunnelsats.github.io/tunnelsats/FAQ.html#tuning-tor ... perhaps worth investigating? I believe someone mentioned LongLivedPorts on this repo a while ago. Edit: yes, see fbcb9fd - hence 5222. |
Interesting point from Though we see only perhaps a couple of hundred makers (at best), we could very easily be supporting 1000 takers+makers+obwatchers. I'll look into changing this. (though, i still don't actually understand why it would be 'connect then disconnect' and not just 'refuse', if indeed this is in any way related). Edit: hmm, on a second read, unfortunately that appears to be a minimum not a maximum. So probably irrelevant but may as well leave the note here. |
In case anyone else is investigating this like me, here's a point that might be very important: how to see detailed logs of the tor instance in which your directory node is running: Note that we spawn a new tor process for it, and that process can be set to log to a particular file by editing
to something like:
(read Log levels here are debug, info, warn, notice or similar as per docs/man page. I think info is already pretty verbose, debug looks overwhelming. I'm going to be tail -f-ing this to help me see what transpires with different setups. (You can also control the output of Please let me know if you have other info about logging tor, here. |
An addendum: when |
Seeing this pretty often still when launching bots. Curious why sometimes they will reconnect for very short periods of time.
So, I decided to should start my own directory node to dig into why. Tor is quite a bit more sophisticated than I am, so it was a lot of trial and error to even get going. Now that it's working, I guess it's time to try to break it. If anyone wants to connect, here is the list of directory_nodes I am currently using for reference. This includes the newest from the stopgap list of directory nodes, #1445, to date of this comment, as well as the new node from #1456.
|
From what I can see, It really seems like there's a magic threshold that we reach and things stop working. |
Seeing the same thing here still. I've been monitoring the tor logs for the directory node I setup, but nothing really unusual has presented itself yet, and it still hasn't been hit by the "magic threshold" as far as I can tell. Please report if you have issues with " edit: The "wkd3" onion referenced above has been replaced by plq5jw5hqke6werrc5duvgwbuczrg4mphlqsqbzmdfwxwkm2ncdzxeqd.onion due to unrelated server downtime. |
See JoinMarket-Org#1435. Prior to this commit, if connections were immediately closed, no backoff or jitter were applied to delay the next connection attempt, resulting in too many connection attempts to directory nodes that the client could not reach, and spam in the logs. After this commit, we enforce that we always shut down the ClientService on any disconnection trigger or connection failure, and only attempt to reconnect with a manual backoff delay (4 seconds increasing by 50% for 20 attempts, plus a few seconds jitter), for directory nodes, while for non-directory nodes we always give up on any failure. Max delay is a bit above 3 hours.
EDIT: as per #1436 (comment), the spam issue should be mostly solved. Renamed this issue to focus on the "disconnect" part. Hopefully the changes in #1436 will improve this already.
EDIT:
g3hv4uynnmynqqq2mchf3fcm3yd46kfzmcdogejuckgwknwyq5ya6iad
is constantly disconnecting too now.As reported by many different users and contributors, the following two directory nodes are disconnecting in a repetitive/constant pattern:
3kxw6lf5vf6y26emzwgibzhrzhmhqiw6ekrek3nqfjjmhwznb2moonad.onion:5222
bqlpq6ak24mwvuixixitift4yu42nxchlilrcqwk2ugn45tdclg42qid.onion:5222
The log is simply a constant repetition of:
It started last week from what I can gather from users report, but I can't confirm it wasn't happening earlier too.
At this point, it seems plausible to me this is a bug in our code logic.
The simplest explanation that comes to my mind is that we are not handling correctly some connectivity issue edge case.
Curiously, the newly added directory node
g3hv4uynnmynqqq2mchf3fcm3yd46kfzmcdogejuckgwknwyq5ya6iad.onion:5222
is unaffected. This node is not yet in a release, so it might simply be because of less network traffic (EDIT: this directory is now experiencing exactly the same problem).FWIW, signet directories also appear to be working.
This issue might resolve itself, but IMHO it's good to look into it to prevent it from happening again.
The text was updated successfully, but these errors were encountered: