Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directory nodes constantly disconnecting #1435

Open
PulpCattel opened this issue Jan 18, 2023 · 9 comments · Fixed by #1436
Open

Directory nodes constantly disconnecting #1435

PulpCattel opened this issue Jan 18, 2023 · 9 comments · Fixed by #1436

Comments

@PulpCattel
Copy link
Member

PulpCattel commented Jan 18, 2023

EDIT: as per #1436 (comment), the spam issue should be mostly solved. Renamed this issue to focus on the "disconnect" part. Hopefully the changes in #1436 will improve this already.

EDIT: g3hv4uynnmynqqq2mchf3fcm3yd46kfzmcdogejuckgwknwyq5ya6iad is constantly disconnecting too now.

As reported by many different users and contributors, the following two directory nodes are disconnecting in a repetitive/constant pattern:

  • 3kxw6lf5vf6y26emzwgibzhrzhmhqiw6ekrek3nqfjjmhwznb2moonad.onion:5222
  • bqlpq6ak24mwvuixixitift4yu42nxchlilrcqwk2ugn45tdclg42qid.onion:5222

The log is simply a constant repetition of:

[DEBUG]  We got a disconnect event: 3kxw6lf5vf6y26emzwgibzhrzhmhqiw6ekrek3nqfjjmhwznb2moonad.onion:5222
[INFO ]  Updating status to connected for peer 3kxw6lf5vf6y26emzwgibzhrzhmhqiw6ekrek3nqfjjmhwznb2moonad.onion:5222.

It started last week from what I can gather from users report, but I can't confirm it wasn't happening earlier too.
At this point, it seems plausible to me this is a bug in our code logic.

The simplest explanation that comes to my mind is that we are not handling correctly some connectivity issue edge case.

Curiously, the newly added directory node g3hv4uynnmynqqq2mchf3fcm3yd46kfzmcdogejuckgwknwyq5ya6iad.onion:5222 is unaffected. This node is not yet in a release, so it might simply be because of less network traffic (EDIT: this directory is now experiencing exactly the same problem).

FWIW, signet directories also appear to be working.

This issue might resolve itself, but IMHO it's good to look into it to prevent it from happening again.

AdamISZ added a commit that referenced this issue Jan 21, 2023
Fixes #1435.
Prior to this commit, if connections were immediately closed,
no backoff or jitter were applied to delay the next connection attempt,
resulting in too many connection attempts to directory nodes that the
client could not reach, and spam in the logs.
After this commit, we enforce that we always shut down the ClientService
on any disconnection trigger or connection failure, and only attempt to
reconnect with a manual backoff delay (4 seconds increasing by 50% for
20 attempts, plus a few seconds jitter), for directory nodes, while for
non-directory nodes we always give up on any failure.
@PulpCattel PulpCattel changed the title Log spammed by We got a disconnect event Directory nodes constantly disconnecting Jan 23, 2023
AdamISZ added a commit that referenced this issue Jan 23, 2023
See #1435.
Prior to this commit, if connections were immediately closed,
no backoff or jitter were applied to delay the next connection attempt,
resulting in too many connection attempts to directory nodes that the
client could not reach, and spam in the logs.
After this commit, we enforce that we always shut down the ClientService
on any disconnection trigger or connection failure, and only attempt to
reconnect with a manual backoff delay (4 seconds increasing by 50% for
20 attempts, plus a few seconds jitter), for directory nodes, while for
non-directory nodes we always give up on any failure. Max delay is a bit
above 3 hours.
@PulpCattel PulpCattel reopened this Jan 24, 2023
@AdamISZ
Copy link
Member

AdamISZ commented Jan 27, 2023

Just dropping a note on what I've been seeing for the last few days since #1436 merge: though we back off (with that code ofc, most users are on 0.9.8), we still seem to be basically permanently shut off from two of the three directory-nodes. More work required, even though that patch did have the desired effect, and bots can actually run normally without log spam.

Re: "more work", I think a good place to focus on is how the directory node onion services are configured. Perhaps somewhere buried there's a setting that can allow more connections? Or at least, figure out what the connection limits are (both in timing of requests, and total number, say)?

@PulpCattel
Copy link
Member Author

Just focusing about Tor, this could be a start https://community.torproject.org/onion-services/advanced/dos/

For more crude details we might have to dig Tor source code and spec at some point.

@AdamISZ
Copy link
Member

AdamISZ commented Jan 29, 2023

Was reading docs for a related project, and noticed this: https://tunnelsats.github.io/tunnelsats/FAQ.html#tuning-tor ... perhaps worth investigating? I believe someone mentioned LongLivedPorts on this repo a while ago. Edit: yes, see fbcb9fd - hence 5222.

@AdamISZ
Copy link
Member

AdamISZ commented Feb 5, 2023

Interesting point from man tor: ConnLimit defaults to 1000. Second, I observed earlier today that about 1000 !orderbook messages were propagated by my dir node. Third, the way we use tor is that a separate instance of the tor process is run for each directory node.

Though we see only perhaps a couple of hundred makers (at best), we could very easily be supporting 1000 takers+makers+obwatchers. I'll look into changing this. (though, i still don't actually understand why it would be 'connect then disconnect' and not just 'refuse', if indeed this is in any way related).

Edit: hmm, on a second read, unfortunately that appears to be a minimum not a maximum. So probably irrelevant but may as well leave the note here.

@AdamISZ
Copy link
Member

AdamISZ commented Feb 7, 2023

In case anyone else is investigating this like me, here's a point that might be very important: how to see detailed logs of the tor instance in which your directory node is running:

Note that we spawn a new tor process for it, and that process can be set to log to a particular file by editing path-to-jmcs/jmvenv/lib/python3.xx/site-packages/txtorcon/controller.py line 314 by extending this list:

config_args = ['-f', '/dev/null/non-existant-on-purpose', '--ignore-missing-torrc']

to something like:

config_args = ['-f', '/dev/null/non-existant-on-purpose', '--ignore-missing-torrc', 'Log', 'info file yourfilename.txt']

(read man tor, but notice that while on command line you need double quotes around the multi word value for the Log option, here, you don't need th edouble quotes). Another thing is that the location of the log file created is in /tmp/tortmpxxxxx where the random xxxxx can be seen from ps aux | grep tor or similar.

Log levels here are debug, info, warn, notice or similar as per docs/man page. I think info is already pretty verbose, debug looks overwhelming. I'm going to be tail -f-ing this to help me see what transpires with different setups.

(You can also control the output of journalctl with options in the .service file, like StandardOutput:journal+console but it doesn't help you with the tor log, hence the above).

Please let me know if you have other info about logging tor, here.

@AdamISZ
Copy link
Member

AdamISZ commented Feb 7, 2023

An addendum: when tail-ing that output, it was still overwhelming (which itself may be interesting!) for my mainnet dnode, so you may find notice more realistic than info (as, in fact, the man page suggests). Let's see.

@CohibAA
Copy link

CohibAA commented Mar 3, 2023

Seeing this pretty often still when launching bots. Curious why sometimes they will reconnect for very short periods of time.

2023-03-xxxxxxxxxxxxxxx [WARNING]  Failure to send message to directory: g3hv4uynnmynqqq2mchf3fcm3yd46kfzmcdogejuckgwknwyq5ya6iad.onion:5222.
2023-03-xxxxxxxxxxxxxxx [WARNING]  Failure to send message to directory: 3kxw6lf5vf6y26emzwgibzhrzhmhqiw6ekrek3nqfjjmhwznb2moonad.onion:5222.
2023-03-xxxxxxxxxxxxxxx [WARNING]  Failure to send message to directory: bqlpq6ak24mwvuixixitift4yu42nxchlilrcqwk2ugn45tdclg42qid.onion:5222.

So, I decided to should start my own directory node to dig into why. Tor is quite a bit more sophisticated than I am, so it was a lot of trial and error to even get going. Now that it's working, I guess it's time to try to break it.

If anyone wants to connect, here is the list of directory_nodes I am currently using for reference. This includes the newest from the stopgap list of directory nodes, #1445, to date of this comment, as well as the new node from #1456.

directory_nodes = plq5jw5hqke6werrc5duvgwbuczrg4mphlqsqbzmdfwxwkm2ncdzxeqd.onion:5222,g3hv4uynnmynqqq2mchf3fcm3yd46kfzmcdogejuckgwknwyq5ya6iad.onion:5222,3kxw6lf5vf6y26emzwgibzhrzhmhqiw6ekrek3nqfjjmhwznb2moonad.onion:5222,bqlpq6ak24mwvuixixitift4yu42nxchlilrcqwk2ugn45tdclg42qid.onion:5222,ylegp63psfqh3zk2huckf2xth6dxvh2z364ykjfmvsoze6tkfjceq7qd.onion:5222

@PulpCattel
Copy link
Member Author

From what I can see, g3hv4uynnmynqqq2mchf3fcm3yd46kfzmcdogejuckgwknwyq5ya6iad is "down" too now, same pattern as the other directories. RIP.

It really seems like there's a magic threshold that we reach and things stop working.

@CohibAA
Copy link

CohibAA commented Mar 27, 2023

2023-03-27 xxxxxxxxxxxx [WARNING]  Failure to send message to directory: g3hv4uynnmynqqq2mchf3fcm3yd46kfzmcdogejuckgwknwyq5ya6iad.onion:5222.
2023-03-27 xxxxxxxxxxxx [WARNING]  Failure to send message to directory: 3kxw6lf5vf6y26emzwgibzhrzhmhqiw6ekrek3nqfjjmhwznb2moonad.onion:5222.
2023-03-27 xxxxxxxxxxxx [WARNING]  Failure to send message to directory: bqlpq6ak24mwvuixixitift4yu42nxchlilrcqwk2ugn45tdclg42qid.onion:5222.

Seeing the same thing here still. I've been monitoring the tor logs for the directory node I setup, but nothing really unusual has presented itself yet, and it still hasn't been hit by the "magic threshold" as far as I can tell. Please report if you have issues with "wkd3kd73ze62sqhlj2ebwe6fxqvshw5sya6nkvrgcweegt7ljhuspaid.onion"* and I can reference back to logs more directly possibly.

edit: The "wkd3" onion referenced above has been replaced by plq5jw5hqke6werrc5duvgwbuczrg4mphlqsqbzmdfwxwkm2ncdzxeqd.onion due to unrelated server downtime.

kristapsk pushed a commit to kristapsk/joinmarket-clientserver that referenced this issue Jan 5, 2024
See JoinMarket-Org#1435.
Prior to this commit, if connections were immediately closed,
no backoff or jitter were applied to delay the next connection attempt,
resulting in too many connection attempts to directory nodes that the
client could not reach, and spam in the logs.
After this commit, we enforce that we always shut down the ClientService
on any disconnection trigger or connection failure, and only attempt to
reconnect with a manual backoff delay (4 seconds increasing by 50% for
20 attempts, plus a few seconds jitter), for directory nodes, while for
non-directory nodes we always give up on any failure. Max delay is a bit
above 3 hours.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
@AdamISZ @CohibAA @PulpCattel and others