Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peer is flooded with new ingress streams #734

Open
geigerzaehler opened this issue Jul 26, 2021 · 28 comments
Open

Peer is flooded with new ingress streams #734

geigerzaehler opened this issue Jul 26, 2021 · 28 comments

Comments

@geigerzaehler
Copy link
Contributor

geigerzaehler commented Jul 26, 2021

When connecting to either setzling or sprout we see the following log line every 200ms

Jul 26 12:15:18.816  INFO accept{peer_id=hyb79bxwfz8mbmqarrok53fj9cnmmey6oujg1yfoysjjo1rmzojuhw}:
  disco:discovered{remote_id=hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc}:
  incoming{remote_id=hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc remote_addr=34.141.175.140:12345}:
  librad::net::protocol::io::streams: new ingress stream

In addition, we also see the following log line for a while until it goes away (maybe after replicating a first project)

Jul 26 12:07:56.240  WARN radicle_daemon::peer::run_state::running_waiting_room:
  waiting room error error=MissingUrn(Urn { id: Oid(1183cf230ec31058fa955e69b4172086161641de), path: Some(RefLike("rad/id")) })

This happens even if I start Upstream with a clean slate and I don’t replicate any projects.

I’ve noticed that once my peer is connected to multiple peers I’m seeing this log line for every peer. This puts significant pressure on the CPU resulting in 50% utilization. This is a high severity issue for Upstream users since the app and their machines become unresponsive

Steps to reproduce

  1. Start the most recent version (0.2.9 or master) of Upstream from the command line with a fresh profile using RAD_HOME
  2. Create a profile
  3. Log at the command line output
@geigerzaehler geigerzaehler changed the title Seed is flooding peers with new ingress streams Peer is flooded with new ingress streams Jul 26, 2021
@kim
Copy link
Contributor

kim commented Jul 27, 2021

Is it the same remote_addr always?

@geigerzaehler
Copy link
Contributor Author

Is it the same remote_addr always?

Yes it is. Once I’m connected to more peers this happens with every peer. Also the Oid for the waiting room warning is always the same.

@kim
Copy link
Contributor

kim commented Jul 27, 2021

I don’t know of anything which would trigger packets at a 200ms interval, except perhaps an amplification effect of something malfunctioning with the “waiting room”

@xla
Copy link
Contributor

xla commented Jul 27, 2021

@geigerzaehler If you could conjure up debug logs from seed that correspond to the upstream peer that would help with narrowing down the interaction that is causing this.

@cloudhead
Copy link
Contributor

I’ve noticed that once my peer is connected to multiple peers I’m seeing this log line for every peer. This puts significant pressure on the CPU resulting in 50% utilization. This is a high severity issue for Upstream users since the app and their machines become unresponsive

Does this then rule out a potential issue in the seed node/library?

@geigerzaehler
Copy link
Contributor Author

@geigerzaehler If you could conjure up debug logs from seed that correspond to the upstream peer that would help with narrowing down the interaction that is causing this.

Here’s the seed log just before the issue starts. Afterwards the log basically repeats itself.

seed.log

@geigerzaehler
Copy link
Contributor Author

A couple of things I noticed when I investigated this and that may contribute to the issue

  • We persist the waiting room whenever it’s state changes. This may happen quite frequently, for example on every Have message received by the protocol, even the ones that are uninteresting. Persistance is probably not cheap.
  • Are the calls to kv for persistence blocking on IO? If so, this might be a problem because we don’t spawn them on a separate thread?
  • We’re using tokio::task::spawn_blocking instead of unblocking throughout the daemon codebase
  • daemon::peer::announcement::run does read and write to the store but calls are not synchronized. It is possible that two calls of run() execute concurrently and mess with each other.

@cloudhead
Copy link
Contributor

Seems like radicle-dev/radicle-client-services#5 is related, which doesn't have a waiting room or complex behavior. It seems like peers are establishing large amounts of connections to other peers, something we've been seeing on the seed node for a while.

It could either be malicious behavior, or some kind of feedback loop..

In any case it might help to limit the amount of ingress streams per-peerid, as a mitigation.

@FintanH
Copy link
Contributor

FintanH commented Jul 29, 2021

Something I find weird about this log is that a peer keeps advertising the same Have repeatedly:

    24 Jul 28 10:00:52.008 DEBUG accept{peer_id=hybz9gfgtd9d4pd14a6r66j5hz6f77fed4jdu7pana4fxaxbt369kg}:incoming:incoming{remote_id=hyd7cb39wws6jz5wmifzwtjn8u57siw6knipqttk6nhc657r1fozaw remote_addr=188.120.235.3:12345}:tock{tock=AttemptSend { to: GenericPeerInfo { peer_id: hyd7cb39wws6jz5wmifzwtjn8u57siw6knipqttk6nhc657r1fozaw, advertised_info: PeerAdvertisement { listen_addrs: Within { inner: [188.120.235.3:12345], min: 0, max: 16 }, capabilities: {} }, seen_addrs: Within { inner: [188.120.235.3:12345], min: 0, max: 16 } }, message: Membership(Neighbour { info: PeerAdvertisement { listen_addrs: Within { inner: [10.156.15.198:12345, 127.0.0.1:12345], min: 0, max: 16 }, capabilities: {} }, need_friends: None }) }}:incoming{remote_id=hyd7cb39wws6jz5wmifzwtjn8u57siw6knipqttk6nhc657r1fozaw remote_addr=188.120.235.3:12345}:tock{tock=SendConnected { to: hybdj8d47ew35471nu4gs7pco8p9xib1iksdmtygy43ghyu5ypaumg, message: Gossip(Have { origin: GenericPeerInfo { peer_id: hydd5idud5sbtkdmb8cmuprpf3rsj4ajper3j4738f9efw7nunpgxe, advertised_info: PeerAdvertisement { listen_addrs: Within { inner: [127.0.0.1:37109, 192.168.4.41:37109], min: 0, max: 16 }, capabilities: {} }, seen_addrs: Within { inner: [], min: 0, max: 16 } }, val: Payload { urn: Urn { id: Oid(1183cf230ec31058fa955e69b4172086161641de), path: Some(RefLike("rad/id")) }, rev: Some(Git(b804589da7f790d471c8c1870d880f7ac6b471d9)), origin: None } }) }}: librad::net::protocol::tick: tock

Doing rg b804589da7f790d471c8c1870d880f7ac6b471d9 seed.log | wc will bring up 64 occurrences. They also happen very close to each other.

@FintanH
Copy link
Contributor

FintanH commented Jul 29, 2021

In any case it might help to limit the amount of ingress streams per-peerid, as a mitigation.

But radicle-link recently added rate limiting, so this is even more curious 🤔

@kim
Copy link
Contributor

kim commented Jul 29, 2021

Rate limiting is limited due to the stream-per-message model. That was a bad idea, I'm changing it. At the end of the day, a peer flooding the network cannot really be defended against, unless we auto-ban them permanently.

@geigerzaehler
Copy link
Contributor Author

Any update on this? This issue often makes Upstream unusable :(

@FintanH
Copy link
Contributor

FintanH commented Aug 16, 2021

Any update on this? This issue often makes Upstream unusable :(

I thought the last thing we talked about was Upstream updating librad to this commit cb91ccd and seeing how that might improve usage?

@kim
Copy link
Contributor

kim commented Aug 16, 2021 via email

@geigerzaehler
Copy link
Contributor Author

I thought the last thing we talked about was Upstream updating librad to this commit cb91ccd and seeing how that might improve usage?

That didn’t help unfortunately. We’re still seeing this issue.

Also heard from @cloudhead on matrix that it was sprout misbehaving.

I was also seeing this issue with setzling. And this was after I restarted it and was the only peer connected to it.

@FintanH
Copy link
Contributor

FintanH commented Aug 17, 2021

I was also seeing this issue with setzling. And this was after I restarted it and was the only peer connected to it.

And there was no sign of some peer repeatedly making requests?

@kim
Copy link
Contributor

kim commented Aug 17, 2021 via email

@geigerzaehler
Copy link
Contributor Author

I’m not able to reproduce this issue anymore if I’m the only one connected to a seed node. But I still see this issue if other peers are connected. After adding some logging statements I see the following when I connect to sprout.

Aug 18 16:54:25.173  INFOlibrad::net::peer::storage: gossip received, has: Payload { urn: Urn { id: Oid(8cc143b5b061d9882726564702017b5f94bb0f1a), path: Some(RefLike("rad/id")) }, rev: Some(Git(f2949476d30de0ee650598ba16aee4a4601157da)), origin: None }, provider: hyde43uaaqz5bnh8kpqfwiwx7inxzdmmtiy1ndtf6zazfrcny3ossr
    at /home/thomas/code/radicle-link/librad/src/net/peer/storage.rs:189
    in librad::net::peer::storage::put with has: Payload { urn: Urn { id: Oid(8cc143b5b061d9882726564702017b5f94bb0f1a), path: Some(RefLike("rad/id")) }, rev: Some(Git(f2949476d30de0ee650598ba16aee4a4601157da)), origin: None }
    in librad::net::protocol::broadcast::apply with remote_id: hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc, Have { origin: GenericPeerInfo { peer_id: hyde43uaaqz5bnh8kpqfwiwx7inxzdmmtiy1ndtf6zazfrcny3ossr, advertised_info: PeerAdvertisement { listen_addrs: Within { inner: [127.0.0.1:57843, 192.168.1.108:57843], min: 0, max: 16 }, capabilities: {} }, seen_addrs: Within { inner: [], min: 0, max: 16 } }, val: Payload { urn: Urn { id: Oid(8cc143b5b061d9882726564702017b5f94bb0f1a), path: Some(RefLike("rad/id")) }, rev: Some(Git(f2949476d30de0ee650598ba16aee4a4601157da)), origin: None } }
    in librad::net::protocol::io::streams::incoming with remote_id: hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc, remote_addr: 35.187.60.160:12345
    in librad::net::protocol::io::discovered with remote_id: hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc
    in librad::net::protocol::accept::disco
    in librad::net::protocol::accept with peer_id: hyb9fda6y8tuwqn73wfnh1qf4m9cau714o5xf14t7wgwbkdhs138ay

This log line is repeated every couple of milliseconds. As you can see, the message arrives from the seed but originates from another node.

Could one way to limit the impact of this be to have the waiting room not react to uninteresting messages?

@FintanH
Copy link
Contributor

FintanH commented Aug 18, 2021

Could one way to limit the impact of this be to have the waiting room not react to uninteresting messages?

The thing is that if you've requested an Urn you're not guaranteed to know which PeerId it'll come from so that counts as "uninteresting" in the gossip world, however, it's interesting to the waiting room because it found a (possibly) valid peer to replicate from.

Why do you think this could be the waiting room anyway? 🤔 Maybe I'm missing something you see in the logs there? 👀

@geigerzaehler
Copy link
Contributor Author

Why do you think this could be the waiting room anyway?

I’m not saying the waiting room causes the issue. But it seems odd to me that the messages put so much pressure on the CPU. I was just thinking about limiting the impact of messages that don’t have an effect.

@cloudhead
Copy link
Contributor

So, what are we seeing exactly? That seed nodes are flooding, or that upstream nodes are flooding as soon as they are connected to a seed? Is it perhaps that the seed should check if the project is already tracked here [0], because the call to track is idempotent?

If anything this is probably the source of some unwanted traffic and we should definitely fix it.

kim added a commit that referenced this issue Aug 19, 2021
When calling `Node::track_project`, it might very well be that the
project/peer is already tracked, which would result in unnecessary
broadcast traffic. Thus, don't emit any events in this case.

Also, change the order of `track` and `replicate` to ensure we
do actually fetch the newly-tracked peer's view. Additionally, we can
skip `replicate` if we already tracked the peer, as subsequent broadcast
messages won't be considered uninteresting.

Ref #734
Signed-off-by: Kim Altintop <[email protected]>
kim added a commit that referenced this issue Aug 19, 2021
When calling `Node::track_project`, it might very well be that the
project/peer is already tracked, which would result in unnecessary
broadcast traffic. Thus, don't emit any events in this case.

Also, change the order of `track` and `replicate` to ensure we
do actually fetch the newly-tracked peer's view. Additionally, we can
skip `replicate` if we already tracked the peer, as subsequent broadcast
messages won't be considered uninteresting.

Ref #734
Signed-off-by: Kim Altintop <[email protected]>
@kim
Copy link
Contributor

kim commented Aug 19, 2021

I must say that this waiting room logic escapes me mostly. It is probably the case that the bandaid of screaming "have" into the forest should just be removed -- unfortunately, #653 is still stalled, because replication v3 is more important right now.

In the spirit of graceful centralisation, I could imagine the following workaround: since most clients (upstream clients for sure) bootstrap through a "central" seed, we could have some kind of no-forward flag on wants/haves, which simply cause the seed to respond if they have the wanted data, but not forward the message.

kim added a commit that referenced this issue Aug 19, 2021
When calling `Node::track_project`, it might very well be that the
project/peer is already tracked, which would result in unnecessary
broadcast traffic. Thus, don't emit any events in this case.

Also, change the order of `track` and `replicate` to ensure we
do actually fetch the newly-tracked peer's view. Additionally, we can
skip `replicate` if we already tracked the peer, as subsequent broadcast
messages won't be considered uninteresting.

Ref #734
Signed-off-by: Kim Altintop <[email protected]>
@cloudhead
Copy link
Contributor

That's kind of interesting - is this a soft-fork?

@kim
Copy link
Contributor

kim commented Aug 19, 2021 via email

@cloudhead
Copy link
Contributor

Hehe I mean is it a backwards compatible change?

@kim
Copy link
Contributor

kim commented Aug 19, 2021 via email

@geigerzaehler
Copy link
Contributor Author

Don’t we have some rate limiting in place that should prevent a peer being flooded by messages from another peer?

@kim
Copy link
Contributor

kim commented Aug 19, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants