Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: tracer: upgrade elastic search transport for pubsub traces #10405

Conversation

cortze
Copy link

@cortze cortze commented Mar 7, 2023

Related Issues

The current lotus trace wrapper in master has a limited number of traces pushed to the Elastic Search (ES) remote instance. The daemon only pushes traces related to:

  • Join/leave GossipSub topics
  • Graft/Prune peers per topic
  • GossipSub PeerScores

Furthermore, the current code also forces the user to set up a RemoteTrace in the config-file to track the published messages of the desired topics.

Related to the proposed gossipsub measurement study protocol/network-measurements#17, to track broadcasting latencies of gossipsub messages, the PRC_Recv events are needed. Thus, this PR enables the remote submission of RPC-related traces. As RPC calls already include all the mesh-control messages, these ones are disabled in pro of reducing the overall overhead.

Proposed Changes

Submitting all the RPC traces to the remote ES instance through individual HTTP requests adds significant overhead to the lotus daemon, making it lose synchronization with the head of the chain. Thus, this PR upgrades the Elastic Search transport to support trace batching and a faster HTTP transport in the ES client.

The PR applies some parameters that would work fine and are currently being tested. Feel free to suggest better approaches or parameters like the flushing time or buffer limit for the batching system.

Checklist

Before you mark the PR ready for review, please make sure that:

  • Commits have a clear commit message.
  • PR title is in the form of <PR type>: <area>: <change being made>
  • New features have usage guidelines and / or documentation updates in
  • Tests exist for new functionality or change in behavior
  • CI is green

@cortze
Copy link
Author

cortze commented Mar 14, 2023

Updates

The suggested changes have been running for 5 consecutive days on a locally monitored node. To bring more information about how the remote traces submission impacts the overall performance of the Lotus daemon, these are some plots I generated when comparing the resources used the [email protected] after #7398 with the proposed changes.

NOTE:

  • All the graphs include two different distributions delimited by the red vertical line on 09/03/2023 at 16pm.
  • The distribution on the left corresponds to the data gathered from the Prometheus node exporter running on the same machine as the Lotus daemon on its v1.19.0 after Lotus extended pubsub tracer #7398 , used to sync the filecoin chain from the latest snapshot available.
  • The distribution on the right corresponds to the compiled lotus daemon with the proposed changes

Highlights

  • The lotus daemon maintains the same level of resource utilization with the major exception of the network traffic (both incoming and outcoming).
  • The remote trace submission to an ES instance is the one that generates the extra measured traffic.
  • In both cases, the lotus daemon follows the head of the chain without any apparent problem (no WARN-ERROR logs & lotus sync always reporting 0 blocks missing)

Resource profiling

  • Disk Usage: The aggregated disk usage of the lotus client remains constant at around 4GB/h
Disk space increment Increment per hour:
image image
  • Memory usage: the allocated memory by the process follows the exact same increment over time.

  • CPU usage: There are no significant differences in CPU usage when tracking the PubSub Traces (~2.5% extra at some point of 12/03 ). Note that the first spike of CPU usage is due to the synchronization of the chain.

  • Net-In: x1.5-x2 traffic increment (~150MB/h) when enabling the remote trace submission to the ES instance. Note the initial spike originated from the lotus node syncing the chain and that the same network increment is measured as Net-Out at the ES machine:
Lotus Net-In ES Net-Out
image image
  • Net-Out: x2 increase in sent bytes (350MB/h) after enabling the remote gossip traces submission to ES, probably generated from formatting GossipSub traces in json packages and submitting them to a remote Elastic-Search instance.
Lotus Net-Out ES Net-In
image image

@cortze cortze marked this pull request as ready for review March 14, 2023 10:13
@cortze cortze requested a review from a team as a code owner March 14, 2023 10:13
@cortze cortze force-pushed the feat/upgrade-elastic-search-traces-transport branch 2 times, most recently from a028d9e to 67d419e Compare April 19, 2023 10:47
@cortze
Copy link
Author

cortze commented Apr 19, 2023

Rebased PR to v1.20.3 as it is the latest official release

@cortze cortze force-pushed the feat/upgrade-elastic-search-traces-transport branch 3 times, most recently from b76f4f5 to 87235d2 Compare April 25, 2023 07:56
@cortze
Copy link
Author

cortze commented Apr 25, 2023

Rebase fork to latest v1.23.0

@cortze cortze force-pushed the feat/upgrade-elastic-search-traces-transport branch from dff25de to c2e2725 Compare May 17, 2023 09:55
@cortze
Copy link
Author

cortze commented May 22, 2023

Hey, @snadrus ! Thank you so much for the feedback! I just addressed your comments bringing back the net/http transport.
Let me know if anything is missing from my side :)

@snadrus snadrus merged commit aef2ab6 into filecoin-project:master May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants