-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tcp long connections metrics #1249
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@nlgwcy @LemmyHuang @xiangxinyong can you review the pr, have i writen correct ebpf code, i have done the changes which has been told in previous comments in the proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please present the results of the long connection metrics
- The original destination address problem is still unhandle. If you plan to optimize later, you can create a sub-issue under the existing lfx issue to track the task.
pkg/controller/telemetry/metric.go
Outdated
@@ -55,6 +55,7 @@ type MetricController struct { | |||
EnableAccesslog atomic.Bool | |||
EnableMonitoring atomic.Bool | |||
EnableWorkloadMetric atomic.Bool | |||
EnableLongTCPMetric atomic.Bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary to distinguish between metrics for long connections and short connections.
As long as metrics is turned on, we should handle all types of connections.
bpf/kmesh/workload/include/config.h
Outdated
#define kmesh_perf_map km_perf_map | ||
#define kmesh_perf_info km_perf_info | ||
#define map_of_long_tcp_conns km_longtcpconns_map | ||
#define long_tcp_conns_events km_longtcpconns_events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BPF_OBJ_NAME_LEN = 16
So I think, this name of map is too long.
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
…userspace Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
…km_longconn_ev ringbuff Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
b79a168
to
a698cac
Compare
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
@nlgwcy @LiZhenCheng9527 i have updated the accesslogs for realtime tcp conn monitoring |
I have added realtime monitoring for workload and service metrics (means metrics are reported periodically not after close), do we need serperate metrics for long connections ? |
Signed-off-by: Yash Patel <[email protected]>
`# Enable/Disable Kmesh's accesslog: |
This one is probably the current ebpf program, which exceeds the number of instruction sets allowed in the kernel |
|
@@ -407,21 +440,34 @@ func buildV4Metric(buf *bytes.Buffer) (requestMetric, error) { | |||
data.origDstPort = connectData.OriginalPort | |||
} | |||
|
|||
data.sentBytes = connectData.SentBytes | |||
data.receivedBytes = connectData.ReceivedBytes | |||
data.sentBytes = connectData.SentBytes - tcp_conns[connectData.ConnId].sentBytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When are you going to clean up tcp_conns
?
|
||
__u64 now = bpf_ktime_get_ns(); | ||
// Check if connection duration exceeds threshold | ||
if ((now - conn->start_ns) > LONG_CONN_THRESHOLD_TIME) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is stage reporting.
So what happens if the conn is closed at less than threshold?
bpf/kmesh/workload/tc.c
Outdated
BPF_LOG(ERR, TIMER, "Failed to lookup tcp timer\n"); | ||
} else { | ||
// Initialize and start timer | ||
bpf_timer_init(timer, &tcp_conn_flush_timer, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These API are only available since 5.15, but kmesh support 5.10 from beginning
do we need serperate metrics for long connections ? |
#1249 (comment), do i have to change the Makefile so that the new c file i have added is also compiled with correct header file |
Do you mean handle the metrics for long and short connections separately? Can you tell us your reasons? |
I am not sure what's your means. Did you have any trouble compiling it? |
bpf/kmesh/probes/tcp_probe.h
Outdated
} | ||
__u64 now = bpf_ktime_get_ns(); | ||
info_vals->duration = now - info_vals->start_ns; | ||
get_tcp_probe_info(tcp_sock, info_vals); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this is very confusing now, maybe rename to set_xxx
tcp_report(sk, tcp_sock, storage, BPF_TCP_ESTABLISHED); | ||
update_tcp_conn_info_on_state_change(tcp_sock, storage, state); | ||
if (state == BPF_TCP_CLOSE) { | ||
bpf_sk_storage_delete(&map_of_sock_storage, sk); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of map does not need to delete @nlgwcy to confirm
static inline void | ||
update_tcp_conn_info_on_state_change(struct bpf_tcp_sock *tcp_sock, struct sock_storage_data *storage, __u32 state) | ||
{ | ||
struct tcp_probe_info *info = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new_info?
|
||
// Also trigger's on icmp packets (hence can be used for monitor packet loss) | ||
SEC("tc") | ||
int tc_prog(struct __sk_buff *skb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do this in the existing hook
Metrics we currently have:
What i am proposing was for connection specific metrics (we can show it for the conn whose duration exceeded 1 min), not for short conns because prometheus scrape metrics in 5-15 sec interval conns having less duration is missed by prometheus
Have i made my point clear, sorry for poor english 😂 @LiZhenCheng9527 |
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]> rfac: removed callback funcs in tc.c Signed-off-by: Yash Patel <[email protected]>
dd779fd
to
e80fc80
Compare
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]> rfac: bpf2go.go Signed-off-by: Yash Patel <[email protected]>
What type of PR is this?
/kind feature
What this PR does / why we need it:
The pr introduces new feature of tcp_long_conn metrics
Which issue(s) this PR fixes:
Fixes #1211
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Yes