Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tcp long connections metrics #1249

Draft
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

yp969803
Copy link
Contributor

@yp969803 yp969803 commented Feb 24, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:
The pr introduces new feature of tcp_long_conn metrics
Which issue(s) this PR fixes:

Fixes #1211

Special notes for your reviewer:

Does this PR introduce a user-facing change?:
Yes

Tcp long connections metrics

@kmesh-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign nlgwcy for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yp969803
Copy link
Contributor Author

yp969803 commented Feb 24, 2025

@nlgwcy @LemmyHuang @xiangxinyong can you review the pr, have i writen correct ebpf code, i have done the changes which has been told in previous comments in the proposal.

Copy link
Collaborator

@LiZhenCheng9527 LiZhenCheng9527 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please present the results of the long connection metrics
  2. The original destination address problem is still unhandle. If you plan to optimize later, you can create a sub-issue under the existing lfx issue to track the task.

@@ -55,6 +55,7 @@ type MetricController struct {
EnableAccesslog atomic.Bool
EnableMonitoring atomic.Bool
EnableWorkloadMetric atomic.Bool
EnableLongTCPMetric atomic.Bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to distinguish between metrics for long connections and short connections.
As long as metrics is turned on, we should handle all types of connections.

#define kmesh_perf_map km_perf_map
#define kmesh_perf_info km_perf_info
#define map_of_long_tcp_conns km_longtcpconns_map
#define long_tcp_conns_events km_longtcpconns_events
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BPF_OBJ_NAME_LEN = 16
So I think, this name of map is too long.

Signed-off-by: Yash Patel <[email protected]>
@yp969803 yp969803 force-pushed the tcp_long_connection_code branch from b79a168 to a698cac Compare February 27, 2025 08:53
@yp969803
Copy link
Contributor Author

yp969803 commented Mar 2, 2025

@nlgwcy @LiZhenCheng9527 i have updated the accesslogs for realtime tcp conn monitoring

@yp969803
Copy link
Contributor Author

yp969803 commented Mar 2, 2025

I have added realtime monitoring for workload and service metrics (means metrics are reported periodically not after close), do we need serperate metrics for long connections ?
@nlgwcy @LiZhenCheng9527

Signed-off-by: Yash Patel <[email protected]>
@kmesh-bot kmesh-bot added size/XL and removed size/L labels Mar 2, 2025
@LiZhenCheng9527
Copy link
Collaborator

how can i see access-logs https://kmesh.net/en/docs/userguide/use_kmeshctl/ kmeshctl accesslog is not a command @hzxuzhonghu

`# Enable/Disable Kmesh's accesslog:
kmeshctl monitoring --accesslog enable/disable

@LiZhenCheng9527
Copy link
Collaborator

The BPF Compatability Test failing with error

    bpf_map.go:60: bpf init failed: bpf Load failed: load program: permission denied:
        	; int sockops_prog(struct bpf_sock_ops *skops)
        	0: (bf) r6 = r1

@nlgwcy @LiZhenCheng9527 what u think about it?

This one is probably the current ebpf program, which exceeds the number of instruction sets allowed in the kernel

@LiZhenCheng9527
Copy link
Collaborator

LiZhenCheng9527 commented Mar 3, 2025

which Makefile is compiling ebpf code, i have added a new file ebpf/kmesh/workload/tc.c, how to compile it @nlgwcy @LiZhenCheng9527

make build can compile ebpf code

@@ -407,21 +440,34 @@ func buildV4Metric(buf *bytes.Buffer) (requestMetric, error) {
data.origDstPort = connectData.OriginalPort
}

data.sentBytes = connectData.SentBytes
data.receivedBytes = connectData.ReceivedBytes
data.sentBytes = connectData.SentBytes - tcp_conns[connectData.ConnId].sentBytes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When are you going to clean up tcp_conns?


__u64 now = bpf_ktime_get_ns();
// Check if connection duration exceeds threshold
if ((now - conn->start_ns) > LONG_CONN_THRESHOLD_TIME) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is stage reporting.
So what happens if the conn is closed at less than threshold?

BPF_LOG(ERR, TIMER, "Failed to lookup tcp timer\n");
} else {
// Initialize and start timer
bpf_timer_init(timer, &tcp_conn_flush_timer, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These API are only available since 5.15, but kmesh support 5.10 from beginning

@yp969803
Copy link
Contributor Author

yp969803 commented Mar 3, 2025

do we need serperate metrics for long connections ?
@hzxuzhonghu @nlgwcy @LiZhenCheng9527

@yp969803
Copy link
Contributor Author

yp969803 commented Mar 3, 2025

#1249 (comment), do i have to change the Makefile so that the new c file i have added is also compiled with correct header file
@LiZhenCheng9527

@LiZhenCheng9527
Copy link
Collaborator

do we need serperate metrics for long connections ? @hzxuzhonghu @nlgwcy @LiZhenCheng9527

Do you mean handle the metrics for long and short connections separately? Can you tell us your reasons?

@LiZhenCheng9527
Copy link
Collaborator

#1249 (comment), do i have to change the Makefile so that the new c file i have added is also compiled with correct header file @LiZhenCheng9527

I am not sure what's your means. Did you have any trouble compiling it?

}
__u64 now = bpf_ktime_get_ns();
info_vals->duration = now - info_vals->start_ns;
get_tcp_probe_info(tcp_sock, info_vals);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is very confusing now, maybe rename to set_xxx

tcp_report(sk, tcp_sock, storage, BPF_TCP_ESTABLISHED);
update_tcp_conn_info_on_state_change(tcp_sock, storage, state);
if (state == BPF_TCP_CLOSE) {
bpf_sk_storage_delete(&map_of_sock_storage, sk);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of map does not need to delete @nlgwcy to confirm

static inline void
update_tcp_conn_info_on_state_change(struct bpf_tcp_sock *tcp_sock, struct sock_storage_data *storage, __u32 state)
{
struct tcp_probe_info *info = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_info?


// Also trigger's on icmp packets (hence can be used for monitor packet loss)
SEC("tc")
int tc_prog(struct __sk_buff *skb)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do this in the existing hook

@yp969803
Copy link
Contributor Author

yp969803 commented Mar 3, 2025

Metrics we currently have:

Workload Metrics:  
kmesh_tcp_workload_received_bytes	The size of the  number of bytes received in to a workload 
kmesh_tcp_workload_sent_bytes	The size of the number of bytes sent  to a workload 
etc..
these metrics are workload specific

same way we have serviceMetrics , which are service specififc

What i am proposing was for connection specific metrics (we can show it for the conn whose duration exceeded 1 min), not for short conns because prometheus scrape metrics in 5-15 sec interval conns having less duration is missed by prometheus
ex:

kmesh_long_tcp_conn_received_bytes   The size of recieved bytes by a specific conn,

etc...

these metrics contains label including conn_id which helps us to differentiate conns having src ip, port and destination ip port

Have i made my point clear, sorry for poor english 😂 @LiZhenCheng9527

Signed-off-by: Yash Patel <[email protected]>

rfac: removed callback funcs in tc.c

Signed-off-by: Yash Patel <[email protected]>
@yp969803 yp969803 force-pushed the tcp_long_connection_code branch from dd779fd to e80fc80 Compare March 12, 2025 04:10
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>

rfac: bpf2go.go

Signed-off-by: Yash Patel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[lfx-mentorship-2025-Mar-May] Metrics for TCP Long Connection
4 participants