Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agentx thread #16986

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Conversation

fdumontet6WIND
Copy link
Contributor

  • Introduce a dedicated AgentX thread using frr_pthread.

  • Add mutex locks (ax_mtx, ax_io_mtx) to manage thread
    synchronization for trap transfer and I/O operations.

  • Implemented ring buffers (ibuf_ax) for handling
    "master -> AgentX" communication, improving data handling
    between threads.

  • Update the SNMP read operations to use mutex locks to
    ensure thread-safe execution.

  • Integrated a new dedicated thread to send SNMP traps,
    ensuring separation of responsibilities between the main
    and AgentX threads.

  • Enhanced trap handling to support multi-index traps, with
    excess traps being discarded if the buffer is full,
    preventing overflow.

  • Enhanced trap handling to support multi-index traps.
    When more than "RINGBUF_NB_TRAP" traps are pending for
    transmission, subsequent traps are discarded to prevent
    overflow.

This update significantly improves concurrency,
synchronization, and trap management within the AgentX
module, with added protection against socket's buffer overflow
from excessive traps. The socket's buffer overflow is leading
to process deadlock.

Copy link
Contributor

@vjardin vjardin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, avoid pulling some dependencies specific for bgpd.

lib/agentx.c Show resolved Hide resolved
lib/agentx.c Outdated
@@ -23,25 +23,62 @@
#include "libfrr.h"
#include "xref.h"
#include "lib/libagentx.h"
#include "ringbuf.h" /* for ringbuf_remain, ringbuf_peek, ringbuf_.. */
#include "frr_pthread.h" /* for struct frr_pthread */
#include "bgpd/bgpd.h" /* for bgp_pth_ax */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a a specific dependency with bgpd ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taken into account

@fdumontet6WIND fdumontet6WIND force-pushed the agentx_thread branch 2 times, most recently from fddcf30 to f6efd02 Compare October 2, 2024 19:53
lib/agentx.c Outdated Show resolved Hide resolved
@ton31337
Copy link
Member

ton31337 commented Oct 3, 2024

Also, could you fix Verify Source, and frrbot issues?

@fdumontet6WIND fdumontet6WIND force-pushed the agentx_thread branch 2 times, most recently from 7557c63 to 8535a1b Compare October 3, 2024 12:45
Copy link
Contributor

@mjstapp mjstapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time following what problem is being solved here.

It sounds like there can be a problem if there are many traps to be sent - that the existing code is not detecting that, is not discarding events when it begins to overrun a receiver, and/or is not handling EWOULDBLOCK. Is that correct? wouldn't it be simpler to make the code detect and handle those conditions directly -rather than injecting a new pthread and then teaching it to handle the error conditions?

Is there any other problem present? I'm a bit terrified by this line from the description:

"Update the SNMP read operations to use mutex locks to ensure thread-safe execution."

so ... it is not possible to make reads of daemon data "thread-safe" - right? only the main pthread can access daemon internals. what change are you proposing to make there?

lib/agentx.c Outdated
DEFINE_HOOK(agentx_enabled, (), ());

//bool agentx_enabled = false;

static struct event_loop *agentx_tm;
static struct event_loop *main_tm;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we name this something better? and at least put some documentation into why agentx code has too event_loop data structure pointers. Let's help our future selves out here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taken ento account

@donaldsharp
Copy link
Member

lib/agentx.c:438: error: log message contains tab or newline: "%s not enougth space in ibuf_ax needed : %lu free :\t %lu" (in smux_trap_multi_index())

needs to be fixed

@fdumontet6WIND
Copy link
Contributor Author

tomjstapp
The problem is that, in certain high-load situations, AgentX blocks the entire bgpd process.

We can indeed address some cases by implementing flow control on the traps. Unfortunately, other cases related to the internal mechanisms of AgentX SNMP emerge. It's to provide a global solution, rather than a series of incomplete fixes, that we propose handling the sending of traps in a specific pthread.

@fdumontet6WIND fdumontet6WIND force-pushed the agentx_thread branch 2 times, most recently from 97b75e2 to f03720f Compare October 3, 2024 15:06
@mjstapp
Copy link
Contributor

mjstapp commented Oct 3, 2024

So I'd like to understand that situation better. If the new pthread is wedged somewhere, and a daemon's main pthread is shutting down and waiting for it to stop, how will that work?

tomjstapp The problem is that, in certain high-load situations, AgentX blocks the entire bgpd process.

We can indeed address some cases by implementing flow control on the traps. Unfortunately, other cases related to the internal mechanisms of AgentX SNMP emerge. It's to provide a global solution, rather than a series of incomplete fixes, that we propose handling the sending of traps in a specific pthread.

@fdumontet6WIND
Copy link
Contributor Author

ci:rerun

lib/agentx.c Outdated
/* mutex dedicated to trap transfert between threads */
pthread_mutex_t ax_mtx;
/* mutex dedicated to send/read exclusion */
pthread_mutex_t ax_io_mtx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't they be static ? both ax_mtx and ax_io_mtx.

Copy link
Contributor

@vjardin vjardin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a minor update (static) that would be welcomed.

@vjardin
Copy link
Contributor

vjardin commented Oct 7, 2024

Thanks for the static that are being added. ;)

@fdumontet6WIND
Copy link
Contributor Author

sorry for not having notified the change.

@fdumontet6WIND fdumontet6WIND force-pushed the agentx_thread branch 2 times, most recently from 6519edd to 48685e3 Compare October 8, 2024 17:13
@vjardin
Copy link
Contributor

vjardin commented Oct 8, 2024

Please @fdumontet6WIND , do you have some data on the improvements you are getting ?

LGTM

@fdumontet6WIND
Copy link
Contributor Author

added some comments following Donald sharp advice

@vjardin
Copy link
Contributor

vjardin commented Oct 10, 2024

why closed ?

@fdumontet6WIND
Copy link
Contributor Author

need some improvments

@fdumontet6WIND fdumontet6WIND force-pushed the agentx_thread branch 2 times, most recently from 7f94aa3 to 256fc73 Compare October 11, 2024 15:15
@fdumontet6WIND
Copy link
Contributor Author

ci:rerun

@fdumontet6WIND fdumontet6WIND force-pushed the agentx_thread branch 3 times, most recently from 845a60d to 1780159 Compare October 14, 2024 14:26
@fdumontet6WIND
Copy link
Contributor Author

ci:rerun

@@ -83,7 +88,10 @@ static void agentx_timeout(struct event *t)
snmp_timeout();
run_alarms();
} else {
zlog_err("%s mutex already locked", __func__);
if (nb_locktry_timeout_fail % 100 == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you bundle this fixup with e464755? Also, what should an operator do if I see such an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taken into account.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if an operator see a great number of such an error, restarting processes using agentx is good option

@fdumontet6WIND
Copy link
Contributor Author

ci:rerun

@eqvinox
Copy link
Contributor

eqvinox commented Oct 16, 2024

tomjstapp The problem is that, in certain high-load situations, AgentX blocks the entire bgpd process.

We can indeed address some cases by implementing flow control on the traps. Unfortunately, other cases related to the internal mechanisms of AgentX SNMP emerge. It's to provide a global solution, rather than a series of incomplete fixes, that we propose handling the sending of traps in a specific pthread.

AgentX is blocking the process in write/send for traps, if I understand correctly?

If that's the problem, I'd rather have a buffer/queue without threads. Use event_add_write and make a call when the fd is ready to be written… in the same thread.

@fdumontet6WIND
Copy link
Contributor Author

Hi David,
fd may be ready to be written, but has not enough buffer free memory for sending all what we want.
the send will be blocked.
On another hand there is many reason for blocking , for instance when a timeout occurs.

add "struct event *" parameter to agentx_event_update function.
the function may be called via event_add*** functions.

Signed-off-by: Francois Dumontet <[email protected]>
- Introduce a dedicated AgentX thread using `frr_pthread`.

- Add mutex locks (`ax_mtx`, `ax_io_mtx`) to manage thread
synchronization for trap transfer and I/O operations.

- Implemented ring buffers (`ibuf_ax`) for handling
"master -> AgentX" communication, improving data handling
between threads.

- Update the SNMP read operations to use mutex locks to
ensure thread-safe execution.

- Integrated a new dedicated thread to send SNMP traps,
ensuring separation of responsibilities between the main
and AgentX threads.

- Enhanced trap handling to support multi-index traps, with
excess traps being discarded if the buffer is full,
preventing overflow.

- Enhanced trap handling to support multi-index traps.
When more than "RINGBUF_NB_TRAP" traps are pending for
transmission, subsequent traps are discarded to prevent
overflow.

This update significantly improves concurrency,
synchronization, and trap management within the AgentX
module, with added protection against socket's buffer overflow
from excessive traps. The socket's buffer overflow is leading
to process deadlock.

Signed-off-by: Francois Dumontet <[email protected]>
@fdumontet6WIND
Copy link
Contributor Author

ci:rerun

1 similar comment
@fdumontet6WIND
Copy link
Contributor Author

ci:rerun

@fdumontet6WIND fdumontet6WIND marked this pull request as draft November 5, 2024 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants