From e663b9d7942c003072f7d421be61ccd8fad6afca Mon Sep 17 00:00:00 2001 From: Erik Boasson Date: Tue, 21 Nov 2023 15:34:56 +0100 Subject: [PATCH 1/2] Fix timed-event lockup on pre-emptive ACKNACKs There was a case in which the ACKNACK generation event generating pre-emptive ACKNACKs could continually get rescheduled for the same time, causing the timed-event mechanism to spin generating those pre-emptive ACKNACKS but without otherwise making progress. When certain conditions change externally, it continues normal operation: * reception of a heartbeat from the writer to which it is sending pre-emptive ACKNACKs (by virtue of switching to regular ACKNACKs); * deletion of that proxy writer corresponding to that writer; * deletion of the reader; The issue is that discovery matches the two and creates a struct ddsi_rd_pwr_match, it sets the time stamp of creation. This time may lay in the future of the time at which the ACKNACK event is initially scheduled, because the creation time is taken from the clock directly, whereas the initial schedule time is taken from the approximate time stamp passed into ddsi_proxy_writer_add_connection. So on executing the ACKNACK event handler, the current time passed into the handler may lay before the creation time. The "time of last ack" (t_last_ack) gets set to this time and the event gets rescheduled for t_last_ack + new_intv, where new_intv = 0 if it precedes the creation time. The timed event handler caches the current time, updating it only when it is known more than a trivial amount of time may have passed. Rightly or wrongly, executing a timed-event doesn't cause the clock to be read again, and so the first event in the queue is the event it just executed, and it immediately executes it again, in exactly the same state. Among the consequences are the obvious one of not processing any events scheduled for a later time, but also not sending out any retransmit queued by the receive path as well as discovery messages queued for transmission. There are multiple contributing factors here: * It is daft to set the creation time to a later time than the initial schedule. * It is, arguably, unwise to cache the time so aggressively. This originates in the cost of reading the clock in older systems. This is presumably still the case in various embedded systems, so at least there is a decent argument for it. But the root cause in my view is the fact that it can reschedule the event for the same time. All timestamps are in [0, INT64_MAX ns), so deleting the special case of a current time that precedes the creation time does not introduce signed integer overflow. For times in [0,1s) it can potentially delay the transmission of the initial pre-emptive ACKNACK by 1s. It is exceedingly unlikely that it will ever need to send one in that interval, and even more unlikely that it affects anything, because the writer will be sending heartbeats as well. Signed-off-by: Erik Boasson --- src/core/ddsi/src/ddsi_acknack.c | 21 ++++++++------------- 1 file changed, 8 insertions(+), 13 deletions(-) diff --git a/src/core/ddsi/src/ddsi_acknack.c b/src/core/ddsi/src/ddsi_acknack.c index 8de59e038c..1b0a4c8dce 100644 --- a/src/core/ddsi/src/ddsi_acknack.c +++ b/src/core/ddsi/src/ddsi_acknack.c @@ -505,20 +505,15 @@ static struct ddsi_xmsg *make_and_resched_acknack (struct ddsi_xevent *ev, struc static dds_duration_t preemptive_acknack_interval (const struct ddsi_pwr_rd_match *rwn) { - if (rwn->t_last_ack.v < rwn->tcreate.v) - return 0; + const dds_duration_t age = rwn->t_last_ack.v - rwn->tcreate.v; + if (age <= DDS_SECS (10)) + return DDS_SECS (1); + else if (age <= DDS_SECS (60)) + return DDS_SECS (2); + else if (age <= DDS_SECS (120)) + return DDS_SECS (5); else - { - const dds_duration_t age = rwn->t_last_ack.v - rwn->tcreate.v; - if (age <= DDS_SECS (10)) - return DDS_SECS (1); - else if (age <= DDS_SECS (60)) - return DDS_SECS (2); - else if (age <= DDS_SECS (120)) - return DDS_SECS (5); - else - return DDS_SECS (10); - } + return DDS_SECS (10); } static struct ddsi_xmsg *make_preemptive_acknack (struct ddsi_xevent *ev, struct ddsi_proxy_writer *pwr, struct ddsi_pwr_rd_match *rwn, ddsrt_mtime_t tnow) From ef03042d2450e15e473b23c6c097b64891282c8c Mon Sep 17 00:00:00 2001 From: Erik Boasson Date: Thu, 23 Nov 2023 11:00:46 +0100 Subject: [PATCH 2/2] Set match creation time to cached time This avoids the strange situation where ACKNACK events can be scheduled at times earlier than the purported creation time of the match object. That strange situation is harmless if progress is guaranteed in the scheduling of events, which is addressed in a separate commit. Signed-off-by: Erik Boasson --- src/core/ddsi/src/ddsi_endpoint_match.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/core/ddsi/src/ddsi_endpoint_match.c b/src/core/ddsi/src/ddsi_endpoint_match.c index bbcf695c6d..f40a8ba84a 100644 --- a/src/core/ddsi/src/ddsi_endpoint_match.c +++ b/src/core/ddsi/src/ddsi_endpoint_match.c @@ -1026,7 +1026,7 @@ void ddsi_proxy_writer_add_connection (struct ddsi_proxy_writer *pwr, struct dds ELOGDISC (pwr, " ddsi_proxy_writer_add_connection(pwr "PGUIDFMT" rd "PGUIDFMT")", PGUID (pwr->e.guid), PGUID (rd->e.guid)); m->rd_guid = rd->e.guid; - m->tcreate = ddsrt_time_monotonic (); + m->tcreate = tnow; /* We track the last heartbeat count value per reader--proxy-writer pair, so that we can correctly handle directed heartbeats. The