Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OF-2966: Fix race condition in BOSH (noticable during logout) #2681

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

guusdk
Copy link
Member

@guusdk guusdk commented Feb 17, 2025

The first three commits in this PR build towards a fix for a race condition, in which terminated BOSH clients can leave Openfire in an inconsistent state (and, in certain circumstances, lead to memory starvation). See https://igniterealtime.atlassian.net/browse/OF-2966 for details.

BOSH related events (mostly: processing of received stanzas) are executed by a worker pool of threads.

Events for each individual session must be processed in the same order in which they are received. Prior to this change, every event was potentially executed by a different worker thread. This can cause some events to be processed (or finish processing) before an earlier event finishes execution in another worker thread.

This commit adds FIFO-based queuing of events, that ensures that events that belong to the same session are processed:
- by only one thread from the worker pool;
- in the same order in which they're received.

The technique applied here is inspired (well, just bluntly copied) from Ben Manes' work in https://github.com/ben-manes/caffeine/blob/master/jcache/src/main/java/com/github/benmanes/caffeine/jcache/event/EventDispatcher.java
In the previous commit, a mechanism was introduced that ensures per-session FIFO processing of events, using a queuing mechanism based on Futures.

This commit adds a periodic check to see if any events are queued for sessions that no longer exist, discarding those events if they're found.

The cleanup is intended as a safeguard measure against slow accumulation of non-executable events, which over time could lead to memory starvation.
@guusdk guusdk added the backport 4.9 on merge, GHA will generate a PR with these changes against 4.9 branch label Feb 17, 2025
@guusdk guusdk requested a review from Fishbowler February 17, 2025 11:01
By having the 'close' routine be executed by the same worker threads that process regular BOSH data, the processing order of all of these will make use of the FIFO mechanism introduced in the last two commits.

This helps ensure that a 'close' (terminate) event is processed only after stanzas that were received earlier have been processed - assuming that this processing is synchronous.
Removed redundant 'if is-log-level-is-enabled' checks.
Used parameterized logging statements when composing a log message.
@guusdk guusdk force-pushed the OF-2966_BOSH-logout-race branch from 2686f99 to 0952fc3 Compare February 17, 2025 11:15
Copy link
Member

@Fishbowler Fishbowler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this makes sense to me, and all looks like it should work.
How can we prove it though?
Looks like you've taken care of performance in terms of memory usage - any concerns about speed?

@guusdk
Copy link
Member Author

guusdk commented Feb 18, 2025

The JIRA ticket lists a manual modification to Openfire that I used to reproduce a problem that's fixed by this PR. I can reproduce the issue 100% without this change, and not at all with this change.

I don't have a specific concern about speed. This surely adds some overhead, but I expect that to be negligible.

@Fishbowler
Copy link
Member

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 4.9 on merge, GHA will generate a PR with these changes against 4.9 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants