Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race condition in mk_list_add/del when connections are created/deleted #8013

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

obs-gh-arvindsrinivasan
Copy link

Fix a race condition in mk_list_add/del when connections are created/deleted.
This change wraps the connection creation and deletion in a mutex. Tested this on Windows Server 2022.

Fixes #7963

Testing
Before we can approve your change; please submit the following in a comment:
[2023/10/06 14:38:57] [trace] [io coro=000001BE423CB870] [net_write] ret=298 total=298/298
[2023/10/06 14:38:57] [trace] [io coro=000001BE423CB870] [net_write] trying 1818 bytes
[2023/10/06 14:38:57] [trace] [io coro=000001BE423CB870] [net_write] ret=1818 total=1818/1818
[2023/10/06 14:38:57] [trace] [io coro=000001BE423CB870] [net_read] try up to 4095 bytes
[2023/10/06 14:38:57] [trace] [engine] resuming coroutine=000001BE423CB870
[2023/10/06 14:38:57] [trace] [io coro=000001BE423CB870] [net_read] ret=390
[2023/10/06 14:38:57] [ info] [output:http:http.0] 179969258044.collect.observe-staging.com:443, HTTP status=200
{"ok":true}
[2023/10/06 14:38:57] [debug] [upstream] KA connection #1960 to 179969258044.collect.observe-staging.com:443 is now available
[2023/10/06 14:38:57] [debug] [out flush] cb_destroy coro_id=3700
[2023/10/06 14:38:57] [trace] [coro] destroy coroutine=000001BE423CB870 data=000001BE423CB888
[2023/10/06 14:39:02] [trace] [task 000001BE42399060] created (id=0)
[2023/10/06 14:39:02] [debug] [task] created task=000001BE42399060 id=0 OK
[2023/10/06 14:39:02] [trace] [upstream] get new connection for 179969258044.collect.observe-staging.com:443, net setup:
net.connect_timeout = 10 seconds
net.source_address = any
net.keepalive = enabled
net.keepalive_idle_timeout = 15 seconds
net.max_worker_connections = 0
[2023/10/06 14:39:02] [debug] [output:http:http.0] task_id=0 assigned to thread #0
[2023/10/06 14:39:02] [trace] [net] connection #1888 in process to 179969258044.collect.observe-staging.com:443
[2023/10/06 14:39:02] [trace] [engine] resuming coroutine=000001BE423CB930
[2023/10/06 14:39:02] [trace] [engine] resuming coroutine=000001BE423CB930
[2023/10/06 14:39:02] [trace] [tls] connection and handshake OK
[2023/10/06 14:39:02] [trace] [io] connection OK
[2023/10/06 14:39:02] [debug] [upstream] KA connection #1888 to 179969258044.collect.observe-staging.com:443 is connected
[2023/10/06 14:39:02] [debug] [http_client] not using http_proxy for header
[2023/10/06 14:39:02] [trace] [io coro=000001BE423CB930] [net_write] trying 298 bytes
[2023/10/06 14:39:02] [trace] [io coro=000001BE423CB930] [net_write] ret=298 total=298/298
[2023/10/06 14:39:02] [trace] [io coro=000001BE423CB930] [net_write] trying 2357 bytes
[2023/10/06 14:39:02] [trace] [io coro=000001BE423CB930] [net_write] ret=2357 total=2357/2357
[2023/10/06 14:39:02] [trace] [io coro=000001BE423CB930] [net_read] try up to 4095 bytes
[2023/10/06 14:39:02] [trace] [engine] resuming coroutine=000001BE423CB930
[2023/10/06 14:39:02] [trace] [engine] resuming coroutine=000001BE423CB930
[2023/10/06 14:39:02] [trace] [engine] [task event] task_id=0 out_id=0 return=OK
[2023/10/06 14:39:02] [trace] [io coro=000001BE423CB930] [net_read] ret=390
[2023/10/06 14:39:02] [debug] [task] destroy task=000001BE42399060 (task_id=0)
[2023/10/06 14:39:02] [ info] [output:http:http.0] 179969258044.collect.observe-staging.com:443, HTTP status=200
{"ok":true}
[2023/10/06 14:39:02] [debug] [upstream] KA connection #1888 to 179969258044.collect.observe-staging.com:443 is now available
[2023/10/06 14:39:02] [debug] [out flush] cb_destroy coro_id=3701
[2023/10/06 14:39:02] [trace] [coro] destroy coroutine=000001BE423CB930 data=000001BE423CB948
[2023/10/06 14:39:05] [engine] caught signal (SIGINT)
[2023/10/06 14:39:05] [trace] [engine] flush enqueued data
[2023/10/06 14:39:05] [ warn] [engine] service will shutdown in max 5 seconds

  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

N/A

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@NathanNam
Copy link

Any updates on this pr?

@edsiper
Copy link
Member

edsiper commented Oct 17, 2023

@obs-gh-arvindsrinivasan since this is a dependency, we will need the PR to be send to the upstream project monkey/monkey .

@leonardo-albertovich in the meanwhile, would you please validate this contribution ?

@obs-gh-arvindsrinivasan
Copy link
Author

I found another similar bug unfortunately in the listen loop. Will wait for the review of this one and will submit another fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Another race condition in fluent-bit on Windows
4 participants