WIP: Prevent race condition around strerror which is not thread safe #3

ushachar · 2020-04-01T12:07:23Z

Allocate an internal error array at start time and reference that when trying to
print out a system error.
The alternative of using strerror_r / _l is messy since it behaves differently
between *NIX / GNU implementations AND requires dynamic memory allocation at
(resulting in a much bigger change)

update from master

Allocate an internal error array at start time and reference that when trying to print out a system error. The alternative of using strerror_r / _l is messy since it behaves differently between *NIX / GNU implementations AND requires dynamic memory allocation at (resulting in a much bigger change)

oranagra

are we sure that are no systems or some extensions that somehow extend the errors beyond 256 errors?

oranagra · 2020-04-01T12:25:42Z

src/redis_err.c

+
+const char *redisError(unsigned int err) {
+    if (err > MAX_ERR) {
+        perror("Requested out of bounds error string");


i don't think you should print anything. no one usually monitors stderr, and not sure there's a point in printing anyway since the returned string probably goes to the log.
what can be helpful is stack trace, in case the one that prints to the log doesn't have a distinct message.

oranagra · 2020-04-01T12:26:33Z

src/redis_err.c

+}
+
+
+void freeErrorStrings() {


redis doesn't bother to free memory.. you can keep this method, but don't bother to call it.

oranagra · 2020-04-01T12:27:53Z

src/server.c

@@ -5073,6 +5075,7 @@ int main(int argc, char **argv) {
    aeSetAfterSleepProc(server.el,afterSleep);
    aeMain(server.el);
    aeDeleteEventLoop(server.el);
+    freeErrorStrings();


this is actually dead code.. the normal place where redis exists, it with a call to exit after calling prepareForShutdown. but anyway, redis doesn't bother to release memory or shutdown.. that's a waste of time. i suggest to remove this line.

oranagra · 2020-04-01T12:29:30Z

src/server.c

@@ -4902,6 +4903,7 @@ int main(int argc, char **argv) {
    setlocale(LC_COLLATE,"");
    tzset(); /* Populates 'timezone' global. */
    zmalloc_set_oom_handler(redisOutOfMemoryHandler);
+    initErrorStrings();


the unit test code above may some day use this mechanism too (by calling a method that calls this). unlikely, but maybe you should consider moving this to the top.

Now both master and replicas keep track of the last replication offset that contains meaningful data (ignoring the tailing pings), and both trim that tail from the replication backlog, and the offset with which they try to use for psync. the implication is that if someone missed some pings, or even have excessive pings that the promoted replica has, it'll still be able to psync (avoid full sync). the downside (which was already committed) is that replicas running old code may fail to psync, since the promoted replica trims pings form it's backlog. This commit adds a test that reproduces several cases of promotions and demotions with stale and non-stale pings Background: The mearningful offset on the master was added recently to solve a problem were the master is left all alone, injecting PINGs into it's backlog when no one is listening and then gets demoted and tries to replicate from a replica that didn't have any of the PINGs (or at least not the last ones). however, consider this case: master A has two replicas (B and C) replicating directly from it. there's no traffic at all, and also no network issues, just many pings in the tail of the backlog. now B gets promoted, A becomes a replica of B, and C remains a replica of A. when A gets demoted, it trims the pings from its backlog, and successfully replicate from B. however, C is still aware of these PINGs, when it'll disconnect and re-connect to A, it'll ask for something that's not in the backlog anymore (since A trimmed the tail of it's backlog), and be forced to do a full sync (something it didn't have to do before the meaningful offset fix). Besides that, the psync2 test was always failing randomly here and there, it turns out the reason were PINGs. Investigating it shows the following scenario: cycle 1: redis #1 is master, and all the rest are direct replicas of #1 cycle 2: redis #2 is promoted to master, #1 is a replica of #2 and #3 is replica of #1 now we see that when #1 is demoted it prints: 17339:S 21 Apr 2020 11:16:38.523 * Using the meaningful offset 3929963 instead of 3929977 to exclude the final PINGs (14 bytes difference) 17339:S 21 Apr 2020 11:16:39.391 * Trying a partial resynchronization (request e2b3f8817735fdfe5fa4626766daa938b61419e5:3929964). 17339:S 21 Apr 2020 11:16:39.392 * Successful partial resynchronization with master. and when #3 connects to the demoted #2, #2 says: 17339:S 21 Apr 2020 11:16:40.084 * Partial resynchronization not accepted: Requested offset for secondary ID was 3929978, but I can reply up to 3929964 so the issue here is that the meaningful offset feature saved the day for the demoted master (since it needs to sync from a replica that didn't get the last ping), but it didn't help one of the other replicas which did get the last ping.

Merge pull request #2 from antirez/unstable

f9018b9

update from master

ushachar force-pushed the uri_fix_strerror branch from b7898f9 to 42f51b7 Compare April 1, 2020 12:09

ushachar force-pushed the uri_fix_strerror branch from 42f51b7 to 0fe6267 Compare April 1, 2020 12:12

oranagra approved these changes Apr 1, 2020

View reviewed changes

ushachar mentioned this pull request Jul 19, 2020

WIP: Prevent race condition around strerror which is not thread safe #4

Open

ushachar force-pushed the unstable branch from f9018b9 to eabe3ea Compare September 19, 2020 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Prevent race condition around strerror which is not thread safe #3

WIP: Prevent race condition around strerror which is not thread safe #3

ushachar commented Apr 1, 2020

oranagra left a comment

oranagra Apr 1, 2020

oranagra Apr 1, 2020

oranagra Apr 1, 2020

oranagra Apr 1, 2020

		}


		void freeErrorStrings() {

WIP: Prevent race condition around strerror which is not thread safe #3

Are you sure you want to change the base?

WIP: Prevent race condition around strerror which is not thread safe #3

Conversation

ushachar commented Apr 1, 2020

oranagra left a comment

Choose a reason for hiding this comment

oranagra Apr 1, 2020

Choose a reason for hiding this comment

oranagra Apr 1, 2020

Choose a reason for hiding this comment

oranagra Apr 1, 2020

Choose a reason for hiding this comment

oranagra Apr 1, 2020

Choose a reason for hiding this comment