Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Prevent race condition around strerror which is not thread safe #3

Open
wants to merge 2 commits into
base: unstable
Choose a base branch
from

Conversation

ushachar
Copy link
Owner

@ushachar ushachar commented Apr 1, 2020

Allocate an internal error array at start time and reference that when trying to
print out a system error.
The alternative of using strerror_r / _l is messy since it behaves differently
between *NIX / GNU implementations AND requires dynamic memory allocation at
(resulting in a much bigger change)

Allocate an internal error array at start time and reference that when trying to
print out a system error.
The alternative of using strerror_r / _l is messy since it behaves differently
between *NIX / GNU implementations AND requires dynamic memory allocation at
(resulting in a much bigger change)
Copy link

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that are no systems or some extensions that somehow extend the errors beyond 256 errors?


const char *redisError(unsigned int err) {
if (err > MAX_ERR) {
perror("Requested out of bounds error string");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think you should print anything. no one usually monitors stderr, and not sure there's a point in printing anyway since the returned string probably goes to the log.
what can be helpful is stack trace, in case the one that prints to the log doesn't have a distinct message.

}


void freeErrorStrings() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redis doesn't bother to free memory.. you can keep this method, but don't bother to call it.

@@ -5073,6 +5075,7 @@ int main(int argc, char **argv) {
aeSetAfterSleepProc(server.el,afterSleep);
aeMain(server.el);
aeDeleteEventLoop(server.el);
freeErrorStrings();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually dead code.. the normal place where redis exists, it with a call to exit after calling prepareForShutdown. but anyway, redis doesn't bother to release memory or shutdown.. that's a waste of time. i suggest to remove this line.

@@ -4902,6 +4903,7 @@ int main(int argc, char **argv) {
setlocale(LC_COLLATE,"");
tzset(); /* Populates 'timezone' global. */
zmalloc_set_oom_handler(redisOutOfMemoryHandler);
initErrorStrings();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the unit test code above may some day use this mechanism too (by calling a method that calls this). unlikely, but maybe you should consider moving this to the top.

ushachar pushed a commit that referenced this pull request Sep 19, 2020
Now both master and replicas keep track of the last replication offset
that contains meaningful data (ignoring the tailing pings), and both
trim that tail from the replication backlog, and the offset with which
they try to use for psync.

the implication is that if someone missed some pings, or even have
excessive pings that the promoted replica has, it'll still be able to
psync (avoid full sync).

the downside (which was already committed) is that replicas running old
code may fail to psync, since the promoted replica trims pings form it's
backlog.

This commit adds a test that reproduces several cases of promotions and
demotions with stale and non-stale pings

Background:
The mearningful offset on the master was added recently to solve a problem were
the master is left all alone, injecting PINGs into it's backlog when no one is
listening and then gets demoted and tries to replicate from a replica that didn't
have any of the PINGs (or at least not the last ones).

however, consider this case:
master A has two replicas (B and C) replicating directly from it.
there's no traffic at all, and also no network issues, just many pings in the
tail of the backlog. now B gets promoted, A becomes a replica of B, and C
remains a replica of A. when A gets demoted, it trims the pings from its
backlog, and successfully replicate from B. however, C is still aware of
these PINGs, when it'll disconnect and re-connect to A, it'll ask for something
that's not in the backlog anymore (since A trimmed the tail of it's backlog),
and be forced to do a full sync (something it didn't have to do before the
meaningful offset fix).

Besides that, the psync2 test was always failing randomly here and there, it
turns out the reason were PINGs. Investigating it shows the following scenario:

cycle 1: redis #1 is master, and all the rest are direct replicas of #1
cycle 2: redis #2 is promoted to master, #1 is a replica of #2 and #3 is replica of #1
now we see that when #1 is demoted it prints:
17339:S 21 Apr 2020 11:16:38.523 * Using the meaningful offset 3929963 instead of 3929977 to exclude the final PINGs (14 bytes difference)
17339:S 21 Apr 2020 11:16:39.391 * Trying a partial resynchronization (request e2b3f8817735fdfe5fa4626766daa938b61419e5:3929964).
17339:S 21 Apr 2020 11:16:39.392 * Successful partial resynchronization with master.
and when #3 connects to the demoted #2, #2 says:
17339:S 21 Apr 2020 11:16:40.084 * Partial resynchronization not accepted: Requested offset for secondary ID was 3929978, but I can reply up to 3929964

so the issue here is that the meaningful offset feature saved the day for the
demoted master (since it needs to sync from a replica that didn't get the last
ping), but it didn't help one of the other replicas which did get the last ping.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants