Sentinel cluster settings 1 node network iface down, Probability unable to query the master node, MasterAddr error： context deadline exceeded #3172

kwenzh · 2024-10-28T12:54:40Z

Issue tracker is used for reporting bugs and discussing new features. Please use
stackoverflow for supporting issues.

in a 3 node cluster, 3 sentinel + 3 redis-server, named: A 、B、C node, Construct C node network card goes offline, eg: ifconfig eth0 down, then the client reconnects to the Redis Sentinel to find the master address with func NewFailoverClient

Expected Behavior

redis-server failover , client can connect new master redis success

Current Behavior

Probability error: context deadline exceeded, when it try to connect C sentinel node, return err in https://github.com/redis/go-redis/blob/master/sentinel.go#L559, although A and B is work normaly, the context is deadline in this time, Because the faulty node C is placed in the first place during random sentinel addresses, C exhausts the context time, resulting in the immediate context timeout of A and B

Possible Solution

In obtaining the master address function, instead of using sequential joins for each sentinel address query you can consider concurrent goroutine queries, or use a separate context for each round of queries
Change the context of each iteration to be independent, use context.deadline to copy context

for i, sentinelAddr := range c.sentinelAddrs {
		sentinel := NewSentinelClient(c.opt.sentinelOptions(sentinelAddr))

		masterAddr, err := sentinel.GetMasterAddrByName(ctx, c.opt.MasterName).Result()
		if err != nil {
			_ = sentinel.Close()
			if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
				return "", err
			}
			internal.Logger.Printf(ctx, "sentinel: GetMasterAddrByName master=%q failed: %s",
				c.opt.MasterName, err)
			continue
		}

		// Push working sentinel to the top.
		c.sentinelAddrs[0], c.sentinelAddrs[i] = c.sentinelAddrs[i], c.sentinelAddrs[0]
		c.setSentinel(ctx, sentinel)

		addr := net.JoinHostPort(masterAddr[0], masterAddr[1])
		return addr, nil
	}

Steps to Reproduce

deploy a 3 sentinel + 3 redis server cluster,
make One of the node nics is offline and unreachable, eg ifconfig etho down
The client connect redis cluster repeatedly with func NewFailoverClient
Check whether the primary redis address can be obtained
it seem error : context deadline exceeded,

Context (Environment)

centos8 with kernel: 4.18
go-redis: v9.6.0
ctx timeout: 3s,
dialTimeout: default 5s

Detailed Description

I think the point is,

The first point to get the primary address is, why query each node sequentially, so that the failed node in the front row may affect the healthy node in the back
Second, when repeated initialization, the random function is a pseudo-random, and the random seed is 1, which may lead to multiple rounds of repeated initialization results are the same, and it is always fixed for a certain failure, that is, when the faulty node is randomized to the first place

The text was updated successfully, but these errors were encountered:

kwenzh · 2024-10-29T01:18:28Z

Simulating multiple random sentinel nodes, it can be observed that node C is randomly placed in the first position during the second simulation. Moreover, the results are the same in each round because it is pseudo-random with a seed of 1.

for cnt := 0; cnt < 10; cnt++ {
		arrs := []string{"A", "B", "C"}
		Shuffle(3, func(i, j int) {
			fmt.Println(">>>>>>>>", i, j)
			arrs[i], arrs[j] = arrs[j], arrs[i]
		})
		fmt.Println(">>>>>>>>", arrs)
	}

output:

>>>>>>>> 2 1
>>>>>>>> 1 1     
>>>>>>>> [A C B] 
>>>>>>>> 2 1     
>>>>>>>> 1 0     
>>>>>>>> [C A B] 
>>>>>>>> 2 1     
>>>>>>>> 1 1     
>>>>>>>> [A C B] 
>>>>>>>> 2 0     
>>>>>>>> 1 0     
>>>>>>>> [B C A] 
>>>>>>>> 2 0     
>>>>>>>> 1 0     
>>>>>>>> [B C A] 
>>>>>>>> 2 1     
>>>>>>>> 1 1     
>>>>>>>> [A C B] 
>>>>>>>> 2 0     
>>>>>>>> 1 0     
>>>>>>>> [B C A] 
>>>>>>>> 2 0     
>>>>>>>> 1 0     
>>>>>>>> [B C A] 
>>>>>>>> 2 0     
>>>>>>>> 1 0                                                                    
>>>>>>>> [B C A]                                                                
>>>>>>>> 2 2                                                                    
>>>>>>>> 1 0                                                                    
>>>>>>>> [B A C]

Simulating multiple initializations of the sentinel, when node C fails, an error will occur in the second round of the loop, causing it to exit due to a context timeout.


func mock_sentinel() {
	for i := 0; i < 10; i++ {
		addr := []string{
			"A", "B", "C",
		}
		sent := redis.NewFailoverClient(&redis.FailoverOptions{
			SentinelAddrs: addr,
			MasterName: "mymaster",
		})
		ctx, cancel := context.WithTimeout(context.Background(), time.Second*3)
		defer cancel()
		_, err := sent.Ping(ctx).Result()
		if err != nil {
			panic(err)
		}
                fmt.Println("connect failover client ok", i)
	}
}

kwenzh · 2025-04-17T07:30:55Z

@ndyakov

in #3334 , In the failure scenario, a faulty node in the Sentinel cluster causes a context timeout. A scenario may be missed here. When the last select-case handles the err, there may be a correct master address at this time, but it is not judged.

select {
	case <-done:
		if masterAddr != "" {
			return masterAddr, nil
		}
		return "", errors.New("redis: all sentinels specified in configuration are unreachable")
	case err := <-errCh:
               if masterAddr != "" {
			return masterAddr, nil
		}
		return "", err
	}

or


	var (
		masterAddr string
		wg         sync.WaitGroup
		once       sync.Once
		errCh      = make(chan error, len(c.sentinelAddrs))
		done       = make(chan struct{})
	)

	ctx, cancel := context.WithCancel(ctx)
	defer cancel()

	for i, sentinelAddr := range c.sentinelAddrs {
		wg.Add(1)
		go func(i int, addr string) {
			defer wg.Done()
			select {
			case <- done:
				return 
			default:
				sentinelCli := NewSentinelClient(c.opt.sentinelOptions(addr))
				addrVal, err := sentinelCli.GetMasterAddrByName(ctx, c.opt.MasterName).Result()
				if err != nil {
					if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
						// Report immediately and return
						errCh <- err
						return
					}
					internal.Logger.Printf(ctx, "sentinel: GetMasterAddrByName addr=%s, master=%q failed: %s",
						addr, c.opt.MasterName, err)
					_ = sentinelCli.Close()
					return
				}

				once.Do(func() {
					masterAddr = net.JoinHostPort(addrVal[0], addrVal[1])
					// Push working sentinel to the top
					c.sentinelAddrs[0], c.sentinelAddrs[i] = c.sentinelAddrs[i], c.sentinelAddrs[0]
					c.setSentinel(ctx, sentinelCli)
					internal.Logger.Printf(ctx, "sentinel: selected addr=%s masterAddr=%s", addr, masterAddr)
					close(done)
				})	
			}
		}(i, sentinelAddr)
	}
	wg.Wait()
        close(errCh)
	if masterAddr != "" {
		return masterAddr, nil
	}
	for _, serr := range errCh {
		if serr != nil {
			return "", serr 
		}
	}
	return "", errors.New("redis: all sentinels specified in configuration are unreachable")

ndyakov · 2025-04-17T07:37:52Z

@kwenzh i will review this once again today. Last time I reviewed it it looked like if all nodes are done, the done channel is closed and only then the wait group will be done. Anyway, the first approach makes sense and will be a more robust approach. Feel free to open a PR, just include a comment why we have this check in the error case.

kwenzh · 2025-04-17T11:14:31Z

@kwenzh i will review this once again today. Last time I reviewed it it looked like if all nodes are done, the done channel is closed and only then the wait group will be done. Anyway, the first approach makes sense and will be a more robust approach. Feel free to open a PR, just include a comment why we have this check in the error case.

Yes, that's correct. After waiting for all nodes to complete, close the done channel, and also wait for the faulty sentinel node to finish exiting the coroutine. Here, the select-case will first retrieve from err.

ndyakov · 2025-04-17T11:20:27Z

@kwenzh feel free to review #3349

kwenzh · 2025-04-17T11:45:51Z

@kwenzh feel free to review #3349

got it,

ndyakov · 2025-04-17T13:32:27Z

#3349 is merged.

kwenzh changed the title ~~Sentinel cluster set 1 node network iface down, unable to elect a master, context deadline exceeded~~ Sentinel cluster settings 1 node network iface down, Probability unable to query the master node, MasterAddr error： context deadline exceeded Oct 28, 2024

kwenzh mentioned this issue Apr 11, 2025

Ensure context isn't exhausted via concurrent query as opposed to sentinel query #3334

Merged

kwenzh mentioned this issue Apr 17, 2025

Check for handling errors before returning when querying the sentinel to get the master address. #3350

Closed

ndyakov mentioned this issue Apr 17, 2025

fix: better error handling when fetching the master node from the sentinels #3349

Merged

ndyakov closed this as completed Apr 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentinel cluster settings 1 node network iface down, Probability unable to query the master node, MasterAddr error： context deadline exceeded #3172

Sentinel cluster settings 1 node network iface down, Probability unable to query the master node, MasterAddr error： context deadline exceeded #3172

kwenzh commented Oct 28, 2024 •

edited

Loading

kwenzh commented Oct 29, 2024 •

edited

Loading

kwenzh commented Apr 17, 2025 •

edited

Loading

ndyakov commented Apr 17, 2025

kwenzh commented Apr 17, 2025

ndyakov commented Apr 17, 2025

kwenzh commented Apr 17, 2025

ndyakov commented Apr 17, 2025

Sentinel cluster settings 1 node network iface down, Probability unable to query the master node, MasterAddr error： context deadline exceeded #3172

Sentinel cluster settings 1 node network iface down, Probability unable to query the master node, MasterAddr error： context deadline exceeded #3172

Comments

kwenzh commented Oct 28, 2024 • edited Loading

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context (Environment)

Detailed Description

kwenzh commented Oct 29, 2024 • edited Loading

kwenzh commented Apr 17, 2025 • edited Loading

ndyakov commented Apr 17, 2025

kwenzh commented Apr 17, 2025

ndyakov commented Apr 17, 2025

kwenzh commented Apr 17, 2025

ndyakov commented Apr 17, 2025

kwenzh commented Oct 28, 2024 •

edited

Loading

kwenzh commented Oct 29, 2024 •

edited

Loading

kwenzh commented Apr 17, 2025 •

edited

Loading