LD Relay doesn't repopulate Redis when Redis is restarted #141

khepin · 2021-06-02T17:49:11Z

Describe the bug
We use LD and LD relay for PHP apps where the recommended approach is to have LD Relay persist the data in Redis and PHP connect to Redis.

When the Redis instance is restarted, it loses data and the PHP apps will then return false for every evaluation.

After Redis has been restarted, LDRelay does not repopulate the data in Redis and we had our flags all return false for 1 hour.

A restart of LD relay fixed the issue.

To reproduce

Start LD Relay with persistence to Redis
Restart Redis
The data is lost in Redis and doesn't get repopulated

Expected behavior

Start LD Relay with persistence to Redis
Restart Redis
LD Relay detects the connection reset to Redis / restart and repopulates the data in Redis

Logs
No useful logs were found

SDK version

Language version, developer tools
PHP 7.4

OS/platform

Additional context
Our cache TTL for redis was set at 30s but we were missing flag definitions for almost an hour until we rebooted LD relay which fixed the issue.

The text was updated successfully, but these errors were encountered:

eli-darkly · 2021-06-02T18:34:48Z

In the current implementation, there is no way for the Relay Proxy to detect reliably that Redis has restarted.

It can detect if Redis is unavailable at a time when it's actually trying to do something with Redis— that is, if it receives a flag update from LaunchDarkly and tries to save it to Redis, or if it gets a request for flags from an SDK at a time when the cache has expired so it has to read Redis, and Redis is unavailable at that moment, it will detect that as a database error. In that case, it enters a "database is unavailable" state where it keeps checking until the database is available again, and when it is, then it will re-save all the flags.

But if the Redis connection goes away briefly during a time when it wasn't active anyway, the Relay Proxy can't detect that— the Redis client doesn't (as far as I know) make such interruptions visible to the application, it handles them transparently. There is no signal to tell it "the database was available last time you checked, and it is also available now, but all the data has been lost in between then and now."

In any case, the current situation is that if you know you are going to restart Redis and lose your data, you need to also restart the Relay Proxy.

khepin · 2021-06-02T19:12:20Z

Are there any plans (or avenues to be explored) to move away from the current situation towards something more reliable?

eli-darkly · 2021-06-02T19:41:15Z

This is as far as I know the first time the issue has been raised, so there haven't been any plans up till now and it's too soon for me to comment on how we might handle this. Again, at present it's not clear to what extent the Redis client library allows us to even detect this condition, so I'm not sure if this is an issue that's specific to the Relay Proxy or a more general issue inherent in using Redis without a persistent backing store.

khepin · 2021-06-02T19:49:18Z

From an initial look at the code I believe you're using the connection pool for redis which indeed would hide connection failures. That package also offers the ability to make a direct connection where a periodic healthcheck could be sent to check if the connection is still alive.

The following code for example detects connection errors by sending a periodic PING. If I then restart redis it'll detect EOF on the connection on the next attempt.

package main

import (
	"fmt"
	"time"

	"github.com/gomodule/redigo/redis"
)

func main() {
	err := <-signalOnConnectionFailure()
	fmt.Println(err)
}

func signalOnConnectionFailure() <-chan (error) {
	healthchecks := make(chan (error))
	go func() {
		c, err := redis.Dial("tcp", "redis:6379")
		if err != nil {
			healthchecks <- err
			close(healthchecks)
			return
		}

		for true {
			<-time.After(30 * time.Second)
			_, err := c.Do("PING")
			if err != nil {
				healthchecks <- err
				close(healthchecks)
				return
			}

			// connection healthy
		}
	}()
	return healthchecks
}

bezmax · 2021-06-03T15:40:58Z

ld-relay could just maintain/verify a marker entry in Redis. If the entry is gone - that means Redis needs to be repopulated.

khepin · 2021-06-03T15:51:23Z

Yeah that seems like a very valid approach as well.

eli-darkly · 2021-06-03T17:56:36Z

Well, there already is a marker entry, it's just that up till now the data store implementation did not do any kind of constant polling of the database (except in the case I mentioned after it's already run into a server error) but only accessed it when there was an immediate need to read or write data. We were trying to avoid creating more server traffic than was strictly necessary for applications to use LD.

We may need to re-evaluate that but we'll need to do it carefully - there are some design decisions to be made. This whole thing is not just part of ld-relay - the entire data store implementation comes from the LD Go SDK which Relay uses. In applications other than Relay there can be different concerns, for instance this could be a Go application that is only reading flags from the database that were put there by Relay and should not ever be populating the database itself. So we would need to build in some extra context-awareness that there isn't a place for in the current internal API. There's also the question of how much of this should be specific to Redis vs. other data store integrations - Redis is somewhat unusual in terms of not necessarily being a persistent store (although many of our customers do use it with persistent storage enabled), and also with other databases like DynamoDB where even a small query is subject to throttling it'd be generally undesirable to poll in this way.

So, for now I can only say that this is a thing we're considering.

eli-darkly · 2021-06-03T17:58:49Z

@christianladam Yes, that's the marker entry I mentioned, but due to the current internal design of the Go SDK we can't quite do what you're suggesting - Relay does not currently have direct access to the object that's implemented in redis_impl.go, it sees only a higher layer that adds caching around those queries, and caching would defeat the purpose of polling in this case. So, again, whatever we do would require some internal API rethinking.

christianladam · 2021-06-03T18:02:45Z

@christianladam Yes, that's the marker entry I mentioned, but due to the current internal design of the Go SDK we can't quite do what you're suggesting - Relay does not currently have direct access to the object that's implemented in redis_impl.go, it sees only a higher layer that adds caching around those queries, and caching would defeat the purpose of polling in this case. So, again, whatever we do would require some internal API rethinking.

I deleted my comment before your response but yeah there's additional work on the relay side itself required to expose this as you definitely don't want the LDClients outside the relay messing with the data population, as you already mentioned.

eli-darkly · 2021-06-03T19:53:53Z

you definitely don't want the LDClients outside the relay messing with the data population, as you already mentioned

It's not quite as simple as that either. An SDK instance that's configured to use Redis could either be using it in a read-only way, where it expects that Relay (or some equivalent process) is responsible for populating the database, or as a backing store that belongs to the SDK (i.e. much the same way Relay would use it: the app puts flags in the database, so if the app has to restart and can't reach LD right away, the last known data is in there) in which case it's not read-only and we would probably want the SDK to have the same behavior that's been suggested here. Currently the SDK does have some ability to know which of those situations it's in, but that knowledge isn't quite available at the level where it would need to be for implementing this.

(v6 - #9) create RelayCore and move most of the core logic into it

louis-launchdarkly · 2023-08-23T17:47:38Z

An update - Recently the SDK team has been looking at the persistent store design, and we will keep this use case in mind.

LaunchDarklyCI pushed a commit that referenced this issue Jun 3, 2021

Merge pull request #141 from launchdarkly/eb/ch83491/relay-base

dd1fa96

(v6 - #9) create RelayCore and move most of the core logic into it

louis-launchdarkly added the enhancement label Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LD Relay doesn't repopulate Redis when Redis is restarted #141

LD Relay doesn't repopulate Redis when Redis is restarted #141

khepin commented Jun 2, 2021

eli-darkly commented Jun 2, 2021

khepin commented Jun 2, 2021

eli-darkly commented Jun 2, 2021 •

edited

Loading

khepin commented Jun 2, 2021 •

edited

Loading

bezmax commented Jun 3, 2021

khepin commented Jun 3, 2021

eli-darkly commented Jun 3, 2021

eli-darkly commented Jun 3, 2021 •

edited

Loading

christianladam commented Jun 3, 2021 •

edited

Loading

eli-darkly commented Jun 3, 2021

louis-launchdarkly commented Aug 23, 2023

LD Relay doesn't repopulate Redis when Redis is restarted #141

LD Relay doesn't repopulate Redis when Redis is restarted #141

Comments

khepin commented Jun 2, 2021

eli-darkly commented Jun 2, 2021

khepin commented Jun 2, 2021

eli-darkly commented Jun 2, 2021 • edited Loading

khepin commented Jun 2, 2021 • edited Loading

bezmax commented Jun 3, 2021

khepin commented Jun 3, 2021

eli-darkly commented Jun 3, 2021

eli-darkly commented Jun 3, 2021 • edited Loading

christianladam commented Jun 3, 2021 • edited Loading

eli-darkly commented Jun 3, 2021

louis-launchdarkly commented Aug 23, 2023

eli-darkly commented Jun 2, 2021 •

edited

Loading

khepin commented Jun 2, 2021 •

edited

Loading

eli-darkly commented Jun 3, 2021 •

edited

Loading

christianladam commented Jun 3, 2021 •

edited

Loading