Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dragonfly crashes in minimal memory configuration #2739

Open
applike-ss opened this issue Mar 18, 2024 · 10 comments
Open

Dragonfly crashes in minimal memory configuration #2739

applike-ss opened this issue Mar 18, 2024 · 10 comments
Labels
bug Something isn't working minor nice to have enhancement

Comments

@applike-ss
Copy link
Contributor

Describe the bug
I have discovered crashes of dragonfly upon testing the cache mode to ensure it is stable before using it in prod workloads.
It seems that it is crashing at a random point when putting lots of data

To Reproduce
Steps to reproduce the behavior:

  1. create a data.txt for later data ingestion use: dd if=/dev/urandom bs=1024 count=100| base64 > data.txt
  2. now create a script to ingest data and call it redis-pump.sh (put same folder as data.txt):
#!/bin/zsh

CLUSTER_HOSTNAME=$1
MAX_ITERATIONS=$2

i=0

if [ "" = "${MAX_ITERATIONS}" ]; then
	echo "pushing as much data as possible to ${CLUSTER_HOSTNAME}"
else
	echo "pushing ${MAX_ITERATIONS} chunks of data to ${CLUSTER_HOSTNAME}"
fi

DATA="$(cat data.txt)"

while true;
do
	i=$(expr $i + 1)

	redis-cli -h "${CLUSTER_HOSTNAME}" -x set x-data-$i < data.txt | grep -v "OK"

	if [ $((i % 5)) = 0 ]; then
		echo "pushed $i data chunks to ${CLUSTER_HOSTNAME}"
	fi

	if [ "$i" = "${MAX_ITERATIONS}" ]; then
		echo "pushed all $i data chunks to ${CLUSTER_HOSTNAME}"
		exit
	fi
done
  1. in one terminal execute: docker run -it --rm --cpus 0.1 -m 320M -p 6379:6379 -p 9999:9999 --mount type=tmpfs,destination=/dragonfly/snapshots,tmpfs-size=524288000 ghcr.io/dragonflydb/dragonfly-weekly:e8650ed2b4ebd550c966751dd33ebb1ac4f82b1f-ubuntu --alsologtostderr --primary_port_http_enabled=false --admin_port=9999 --admin_nopass --cache_mode --primary_port_http_enabled=true --cluster_mode=emulated --dir=/dragonfly/snapshots --snapshot_cron="*/5 * * * *"
  2. Wait for dragonfly to run
  3. run ./redis-pump.sh 127.0.0.1 1000000 and wait ~7-10 minutes

Expected behavior
There should not be any crashes. If there are crashes, there should be a reasonable error/crit message telling why it happened

Logs
dragonfly:

I20240318 14:37:46.707206     1 init.cc:70] dragonfly running in opt mode.
* Logs will be written to the first available of the following paths:
/tmp/dragonfly.*
./dragonfly.*
* For the available flags type dragonfly [--help | --helpfull]
* Documentation can be found at: https://www.dragonflydb.io/docs
I20240318 14:37:46.707629     1 dfly_main.cc:627] Starting dragonfly df-e8650ed-e8650ed2b4ebd550c966751dd33ebb1ac4f82b1f
W20240318 14:37:46.708315     1 dfly_main.cc:666] SWAP is enabled. Consider disabling it when running Dragonfly.
I20240318 14:37:46.708344     1 dfly_main.cc:671] maxmemory has not been specified. Deciding myself....
I20240318 14:37:46.708350     1 dfly_main.cc:680] Found 320.00MiB available memory. Setting maxmemory to 256.00MiB
I20240318 14:37:46.709347     8 uring_proactor.cc:172] IORing with 1024 entries, allocated 102464 bytes, cq_entries is 2048
I20240318 14:37:46.709468     1 proactor_pool.cc:146] Running 1 io threads
I20240318 14:37:46.718241     1 dfly_main.cc:258] Listening on admin socket any:9999
I20240318 14:37:46.719838     1 server_family.cc:665] Host OS: Linux 6.6.16-linuxkit aarch64 with 1 threads
I20240318 14:37:46.720253     1 snapshot_storage.cc:108] Load snapshot: Searching for snapshot in directory: "/dragonfly/snapshots"
W20240318 14:37:46.720288     1 server_family.cc:758] Load snapshot: No snapshot found
I20240318 14:37:46.720400     8 listener_interface.cc:122] sock[5] AcceptServer - listening on port 9999
I20240318 14:37:46.720415     8 listener_interface.cc:122] sock[6] AcceptServer - listening on port 6379
I20240318 14:40:09.778756     8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-03-18T14:40:00-summary.dfs" finished after 9 s
╭─    ~ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── KILL ✘  7m 21s   15:45:06  ─╮

redis-pump.sh

pushed 5140 data chunks to 127.0.0.1
pushed 5145 data chunks to 127.0.0.1
pushed 5150 data chunks to 127.0.0.1
pushed 5155 data chunks to 127.0.0.1
pushed 5160 data chunks to 127.0.0.1
pushed 5165 data chunks to 127.0.0.1
Error: Server closed the connection
Error: Connection reset by peer
Could not connect to Redis at 127.0.0.1:6379: Connection refused
Could not connect to Redis at 127.0.0.1:6379: Connection refused
Could not connect to Redis at 127.0.0.1:6379: Connection refused
pushed 5170 data chunks to 127.0.0.1

Environment (please complete the following information):

Reproducible Code Snippet

See above

Is there any additional information i can provide or an option to let dragonfly run with debug logging for gathering more interesting information?

@applike-ss applike-ss added the bug Something isn't working label Mar 18, 2024
@applike-ss
Copy link
Contributor Author

applike-ss commented Mar 19, 2024

Removing dir and snapshot_cron parameters btw. does not show this behaviour.
I have it running without these parameters and pumping data into it for more than 6 hours already.
I could imagine a memory related issue that makes containerd just kill the container, e.g. creating the snapshot + the data that is in memory for storing the kv data runs into the limit that i set for the container (320Mi).

@romange
Copy link
Collaborator

romange commented Mar 19, 2024

you can easily verify it @applike-ss by raising the limit and checking used_memory_peak_rss with "INFO MEMORY" command

@romange romange added the minor nice to have enhancement label Mar 25, 2024
@romange
Copy link
Collaborator

romange commented Mar 25, 2024

Is there anything we can help here @applike-ss ?

@chakaz probably a good use-case to focus on, when switch working on RSS instrumentation. Not urgent though.

@applike-ss
Copy link
Contributor Author

you can easily verify it @applike-ss by raising the limit and checking used_memory_peak_rss with "INFO MEMORY" command

I'm doing that right now. I assume you want me to watch "redis-cli info memory | grep used_memory_peak_rss", correct? Otherwise let me know how i'm supposed to test here.
I will post results once i got them.

Is there anything we can help here @applike-ss ?

@chakaz probably a good use-case to focus on, when switch working on RSS instrumentation. Not urgent though.

Sorry, was on vacation and couldn't check back here.

@applike-ss
Copy link
Contributor Author

Observed memory usages:

Limited to 320Mi (256Mi maxmemory, crashing):
Output:

I20240402 06:52:46.653187     1 dfly_main.cc:627] Starting dragonfly df-e8650ed-e8650ed2b4ebd550c966751dd33ebb1ac4f82b1f
W20240402 06:52:46.653295     1 dfly_main.cc:666] SWAP is enabled. Consider disabling it when running Dragonfly.
I20240402 06:52:46.653316     1 dfly_main.cc:671] maxmemory has not been specified. Deciding myself....
I20240402 06:52:46.653319     1 dfly_main.cc:680] Found 320.00MiB available memory. Setting maxmemory to 256.00MiB
I20240402 06:52:46.654233     8 uring_proactor.cc:172] IORing with 1024 entries, allocated 102464 bytes, cq_entries is 2048
I20240402 06:52:46.654599     1 proactor_pool.cc:146] Running 1 io threads
I20240402 06:52:46.720907     1 dfly_main.cc:258] Listening on admin socket any:9999
I20240402 06:52:46.721390     1 server_family.cc:665] Host OS: Linux 6.6.16-linuxkit aarch64 with 1 threads
I20240402 06:52:46.721801     1 snapshot_storage.cc:108] Load snapshot: Searching for snapshot in directory: "/dragonfly/snapshots"
W20240402 06:52:46.721869     1 server_family.cc:758] Load snapshot: No snapshot found
I20240402 06:52:46.722051     8 listener_interface.cc:122] sock[5] AcceptServer - listening on port 9999
I20240402 06:52:46.722074     8 listener_interface.cc:122] sock[6] AcceptServer - listening on port 6379
I20240402 06:53:00.228514     8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:53:00-summary.dfs" finished after 0 us
I20240402 06:54:02.930020     8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:54:00-summary.dfs" finished after 2 s

used_memory_peak_rss:320741376

Limited to 640Mi (512Mi maxmemory, not crashing):
Output:

I20240402 06:56:57.617758     1 init.cc:70] dragonfly running in opt mode.
* Logs will be written to the first available of the following paths:
/tmp/dragonfly.*
./dragonfly.*
* For the available flags type dragonfly [--help | --helpfull]
* Documentation can be found at: https://www.dragonflydb.io/docs
I20240402 06:56:57.617822     1 dfly_main.cc:627] Starting dragonfly df-e8650ed-e8650ed2b4ebd550c966751dd33ebb1ac4f82b1f
W20240402 06:56:57.617899     1 dfly_main.cc:666] SWAP is enabled. Consider disabling it when running Dragonfly.
I20240402 06:56:57.617913     1 dfly_main.cc:671] maxmemory has not been specified. Deciding myself....
I20240402 06:56:57.617915     1 dfly_main.cc:680] Found 640.00MiB available memory. Setting maxmemory to 512.00MiB
I20240402 06:56:57.618276     8 uring_proactor.cc:172] IORing with 1024 entries, allocated 102464 bytes, cq_entries is 2048
I20240402 06:56:57.618389     1 proactor_pool.cc:146] Running 1 io threads
I20240402 06:56:57.619359     1 dfly_main.cc:258] Listening on admin socket any:9999
I20240402 06:56:57.619737     1 server_family.cc:665] Host OS: Linux 6.6.16-linuxkit aarch64 with 1 threads
I20240402 06:56:57.619998     1 snapshot_storage.cc:108] Load snapshot: Searching for snapshot in directory: "/dragonfly/snapshots"
W20240402 06:56:57.620031     1 server_family.cc:758] Load snapshot: No snapshot found
I20240402 06:56:57.620177     8 listener_interface.cc:122] sock[5] AcceptServer - listening on port 9999
I20240402 06:56:57.620189     8 listener_interface.cc:122] sock[6] AcceptServer - listening on port 6379
I20240402 06:57:00.002776     8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:57:00-summary.dfs" finished after 0 us
I20240402 06:58:01.918434     8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:58:00-summary.dfs" finished after 1 s
I20240402 06:59:04.121477     8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:59:00-summary.dfs" finished after 4 s
E20240402 07:00:09.115413     8 rdb_save.cc:1406] io error system:28
W20240402 07:00:09.325433     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:01:09.622743     8 rdb_save.cc:1406] io error system:28
W20240402 07:01:09.825254     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:02:13.933493     8 rdb_save.cc:1406] io error system:28
W20240402 07:02:14.120755     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:03:17.126403     8 rdb_save.cc:1406] io error system:28
W20240402 07:03:17.223644     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:04:19.825512     8 rdb_save.cc:1406] io error system:28
W20240402 07:04:20.023314     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:05:16.028622     8 rdb_save.cc:1406] io error system:28
W20240402 07:05:16.226780     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:06:20.922616     8 rdb_save.cc:1406] io error system:28
W20240402 07:06:21.125222     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:07:19.232414     8 rdb_save.cc:1406] io error system:28
W20240402 07:07:19.429133     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:08:19.621541     8 rdb_save.cc:1406] io error system:28
W20240402 07:08:19.833389     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:09:15.229292     8 rdb_save.cc:1406] io error system:28
W20240402 07:09:15.426709     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:10:18.745149     8 rdb_save.cc:1406] io error system:28
W20240402 07:10:19.034804     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:11:20.337744     8 rdb_save.cc:1406] io error system:28
W20240402 07:11:20.622592     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:12:20.823269     8 rdb_save.cc:1406] io error system:28
W20240402 07:12:21.012964     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:13:20.706354     8 rdb_save.cc:1406] io error system:28
W20240402 07:13:20.903396     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:14:19.709885     8 rdb_save.cc:1406] io error system:28
W20240402 07:14:20.008428     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:15:19.915210     8 rdb_save.cc:1406] io error system:28
W20240402 07:15:20.305893     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:16:21.213469     8 rdb_save.cc:1406] io error system:28
W20240402 07:16:21.406510     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:17:17.004458     8 rdb_save.cc:1406] io error system:28
W20240402 07:17:17.214521     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:18:19.917520     8 rdb_save.cc:1406] io error system:28
W20240402 07:18:20.219050     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:19:20.011533     8 rdb_save.cc:1406] io error system:28
W20240402 07:19:20.309616     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:20:20.009765     8 rdb_save.cc:1406] io error system:28
W20240402 07:20:20.310956     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:21:18.818560     8 rdb_save.cc:1406] io error system:28
W20240402 07:21:19.203550     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:22:20.719043     8 rdb_save.cc:1406] io error system:28
W20240402 07:22:21.209784     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:23:21.320616     8 rdb_save.cc:1406] io error system:28
W20240402 07:23:21.611627     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:24:21.321049     8 rdb_save.cc:1406] io error system:28
W20240402 07:24:21.622252     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:25:20.007305     8 rdb_save.cc:1406] io error system:28
W20240402 07:25:20.309065     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:26:19.608237     8 rdb_save.cc:1406] io error system:28
W20240402 07:26:19.921449     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:27:20.921967     8 rdb_save.cc:1406] io error system:28
W20240402 07:27:21.305016     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:28:21.204382     8 rdb_save.cc:1406] io error system:28
W20240402 07:28:21.512039     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:29:20.717458     8 rdb_save.cc:1406] io error system:28
W20240402 07:29:21.213011     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:30:21.419281     8 rdb_save.cc:1406] io error system:28
W20240402 07:30:21.610126     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:31:19.925829     8 rdb_save.cc:1406] io error system:28
W20240402 07:31:20.208326     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:32:20.812709     8 rdb_save.cc:1406] io error system:28
W20240402 07:32:21.014976     8 server_family.cc:918] Failed to perform snapshot No space left on device
^CI20240402 07:33:16.215328     8 accept_server.cc:24] Exiting on signal Interrupt
E20240402 07:33:16.218310     8 rdb_save.cc:1406] io error system:28
I20240402 07:33:16.221161     8 listener_interface.cc:222] Listener stopped for port 9999
W20240402 07:33:16.413079     8 server_family.cc:918] Failed to perform snapshot No space left on device
I20240402 07:33:16.413216     8 listener_interface.cc:222] Listener stopped for port 6379
E20240402 07:33:35.016315     8 rdb_save.cc:1406] io error system:28
W20240402 07:33:35.311447     8 server_family.cc:784] Failed to perform snapshot No space left on device

used_memory_peak_rss:668233728

My assumption is that due to dragonfly running in cache mode, it should never go OOM - so no crash should happen.
If it was not running in cache mode, i would expect it to write out OOM responses if i write data to it when current data + incoming data > maxmemory and also not crash.

@chakaz
Copy link
Collaborator

chakaz commented Apr 4, 2024

@applike-ss we plan to take into account the additional memory of serializing to file (and also for new replica connecting in full sync stage), and perhaps fail / slow down these actions when memory is tight.
Also, we plan to migrate from using use-memory (the memory used only by the data) and start using rss (actual memory usage).
Both actions will take at the very least a few weeks, likely more, so in the meanwhile you could add some memory buffer for these operations specifically.

@applike-ss
Copy link
Contributor Author

With the current version (docker.dragonflydb.io/dragonflydb/dragonfly:v1.20.1), we have to use 430Mi when using max-memory of 320M and doing snapshots every 5 minutes and pushing 1mb data to redis all the time in cache-mode.

Still waiting on dragonfly to exit gracefully when this kind of issue happens. From the logs i don't see it shut down when crashing due to OOM situation and no error log for memory allocation or similar. At least a message would be nice that let's us easier see from logs why it crashed.

@romange
Copy link
Collaborator

romange commented Jul 16, 2024

Unfortunately based on the information we have here, we can not create any constructive action items here. We need to have a way to reproduce it before we try fixing it 🤷🏼

@applike-ss
Copy link
Contributor Author

I did provide a demo already in the first message of this issue, however here's another one for k8s.

Here's the steps for reproduction:

  • Have k3d installed (e.g. via asdf) and docker up and running and ensure no other k3d cluster is currently running
  • ensure you created the redis-pump.sh script as in the first message
  • ensure you created the data.txt file as in the first message
  • create the k3d cluster with port forwarding
  • install the dragonfly operator
  • apply the provided CR
  • Run ./redis-pump.sh 127.0.0.1 6379 1000000

Create K3D Cluster with port forwarding:

k3d cluster create -p 127.0.0.1:6379:6379

Install the dragonfly operator:

kubectl apply -f https://raw.githubusercontent.com/dragonflydb/dragonfly-operator/main/manifests/dragonfly-operator.yaml

CR:

apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  labels:
    app.kubernetes.io/name: dragonfly
    app.kubernetes.io/instance: dragonfly-sample
    app.kubernetes.io/part-of: dragonfly-operator
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/created-by: dragonfly-operator
  name: dragonfly
spec:
  replicas: 3
  args:
    - --maxmemory=256M
    - --alsologtostderr
    - --primary_port_http_enabled=false
    - --admin_port=9999
    - --admin_nopass
    - --cache_mode
    - --primary_port_http_enabled=true
    - --cluster_mode=emulated
  snapshot:
    cron: "*/5 * * * *"
    persistentVolumeClaimSpec:
      resources:
        requests:
          storage: 1Gi
      accessModes:
        - ReadWriteOnce
  resources:
    requests:
      cpu: 500m
      memory: 320Mi
    limits:
      cpu: 500m
      memory: 320Mi
  serviceSpec:
    type: LoadBalancer

@chakaz chakaz removed their assignment Sep 9, 2024
@applike-ss
Copy link
Contributor Author

applike-ss commented Sep 16, 2024

I just re-checked it with v1.22.1 and the problem still exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working minor nice to have enhancement
Projects
None yet
Development

No branches or pull requests

4 participants