Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvs commit: no such file or directory #6463

Open
vsoch opened this issue Nov 29, 2024 · 28 comments
Open

kvs commit: no such file or directory #6463

vsoch opened this issue Nov 29, 2024 · 28 comments

Comments

@vsoch
Copy link
Member

vsoch commented Nov 29, 2024

Continuing from discussion in #6461. When

When we get to larger sizes of a kary:3 tree (and it's not clear to me if the topology matters here or if it was transient) we start to see that the archive create is failing. Here is an example output from an experiment from last night - the purple line is erroneous. The archive wasn't created or distributed, it issued that error and the time ended too early.

distribute-all-nodes-nodes-6

Specifically, I think I'm hitting some limit or other error with kary-3 (and I'm not sure what at the moment, just finished the runs and would need to manually inspect):

image

Here is what that topology looks like:

nodes-6-topo-kary:3
0 faa621ae1899: full
├─ 1 faa621ae1899: full
│  ├─ 4 faa621ae1899: full
│  └─ 5 faa621ae1899: full
├─ 2 faa621ae1899: full
└─ 3 faa621ae1899: full

I am trying to reproduce it now and look at flux dmesg and other stats for possibly other / more information.

@garlick
Copy link
Member

garlick commented Nov 29, 2024

If archive creation is failing, I would think that would be localized to rank 0 and the overlay config shouldn't matter?

It's odd that it's not reproducible for any topology then, unless the tests are run repeatedly on the same instance and the error occurs after some build-up of state. A first thing to check would be if the rank 0 backing store file system, usually in /tmp, is running out of space.

Whatever it turns out to be, this error message is decidedly unhelpful and perhaps actively misleading so there's a bug to fix here.

@vsoch
Copy link
Member Author

vsoch commented Nov 29, 2024

It's odd that it's not reproducible for any topology then, unless the tests are run repeatedly on the same instance and the error occurs after some build-up of state. A first thing to check would be if the rank 0 backing store file system, usually in /tmp, is running out of space.

I'm running these in sequence on the same cluster - and (strangely) I haven't reproduced yet for kary-3. These are being run in pods -> containers in Kubernetes, and all environments are consistent (the same exact container being deployed on the same nodes) and the only difference was the topology being used. The pod setup means that the entire thing is nuked and re-created with each run (there is no persisting temporary file system).

I suspect it's something more ephemeral that might be hard to pin down. What I'm doing is running the same experiment again, and I'm carefully monitoring each log as it is generated. As soon as I see this reproduced (and I hope I do) I have another interactive shell that is ready to look at flux dmesg and the command to show state.

@vsoch
Copy link
Member Author

vsoch commented Nov 29, 2024

If you want to see how the experiments work, all of the files to reproduce are here. Let me know if you have questions! The basic idea is that we build a container with those blobs, deploy it to 6 nodes in a GKE (Google Kubernetes Engine) cluster, and then the Flux Operator (which creates a flux MiniCluster) is deployed with an entrypoint to generate a custom experiment script (which comes down to a bash script, you can see one printed at the end of the experiment run, here.

For the experiment, we loop through the sizes, each time creating the archive, using some mechanism to distribute, and then cleaning up. I'm trying a few simple distribution mechanisms on each topology - across all the nodes, and then a two step distribute that sends from the root to the middle nodes, and then middle nodes to the leaves, the idea being that I can compare that to distributing to all nodes (and maybe this isn't necessary to test because it's unlikely to be an improvement)? I am also do a rm -rf of the actual file that is extracted to each node so the node (hypothetically) doesn't fill up. Let me know if you see any issue with that experiment setup (user error or otherwise).

@vsoch
Copy link
Member Author

vsoch commented Nov 29, 2024

Got it!

2024-11-29T20:17:55.469647Z sched-simple.debug[0]: resource update: {"resources":{"version":1,"execution":{"R_lite":[{"rank":"0-5","children":{"core":"0-15"}}],"starttime":0.0,"expiration":0.0,"nodelist":["flux-sample-[0-5]"]}},"up":""}
2024-11-29T20:17:55.469705Z job-manager.debug[0]: scheduler: hello
2024-11-29T20:17:55.469802Z job-manager.debug[0]: scheduler: ready limited
2024-11-29T20:17:55.469863Z sched-simple.debug[0]: ready: 0 of 96 cores: 
2024-11-29T20:17:55.559711Z broker.info[0]: rc1.0: /etc/flux/rc1 Exited (rc=0) 0.3s
2024-11-29T20:17:55.559786Z broker.info[0]: rc1-success: init->quorum 0.297064s
2024-11-29T20:17:55.660192Z broker.info[0]: online: flux-sample-0 (ranks 0)
2024-11-29T20:17:55.660239Z sched-simple.debug[0]: resource update: {"up":"0"}
2024-11-29T20:17:56.048426Z broker.debug[0]: accepting connection from flux-sample-1 (rank 1) status partial
2024-11-29T20:17:56.309434Z sched-simple.debug[0]: resource update: {"up":"1"}
2024-11-29T20:17:56.699960Z sched-simple.debug[0]: resource update: {"up":"2"}
2024-11-29T20:17:56.889295Z sched-simple.debug[0]: resource update: {"up":"3"}
2024-11-29T20:17:57.078709Z sched-simple.debug[0]: resource update: {"up":"4"}
2024-11-29T20:17:57.267006Z broker.info[0]: online: flux-sample-[0-5] (ranks 0-5)
2024-11-29T20:17:57.267021Z broker.info[0]: quorum-full: quorum->run 1.70723s
2024-11-29T20:17:57.267043Z sched-simple.debug[0]: resource update: {"up":"5"}
2024-11-29T20:17:57.281339Z broker.debug[0]: rmmod content
2024-11-29T20:17:57.281545Z broker.debug[0]: module content exited
2024-11-29T20:17:57.282119Z broker.debug[0]: insmod content
2024-11-29T20:19:20.392899Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:19:20.456611Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:19:20.476759Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:19:32.509589Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:19:32.564534Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:19:32.583773Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:19:44.271318Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:19:44.339905Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:19:44.360034Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:20:00.163049Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:20:00.227368Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:20:00.247890Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
root@flux-sample-0:/chonks# flux module stats content | jq
{
  "count": 20,
  "valid": 20,
  "dirty": 20,
  "size": 3074987,
  "flush-batch-count": 0,
  "mmap": {
    "tags": {},
    "blobs": 0
  }
}

image

image

I do think the broker might have exited? At least in my interactive console when there was the Exit:

root@flux-sample-0:/chonks# flux resource list
     STATE NNODES NCORES NGPUS NODELIST
      free      6     96     0 flux-sample-[0-5]
 allocated      0      0     0 
      down      0      0     0 
root@flux-sample-0:/chonks# flux dmesgflux-proxy: Lost connection to Flux
flux-proxy: Sending SIGHUP to child processes

root@flux-sample-0:/chonks# command terminated with exit code 137

Let me know if there is something I can look at - the cluster will self terminate in about 15 mins.

@vsoch
Copy link
Member Author

vsoch commented Nov 29, 2024

It seems strange that the size remains big - I wonder if there is some initial error with cleaning (removing) an archive, and then the rest result from it spilling over (if that is possible)? I would expect the size to go to zero as I create (and remove) archives.

root@flux-sample-0:/chonks# flux module stats content | jq
{
  "count": 20,
  "valid": 20,
  "dirty": 20,
  "size": 3074987,
  "flush-batch-count": 0,
  "mmap": {
    "tags": {},
    "blobs": 0
  }
}
root@flux-sample-0:/chonks# 
root@flux-sample-0:/chonks# flux module stats content | jq
{
  "count": 22,
  "valid": 22,
  "dirty": 22,
  "size": 3512747,
  "flush-batch-count": 0,
  "mmap": {
    "tags": {},
    "blobs": 0
  }
}
root@flux-sample-0:/chonks# flux module stats content | jq
{
  "count": 23,
  "valid": 23,
  "dirty": 23,
  "size": 4267947,
  "flush-batch-count": 0,
  "mmap": {
    "tags": {},
    "blobs": 0
  }
}

Ahh this is telling! The ranks are 💀

# flux exec -r all flux module stats content | jq
flux-exec: Error: rank 1: flux: No route to host
flux-exec: Error: rank 2: flux: No route to host
flux-exec: Error: rank 3: flux: No route to host
flux-exec: Error: rank 4: flux: No route to host
flux-exec: Error: rank 5: flux: No route to host
{
  "count": 24,
  "valid": 24,
  "dirty": 24,
  "size": 4681643,
  "flush-batch-count": 0,
  "mmap": {
    "tags": {},
    "blobs": 0
  }
}
root@flux-sample-0:/chonks# flux resource list
     STATE NNODES NCORES NGPUS NODELIST
      free      1     16     0 flux-sample-0
 allocated      0      0     0 
      down      5     80     0 flux-sample-[1-5]

So whatever happens, it kills the workers (but the lead broker persists), likely because it restarts.

@vsoch
Copy link
Member Author

vsoch commented Nov 29, 2024

It looks like they are coming up and down?

image

@vsoch
Copy link
Member Author

vsoch commented Nov 29, 2024

Oh interesting! So the workers are segfaulting. Here is rank 1.

🌀 flux start  -o --config /mnt/flux/view/etc/flux/config -Scron.directory=/etc/flux/system/cron.d   -Stbon.fanout=256   -Srundir=/mnt/flux/view/run/flux    -Sstatedir=/mnt/flux/view/var/lib/flux -Slocal-uri=local:///mnt/flux/view/run/flux/local -Stbon.connect_timeout=5s -Stbon.topo=kary:1    -Slog-stderr-level=0    -Slog-stderr-mode=local
/flux_operator/wait-0.sh: line 199:    32 Segmentation fault      (core dumped) flux start -o --config ${viewroot}/etc/flux/config ${brokerOptions}
Return value for follower worker is 139
😪 Sleeping 15s to try again...
/flux_operator/wait-0.sh: line 199:   113 Segmentation fault      (core dumped) flux start -o --config ${viewroot}/etc/flux/config ${brokerOptions}
Return value for follower worker is 139
😪 Sleeping 15s to try again...
/flux_operator/wait-0.sh: line 199:   143 Segmentation fault      (core dumped) flux start -o --config ${viewroot}/etc/flux/config ${brokerOptions}
Return value for follower worker is 139
😪 Sleeping 15s to try again...
/flux_operator/wait-0.sh: line 199:   180 Segmentation fault      (core dumped) flux start -o --config ${viewroot}/etc/flux/config ${brokerOptions}
Return value for follower worker is 139
😪 Sleeping 15s to try again...
/flux_operator/wait-0.sh: line 199:   210 Segmentation fault      (core dumped) flux start -o --config ${viewroot}/etc/flux/config ${brokerOptions}
Return value for follower worker is 139
😪 Sleeping 15s to try again...

@vsoch
Copy link
Member Author

vsoch commented Nov 29, 2024

okay - I'm going to build a throwaway variant of the flux operator that (on fail) adds a valgrind prefix to the broker start. That should minimally give us the terminal output, and if I'm fast enough I can copy the core dump from the node too.

@vsoch
Copy link
Member Author

vsoch commented Nov 29, 2024

okay! New loop for a working is running:

vg=""
while true
    do
        ${vg} flux start -o --config ${viewroot}/etc/flux/config ${brokerOptions}
        retval=$?
        if [[ "${retval}" -eq 0 ]] || [[ "false" == "true" ]]; then
             echo "The follower worker exited cleanly. Goodbye!"
             break
        fi
        echo "Return value for follower worker is ${retval}"
        vg="valgrind --leak-check=full"
        echo "😪 Sleeping 15s to try again..."
        sleep 15
    done

Watching! 👁️

@vsoch
Copy link
Member Author

vsoch commented Nov 29, 2024

okay so this is interesting - the initial error does not kill the other brokers. It's actually the check of the state that is doing it:

# flux exec -r all flux module stats content | jq
flux-exec: Error: rank 2: flux: No route to host
flux-exec: Error: rank 1: flux: No route to host
flux-exec: Error: rank 4: flux: No route to host
flux-exec: Error: rank 3: flux: No route to host
flux-exec: Error: rank 5: flux: No route to host
{
  "count": 25,
  "valid": 25,
  "dirty": 25,
  "size": 3709718,
  "flush-batch-count": 0,
  "mmap": {
    "tags": {},
    "blobs": 0
  }
}
root@flux-sample-0:/chonks# flux dmesg

I didn't see the broker restart until I issued that. I saw the error happen before that, but it didn't restart the brokers. So I don't think I can get a core dump for this - it won't naturally trigger on its own, and when I trigger it, the valgrind cuts (and I can't find a dump).

🌀 flux start  -o --config /mnt/flux/view/etc/flux/config -Scron.directory=/etc/flux/system/cron.d   -Stbon.fanout=256   -Srundir=/mnt/flux/view/run/flux    -Sstatedir=/mnt/flux/view/var/lib/flux -Slocal-uri=local:///mnt/flux/view/run/flux/local -Stbon.connect_timeout=5s -Stbon.topo=kary:1    -Slog-stderr-level=0    -Slog-stderr-mode=local
flux-broker: Warning: unable to resolve upstream peer flux-sample-3.flux-service.default.svc.cluster.local: Name or service not known
/flux_operator/wait-0.sh: line 203:    31 Segmentation fault      (core dumped) ${vg} flux start -o --config ${viewroot}/etc/flux/config ${brokerOptions}
Return value for follower worker is 139
😪 Sleeping 15s to try again...
==105== Memcheck, a memory error detector
==105== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==105== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==105== Command: flux start -o --config /mnt/flux/view/etc/flux/config -Scron.directory=/etc/flux/system/cron.d -Stbon.fanout=256 -Srundir=/mnt/flux/view/run/flux -Sstatedir=/mnt/flux/view/var/lib/flux -Slocal-uri=local:///mnt/flux/view/run/flux/local -Stbon.connect_timeout=5s -Stbon.topo=kary:1 -Slog-stderr-level=0 -Slog-stderr-mode=local
==105== 
/flux_operator/wait-0.sh: line 203:   105 Segmentation fault      (core dumped) ${vg} flux start -o --config ${viewroot}/etc/flux/config ${brokerOptions}
Return value for follower worker is 139
😪 Sleeping 15s to try again...
==142== Memcheck, a memory error detector
==142== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==142== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==142== Command: flux start -o --config /mnt/flux/view/etc/flux/config -Scron.directory=/etc/flux/system/cron.d -Stbon.fanout=256 -Srundir=/mnt/flux/view/run/flux -Sstatedir=/mnt/flux/view/var/lib/flux -Slocal-uri=local:///mnt/flux/view/run/flux/local -Stbon.connect_timeout=5s -Stbon.topo=kary:1 -Slog-stderr-level=0 -Slog-stderr-mode=local
==142== 

What I tried doing instead is shelling into the worker node directly to run the content command there, and the output is slightly different:

root@flux-sample-4:/chonks# flux dmesg
2024-11-29T20:54:56.913935Z broker.debug[4]: insmod connector-local
2024-11-29T20:54:56.914001Z broker.info[4]: start: none->join 0.48083ms
2024-11-29T20:54:56.914192Z connector-local.debug[4]: allow-guest-user=true
2024-11-29T20:54:56.914204Z connector-local.debug[4]: allow-root-owner=true
2024-11-29T20:54:56.915879Z broker.debug[4]: hello parent 3 0b453ad3-795a-4fa8-a76f-f55e2eaf9cee
2024-11-29T20:54:56.918234Z broker.debug[4]: accepting connection from flux-sample-5 (rank 5) status full
2024-11-29T20:54:57.208757Z broker.info[4]: parent-ready: join->init 0.294751s
2024-11-29T20:54:57.222022Z connector-local.debug[4]: allow-guest-user=true
2024-11-29T20:54:57.222031Z connector-local.debug[4]: allow-root-owner=true
2024-11-29T20:54:57.222103Z broker.info[4]: configuration updated
2024-11-29T20:54:57.228404Z broker.debug[4]: insmod content
2024-11-29T20:54:57.234354Z broker.debug[4]: insmod barrier
2024-11-29T20:54:57.245247Z broker.debug[4]: insmod kvs
2024-11-29T20:54:57.251352Z broker.debug[4]: insmod kvs-watch
2024-11-29T20:54:57.257414Z broker.debug[4]: insmod resource
2024-11-29T20:54:57.279246Z broker.debug[4]: insmod job-info
2024-11-29T20:54:57.285393Z broker.debug[4]: insmod job-ingest
2024-11-29T20:54:57.285630Z job-ingest.debug[4]: configuring validator with plugins=(null), args=(null) (enabled)
2024-11-29T20:54:57.286703Z job-ingest.debug[4]: fluid ts=402122ms
2024-11-29T20:54:57.287548Z broker.info[4]: rc1.0: running /etc/flux/rc1.d/02-cron
2024-11-29T20:54:57.300668Z broker.info[4]: rc1.0: /etc/flux/rc1 Exited (rc=0) 0.1s
2024-11-29T20:54:57.300750Z broker.info[4]: rc1-success: init->quorum 91.9843ms
2024-11-29T20:54:57.300782Z broker.info[4]: quorum-full: quorum->run 0.02641ms
2024-11-29T20:55:03.680639Z kvs.err[4]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:55:03.729080Z kvs.err[4]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:55:35.522720Z kvs.err[4]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:56:07.482510Z kvs.err[4]: content_load_completion: content_load_get: No such file or directory
root@flux-sample-4:/chonks# flux module stats content | jq
flux-proxy: flux_reactor_run: Success
flux-proxy: Sending SIGHUP to child processes
flux-module: content.stats-get: No such file or directory

And in color for readability:

image

Could that be a hint?

@vsoch
Copy link
Member Author

vsoch commented Nov 29, 2024

In case any of this is helpful (I'm running out of ideas):

flux module stats
root@flux-sample-3:/chonks# flux module stats -R content 
{
 "utime": 0.0,
 "stime": 0.00073499999999999998,
 "maxrss": 60784,
 "ixrss": 0,
 "idrss": 0,
 "isrss": 0,
 "minflt": 17,
 "majflt": 0,
 "nswap": 0,
 "inblock": 0,
 "oublock": 0,
 "msgsnd": 0,
 "msgrcv": 0,
 "nsignals": 0,
 "nvcsw": 15,
 "nivcsw": 0
}
root@flux-sample-3:/chonks# flux module list
Module                   Idle  S Sendq Recvq Service
kvs-watch                  20  R     0     0 
job-info                   20  R     0     0 
connector-local             0  R     0     0 
job-ingest                 20  R     0     0 
content                     7  R     0     0 
resource                   20  R     0     0 
kvs                        11  R     0     0 
barrier                    20  R     0     0 
root@flux-sample-3:/chonks# flux module stats kvs-watch
{
 "watchers": 0,
 "namespace-count": 0,
 "namespaces": {}
}
root@flux-sample-3:/chonks# flux module stats job-info
{
 "lookups": 0,
 "watchers": 0,
 "guest_watchers": 0,
 "update_lookups": 0,
 "update_watchers": 0
}
root@flux-sample-3:/chonks# flux module stats connector-local
{
 "tx": {
  "request": 49,
  "response": 1,
  "event": 0,
  "control": 0
 },
 "rx": {
  "request": 2,
  "response": 23,
  "event": 0,
  "control": 0
 }
}
root@flux-sample-3:/chonks# flux module stats job-ingest
{
 "pipeline": {
  "frobnicator": {
   "running": 0,
   "requests": 0,
   "errors": 0,
   "trash": 0,
   "backlog": 0,
   "pids": [
    0,
    0,
    0,
    0
   ]
  },
  "validator": {
   "running": 0,
   "requests": 0,
   "errors": 0,
   "trash": 0,
   "backlog": 0,
   "pids": [
    0,
    0,
    0,
    0
   ]
  }
 }
}
root@flux-sample-3:/chonks# flux module stats resource
{
 "tx": {
  "request": 4,
  "response": 0,
  "event": 0,
  "control": 0
 },
 "rx": {
  "request": 1,
  "response": 3,
  "event": 0,
  "control": 0
 }
}
root@flux-sample-3:/chonks# flux module stats kvs
{
 "cache": {
  "obj size total (MiB)": 0.00029468536376953125,
  "obj size (KiB)": {
   "count": 1,
   "min": 0.309,
   "mean": 0.309,
   "stddev": 0.0,
   "max": 0.309
  },
  "#obj dirty": 0,
  "#obj incomplete": 0,
  "#faults": 7
 },
 "namespace": {
  "primary": {
   "#versionwaiters": 0,
   "#no-op stores": 0,
   "#transactions": 0,
   "#readytransactions": 0,
   "store revision": 18
  }
 },
 "pending_requests": 0
}
root@flux-sample-3:/chonks# flux module stats barrier
{
 "tx": {
  "request": 3,
  "response": 0,
  "event": 0,
  "control": 0
 },
 "rx": {
  "request": 1,
  "response": 2,
  "event": 0,
  "control": 0
 }
}

@garlick
Copy link
Member

garlick commented Nov 29, 2024

Are there multiple candidates for a root cause here? I see

  • flux archive create failed a kvs commit with "No such file or directory"
  • a broker segfault

As always, it'd be great to get a backtrace from the segfault to see what happend there (and which broker is segfaulting - I'm guessing rank 0?)

The archive creation failure should be reproducible in isolation since, if done on rank 0, it doesn't involve any communication among brokers.

@vsoch
Copy link
Member Author

vsoch commented Nov 30, 2024

I'll see what I can do.

@vsoch
Copy link
Member Author

vsoch commented Nov 30, 2024

Some valgrind output - this first one (create) didn't seem to error:

root@flux-sample-0:/chonks#     echo "EVENT create-archive-${size}"
    time valgrind --leak-check=full flux archive create --name create-archive-${size} --dir /chonks ${size}gb.txt
EVENT create-archive-4
==101== Memcheck, a memory error detector
==101== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==101== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==101== Command: flux archive create --name create-archive-4 --dir /chonks 4gb.txt
==101== 
==101== Warning: set address range perms: large range [0x59c93000, 0x148346000) (defined)

==101== Warning: set address range perms: large range [0x59c93000, 0x148346000) (noaccess)
==101== 
==101== HEAP SUMMARY:
==101==     in use at exit: 911,171 bytes in 22,911 blocks
==101==   total heap usage: 170,022 allocs, 147,111 frees, 8,021,501,792 bytes allocated
==101== 
==101== 911,171 (40 direct, 911,131 indirect) bytes in 1 blocks are definitely lost in loss record 16 of 16
==101==    at 0x4846828: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==101==    by 0x4AD0287: json_array (in /usr/lib/x86_64-linux-gnu/libjansson.so.4.14.0)
==101==    by 0x121FBF: subcmd_create (archive.c:310)
==101==    by 0x1209D8: cmd_archive (archive.c:665)
==101==    by 0x1166DD: main (flux.c:235)
==101== 
==101== LEAK SUMMARY:
==101==    definitely lost: 40 bytes in 1 blocks
==101==    indirectly lost: 911,131 bytes in 22,910 blocks
==101==      possibly lost: 0 bytes in 0 blocks
==101==    still reachable: 0 bytes in 0 blocks
==101==         suppressed: 0 bytes in 0 blocks
==101== 
==101== For lists of detected and suppressed errors, rerun with: -s
==101== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

real	3m11.981s
user	2m58.870s
sys	0m3.613s

That says there is one error, but I don't see it printed (about the kvs). I'm running these manually so I'll keep retrying - I'll likely hit a core dump at some point.

@garlick
Copy link
Member

garlick commented Nov 30, 2024

Don't bother running valgrind on flux-archive - the segfault is in the broker. At least that's what I thought? If you can find the conditions needed to recreate it then maybe we can do it under gdb or in circumstances where it'll be easier to retrieve the core.

@vsoch
Copy link
Member Author

vsoch commented Nov 30, 2024

Oh - so should valgrind wrap flux start?

@vsoch
Copy link
Member Author

vsoch commented Nov 30, 2024

okay - re-running with a wrapped to flux start - will report back if I see an issue!

@vsoch
Copy link
Member Author

vsoch commented Nov 30, 2024

okay - I can't get this information for you. When I wrap flux-start, I can see that the error occurs, but the lead broker doesn't seem to exit then because the next command continues on.

real	0m0.028s
user	0m0.002s
sys	0m0.004s
EVENT create-archive-10
flux-archive: kvs commit: No such file or directory

real	0m38.941s
user	0m14.778s
sys	0m11.250s
EVENT distribute-all-nodes-10
flux-archive: KVS lookup archive.create-archive-10: No such file or directory
flux-archive: KVS lookup archive.create-archive-10: No such file or directory
flux-archive: KVS lookup archive.create-archive-10: No such file or directory
flux-archive: KVS lookup archive.create-archive-10: No such file or directory
flux-archive: KVS lookup archive.create-archive-10: No such file or directory
[1-5]: Exit 1

That also hints that it was the workers (1-5) that had the error, but I don't see that they restarted either. This is why I was originally using valgrind to wrap the flux-archive command, because I thought I might see something there, but all that I see is what I posted above. I didn't see it paired with the KVS lookup error. It could just be that I haven't reproduced it yet with the right settings - I think I probably need valgrind to wrap the command in the script (the flux archive one) and then retrieve that output and core. For the broker exiting, I think that only happened when I explicitly logged in and did a flux exec to all the nodes to print the content module stats. That could be another error, but it's not the one I'm interested in.

Let me know what you'd like to do.

@vsoch
Copy link
Member Author

vsoch commented Nov 30, 2024

To summarize - this is what I think is going on:

  • The segfault happens for a broker, specifically a worker broker, when the command is issued that looks at the content state (what I was manually doing) and this triggers the restart. I could collect the core for this, potentially, but it's not the issue that I'm interested in.
  • The issue that I am interested in is the "kvs commit: no such file or directory" for which I don't see that the main broker exits, and when that flux-archive command emits, the lead broker seems to keep going - so wrapping flux start wouldn't help us here. It's an error that doesn't trigger a segfault.

At most, I think what I could try is wrapping the worker flux start with valgrind, and then having it loop so I can shell in and look for a core. I could also go back to the original method and put valgrind as a prefix in front of flux archive, but I did that earlier today and didn't see any core dumps.

@garlick
Copy link
Member

garlick commented Nov 30, 2024

The issue that I am interested in is the "kvs commit: no such file or directory" for which I don't see that the main broker exits, and when that flux-archive command emits, the lead broker seems to keep going - so wrapping flux start wouldn't help us here. It's an error that doesn't trigger a segfault.

Sounds like we should try to recreate the failing flux archive then and see if that creates the conditions for the broker segfault. AFAICS, you're just creating a flux archive with one big blob in it on rank 0. I've tried this up to 10g and not been able to recreate the failure.

O/T: although --mmap adds more constraints and failure modes (and it would be good to understand what is going on with the other failures before adding more!) it was designed as an optimization for archiving large files. You may want to include it in your study. See the WARNING box in flux-archive(1) though.

@vsoch
Copy link
Member Author

vsoch commented Nov 30, 2024

okay - I've been running the experiment with valgrind wrapping the flux archive commands for almost 5 hours, and I didn't reproduce. I also tested this once more:

At most, I think what I could try is wrapping the worker flux start with valgrind

And I can confirm that the workers do not segfault or restart, I'm not sure why the lead broker shows an Exit, but they don't report any exit / restart in their logs. The fact that it didn't happen with valgrind makes me wonder if it's something racy - notably valgrind slows everything down immensely, and maybe that slowdown is preventing the kvs commit issues because we wait a bit longer between operations?

@garlick
Copy link
Member

garlick commented Nov 30, 2024

Is there a way we could create a reproducer for just the flux archive failure? It seems like that ought to be reproducible with a size=1 instance since it's just flux archive create writing to the KVS, correct? Is it maybe something to do with content of the files you are putting in the archive? In my unsuccessful attempts to repro, I was just getting random bytes from /dev/urandom and trying to create an archive with one big "file" in it (up to 10G). There were hundreds of G available in /tmp.

@vsoch
Copy link
Member Author

vsoch commented Nov 30, 2024

Yeah - let me work on something. I just tried again (a simplified version in Kubernetes) and reproduced the error, always starting at size 4.

# Go up to 10 sizes for now (matches container size)
for size in $(seq 1 10)
  do
    # Create the archive and time it
    echo "EVENT create-archive-${size}"
    time  flux archive create --name create-archive-${size} --dir /chonks ${size}gb.txt

    # Distribute to all nodes, but to /tmp because file exists in /chonks
    echo "EVENT distribute-all-nodes-${size}"
    time flux exec -r all -x 0  flux archive extract --name create-archive-${size} -C /tmp
    sleep 2

    echo "EVENT delete-archive-all-nodes-${size}"
    time flux exec -r all  flux archive remove --name create-archive-${size} 2>/dev/null
    sleep 2

    # Cleanup all archive on all nodes
    echo "EVENT clean-all-nodes-${size}"
    time flux exec -r all rm -rf /tmp/${size}gb.txt
    sleep 2
done

# Show self!
echo "SCRIPT entrypoint"

I'll try to reproduce this in my docker image that is running that, give me a few minutes!

@vsoch
Copy link
Member Author

vsoch commented Nov 30, 2024

Running with a single container in docker doesn't reproduce it.

docker run -it ghcr.io/converged-computing/container-chonks:topology-flux-0.66.0

# The chonks are here
ls
# 10gb.txt  1gb.txt  2gb.txt  3gb.txt  4gb.txt  5gb.txt  6gb.txt  7gb.txt  8gb.txt  9gb.txt

# Start the broker with a different topology (but actually it doesn't matter these are faux nodes)
flux start  -Stbon.fanout=256  -Stbon.connect_timeout=5s -Stbon.topo=binomial  -Slog-stderr-level=0  -Slog-stderr-mode=local 

Run the script above:

for size in $(seq 1 10)
  do
    # Create the archive and time it
    echo "EVENT create-archive-${size}"
    time  flux archive create --name create-archive-${size} --dir /chonks ${size}gb.txt

    # Distribute to all nodes, but to /tmp because file exists in /chonks
    echo "EVENT distribute-all-nodes-${size}"
    time flux exec -r all   flux archive extract --name create-archive-${size} -C /tmp
    sleep 2

    echo "EVENT delete-archive-all-nodes-${size}"
    time flux exec -r all  flux archive remove --name create-archive-${size} 2>/dev/null
    sleep 2

    # Cleanup all archive on all nodes
    echo "EVENT clean-all-nodes-${size}"
    time flux exec -r all rm -rf /tmp/${size}gb.txt
    sleep 2
done

(everything works). I will try to make an example in kind, which (don't worry) is very easy to use. We need to have separate nodes or at least containers I think.

@vsoch
Copy link
Member Author

vsoch commented Nov 30, 2024

This didn't reproduce either 😞 , but I'll include for comprehensiveness. Here is where to install kind - "Kubernetes in Docker" - I'm guessing you already have docker. Then:

# Create your cluster  (4 nodes, one container each, be wary of disk space)
wget https://raw.githubusercontent.com/converged-computing/flux-distribute/refs/heads/main/topology/kind-experiment/kind-config.yaml
kind create cluster --config kind-config.yaml

# Run this so kubectl does autocompletion with tab
source <(kubectl completion bash)

# Install the flux operator
kubectl apply -f https://raw.githubusercontent.com/flux-framework/flux-operator/refs/heads/main/examples/dist/flux-operator.yaml

# This is how to see logs - you should not see anything that looks erreous
pod_name=$(kubectl get pods -n operator-system -o json | jq -r .items[0].metadata.name)
kubectl logs -n operator-system ${pod_name}
What a correct / working Flux Operator log looks like
1.7330086743293536e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": "127.0.0.1:8080"}
1.733008674329579e+09	INFO	setup	starting manager
1.7330086743296938e+09	INFO	Starting server	{"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
1.7330086743296936e+09	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
I1130 23:17:54.329739       1 leaderelection.go:248] attempting to acquire leader lease operator-system/14dde902.flux-framework.org...
I1130 23:17:54.332808       1 leaderelection.go:258] successfully acquired lease operator-system/14dde902.flux-framework.org
1.7330086743328419e+09	DEBUG	events	Normal	{"object": {"kind":"Lease","namespace":"operator-system","name":"14dde902.flux-framework.org","uid":"60570642-3c34-463b-ad4d-d8dc24244a92","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1084"}, "reason": "LeaderElection", "message": "operator-controller-manager-b69fd5b7d-tlpkz_79a6000a-643e-400a-9fa7-288a377261ef became leader"}
1.7330086743329332e+09	INFO	Starting EventSource	{"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "source": "kind source: *v1alpha2.MiniCluster"}
1.7330086743329582e+09	INFO	Starting EventSource	{"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "source": "kind source: *v1.Ingress"}
1.7330086743329635e+09	INFO	Starting EventSource	{"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "source": "kind source: *v1.Job"}
1.7330086743329725e+09	INFO	Starting EventSource	{"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "source": "kind source: *v1.Secret"}
1.733008674333002e+09	INFO	Starting EventSource	{"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "source": "kind source: *v1.Service"}
1.7330086743330076e+09	INFO	Starting EventSource	{"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "source": "kind source: *v1.Pod"}
1.7330086743330116e+09	INFO	Starting EventSource	{"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "source": "kind source: *v1.ConfigMap"}
1.733008674333016e+09	INFO	Starting EventSource	{"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "source": "kind source: *v1.Job"}
1.733008674333019e+09	INFO	Starting Controller	{"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster"}
1.7330086744340837e+09	INFO	Starting workers	{"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "worker count": 1}

Here is how to inspect your nodes / pods (you don't have any pods yet)

kubectl get nodes
kubectl get pods
# see the nodes pods are assigned to
kubectl get pods -o wide

Now we have our operator, let's create the cluster. This is in "interactive" mode so it won't run anything - it will sit for you to shell in and do whatever you like until you delete it.

kubectl apply -f https://gist.githubusercontent.com/vsoch/b4af4f455b96398b348a631f55fdf771/raw/483012120ef22f43de73663d4ee3b7e3ef0ed680/minicluster.yaml

That's going to take a bit to pull because the container is big. You can use --watch to wait for it to from Init to PodInitializing to Running.

$ kubectl get pods --watch
NAME                  READY   STATUS     RESTARTS   AGE
flux-sample-0-62nzm   0/1     Init:0/1   0          5s
flux-sample-1-qb7gj   0/1     Init:0/1   0          5s
flux-sample-2-g8clx   0/1     Init:0/1   0          5s
flux-sample-3-w54sd   0/1     Init:0/1   0          5s

Init is usually when flux is installed on the fly, which isn't done here because we don't have a view. But it's still configured. When it's running (takes 15 minutes on my machine) you can shell in to the lead broker.

lead_broker=$(kubectl get pods -o json | jq -r .items[0].metadata.name)
kubectl exec -it ${lead_broker} bash

Once you are in the container, you need to connect to the socket. This is a container where the view (automated install of flux) is disabled in favor of the updated version with your Segfault fix, so we can't use the easy interfaces / environment to shell in. Instead:

flux proxy local:///mnt/flux/view/run/flux/local bash
root@flux-sample-0:/chonks# flux resource list
     STATE NNODES NCORES NGPUS NODELIST
      free      4     32     0 flux-sample-[0-3]
 allocated      0      0     0 
      down      0      0     0 

Should have a binomial overlay topology

0 flux-sample-0: full
├─ 1 flux-sample-1: full
└─ 2 flux-sample-2: full
   └─ 3 flux-sample-3: full

Run the experiment:

for size in $(seq 1 10)
  do
    # Create the archive and time it
    echo "EVENT create-archive-${size}"
    time  flux archive create --name create-archive-${size} --dir /chonks ${size}gb.txt

    # Distribute to all nodes, but to /tmp because file exists in /chonks
    echo "EVENT distribute-all-nodes-${size}"
    time flux exec -r all   flux archive extract --name create-archive-${size} -C /tmp
    sleep 2

    echo "EVENT delete-archive-all-nodes-${size}"
    time flux exec -r all  flux archive remove --name create-archive-${size} 2>/dev/null
    sleep 2

    # Cleanup all archive on all nodes
    echo "EVENT clean-all-nodes-${size}"
    time flux exec -r all rm -rf /tmp/${size}gb.txt
    sleep 2
done

It could be the number of nodes - I am only doing 4 now, I'll try updating to 6. I hope it's not something specific to Kubernetes in the cloud on actual nodes - I could probably debug that a bit by trying on AWS (where we have hpc instances with single tenancy). Ug. I guess I'll keep trying and update you here.

@vsoch
Copy link
Member Author

vsoch commented Dec 1, 2024

I did 3x on kind, did not reproduce.

@vsoch
Copy link
Member Author

vsoch commented Dec 1, 2024

Wow, so I'm running on AWS, and... no bug! 🤯 Just to add more to the mystery. So this looks to be something that happens exclusively on GKE.

create-archive-nodes-6
distribute-all-nodes-nodes-6
clean-all-nodes-nodes-6

Is it worth testing --mmap as well? Will the performance be better, despite the caveats?

@garlick
Copy link
Member

garlick commented Dec 1, 2024

Up to you. --mmap should reduce the time needed to create the archive and reduce the backing store space usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants