-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvs commit: no such file or directory #6463
Comments
If archive creation is failing, I would think that would be localized to rank 0 and the overlay config shouldn't matter? It's odd that it's not reproducible for any topology then, unless the tests are run repeatedly on the same instance and the error occurs after some build-up of state. A first thing to check would be if the rank 0 backing store file system, usually in /tmp, is running out of space. Whatever it turns out to be, this error message is decidedly unhelpful and perhaps actively misleading so there's a bug to fix here. |
I'm running these in sequence on the same cluster - and (strangely) I haven't reproduced yet for kary-3. These are being run in pods -> containers in Kubernetes, and all environments are consistent (the same exact container being deployed on the same nodes) and the only difference was the topology being used. The pod setup means that the entire thing is nuked and re-created with each run (there is no persisting temporary file system). I suspect it's something more ephemeral that might be hard to pin down. What I'm doing is running the same experiment again, and I'm carefully monitoring each log as it is generated. As soon as I see this reproduced (and I hope I do) I have another interactive shell that is ready to look at flux dmesg and the command to show state. |
If you want to see how the experiments work, all of the files to reproduce are here. Let me know if you have questions! The basic idea is that we build a container with those blobs, deploy it to 6 nodes in a GKE (Google Kubernetes Engine) cluster, and then the Flux Operator (which creates a flux MiniCluster) is deployed with an entrypoint to generate a custom experiment script (which comes down to a bash script, you can see one printed at the end of the experiment run, here. For the experiment, we loop through the sizes, each time creating the archive, using some mechanism to distribute, and then cleaning up. I'm trying a few simple distribution mechanisms on each topology - across all the nodes, and then a two step distribute that sends from the root to the middle nodes, and then middle nodes to the leaves, the idea being that I can compare that to distributing to all nodes (and maybe this isn't necessary to test because it's unlikely to be an improvement)? I am also do a |
Got it! 2024-11-29T20:17:55.469647Z sched-simple.debug[0]: resource update: {"resources":{"version":1,"execution":{"R_lite":[{"rank":"0-5","children":{"core":"0-15"}}],"starttime":0.0,"expiration":0.0,"nodelist":["flux-sample-[0-5]"]}},"up":""}
2024-11-29T20:17:55.469705Z job-manager.debug[0]: scheduler: hello
2024-11-29T20:17:55.469802Z job-manager.debug[0]: scheduler: ready limited
2024-11-29T20:17:55.469863Z sched-simple.debug[0]: ready: 0 of 96 cores:
2024-11-29T20:17:55.559711Z broker.info[0]: rc1.0: /etc/flux/rc1 Exited (rc=0) 0.3s
2024-11-29T20:17:55.559786Z broker.info[0]: rc1-success: init->quorum 0.297064s
2024-11-29T20:17:55.660192Z broker.info[0]: online: flux-sample-0 (ranks 0)
2024-11-29T20:17:55.660239Z sched-simple.debug[0]: resource update: {"up":"0"}
2024-11-29T20:17:56.048426Z broker.debug[0]: accepting connection from flux-sample-1 (rank 1) status partial
2024-11-29T20:17:56.309434Z sched-simple.debug[0]: resource update: {"up":"1"}
2024-11-29T20:17:56.699960Z sched-simple.debug[0]: resource update: {"up":"2"}
2024-11-29T20:17:56.889295Z sched-simple.debug[0]: resource update: {"up":"3"}
2024-11-29T20:17:57.078709Z sched-simple.debug[0]: resource update: {"up":"4"}
2024-11-29T20:17:57.267006Z broker.info[0]: online: flux-sample-[0-5] (ranks 0-5)
2024-11-29T20:17:57.267021Z broker.info[0]: quorum-full: quorum->run 1.70723s
2024-11-29T20:17:57.267043Z sched-simple.debug[0]: resource update: {"up":"5"}
2024-11-29T20:17:57.281339Z broker.debug[0]: rmmod content
2024-11-29T20:17:57.281545Z broker.debug[0]: module content exited
2024-11-29T20:17:57.282119Z broker.debug[0]: insmod content
2024-11-29T20:19:20.392899Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:19:20.456611Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:19:20.476759Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:19:32.509589Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:19:32.564534Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:19:32.583773Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:19:44.271318Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:19:44.339905Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:19:44.360034Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:20:00.163049Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:20:00.227368Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
2024-11-29T20:20:00.247890Z kvs.err[0]: content_load_completion: content_load_get: No such file or directory
root@flux-sample-0:/chonks# flux module stats content | jq
{
"count": 20,
"valid": 20,
"dirty": 20,
"size": 3074987,
"flush-batch-count": 0,
"mmap": {
"tags": {},
"blobs": 0
}
} I do think the broker might have exited? At least in my interactive console when there was the Exit: root@flux-sample-0:/chonks# flux resource list
STATE NNODES NCORES NGPUS NODELIST
free 6 96 0 flux-sample-[0-5]
allocated 0 0 0
down 0 0 0
root@flux-sample-0:/chonks# flux dmesgflux-proxy: Lost connection to Flux
flux-proxy: Sending SIGHUP to child processes
root@flux-sample-0:/chonks# command terminated with exit code 137 Let me know if there is something I can look at - the cluster will self terminate in about 15 mins. |
It seems strange that the size remains big - I wonder if there is some initial error with cleaning (removing) an archive, and then the rest result from it spilling over (if that is possible)? I would expect the size to go to zero as I create (and remove) archives.
Ahh this is telling! The ranks are 💀
So whatever happens, it kills the workers (but the lead broker persists), likely because it restarts. |
Oh interesting! So the workers are segfaulting. Here is rank 1. 🌀 flux start -o --config /mnt/flux/view/etc/flux/config -Scron.directory=/etc/flux/system/cron.d -Stbon.fanout=256 -Srundir=/mnt/flux/view/run/flux -Sstatedir=/mnt/flux/view/var/lib/flux -Slocal-uri=local:///mnt/flux/view/run/flux/local -Stbon.connect_timeout=5s -Stbon.topo=kary:1 -Slog-stderr-level=0 -Slog-stderr-mode=local
/flux_operator/wait-0.sh: line 199: 32 Segmentation fault (core dumped) flux start -o --config ${viewroot}/etc/flux/config ${brokerOptions}
Return value for follower worker is 139
😪 Sleeping 15s to try again...
/flux_operator/wait-0.sh: line 199: 113 Segmentation fault (core dumped) flux start -o --config ${viewroot}/etc/flux/config ${brokerOptions}
Return value for follower worker is 139
😪 Sleeping 15s to try again...
/flux_operator/wait-0.sh: line 199: 143 Segmentation fault (core dumped) flux start -o --config ${viewroot}/etc/flux/config ${brokerOptions}
Return value for follower worker is 139
😪 Sleeping 15s to try again...
/flux_operator/wait-0.sh: line 199: 180 Segmentation fault (core dumped) flux start -o --config ${viewroot}/etc/flux/config ${brokerOptions}
Return value for follower worker is 139
😪 Sleeping 15s to try again...
/flux_operator/wait-0.sh: line 199: 210 Segmentation fault (core dumped) flux start -o --config ${viewroot}/etc/flux/config ${brokerOptions}
Return value for follower worker is 139
😪 Sleeping 15s to try again... |
okay - I'm going to build a throwaway variant of the flux operator that (on fail) adds a valgrind prefix to the broker start. That should minimally give us the terminal output, and if I'm fast enough I can copy the core dump from the node too. |
okay! New loop for a working is running: vg=""
while true
do
${vg} flux start -o --config ${viewroot}/etc/flux/config ${brokerOptions}
retval=$?
if [[ "${retval}" -eq 0 ]] || [[ "false" == "true" ]]; then
echo "The follower worker exited cleanly. Goodbye!"
break
fi
echo "Return value for follower worker is ${retval}"
vg="valgrind --leak-check=full"
echo "😪 Sleeping 15s to try again..."
sleep 15
done Watching! 👁️ |
okay so this is interesting - the initial error does not kill the other brokers. It's actually the check of the state that is doing it:
I didn't see the broker restart until I issued that. I saw the error happen before that, but it didn't restart the brokers. So I don't think I can get a core dump for this - it won't naturally trigger on its own, and when I trigger it, the valgrind cuts (and I can't find a dump). 🌀 flux start -o --config /mnt/flux/view/etc/flux/config -Scron.directory=/etc/flux/system/cron.d -Stbon.fanout=256 -Srundir=/mnt/flux/view/run/flux -Sstatedir=/mnt/flux/view/var/lib/flux -Slocal-uri=local:///mnt/flux/view/run/flux/local -Stbon.connect_timeout=5s -Stbon.topo=kary:1 -Slog-stderr-level=0 -Slog-stderr-mode=local
flux-broker: Warning: unable to resolve upstream peer flux-sample-3.flux-service.default.svc.cluster.local: Name or service not known
/flux_operator/wait-0.sh: line 203: 31 Segmentation fault (core dumped) ${vg} flux start -o --config ${viewroot}/etc/flux/config ${brokerOptions}
Return value for follower worker is 139
😪 Sleeping 15s to try again...
==105== Memcheck, a memory error detector
==105== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==105== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==105== Command: flux start -o --config /mnt/flux/view/etc/flux/config -Scron.directory=/etc/flux/system/cron.d -Stbon.fanout=256 -Srundir=/mnt/flux/view/run/flux -Sstatedir=/mnt/flux/view/var/lib/flux -Slocal-uri=local:///mnt/flux/view/run/flux/local -Stbon.connect_timeout=5s -Stbon.topo=kary:1 -Slog-stderr-level=0 -Slog-stderr-mode=local
==105==
/flux_operator/wait-0.sh: line 203: 105 Segmentation fault (core dumped) ${vg} flux start -o --config ${viewroot}/etc/flux/config ${brokerOptions}
Return value for follower worker is 139
😪 Sleeping 15s to try again...
==142== Memcheck, a memory error detector
==142== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==142== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==142== Command: flux start -o --config /mnt/flux/view/etc/flux/config -Scron.directory=/etc/flux/system/cron.d -Stbon.fanout=256 -Srundir=/mnt/flux/view/run/flux -Sstatedir=/mnt/flux/view/var/lib/flux -Slocal-uri=local:///mnt/flux/view/run/flux/local -Stbon.connect_timeout=5s -Stbon.topo=kary:1 -Slog-stderr-level=0 -Slog-stderr-mode=local
==142== What I tried doing instead is shelling into the worker node directly to run the content command there, and the output is slightly different:
And in color for readability: Could that be a hint? |
In case any of this is helpful (I'm running out of ideas): flux module statsroot@flux-sample-3:/chonks# flux module stats -R content
{
"utime": 0.0,
"stime": 0.00073499999999999998,
"maxrss": 60784,
"ixrss": 0,
"idrss": 0,
"isrss": 0,
"minflt": 17,
"majflt": 0,
"nswap": 0,
"inblock": 0,
"oublock": 0,
"msgsnd": 0,
"msgrcv": 0,
"nsignals": 0,
"nvcsw": 15,
"nivcsw": 0
}
root@flux-sample-3:/chonks# flux module list
Module Idle S Sendq Recvq Service
kvs-watch 20 R 0 0
job-info 20 R 0 0
connector-local 0 R 0 0
job-ingest 20 R 0 0
content 7 R 0 0
resource 20 R 0 0
kvs 11 R 0 0
barrier 20 R 0 0
root@flux-sample-3:/chonks# flux module stats kvs-watch
{
"watchers": 0,
"namespace-count": 0,
"namespaces": {}
}
root@flux-sample-3:/chonks# flux module stats job-info
{
"lookups": 0,
"watchers": 0,
"guest_watchers": 0,
"update_lookups": 0,
"update_watchers": 0
}
root@flux-sample-3:/chonks# flux module stats connector-local
{
"tx": {
"request": 49,
"response": 1,
"event": 0,
"control": 0
},
"rx": {
"request": 2,
"response": 23,
"event": 0,
"control": 0
}
}
root@flux-sample-3:/chonks# flux module stats job-ingest
{
"pipeline": {
"frobnicator": {
"running": 0,
"requests": 0,
"errors": 0,
"trash": 0,
"backlog": 0,
"pids": [
0,
0,
0,
0
]
},
"validator": {
"running": 0,
"requests": 0,
"errors": 0,
"trash": 0,
"backlog": 0,
"pids": [
0,
0,
0,
0
]
}
}
}
root@flux-sample-3:/chonks# flux module stats resource
{
"tx": {
"request": 4,
"response": 0,
"event": 0,
"control": 0
},
"rx": {
"request": 1,
"response": 3,
"event": 0,
"control": 0
}
}
root@flux-sample-3:/chonks# flux module stats kvs
{
"cache": {
"obj size total (MiB)": 0.00029468536376953125,
"obj size (KiB)": {
"count": 1,
"min": 0.309,
"mean": 0.309,
"stddev": 0.0,
"max": 0.309
},
"#obj dirty": 0,
"#obj incomplete": 0,
"#faults": 7
},
"namespace": {
"primary": {
"#versionwaiters": 0,
"#no-op stores": 0,
"#transactions": 0,
"#readytransactions": 0,
"store revision": 18
}
},
"pending_requests": 0
}
root@flux-sample-3:/chonks# flux module stats barrier
{
"tx": {
"request": 3,
"response": 0,
"event": 0,
"control": 0
},
"rx": {
"request": 1,
"response": 2,
"event": 0,
"control": 0
}
}
|
Are there multiple candidates for a root cause here? I see
As always, it'd be great to get a backtrace from the segfault to see what happend there (and which broker is segfaulting - I'm guessing rank 0?) The archive creation failure should be reproducible in isolation since, if done on rank 0, it doesn't involve any communication among brokers. |
I'll see what I can do. |
Some valgrind output - this first one (create) didn't seem to error: root@flux-sample-0:/chonks# echo "EVENT create-archive-${size}"
time valgrind --leak-check=full flux archive create --name create-archive-${size} --dir /chonks ${size}gb.txt
EVENT create-archive-4
==101== Memcheck, a memory error detector
==101== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==101== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==101== Command: flux archive create --name create-archive-4 --dir /chonks 4gb.txt
==101==
==101== Warning: set address range perms: large range [0x59c93000, 0x148346000) (defined)
==101== Warning: set address range perms: large range [0x59c93000, 0x148346000) (noaccess)
==101==
==101== HEAP SUMMARY:
==101== in use at exit: 911,171 bytes in 22,911 blocks
==101== total heap usage: 170,022 allocs, 147,111 frees, 8,021,501,792 bytes allocated
==101==
==101== 911,171 (40 direct, 911,131 indirect) bytes in 1 blocks are definitely lost in loss record 16 of 16
==101== at 0x4846828: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==101== by 0x4AD0287: json_array (in /usr/lib/x86_64-linux-gnu/libjansson.so.4.14.0)
==101== by 0x121FBF: subcmd_create (archive.c:310)
==101== by 0x1209D8: cmd_archive (archive.c:665)
==101== by 0x1166DD: main (flux.c:235)
==101==
==101== LEAK SUMMARY:
==101== definitely lost: 40 bytes in 1 blocks
==101== indirectly lost: 911,131 bytes in 22,910 blocks
==101== possibly lost: 0 bytes in 0 blocks
==101== still reachable: 0 bytes in 0 blocks
==101== suppressed: 0 bytes in 0 blocks
==101==
==101== For lists of detected and suppressed errors, rerun with: -s
==101== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
real 3m11.981s
user 2m58.870s
sys 0m3.613s That says there is one error, but I don't see it printed (about the kvs). I'm running these manually so I'll keep retrying - I'll likely hit a core dump at some point. |
Don't bother running valgrind on |
Oh - so should valgrind wrap flux start? |
okay - re-running with a wrapped to flux start - will report back if I see an issue! |
okay - I can't get this information for you. When I wrap flux-start, I can see that the error occurs, but the lead broker doesn't seem to exit then because the next command continues on.
That also hints that it was the workers (1-5) that had the error, but I don't see that they restarted either. This is why I was originally using valgrind to wrap the flux-archive command, because I thought I might see something there, but all that I see is what I posted above. I didn't see it paired with the KVS lookup error. It could just be that I haven't reproduced it yet with the right settings - I think I probably need valgrind to wrap the command in the script (the Let me know what you'd like to do. |
To summarize - this is what I think is going on:
At most, I think what I could try is wrapping the worker |
Sounds like we should try to recreate the failing O/T: although |
okay - I've been running the experiment with valgrind wrapping the flux archive commands for almost 5 hours, and I didn't reproduce. I also tested this once more:
And I can confirm that the workers do not segfault or restart, I'm not sure why the lead broker shows an Exit, but they don't report any exit / restart in their logs. The fact that it didn't happen with valgrind makes me wonder if it's something racy - notably valgrind slows everything down immensely, and maybe that slowdown is preventing the kvs commit issues because we wait a bit longer between operations? |
Is there a way we could create a reproducer for just the |
Yeah - let me work on something. I just tried again (a simplified version in Kubernetes) and reproduced the error, always starting at size 4.
I'll try to reproduce this in my docker image that is running that, give me a few minutes! |
Running with a single container in docker doesn't reproduce it. docker run -it ghcr.io/converged-computing/container-chonks:topology-flux-0.66.0
# The chonks are here
ls
# 10gb.txt 1gb.txt 2gb.txt 3gb.txt 4gb.txt 5gb.txt 6gb.txt 7gb.txt 8gb.txt 9gb.txt
# Start the broker with a different topology (but actually it doesn't matter these are faux nodes)
flux start -Stbon.fanout=256 -Stbon.connect_timeout=5s -Stbon.topo=binomial -Slog-stderr-level=0 -Slog-stderr-mode=local Run the script above: for size in $(seq 1 10)
do
# Create the archive and time it
echo "EVENT create-archive-${size}"
time flux archive create --name create-archive-${size} --dir /chonks ${size}gb.txt
# Distribute to all nodes, but to /tmp because file exists in /chonks
echo "EVENT distribute-all-nodes-${size}"
time flux exec -r all flux archive extract --name create-archive-${size} -C /tmp
sleep 2
echo "EVENT delete-archive-all-nodes-${size}"
time flux exec -r all flux archive remove --name create-archive-${size} 2>/dev/null
sleep 2
# Cleanup all archive on all nodes
echo "EVENT clean-all-nodes-${size}"
time flux exec -r all rm -rf /tmp/${size}gb.txt
sleep 2
done (everything works). I will try to make an example in kind, which (don't worry) is very easy to use. We need to have separate nodes or at least containers I think. |
This didn't reproduce either 😞 , but I'll include for comprehensiveness. Here is where to install kind - "Kubernetes in Docker" - I'm guessing you already have docker. Then: # Create your cluster (4 nodes, one container each, be wary of disk space)
wget https://raw.githubusercontent.com/converged-computing/flux-distribute/refs/heads/main/topology/kind-experiment/kind-config.yaml
kind create cluster --config kind-config.yaml
# Run this so kubectl does autocompletion with tab
source <(kubectl completion bash)
# Install the flux operator
kubectl apply -f https://raw.githubusercontent.com/flux-framework/flux-operator/refs/heads/main/examples/dist/flux-operator.yaml
# This is how to see logs - you should not see anything that looks erreous
pod_name=$(kubectl get pods -n operator-system -o json | jq -r .items[0].metadata.name)
kubectl logs -n operator-system ${pod_name} What a correct / working Flux Operator log looks like1.7330086743293536e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": "127.0.0.1:8080"}
1.733008674329579e+09 INFO setup starting manager
1.7330086743296938e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
1.7330086743296936e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
I1130 23:17:54.329739 1 leaderelection.go:248] attempting to acquire leader lease operator-system/14dde902.flux-framework.org...
I1130 23:17:54.332808 1 leaderelection.go:258] successfully acquired lease operator-system/14dde902.flux-framework.org
1.7330086743328419e+09 DEBUG events Normal {"object": {"kind":"Lease","namespace":"operator-system","name":"14dde902.flux-framework.org","uid":"60570642-3c34-463b-ad4d-d8dc24244a92","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1084"}, "reason": "LeaderElection", "message": "operator-controller-manager-b69fd5b7d-tlpkz_79a6000a-643e-400a-9fa7-288a377261ef became leader"}
1.7330086743329332e+09 INFO Starting EventSource {"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "source": "kind source: *v1alpha2.MiniCluster"}
1.7330086743329582e+09 INFO Starting EventSource {"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "source": "kind source: *v1.Ingress"}
1.7330086743329635e+09 INFO Starting EventSource {"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "source": "kind source: *v1.Job"}
1.7330086743329725e+09 INFO Starting EventSource {"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "source": "kind source: *v1.Secret"}
1.733008674333002e+09 INFO Starting EventSource {"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "source": "kind source: *v1.Service"}
1.7330086743330076e+09 INFO Starting EventSource {"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "source": "kind source: *v1.Pod"}
1.7330086743330116e+09 INFO Starting EventSource {"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "source": "kind source: *v1.ConfigMap"}
1.733008674333016e+09 INFO Starting EventSource {"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "source": "kind source: *v1.Job"}
1.733008674333019e+09 INFO Starting Controller {"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster"}
1.7330086744340837e+09 INFO Starting workers {"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "worker count": 1} Here is how to inspect your nodes / pods (you don't have any pods yet) kubectl get nodes
kubectl get pods
# see the nodes pods are assigned to
kubectl get pods -o wide Now we have our operator, let's create the cluster. This is in "interactive" mode so it won't run anything - it will sit for you to shell in and do whatever you like until you delete it. kubectl apply -f https://gist.githubusercontent.com/vsoch/b4af4f455b96398b348a631f55fdf771/raw/483012120ef22f43de73663d4ee3b7e3ef0ed680/minicluster.yaml That's going to take a bit to pull because the container is big. You can use $ kubectl get pods --watch
NAME READY STATUS RESTARTS AGE
flux-sample-0-62nzm 0/1 Init:0/1 0 5s
flux-sample-1-qb7gj 0/1 Init:0/1 0 5s
flux-sample-2-g8clx 0/1 Init:0/1 0 5s
flux-sample-3-w54sd 0/1 Init:0/1 0 5s Init is usually when flux is installed on the fly, which isn't done here because we don't have a view. But it's still configured. When it's running (takes 15 minutes on my machine) you can shell in to the lead broker. lead_broker=$(kubectl get pods -o json | jq -r .items[0].metadata.name)
kubectl exec -it ${lead_broker} bash Once you are in the container, you need to connect to the socket. This is a container where the view (automated install of flux) is disabled in favor of the updated version with your Segfault fix, so we can't use the easy interfaces / environment to shell in. Instead: flux proxy local:///mnt/flux/view/run/flux/local bash
root@flux-sample-0:/chonks# flux resource list
STATE NNODES NCORES NGPUS NODELIST
free 4 32 0 flux-sample-[0-3]
allocated 0 0 0
down 0 0 0 Should have a binomial overlay topology
Run the experiment: for size in $(seq 1 10)
do
# Create the archive and time it
echo "EVENT create-archive-${size}"
time flux archive create --name create-archive-${size} --dir /chonks ${size}gb.txt
# Distribute to all nodes, but to /tmp because file exists in /chonks
echo "EVENT distribute-all-nodes-${size}"
time flux exec -r all flux archive extract --name create-archive-${size} -C /tmp
sleep 2
echo "EVENT delete-archive-all-nodes-${size}"
time flux exec -r all flux archive remove --name create-archive-${size} 2>/dev/null
sleep 2
# Cleanup all archive on all nodes
echo "EVENT clean-all-nodes-${size}"
time flux exec -r all rm -rf /tmp/${size}gb.txt
sleep 2
done It could be the number of nodes - I am only doing 4 now, I'll try updating to 6. I hope it's not something specific to Kubernetes in the cloud on actual nodes - I could probably debug that a bit by trying on AWS (where we have hpc instances with single tenancy). Ug. I guess I'll keep trying and update you here. |
I did 3x on kind, did not reproduce. |
Up to you. |
Continuing from discussion in #6461. When
When we get to larger sizes of a kary:3 tree (and it's not clear to me if the topology matters here or if it was transient) we start to see that the archive create is failing. Here is an example output from an experiment from last night - the purple line is erroneous. The archive wasn't created or distributed, it issued that error and the time ended too early.
Specifically, I think I'm hitting some limit or other error with kary-3 (and I'm not sure what at the moment, just finished the runs and would need to manually inspect):
Here is what that topology looks like:
I am trying to reproduce it now and look at flux dmesg and other stats for possibly other / more information.
The text was updated successfully, but these errors were encountered: