tmp space fills up #912
-
Apparently not all temp files get removed, slowly filling a disk until no more new plots can be made and 0 bytes space is left despite only 1 active plotman job for that disk is still running and all log files seem OK too. No problems were detected, I've been plotting for 8 days straight on my HP Server (2x NVMe 1.9TB, 2x SSD 1.9TB, 4x HDD Volume). Sorry for the long title, let me start with some background. I've been trying to tune my HP for a few weeks now. It's a sloooow process :) With previous testing I ran into the situation that one or more disks would get too full, messing up the plotting process (I'd often get corrupted plots out of it). After a while of testing, I started to notice that when I manually killed a PID, and manually removed it's temp files with their UID, I still got full disks EVEN when they shouldn't be full. If I tried to use plotman to stop a plot on a disk at risk of running out of space, plotman wouldn't actually "see" the temp files and report 0 bytes in size. I ignored that and just did the manual clean up. After a system reboot and a more proper handling on plotman, I started to play with some phase limits and global stagger values. This went great for 7 days, up to when I started to discover a bottleneck (of the two PCIe NVMe's, the 2nd one consistently underperformed for about a good 20% - same file systems and the OS is on the first NVMe, but still: it's substantially slower). When I made my latest edits (see below) I noticed that for the first time, with a global stagger of 65 minutes, the nvme2 disk would get slower to finish (despite running less tasks), up to a point where a later started plot would finish faster (normally a plot finishes in about 24 hours, I've now seen the 2nd NVMe starting to do close to 28h about a plot all of a sudden). This was last evening and I kept it running overnight. This morning, I noticed that my FIRST NVMe disk wasn't doing that many plots. In fact, it was only doing ONE plot. This was weird, a faster global plotting value but NOW all of a sudden new plots aren't created fast enough any longer? I remember from my previous testing last week that I woke up to plotman constantly kicking off new plots, where it SHOULD've been impossible for the scheduler to even find new tasks to create, or so I thought. It looked weird to me, but again I sought blame with myself. NOW, this morning, my first NVMe disk reported as being completely FULL, 0 bytes free space left!! And, only ONE plotting task running on it :(
When I go to its temp folder, I find temp files for 5 different plots and I have 811 files. Yet as you can see, plotman insists there's only 1 plot running atm. It should run 3, with an optional 4th in phase 3.6 or 4.0. Snippets from the config as it was when I first noticed this bug wasn't something I was messing up:
To upload a zip of all my logs, I need to free up some space from my 1st NVMe disk. I do that by removing the following temp files: plot-k33-2021-08-27-03-49-39e46420990d57d2ee5815573bf4905cf518e2d56ff6eff58fba43342b64b665.plot.sort |
Beta Was this translation helpful? Give feedback.
Replies: 12 comments 17 replies
-
PS: I ctrl-c'd plotman and I'll leave it plotting for the next 24 hours to finish all plots. Up to then, I'm here for any experiments and questions needed :) PS: looking at chiaplotgraph (ubuntu), my harvester stopped harvesting a good 3,5 hours ago; I think that's when the first NVMe ran out of space. |
Beta Was this translation helpful? Give feedback.
-
I guess there are three scenarios here. The one where you manually kill jobs and you are expected to clean them up. The one where you ask plotman to kill jobs and it is expected to clean up the tmp files. The one where you leave plotman running without killing anything and the plotter itself is expected to clean up the tmp files. The one specifically relevant to existing plotman features is where you asked plotman to kill a job and it failed to find any tmp files. I guess for that we would want to see the Side note, please don't snippet the configuration file. Just share the entire thing. |
Beta Was this translation helpful? Give feedback.
-
This is for the newly started task "491f72b7". If I do the same for the nvme1 plot still running, I get the same result:
And while it pains me, I'll kill it next and upload the log files you requested.
Both processes indeed disappear from plotman status. I'll check on the status of the temp files next. |
Beta Was this translation helpful? Give feedback.
-
original output too long, so I put it here: https://pastebin.com/Nevi8QTD I'll re-do it all, but this time with all temp files first removed (from nvm1 tmp). Then I'll start plotman interactive, wait a minute, kill the newly started plot and give you the ls output of THAT. I'm pretty sure that'll be more readable anyways :) |
Beta Was this translation helpful? Give feedback.
-
Short answer: test shows temp files are not removed after plotman kill. Output:
|
Beta Was this translation helpful? Give feedback.
-
current plotman.yaml:
|
Beta Was this translation helpful? Give feedback.
-
See a proposed fix in #913. |
Beta Was this translation helpful? Give feedback.
-
For now, 1 question remains though... How comes that after so many days of successful plotting (8), only now suddenly temp files appear to have gotten forgotten? I'll repost in 10 days if the problem occurs once again :) |
Beta Was this translation helpful? Give feedback.
-
Note to self: SSD0 had temp files left after all plotting was done - not too many, but still. |
Beta Was this translation helpful? Give feedback.
-
Just to add to the original question - which appears to be out of the scope of plotman itself - I had the bug re-appear once more. With a global stagger of 68 minutes I was running smoothly for a few days so I decided to make my global stagger 66 minutes, to see if I could speed things up. I noticed that all my disks often were 100% busy, indicating that the limits of my hardware were reached. This is also when I suddenly got the bug again when plotman does not know the plot IDs, NOR do the known plots ever finish (despite hard disk activity). After noticing the bug, I closed plotman for half a day. This is the output I got when I started plotman interactive again:
Existing and identified plots-in-the-making do have a recent last modified time. But as you can see, no progress is being made. I suspect it's due to NVMe & SSD maxing out their bus bandwidths. Global stagger of 67 minutes, here I come :) Edit: I had disk backlog warnings from netstat of 6000ms and more! ;-( |
Beta Was this translation helpful? Give feedback.
-
Alright, that went fast this time. SSD03 already is full, 1MB left. This time there might be a good reason for it though (see #928). But it's one thing that the disk is full because of the for-mentioned problem, it's another that plotman still reports only 3 active plots being made through ssd03...
But when I go to ssd03 and count how many plots still have temp files, I find at least 7 different plots being made. SEVEN! Not 3... Is this still a chia or hardware problem or can we start looking at plotman problems now? ls -lias output for ssd03: https://pastebin.com/0tXUjvB8 Config file:
|
Beta Was this translation helpful? Give feedback.
-
Files on a disk are not representative of a plot process running. So, having tmp files on disk for 7 different plots does not indicate that plotman is incorrectly reporting only 3 plotting processes. You have to look at processes. At some point if you want to debug this you will probably need to start isolating things. In normal operation plotman does not clean up tmp files nor kill processes without you asking. So, if tmp files are being left around when you have not killed processes using plotman then it still doesn't seem likely to be a plotman issue. Maybe look at the logs for those plots. Maybe just switch to madmax with a phase 1 stagger on your fastest NVMe, or RAID the NVMe if they are the size/make/model etc. |
Beta Was this translation helpful? Give feedback.
See a proposed fix in #913.