-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better visibility of stored files and their shard status #133
Comments
While i agree that it is something that should be improved.
i think exposing repair/rebuild queue would be enough |
I can do without the list command, though I do think it would be handy for both human and machine consumption under different circumstances. Exposing info on the repair queue would be fine. One thing I think is important though is that there's a way to get at the info both from CLI and via Prometheus. |
yes, fully agree with this |
Curious, how you currently do this @scottyeager ? If you give more elaboration about these two below, i think i'll have better context about the situation.
|
FYI, |
So these are some basic assumptions, to start:
This is all fine, until some backend is lost and needs to be replaced. When some backend is swapped for a fresh one, any files with a shard on that backend are now in a degraded state. As an operator, I want to be able to understand when (and also if) the system has returned to the desired state of all files having The only way I'm aware of so far to get any visibility into this is to check the zstor logs and wait for evidence that all needed repair operations have succeeded. |
Yes, currently this is the only evidence of "all files's shards are healthy".
With current architecture, there are several things that could become the signs of healthiness:
(maybe there are other signs, but only above things that i could think of right now. Perhaps, we could group all the signs into one prometheus variable with one tag for each sign, (maybe there is better prometheus strategy for this, but the point is to group it into one data structure). Or we could simply put all above different things in one grafana widget. |
I think exposing the time of the last completed repair process as a Prometheus metric is a great start. Backend health is monitored separately, so I think that part is already okay.
If the file is being retrieved as part of a repair operation, then it should shortly be restored to the desired @iwanbk, you mentioned before "exposing the repair queue". That seems like the piece that gives a sufficient view:
As for how exactly to structure this as Prometheus, I think it's okay for the metrics to remain relatively separate and be combined as needed within Grafana (or whatever tool). |
yes, that is correct. But what i mean was
"Lack of sufficient healthy backends" also means that
Yes, i mentioned it before because this is what we did previously:
We can get the Because the number of keys could be huge, we changed the first step to SCAN the keys page by page, instead of in single shot |
Gotcha. The scenario I'm mostly thinking of is when a backend is replaced. In that case, it's not necessarily the case that a
Backend status is exposed as a metric, and this allows users to deduce whether the system can accept writes, based on their config and the number of healthy backends. Certainly if it's the case that no writes can happen, the first priority would be to restore the system to a state where it can. If that means adding new empty backends, then we've returned to the core question inspiring this issue: How can I know when the system managed to return to the desired state of having Based on the discussion here, I think the practical approach is to add a metric that conveys the success of repair operations. Actually I think the typical way to do this would be to show the number of successful operations since the program started. The timestamp of last increase can be deduced from this if needed. |
Going through this, I think the best way forward here is to create a small struct which holds meta info about a repair process. Info would include:
Ideally amount of repairs failed should be 0, and the amount of succesful repairs is degraded objects - amount of repairs failed. This info can be exposed over the CLI and in some way. For prometheus, this can be reasonably exposed to also allow for some derived metrics (scan time taken, repairs per second, known broken objects in the system, to name a few). |
Currently there is no way to query the list of stored files and also no way to see the health of individual files in terms of how many shards they have stored in healthy backends. This makes it difficult to assess whether the system is in a degraded state. It's especially relevant when recovering from some backend failure to be able to check if all files have been rebuilt onto newly supplied backends. Being able to see a list of stored files is also helpful for general inspection of the system without needing to run lots of
check
commands and also keep a separate list of files that have been removed from local storage.So I'm thinking of something like this:
list
command that lists the stored fileslist
orcheck
or both)expected shards
available, if not do they at least haveminimal shards
available)The text was updated successfully, but these errors were encountered: