Better visibility of stored files and their shard status #133

scottyeager · 2024-11-23T02:00:07Z

Currently there is no way to query the list of stored files and also no way to see the health of individual files in terms of how many shards they have stored in healthy backends. This makes it difficult to assess whether the system is in a degraded state. It's especially relevant when recovering from some backend failure to be able to check if all files have been rebuilt onto newly supplied backends. Being able to see a list of stored files is also helpful for general inspection of the system without needing to run lots of check commands and also keep a separate list of files that have been removed from local storage.

So I'm thinking of something like this:

A list command that lists the stored files
Some way of outputting the number of shards present for a given file in live backends (this could be part of list or check or both)
At least one Prometheus metric that helps to understand whether the files are, overall, in a degraded state or not (do they have expected shards available, if not do they at least have minimal shards available)

The text was updated successfully, but these errors were encountered:

iwanbk · 2024-11-25T03:49:47Z

This makes it difficult to assess whether the system is in a degraded state.

While i agree that it is something that should be improved.
I don't think that assessing by listing all stored files is a good idea for these reasons:

The list could be very long
Human eyes is not a trusted tool to check that long list of stored files

i think exposing repair/rebuild queue would be enough

scottyeager · 2024-11-25T04:13:51Z

I can do without the list command, though I do think it would be handy for both human and machine consumption under different circumstances.

Exposing info on the repair queue would be fine. One thing I think is important though is that there's a way to get at the info both from CLI and via Prometheus.

iwanbk · 2024-11-25T04:18:14Z

One thing I think is important though is that there's a way to get at the info both from CLI and via Prometheus.

yes, fully agree with this

iwanbk · 2025-01-02T04:51:58Z

Better visibility of stored files and their shard status

Curious, how you currently do this @scottyeager ?

If you give more elaboration about these two below, i think i'll have better context about the situation.

Being able to see a list of stored files is also helpful for general inspection of the system without needing to run lots of check commands and also keep a separate list of files that have been removed from local storage.

I can do without the list command, though I do think it would be handy for both human and machine consumption under different circumstances.

iwanbk · 2025-01-02T05:49:47Z

FYI, check command returns the checksum from the metadata only without loading the actual data or examining the states of the data backends.

scottyeager · 2025-01-02T19:17:39Z

So these are some basic assumptions, to start:

When a store command completes without error, the file has been stored with expected_shards fulfilled
Getting a checksum back from check means the same thing as 1, which is that the file had expected_shards at the time it was stored. Nothing more is implied about current health of those backends

This is all fine, until some backend is lost and needs to be replaced. When some backend is swapped for a fresh one, any files with a shard on that backend are now in a degraded state.

As an operator, I want to be able to understand when (and also if) the system has returned to the desired state of all files having expected_shards present in healthy backends.

The only way I'm aware of so far to get any visibility into this is to check the zstor logs and wait for evidence that all needed repair operations have succeeded.

iwanbk · 2025-01-04T13:21:51Z

wait for evidence that all needed repair operations have succeeded.

Yes, currently this is the only evidence of "all files's shards are healthy".
Because doing repair means:

iterate all files from the metadata
check data backend and rebuild if needed
In other words, this is the only process where ALL files are checked

With current architecture, there are several things that could become the signs of healthiness:

When doing retrieve, is there any file with broken shard?
does all the backends in the config are healthy
how long since the last repair process

(maybe there are other signs, but only above things that i could think of right now.

Perhaps, we could group all the signs into one prometheus variable with one tag for each sign, (maybe there is better prometheus strategy for this, but the point is to group it into one data structure).

Or we could simply put all above different things in one grafana widget.

scottyeager · 2025-01-09T19:03:30Z

I think exposing the time of the last completed repair process as a Prometheus metric is a great start. Backend health is monitored separately, so I think that part is already okay.

When doing retrieve, is there any file with broken shard?

If the file is being retrieved as part of a repair operation, then it should shortly be restored to the desired expected_shards. So I think what we're really interested in is cases where repair fails or is not possible due to lack of sufficient healthy backends.

@iwanbk, you mentioned before "exposing the repair queue". That seems like the piece that gives a sufficient view:

How long since the last repair process
Progress in completing the repair process

As for how exactly to structure this as Prometheus, I think it's okay for the metrics to remain relatively separate and be combined as needed within Grafana (or whatever tool).

iwanbk · 2025-01-10T03:03:20Z

If the file is being retrieved as part of a repair operation, then it should shortly be restored to the desired expected_shards.

yes, that is correct.

But what i mean was retrieve in regular retrieve command

So I think what we're really interested in is cases where repair fails or is not possible due to lack of sufficient healthy backends.

"Lack of sufficient healthy backends" also means that write operation will be failed.
Don't we have this metrics already? I haven't checked the metrics/code, i assume we have because it is quite important.

you mentioned before "exposing the repair queue". That seems like the piece that gives a sufficient view:

Yes, i mentioned it before because this is what we did previously:

get ALL keys from the metadata
check and repair it one by one

We can get the repair queue from the first step.

Because the number of keys could be huge, we changed the first step to SCAN the keys page by page, instead of in single shot

scottyeager · 2025-01-15T21:26:29Z

But what i mean was retrieve in regular retrieve command

Gotcha. The scenario I'm mostly thinking of is when a backend is replaced. In that case, it's not necessarily the case that a retrieve would be triggered anytime soon. There's also the fact that the user doesn't have visibility to which files were stored on the backend that was removed. In the case where there are more backends than expected_shards, not every file will have a shard on every backend.

"Lack of sufficient healthy backends" also means that write operation will be failed.
Don't we have this metrics already? I haven't checked the metrics/code, i assume we have because it is quite important.

Backend status is exposed as a metric, and this allows users to deduce whether the system can accept writes, based on their config and the number of healthy backends.

Certainly if it's the case that no writes can happen, the first priority would be to restore the system to a state where it can. If that means adding new empty backends, then we've returned to the core question inspiring this issue:

How can I know when the system managed to return to the desired state of having expected_shards present on backends in the active configuration, in the case that some backend was replaced?

Based on the discussion here, I think the practical approach is to add a metric that conveys the success of repair operations. Actually I think the typical way to do this would be to show the number of successful operations since the program started. The timestamp of last increase can be deduced from this if needed.

LeeSmet · 2025-01-21T13:49:08Z

Going through this, I think the best way forward here is to create a small struct which holds meta info about a repair process. Info would include:

start time
scanned objects
degraded objects identified (which is also the amount of objects we try to repair)
amount of repairs failed (which is the amount of objects left in bad state after the repair attempt)
end timestamp, if the scan/repair finished

Ideally amount of repairs failed should be 0, and the amount of succesful repairs is degraded objects - amount of repairs failed. This info can be exposed over the CLI and in some way. For prometheus, this can be reasonably exposed to also allow for some derived metrics (scan time taken, repairs per second, known broken objects in the system, to name a few).

iwanbk self-assigned this Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better visibility of stored files and their shard status #133

Better visibility of stored files and their shard status #133

scottyeager commented Nov 23, 2024

iwanbk commented Nov 25, 2024

scottyeager commented Nov 25, 2024

iwanbk commented Nov 25, 2024

iwanbk commented Jan 2, 2025

iwanbk commented Jan 2, 2025

scottyeager commented Jan 2, 2025

iwanbk commented Jan 4, 2025

scottyeager commented Jan 9, 2025

iwanbk commented Jan 10, 2025

scottyeager commented Jan 15, 2025

LeeSmet commented Jan 21, 2025

Better visibility of stored files and their shard status #133

Better visibility of stored files and their shard status #133

Comments

scottyeager commented Nov 23, 2024

iwanbk commented Nov 25, 2024

scottyeager commented Nov 25, 2024

iwanbk commented Nov 25, 2024

iwanbk commented Jan 2, 2025

iwanbk commented Jan 2, 2025

scottyeager commented Jan 2, 2025

iwanbk commented Jan 4, 2025

scottyeager commented Jan 9, 2025

iwanbk commented Jan 10, 2025

scottyeager commented Jan 15, 2025

LeeSmet commented Jan 21, 2025