Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate set schema for subsequent calls to SOS store #1532

Closed
wants to merge 1 commit into from

Conversation

tom95858
Copy link
Collaborator

@tom95858 tom95858 commented Nov 21, 2024

It is possible that a set could be provided to the SOS store() function that has a different schema than what was originally provided. This can crash the ldmsd and/or store incorrect data to the container.

This change caches the set digest of the first set in the storage instance and compares this cached value to the set digest provided in the store() function.

If these digests do not match, the set is ignored and not stored.

This change also removes "dead code" that was no longer used after the decomposition changes.

It is possible that a set could be provided to the SOS store()
function that has a different schema than what was originally
provided. This can crash the ldmsd and/or store incorrect
data to the container.

This change caches the set digest of the first set in the
storage instance and compares this cached value to the set
digest provided in the store() function.

If these digests do not match, the set is ignored and
not stored.
@bschwal
Copy link
Collaborator

bschwal commented Nov 22, 2024

Hi @tom95858, how would this affect attempting to store a changed schema to an older schema even after restart? Additionally, is the best method to just drop the schema data rather than storing to a new schema that can then be used for recovery later?

@tom95858
Copy link
Collaborator Author

Hi @tom95858, how would this affect attempting to store a changed schema to an older schema even after restart?

BTW: This change only applies to non-decomposition enabled storage.

In the case you're referring too, it wouldn't work because the schema already exists in the container and has a different set of attributes. We need to create a new schema that matches the new/removed/renamed/re-typed attributes.

Additionally, is the best method to just drop the schema data rather than storing to a new schema that can then be used for recovery later?

No it's not the best, but we would have to design up a set of policies:

  • Is the behavior configurable?
  • How do you name the new schema?
  • How do you notify the administrator that this has occurred?
  • What does the notification say?
  • How do we migrate data from one schema to another so that we have one uber-schema?

So it would be great if we address the issues above and then write the necessary logic. But absent all that work, the bare minimum behavior is to not crash and at this point that requires dropping the data.

@nichamon
Copy link
Collaborator

@tom95858 The patch doesn't handle the case that store_sos is storing a set to an existing schema in a SOS container, but the set schema and the SOS schema in the container have different definition. If this is the case, the assert() in PR #1528 will be hit. Thus, we still need PR #1528 to prevent the crash.

For your information, storage policies cache set digests. A storage policy won't store the sets that have different digest values from the cached value. Please see ldmsd_strgp.c:662.

@nichamon
Copy link
Collaborator

@tom95858 Is it possible for applications to store some SOS schema properties in a SOS database (not the C object)? In this case, I'm talking about storing the set digest in the database.

@tom95858
Copy link
Collaborator Author

@nichamon it doesn't hurt to add your patch, so let's just do it and then we will have our bases covered. Oh and yes, we could store the set digest in the SOS schema, however, SOS is not intended only for LDMS. We also use it for baler. I guess we could add a user-context field of some sort to the schema.

@tom95858
Copy link
Collaborator Author

For your information, storage policies cache set digests. A storage policy won't store the sets that have different digest values from the cached value. Please see ldmsd_strgp.c:662.

@nichamon, This code path is during lookup. The set has already been looked up, or it wouldn't be getting updated. The set returned, however is different, or it wouldn't hit your asserts. Whether or not the returned set's digest is different is another question.

I don't think I understand how this is all happening exactly, but it is. I suppose it is possible that the problem has been misdiagnosed, but we should at least apply your patch to get more information.

@nichamon
Copy link
Collaborator

@tom95858 I can reproduce the assert(), and here is the scenario.

I had a sampler daemon (node-1) and an aggregator running. The aggregator stored my_set to a SOS container, named test_sampler.

> ldms_ls -x sock -p 10001 -h node-1 -l

image
Please note that the type of metric b is UINT64.

On the aggregator node,

> sos_cmd -C test_sampler -l

Here is a snippet of the sos_cmd output. Please note that metric b has type UINT64.
image

I killed both sampler daemon and aggregator. Then, I started the sampler daemon with a modified configuration that changed the type of metric b to UINT32.

ldms_ls -x sock -p 10001 -h node-1 -l

image

Then I restarted the aggregator with the same configuration that stores my_set to the existing schema my_schema in the test_sampler container above.

With my patch in PR #1528, ldmsd reported the error messages of mismatched value types instead of the assertion.
image

@nichamon
Copy link
Collaborator

@nichamon it doesn't hurt to add your patch, so let's just do it and then we will have our bases covered. Oh and yes, we could store the set digest in the SOS schema, however, SOS is not intended only for LDMS. We also use it for baler. I guess we could add a user-context field of some sort to the schema.

Adding a user-content field to SOS schema and used that field to compare with the digest cached in the store sos handle would cover the case I showed in my previous comment.

@bschwal
Copy link
Collaborator

bschwal commented Nov 25, 2024

@nichamon this is only for a type mismatch. If a new field gets added (which is the usual use case for us) you won't get this message. We need some behavior to catch not only if the schema changes while the daemon is running but also if the daemon restarts and there is a mismatch.

@bschwal
Copy link
Collaborator

bschwal commented Nov 25, 2024

I think we can decide in the future on what to do for a new schema creation but we need something where the daemon won't crash and will provide appropriate error messages.

@nichamon
Copy link
Collaborator

@bschwal Yes, I'm aware of adding new metrics case. There is a logic in store_sos.c that checks if the number of metrics in the set is larger than the number of attributes in a SOS schema. It is not perfect, but store_sos shouldn't crash ldmsd.

Tom and I also discussed a path forward for comparing the set schema and the SOS schema in the database by storing the set digest or set SHA hash in the SOS database (#1532 (comment)). This will catch any differences between the SOS schema definition in the database and the set schema definition without going through the whole set.

@bschwal
Copy link
Collaborator

bschwal commented Nov 25, 2024

@bschwal Yes, I'm aware of adding new metrics case. There is a logic in store_sos.c that checks if the number of metrics in the set is larger than the number of attributes in a SOS schema. It is not perfect, but store_sos shouldn't crash ldmsd.

Tom and I also discussed a path forward for comparing the set schema and the SOS schema in the database by storing the set digest or set SHA hash in the SOS database (#1532 (comment)). This will catch any differences between the SOS schema definition in the database and the set schema definition without going through the whole set.

Hi @nichamon, are you saying this logic already exists? If so it does not work. store_sos does (and I've seen this behavior for at least a year now) crash ldmsd when the situation I described occurs.

I like the idea of storing the set digest in the sos database and doing a check on that.

@nichamon
Copy link
Collaborator

Hi @nichamon, are you saying this logic already exists? If so it does not work. store_sos does (and I've seen this behavior for at least a year now) crash ldmsd when the situation I described occurs.

@bschwal Let me clarify the current and planned behavior:

Current behavior:

  • store_sos does check for schema mismatches in two ways:
    • It verifies that each metric type matches its corresponding SOS attribute type (causes a crash due to assertion)
    • It checks if there are more metrics than SOS attributes (reports an error but does not crash)

However, when new metrics are inserted between existing ones, the type mismatch check could trigger first, leading to a crash before the count difference can be detected and safely reported.

Proposed solution:

Once #1528 and this PR #1532 are merged, you won't see these crashes anymore.

@bschwal
Copy link
Collaborator

bschwal commented Nov 25, 2024

Sounds great to me, thanks all!

@nichamon nichamon added this to the v4.4.5 milestone Nov 26, 2024
@tom95858
Copy link
Collaborator Author

@nichamon, @bschwal I don't think this patch actually does anything useful. @nichamon recommends we save the LDMS set digest somewhere in the SOS schema. I will look at doing this.

@tom95858 tom95858 closed this Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants