Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ConfigManagingActor raising unhandled exceptions when file doesn't exist #1116

Conversation

ela-kotulska-frequenz
Copy link
Contributor

@ela-kotulska-frequenz ela-kotulska-frequenz commented Nov 25, 2024

This was detected when I edited config file with vim and config.toml.swp file was created and deleted.

@ela-kotulska-frequenz ela-kotulska-frequenz added type:bug Something isn't working part:config Affects the configuration management labels Nov 25, 2024
@ela-kotulska-frequenz ela-kotulska-frequenz self-assigned this Nov 25, 2024
@ela-kotulska-frequenz ela-kotulska-frequenz requested a review from a team as a code owner November 25, 2024 11:07
@ela-kotulska-frequenz ela-kotulska-frequenz requested review from daniel-zullo-frequenz and removed request for a team November 25, 2024 11:07
@github-actions github-actions bot added the part:docs Affects the documentation label Nov 25, 2024
@ela-kotulska-frequenz ela-kotulska-frequenz added the priority:high Address this as soon as possible label Nov 25, 2024
Copy link
Contributor

@llucax llucax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch!

I think it would be good to add a test case for this, as it is not trivial to see why DELETE is handled separately,, but maybe we need to introduce a integration test, not sure if it would be easy to do with mocking.

src/frequenz/sdk/config/_config_managing.py Show resolved Hide resolved
@llucax llucax added this to the v1.0.0-rc1302 milestone Nov 25, 2024
@ela-kotulska-frequenz
Copy link
Contributor Author

ela-kotulska-frequenz commented Nov 26, 2024

Just found few more problems and fixed.

I am still thinking how to fix @llucax comments and will continue working on it today.

@llucax
Copy link
Contributor

llucax commented Nov 26, 2024

I am still thinking how to fix @llucax comments and will continue working on it today.

@ela-kotulska-frequenz let me know what do you think about completely ignoring DELETE for now. I think that should be super simple, it won't complicate the code logic, should fix the issue with minimal changes and only with a small downside (not logging deletes). If we go for it I would open a new issue about this and add a comment in the code explaining why we are ignoring DELETE and a link to the issue.

@ela-kotulska-frequenz
Copy link
Contributor Author

let me know what do you think about completely ignoring DELETE for now.

I agree! :) Will do proper changes.

@github-actions github-actions bot added the part:tests Affects the unit, integration and performance (benchmarks) tests label Nov 26, 2024
@ela-kotulska-frequenz
Copy link
Contributor Author

ela-kotulska-frequenz commented Nov 26, 2024

  • updated all commits and changed order to make PR more readable.
  • wrote unit tests. FileWatcher is well tested (in FileWatcher's repo), so I mocked it to tests scenario when files are deleted.
  • @llucax in new unit tests I found another potential bug: Deletion of config file is ignored and config are not updated. It worked like this before and I left it like this. Should it work like this?

@ela-kotulska-frequenz ela-kotulska-frequenz force-pushed the config_managing_actor branch 2 times, most recently from 3ce7400 to 26ce319 Compare November 26, 2024 16:27
@llucax
Copy link
Contributor

llucax commented Nov 27, 2024

@llucax in new unit tests I found another potential bug: Deletion of config file is ignored and config are not updated. It worked like this before and I left it like this. Should it work like this?

I think so, I think a complete and permanent deletion of the config file should be an error, and when facing that error I think the best option is to continue with the old config, like we do with other errors. I don't see any valid reason to remove a config file completely and permanently. If the file is created again we should be notified and pick up the changes. If we are not notified about the re-creation, then we have a problem, but maybe we can test that (if it is not tested already).

Copy link
Contributor

@llucax llucax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, this got really tricky. Sorry for not being more clear about removing DELETE from the events we watch. I think doing that could simplify the code even further.

src/frequenz/sdk/config/_config_managing.py Outdated Show resolved Hide resolved
src/frequenz/sdk/config/_config_managing.py Outdated Show resolved Hide resolved
Comment on lines 166 to 135
except FileNotFoundError:
# It is ok for config file to don't exist.
_logger.info(
"%s: Config file %s not found. Ignoring it.", self, config_path
)
error_count += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, interesting. Why not an error? I also wonder if we shouldn't do this actually for any exception raised while reading, parsing and validating the config files. The goal of this is to be as resilient as possible in case of a partial failure (we can only read part of the config. And it is really tricky, sometimes I think if one config file read failed, we should probably skip the update completely for all files, as we might get into an inconsistent config.

I'm starting to feel maybe it would be better to restart the actor if there is an issue reading any of the requested files (i.e. requiring users to create at least and empty config for the files that are requested in the constructor).

But maybe this is a topic for a different PR. But I would make this a _logger.error() in this PR. And maybe catch all exceptions, not just ValueError and FileNotFound (at least all OSError, not just FileNotFound as well as any errors that could be railed by tomlib.load(), maybe it only raises ValueError).

Copy link
Contributor Author

@ela-kotulska-frequenz ela-kotulska-frequenz Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it works like this:
Every time any config file change - we read all config files. But not all config files might exists and this cover this case.

Luca, this PR was to fix existing bugs, not redesign ConfigManger. I guess it needs redesign and rethinking but maybe in separate PR and after we agree how it should work.

Copy link
Contributor Author

@ela-kotulska-frequenz ela-kotulska-frequenz Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here I log info, if any config file is doesn't exist.
And (code 3 lines below) raises ValueError if no config file exist. (at least one must exist)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luca, this PR was to fix existing bugs, not redesign ConfigManger. I guess it needs redesign and rethinking but maybe in separate PR and after we agree how it should work.

Yes, 100% agreed, this is also why I said re-evaluating what to do on a failure to read a config file is for another PR.

But catching other types of errors is still a bug that IMHO is in scope for this PR. Since you are catching a new FileNotFound why not just catching all OSErrors?

So here I log info, if any config file is doesn't exist.
And (code 3 lines below) raises ValueError if no config file exist. (at least one must exist)

Yeah, but that's only if we can't read any config file, if only one failed, no ValueError will be raised, and the missing file won't be noticed if we are only monitoring WARNING and up. Also it is more consistent to how we handle the TOML parsing ValueError. We log an error if TOML parsing fails. I don't understand why the inconsistency of making one issue emit an error log and the new one just a info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@ela-kotulska-frequenz ela-kotulska-frequenz Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok the whole discussion was because I had impression that having not all config files is not an error.
But it looks like it is unexpected so as you suggested I changed severity of this logging to error. :)
Sorry for my misunderstanding!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it is OK, originally it was thought like that, only one config file should be enough, but I think that's assuming too much, and is hard to know if missing files are errors or expected, so I opened the new issue to change that in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see in the last commit you are changing this. I would not address the new issue (#1120) in this PR, I think it needs a bit more thought and it is not strictly related to the bug fixes in this PR.

src/frequenz/sdk/config/_config_managing.py Outdated Show resolved Hide resolved
src/frequenz/sdk/config/_config_managing.py Outdated Show resolved Hide resolved
Copy link
Contributor

@llucax llucax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some completely new comments to make it more clear what should be addressed in this PR, the rest if for a future PR, I already created an issue for that.

Besides the new comment, for me it is still not clear why it is better to create the file watcher in the constructor and not the _run().

src/frequenz/sdk/config/_config_managing.py Outdated Show resolved Hide resolved
src/frequenz/sdk/config/_config_managing.py Outdated Show resolved Hide resolved
src/frequenz/sdk/config/_config_managing.py Outdated Show resolved Hide resolved
Becasuse:
* not all files in self._config_paths might exists
* catching the watchfiles signal and file reading happen
in separate async thread and the file could be removed
between signal detection and file reading.

This prevents unexpected actor crash.

Signed-off-by: Elzbieta Kotulska <[email protected]>
`event.path.samefile` raises error if:
 * `event.pathq` doesn't exists
 * path `p` doesn't exists
In this case it raised exception:
 * if any file in parent directory was deleted
 * any path in self._config_paths didn't exist

Signed-off-by: Elzbieta Kotulska <[email protected]>
Signed-off-by: Elzbieta Kotulska <[email protected]>
Signed-off-by: Elzbieta Kotulska <[email protected]>
@ela-kotulska-frequenz ela-kotulska-frequenz force-pushed the config_managing_actor branch 2 times, most recently from 791e369 to 5acf147 Compare November 28, 2024 12:21
Eliminate recursive actor crashes when config files are missing.
The previous implementation suffered from a critical error pattern:
- The `_read_config` method is called at the start of the `_run` method
  and on FileWatcher event.
- When no config files existed, an exception would be raised
- This exception caused the actor to crash and immediately restart
- Restarting triggered the same `_read_config` method
- The cycle repeated, creating a persistent crash loop

This fix introduces a more robust approach:
- Detect missing config files without throwing exceptions
- Set up a FileWatcher to monitor for future config file creation
- call `_read_config` method as soon as any config file is crated.

Signed-off-by: Elzbieta Kotulska <[email protected]>
@llucax llucax modified the milestones: v1.0.0-rc1302, v1.0.0-rc1400 Nov 28, 2024
@llucax llucax added the scope:breaking-change Breaking change, users will need to update their code label Nov 28, 2024
@llucax
Copy link
Contributor

llucax commented Nov 28, 2024

Adding label for breaking changes and assigning to rc1400 as this now breaks the API, but we can do a quick release after it is merged anyway.

Copy link
Contributor

@llucax llucax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Approving.

I'm still undecided if it wouldn't be better to remove the last commit and leave that to be fixed separate after some more thought. But I think it is OK to move forward as it is too and reconsider in the future when #1120 is addressed.

@ela-kotulska-frequenz ela-kotulska-frequenz added this pull request to the merge queue Nov 28, 2024
Merged via the queue into frequenz-floss:v1.x.x with commit 29e4d67 Nov 28, 2024
19 checks passed
@ela-kotulska-frequenz ela-kotulska-frequenz deleted the config_managing_actor branch November 28, 2024 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
part:config Affects the configuration management part:docs Affects the documentation part:tests Affects the unit, integration and performance (benchmarks) tests priority:high Address this as soon as possible scope:breaking-change Breaking change, users will need to update their code type:bug Something isn't working
Projects
Development

Successfully merging this pull request may close these issues.

2 participants