Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-6341] Upgrade to 24.04 base and etcd v3.5.18 #32

Merged
merged 17 commits into from
Feb 5, 2025

Conversation

reneradoi
Copy link
Collaborator

@reneradoi reneradoi commented Jan 30, 2025

This PR upgrades the etcd operator to run on Ubuntu 24.04 base image and deploy etcd in version 3.5.18.

Besides those changes, it includes the following adjustments / fixes:

  • ClusterManager: cluster membership operation cannot be executed against any cluster member anymore in etcd 3.5.18, use all cluster endpoints instead
  • Integration tests: ops_test.build_charm is deprecated for 24.04 in data-platform-workflows (see deprecation notice here). Therefore we now build the charm outside of the tests when running the tests locally (see here), and assume the charm file exists in the tests.
  • Integration test for scaling: Avoid instability in CI run by setting raise_on_error=False in wait_for_idle when scaling down. The member remove operation might result in errors on the storage_detaching hook if the member cannot be removed at the specific moment (desired behaviour). The operation might succeed on the next run, but the test would fail then.

carlcsaposs-canonical and others added 16 commits January 22, 2025 11:08
Fixes issue where files in charm directory conflict with files created
by poetry plugin during staging
The continuous writes function in the integration test suite needs to be
able to run independently. Currently it is executed via the `etcdctl`
command being run in one of the deployed juju units. This is problematic
as we might want to remove this specific unit as part of the scaling/HA
tests.

This PR adjusts integration test setup and installs the deb-package
`etcd-client` on the host machine. The continuous writes function is
then run with this client's `etcdctl` and thus not depending on
execution in specific units anymore.

The function is still kept very simple:
- It needs to be started with `start_continuous_writes(endpoints=...)`
and stopped with `stop_continuous_writes()`.
- It can be started with multiple endpoints though and persists through
changes of Raft or Juju leadership.
- Use the `assert_continuous_writes_increasing` and
`assert_continuous_writes_consistent` helpers to ensure that the writes
continued throughout any cluster/HA related operation.

**Note:**
There is no currently maintained Python client lib for etcd (see
[issue](etcd-io/etcd#13285)), that's why we
keep using `etcdctl` to connect to etcd on each write. Testing with
several of the available, unsupported clients failed because they either
don't support authentication or cannot be installed to due
incompatibilities with protobuf.
This pull request added CA rotation to the charm.

### TLS Certificate Management Enhancements:

* **New TLS CA Rotation States:**
- Added `TLSCARotationState` enum to represent different stages of CA
rotation (`src/literals.py`).

* **CA Rotation and Cleanup Methods:**
- Introduced methods in `TLSManager` for setting CA rotation state,
checking CA status across servers, and cleaning up old CAs
(`src/managers/tls.py`).
[[1]](diffhunk://#diff-a2c036f44ae9bbab595aad84b23bc3106a342c5ab12f7b0cf3d26268b51cdc83R136-R204)
[[2]](diffhunk://#diff-a2c036f44ae9bbab595aad84b23bc3106a342c5ab12f7b0cf3d26268b51cdc83L57-R102)

* **Event Handling Updates:**
- Added `CleanCAEvent` to handle CA cleanup events and integrated it
into the `TLSEvents` class (`src/events/tls.py`).
[[1]](diffhunk://#diff-3dfb4a104b407f7d4b2e07795a63745a09cde228a8dbe86602e7b72ea600fbe4L15-R57)
[[2]](diffhunk://#diff-3dfb4a104b407f7d4b2e07795a63745a09cde228a8dbe86602e7b72ea600fbe4R88-R89)
[[3]](diffhunk://#diff-3dfb4a104b407f7d4b2e07795a63745a09cde228a8dbe86602e7b72ea600fbe4R229-R243)

### Codebase Improvements:

* **Logging Enhancements:**
- Improved logging messages to provide more context during rolling
restarts and CA rotations (`src/charm.py`).
[[1]](diffhunk://#diff-b9ed39bbc9c0387bd3e07da31d13373745534a1cd723d3e292c73496b12e307cL68-R78)
[[2]](diffhunk://#diff-b9ed39bbc9c0387bd3e07da31d13373745534a1cd723d3e292c73496b12e307cR178-R202)
This PR adds scale-down functionality to the etcd operator. It includes
the following changes:

**EtcdEvents**
- new methods for handling the `RelationDepartedEvent` and
`StorageDetachingEvent`

**ClusterManager**
- refactoring of the `get_leader()` method (now `leader` property) to
return the `leader id` instead of the `endpoint` of the leader
- a new method for selecting a new Raft leader from the remaining
cluster members before removing a cluster member
- a new method `remove_member()` which makes sure a) the cluster is
always available even if removing the Raft leader and b) the unit
removal is blocked until the cluster could be reconfigured or errors out
in case it can't

**EtcdClient**
- new method `remove_member()` to reconfigure the etcd cluster without
the departing unit
- new method `move_leader()` to transfer Raft leadership from in the
etcd cluster to a different unit
- remove duplicate arguments for the `_run_etcdctl()` method
(accidentally introduced in
[PR#13](#13))

**TLSEvents**
- an update to the `_on_certificates_broken` handler to only initiate a
rolling restart if the respective unit is started and not departing soon

**Workload**
- new method `stop()` to stop a snap service

**Test Coverage**
Integration tests in `test_scaling.py` for
- removing one unit
- removing the Raft leader
- removing multiple units
- scaling to zero and back up
- remove the Juju leader
- remove the application

Unit tests in `test_charm.py` for the `storage_detaching` handler

**Integration Test Helpers**
- a new method `get_raft_leader()` returning the member-name of the Raft
leader in an etcd cluster
- a new method `wait_for_cluster_formation()` that waits until all
cluster members have been promoted to full-voting member
- `get_application_relation_data()` taken over from other integration
test suits to check for peer relation data in integration tests
- a fix to the `continuous_writes` for cleanup of previous runs in the
same etcd cluster
…ember anymore, use all cluster endpoints instead
@reneradoi reneradoi changed the title [DPE-6341] Upgrade to 24.04 base [DPE-6341] Upgrade to 24.04 base and etcd v3.5.18 Jan 31, 2025
@reneradoi reneradoi marked this pull request as ready for review January 31, 2025 15:51
@reneradoi reneradoi changed the base branch from main to 3.5/edge February 3, 2025 07:14
# Conflicts:
#	src/managers/cluster.py
#	tests/integration/ha/test_scaling.py
#	tests/integration/test_charm.py
#	tests/integration/tls/test_ca_rotation.py
#	tests/integration/tls/test_tls.py
#	tox.ini
Copy link
Contributor

@Mehdi-Bendriss Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks René, I have a question / concern on raise_on_error

Copy link
Contributor

@skourta skourta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @reneradoi

@reneradoi reneradoi merged commit cd9df9d into 3.5/edge Feb 5, 2025
11 of 12 checks passed
@reneradoi reneradoi deleted the upgrade-to-2404-base branch February 5, 2025 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants