[LocalNet] Add sequencer block persistence #277

okdas · 2023-12-15T01:07:32Z

Summary

Human Summary

We have a need for preserving blockchain history between restarts during the development process. This PR enables this. Additionally, it partially resolves a persistent issue where the sequencer experiences prolonged restart times, leading to scenarios where a Kubernetes Persistent Volume Claim (PVC) becomes stuck.

AI Summary

Summary generated by Reviewpad on 11 Jan 24 20:00 UTC

This pull request includes the following changes:

In the .github/label-actions.yml file:
- Added push-image label to the devnet-test-e2e label actions.
- Updated the comment for the devnet-test-e2e label action to include instructions for pushing another commit to produce a container image for DevNet.
In the Tiltfile file:
- Added a new key-value pair in the localnet_config_defaults variable for the sequencer property.
- Added a new YAML file sequencer-volume.yaml for creating a persistent volume claim for the sequencer.
In the cmd/signals/on_exit.go file:
- Updated the signal to listen for SIGTERM instead of os.Kill.
In the docs/static/openapi.yml file:
- Modified several titles to improve readability.
- Added a new enum value for the rpc.options.configs property, REST.
- Added a new values-sequencer.yaml file for the sequencer's values.
Added a new YAML file sequencer-volume.yaml for creating a persistent volume claim for the sequencer.

Note: The diff contains both additions and modifications to different files.

Type of change

Select one or more:

Testing

Run all unit tests: make go_develop_and_test
Run E2E tests locally: make test_e2e
Run E2E tests on DevNet: Add the devnet-test-e2e label to the PR. This is VERY expensive, only do it after all the reviews are complete.

Sanity Checklist

I have tested my changes using the available tooling
I have performed a self-review of my own code
I have commented my code, updated documentation and left TODOs throughout the codebase

okdas · 2023-12-15T01:09:20Z

cmd/signals/on_exit.go

@@ -12,7 +13,7 @@ func GoOnExitSignal(onInterrupt func()) {
 		// Set up sigCh to receive when this process receives an interrupt or
 		// kill signal.
 		sigCh := make(chan os.Signal, 1)
-		signal.Notify(sigCh, os.Interrupt, os.Kill)
+		signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM)


@bryanchriswhite I was troubleshooting sequencer hot reloading issues, and ran into that: https://staticcheck.dev/docs/checks/#SA1016.

I only later realized this code is not used by the sequencer, so I can roll this change back.

okdas · 2023-12-15T01:47:44Z

With that PR we can get persistence across restarts. One can change localnet_config.yaml setting:

sequencer:
  cleanupBeforeEachStart: false

Unfortunately, this setting causes issues. The old process fails to close the database, leading to a situation where a new process starts before the old one has properly terminated. Consequently, the following error is logged:

Error: failed to initialize database: resource temporarily unavailable

A primary reason for this issue is the lack of shutdown or graceful shutdown logic in the sequencer. When it receives a SIGTERM signal (signal 15), the sequencer does not attempt to properly terminate the blockchain, Tendermint, or the database processes. Kubernetes, after waiting for a grace period (default 30 seconds), forcibly terminates the process with a SIGKILL (signal 9).

This inability to gracefully shut down also impedes the functionality of hot-reloading for the sequencer.

@bryanchriswhite, considering your experience in implementing graceful shutdowns for appgateserver and relayminer, could you provide any insights on achieving this for the sequencer? If not, we can consult the Rollkit team. I've attempted to resolve this issue but have been unsuccessful.

Olshansk · 2023-12-15T23:17:05Z

Error: failed to initialize database: resource temporarily unavailable

I'm pretty sure this is a Tendermint/CometBFT error.

Other projects (example here, example2) hit the exact same error.

Any solution we come up with will look like (IMO) a hack in one way or another.

@okdas Can you think of a place in the infrastructure where (maybe a docker dev container?) where we do a ps aux | grep <...> | xargs kill -9 to force kill it in our env environments?

If not, we can consult the Rollkit team. I've attempted to resolve this issue but have been unsuccessful.

Let's do this IN PARALLEL to trying to figure out a solution ourselves. Feel free to even just post a link/screenshot to this PR.

Olshansk · 2024-01-05T16:00:38Z

@okdas Following up with feedback from the rollkit team on this.

okdas · 2024-01-10T00:58:24Z

@Olshansk, thanks, but this is a solution to a different problem. The current concern is: Sequencer is unable to shut down gracefully. It can only be shut down with kill -9.

We already provision new storage for a new sequencer when we don't want to persist state between restarts (cleanupBeforeEachStart is true, default behavior).

However, when we want the data to persist between restarts, we don't want to cleanup the datadir, randomize the datadir name, nor do we want to kill -9 the process (because that could corrupt the database). The error "Error: failed to initialize database: resource temporarily unavailable" ONLY appears when we manually restart in Tilt UI (and we have to do that manually due to lack of graceful shutdown).

I say we merge this as is, but let's create another ticket to fix graceful shutdown. Wdyt?

UPDATE: I was just able to start process running on my mac with 15 signal just fine. I am looking into why the process does not stop when running inside Kubernetes Pod.

okdas · 2024-01-11T01:55:57Z

OK, the issue with the DB being locked is addressed by pokt-network/helm-charts#23.

It was a surprise to me, but I've discovered that sequencer hot reloading does not work due to tilt-dev/tilt-extensions#391.

Which, I think, is actually not a big deal because with changes to the helm chart, plus changes included in that PR, it becomes painless to restart the sequencer via the UI. No hang-ups anymore. I can think of a workaround - we used a different mechanism in the old Pocket v1 repo, which worked great, but let's do that if we actually need hot reloading of the sequencer.

This PR is good to go!

Olshansk · 2024-01-11T16:12:45Z

OK, the issue with the DB being locked is addressed by pokt-network/helm-charts#23.

Q1. I saw some work you did around the volume refactoring and cleanup-datadir, but does this mean that we're not going with the "random datadir" and rou figured out a way to make it deterministic? I simply want to understand and potentially #PUC if there's an opportunity for it.

It was a surprise to me, but I've discovered that sequencer hot reloading does not work due to tilt-dev/tilt-extensions#391.

Q2. This is something @red-0ne and I found an inconsistency in. Sometimes it works and sometimes it doesn't but I haven't done a formal deep dive. Can you confirm the following:

Hot reloading DOES WORK if we click restart in the UI
Hot reloading DOES NOT WORK if we change code, see the image rebuild, but do not click restart in the UI

If this is the case, let's create a ticket to fix it. It is very confusing (especially if you don't know how the infrastructure works) and create a discrepancy between expectation and what one is seeing as a new developer.

Olshansk

@okdas Let's merge with main and run e2e tests before merging this in. Also, please see the comment I left, I would say it's quite important for developer productivity.

github-actions · 2024-01-11T19:58:01Z

The CI will now also run the e2e tests on devnet, which increases the time it takes to complete all CI checks.

okdas · 2024-01-11T20:51:13Z

Q1. I saw some work you did around the volume refactoring and cleanup-datadir, but does this mean that we're not going with the "random datadir" and you figured out a way to make it deterministic? I simply want to understand and potentially #PUC if there's an opportunity for it.

The problem appeared only when LocalNet was configured to preserve the blocks across restarts. The root of the problem was that the sequencer was not stopping in time before the new process started. It appeared this was happening due to the sequencer not receiving the signal to terminate the process. After moving some scripts around (which I wanted to do anyway), that issue disappeared. The sequencer now gets PID 1 inside the container and shuts down instantly when the Pod terminates.

Q2. This is something @red-0ne and I found an inconsistency in. Sometimes it works and sometimes it doesn't, but I haven't done a formal deep dive. Can you confirm the following:
* Hot reloading DOES WORK if we click `restart` in the UI

* Hot reloading DOES NOT WORK if we change code, see the image rebuild, but do not click `restart` in the UI
If this is the case, let's create a ticket to fix it. It is very confusing (especially if you don't know how the infrastructure works) and creates a discrepancy between expectation and what one is seeing as a new developer.

Hot reloading is an automatic restart after a code change. It does not work currently, due to how Tilt utilizes Helm (there are a couple of ways to use Helm charts in Tilt - and the old repo and new repo use different Tilt extensions).

Clicking the "restart" button is not hot reloading. But it does work. We've had issues with sequencer restarts in the UI - they were slow, and sometimes required multiple clicks. The root of the problem is described above, and this PR (along with helm chart changes) solves the issue.

Btw, I also changed a message bot posts after the devnet-test-e2e label is added to the PR to clarify the other issue we've run into today.

Olshansk

Let's merge this in, but I still want to understand if we have an open ticket to resolve hot reloading or not?

okdas · 2024-01-17T01:20:13Z

Let's merge this in, but I still want to understand if we have an open ticket to resolve hot reloading or not?

There is now: #325

* add localnet persistance * bump local-celestia-devnet * rollback celestia container * change messages

add localnet persistance

887ee28

okdas added the infra Infra or tooling related improvements, additions or fixes label Dec 15, 2023

okdas added this to the Shannon TestNet milestone Dec 15, 2023

okdas self-assigned this Dec 15, 2023

okdas commented Dec 15, 2023

View reviewed changes

okdas changed the title ~~Add localnet persistance~~ Add LocalNet persistence Dec 15, 2023

okdas changed the title ~~Add LocalNet persistence~~ [LocalNet] Add sequencer persistence Dec 15, 2023

okdas changed the title ~~[LocalNet] Add sequencer persistence~~ [LocalNet] Add sequencer block persistence Dec 15, 2023

okdas and others added 3 commits January 9, 2024 16:58

Merge branch 'main' into dk-localnet-persistance

4fee782

bump local-celestia-devnet

bc11efe

rollback celestia container

2c84c20

okdas marked this pull request as ready for review January 10, 2024 01:42

Olshansk requested changes Jan 11, 2024

View reviewed changes

Merge branch 'main' into dk-localnet-persistance

78acdda

okdas added the devnet-test-e2e label Jan 11, 2024

github-actions bot added devnet push-image CI related - pushes images to ghcr.io labels Jan 11, 2024

change messages

1b33513

Olshansk approved these changes Jan 12, 2024

View reviewed changes

okdas mentioned this pull request Jan 17, 2024

[LocalNet] Fix sequencer hot reloading #325

Closed

5 tasks

okdas merged commit af93def into main Jan 17, 2024
10 checks passed

okdas deleted the dk-localnet-persistance branch January 17, 2024 01:20

Olshansk pushed a commit that referenced this pull request Jan 18, 2024

[LocalNet] Add sequencer block persistence (#277)

d95e1f7

* add localnet persistance * bump local-celestia-devnet * rollback celestia container * change messages

bryanchriswhite removed push-image CI related - pushes images to ghcr.io devnet-test-e2e labels May 16, 2024

github-actions bot removed the devnet label May 16, 2024

okdas added a commit that referenced this pull request Nov 14, 2024

[LocalNet] Add sequencer block persistence (#277)

624d67f

* add localnet persistance * bump local-celestia-devnet * rollback celestia container * change messages

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LocalNet] Add sequencer block persistence #277

[LocalNet] Add sequencer block persistence #277

okdas commented Dec 15, 2023 •

edited by reviewpad bot

Loading

okdas Dec 15, 2023 •

edited

Loading

okdas commented Dec 15, 2023

Olshansk commented Dec 15, 2023

Olshansk commented Jan 5, 2024

okdas commented Jan 10, 2024 •

edited

Loading

okdas commented Jan 11, 2024

Olshansk commented Jan 11, 2024

Olshansk left a comment

github-actions bot commented Jan 11, 2024

okdas commented Jan 11, 2024

Olshansk left a comment

okdas commented Jan 17, 2024

[LocalNet] Add sequencer block persistence #277

[LocalNet] Add sequencer block persistence #277

Conversation

okdas commented Dec 15, 2023 • edited by reviewpad bot Loading

Summary

Human Summary

AI Summary

Summary generated by Reviewpad on 11 Jan 24 20:00 UTC

Type of change

Testing

Sanity Checklist

okdas Dec 15, 2023 • edited Loading

Choose a reason for hiding this comment

okdas commented Dec 15, 2023

Olshansk commented Dec 15, 2023

Olshansk commented Jan 5, 2024

okdas commented Jan 10, 2024 • edited Loading

okdas commented Jan 11, 2024

Olshansk commented Jan 11, 2024

Olshansk left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 11, 2024

okdas commented Jan 11, 2024

Olshansk left a comment

Choose a reason for hiding this comment

okdas commented Jan 17, 2024

okdas commented Dec 15, 2023 •

edited by reviewpad bot

Loading

okdas Dec 15, 2023 •

edited

Loading

okdas commented Jan 10, 2024 •

edited

Loading