-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LocalNet] Add sequencer block persistence #277
Conversation
@@ -12,7 +13,7 @@ func GoOnExitSignal(onInterrupt func()) { | |||
// Set up sigCh to receive when this process receives an interrupt or | |||
// kill signal. | |||
sigCh := make(chan os.Signal, 1) | |||
signal.Notify(sigCh, os.Interrupt, os.Kill) | |||
signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bryanchriswhite I was troubleshooting sequencer hot reloading issues, and ran into that: https://staticcheck.dev/docs/checks/#SA1016.
I only later realized this code is not used by the sequencer, so I can roll this change back.
With that PR we can get persistence across restarts. One can change sequencer:
cleanupBeforeEachStart: false Unfortunately, this setting causes issues. The old process fails to close the database, leading to a situation where a new process starts before the old one has properly terminated. Consequently, the following error is logged:
A primary reason for this issue is the lack of shutdown or graceful shutdown logic in the sequencer. When it receives a SIGTERM signal (signal 15), the sequencer does not attempt to properly terminate the blockchain, Tendermint, or the database processes. Kubernetes, after waiting for a grace period (default 30 seconds), forcibly terminates the process with a SIGKILL (signal 9). This inability to gracefully shut down also impedes the functionality of hot-reloading for the sequencer. @bryanchriswhite, considering your experience in implementing graceful shutdowns for appgateserver and relayminer, could you provide any insights on achieving this for the sequencer? If not, we can consult the Rollkit team. I've attempted to resolve this issue but have been unsuccessful. |
I'm pretty sure this is a Tendermint/CometBFT error. Other projects (example here, example2) hit the exact same error. Any solution we come up with will look like (IMO) a hack in one way or another. @okdas Can you think of a place in the infrastructure where (maybe a docker dev container?) where we do a
Let's do this IN PARALLEL to trying to figure out a solution ourselves. Feel free to even just post a link/screenshot to this PR. |
@okdas Following up with feedback from the rollkit team on this. |
@Olshansk, thanks, but this is a solution to a different problem. The current concern is: Sequencer is unable to shut down gracefully. It can only be shut down with We already provision new storage for a new sequencer when we don't want to persist state between restarts ( However, when we want the data to persist between restarts, we don't want to cleanup the datadir, randomize the datadir name, nor do we want to I say we merge this as is, but let's create another ticket to fix graceful shutdown. Wdyt? UPDATE: I was just able to start process running on my mac with 15 signal just fine. I am looking into why the process does not stop when running inside Kubernetes Pod. |
OK, the issue with the DB being locked is addressed by pokt-network/helm-charts#23. It was a surprise to me, but I've discovered that sequencer hot reloading does not work due to tilt-dev/tilt-extensions#391. Which, I think, is actually not a big deal because with changes to the helm chart, plus changes included in that PR, it becomes painless to restart the sequencer via the UI. No hang-ups anymore. I can think of a workaround - we used a different mechanism in the old Pocket v1 repo, which worked great, but let's do that if we actually need hot reloading of the sequencer. This PR is good to go! |
Q1. I saw some work you did around the volume refactoring and
Q2. This is something @red-0ne and I found an inconsistency in. Sometimes it works and sometimes it doesn't but I haven't done a formal deep dive. Can you confirm the following:
If this is the case, let's create a ticket to fix it. It is very confusing (especially if you don't know how the infrastructure works) and create a discrepancy between expectation and what one is seeing as a new developer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@okdas Let's merge with main and run e2e tests before merging this in. Also, please see the comment I left, I would say it's quite important for developer productivity.
The CI will now also run the e2e tests on devnet, which increases the time it takes to complete all CI checks. |
The problem appeared only when LocalNet was configured to preserve the blocks across restarts. The root of the problem was that the sequencer was not stopping in time before the new process started. It appeared this was happening due to the sequencer not receiving the signal to terminate the process. After moving some scripts around (which I wanted to do anyway), that issue disappeared. The sequencer now gets PID 1 inside the container and shuts down instantly when the Pod terminates.
Hot reloading is an automatic restart after a code change. It does not work currently, due to how Tilt utilizes Helm (there are a couple of ways to use Helm charts in Tilt - and the old repo and new repo use different Tilt extensions). Clicking the "restart" button is not hot reloading. But it does work. We've had issues with sequencer restarts in the UI - they were slow, and sometimes required multiple clicks. The root of the problem is described above, and this PR (along with helm chart changes) solves the issue. Btw, I also changed a message bot posts after the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this in, but I still want to understand if we have an open ticket to resolve hot reloading
or not?
There is now: #325 |
* add localnet persistance * bump local-celestia-devnet * rollback celestia container * change messages
* add localnet persistance * bump local-celestia-devnet * rollback celestia container * change messages
Summary
Human Summary
We have a need for preserving blockchain history between restarts during the development process. This PR enables this. Additionally, it partially resolves a persistent issue where the sequencer experiences prolonged restart times, leading to scenarios where a Kubernetes Persistent Volume Claim (PVC) becomes stuck.
AI Summary
Summary generated by Reviewpad on 11 Jan 24 20:00 UTC
This pull request includes the following changes:
In the
.github/label-actions.yml
file:push-image
label to thedevnet-test-e2e
label actions.devnet-test-e2e
label action to include instructions for pushing another commit to produce a container image for DevNet.In the
Tiltfile
file:localnet_config_defaults
variable for thesequencer
property.sequencer-volume.yaml
for creating a persistent volume claim for the sequencer.In the
cmd/signals/on_exit.go
file:SIGTERM
instead ofos.Kill
.In the
docs/static/openapi.yml
file:rpc.options.configs
property,REST
.values-sequencer.yaml
file for the sequencer's values.Added a new YAML file
sequencer-volume.yaml
for creating a persistent volume claim for the sequencer.Note: The diff contains both additions and modifications to different files.
Type of change
Select one or more:
Testing
make go_develop_and_test
make test_e2e
devnet-test-e2e
label to the PR. This is VERY expensive, only do it after all the reviews are complete.Sanity Checklist