Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LocalNet] Add sequencer block persistence #277

Merged
merged 6 commits into from
Jan 17, 2024
Merged

Conversation

okdas
Copy link
Member

@okdas okdas commented Dec 15, 2023

Summary

Human Summary

We have a need for preserving blockchain history between restarts during the development process. This PR enables this. Additionally, it partially resolves a persistent issue where the sequencer experiences prolonged restart times, leading to scenarios where a Kubernetes Persistent Volume Claim (PVC) becomes stuck.

AI Summary

Summary generated by Reviewpad on 11 Jan 24 20:00 UTC

This pull request includes the following changes:

  1. In the .github/label-actions.yml file:

    • Added push-image label to the devnet-test-e2e label actions.
    • Updated the comment for the devnet-test-e2e label action to include instructions for pushing another commit to produce a container image for DevNet.
  2. In the Tiltfile file:

    • Added a new key-value pair in the localnet_config_defaults variable for the sequencer property.
    • Added a new YAML file sequencer-volume.yaml for creating a persistent volume claim for the sequencer.
  3. In the cmd/signals/on_exit.go file:

    • Updated the signal to listen for SIGTERM instead of os.Kill.
  4. In the docs/static/openapi.yml file:

    • Modified several titles to improve readability.
    • Added a new enum value for the rpc.options.configs property, REST.
    • Added a new values-sequencer.yaml file for the sequencer's values.
  5. Added a new YAML file sequencer-volume.yaml for creating a persistent volume claim for the sequencer.

Note: The diff contains both additions and modifications to different files.

Type of change

Select one or more:

  • New feature, functionality or library
  • Bug fix
  • Code health or cleanup
  • Documentation
  • Other (specify)

Testing

  • Run all unit tests: make go_develop_and_test
  • Run E2E tests locally: make test_e2e
  • Run E2E tests on DevNet: Add the devnet-test-e2e label to the PR. This is VERY expensive, only do it after all the reviews are complete.

Sanity Checklist

  • I have tested my changes using the available tooling
  • I have performed a self-review of my own code
  • I have commented my code, updated documentation and left TODOs throughout the codebase

@okdas okdas added the infra Infra or tooling related improvements, additions or fixes label Dec 15, 2023
@okdas okdas added this to the Shannon TestNet milestone Dec 15, 2023
@okdas okdas self-assigned this Dec 15, 2023
@@ -12,7 +13,7 @@ func GoOnExitSignal(onInterrupt func()) {
// Set up sigCh to receive when this process receives an interrupt or
// kill signal.
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, os.Interrupt, os.Kill)
signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM)
Copy link
Member Author

@okdas okdas Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryanchriswhite I was troubleshooting sequencer hot reloading issues, and ran into that: https://staticcheck.dev/docs/checks/#SA1016.

I only later realized this code is not used by the sequencer, so I can roll this change back.

@okdas okdas changed the title Add localnet persistance Add LocalNet persistence Dec 15, 2023
@okdas okdas changed the title Add LocalNet persistence [LocalNet] Add sequencer persistence Dec 15, 2023
@okdas okdas changed the title [LocalNet] Add sequencer persistence [LocalNet] Add sequencer block persistence Dec 15, 2023
@okdas
Copy link
Member Author

okdas commented Dec 15, 2023

With that PR we can get persistence across restarts. One can change localnet_config.yaml setting:

sequencer:
  cleanupBeforeEachStart: false

Unfortunately, this setting causes issues. The old process fails to close the database, leading to a situation where a new process starts before the old one has properly terminated. Consequently, the following error is logged:

Error: failed to initialize database: resource temporarily unavailable

A primary reason for this issue is the lack of shutdown or graceful shutdown logic in the sequencer. When it receives a SIGTERM signal (signal 15), the sequencer does not attempt to properly terminate the blockchain, Tendermint, or the database processes. Kubernetes, after waiting for a grace period (default 30 seconds), forcibly terminates the process with a SIGKILL (signal 9).

This inability to gracefully shut down also impedes the functionality of hot-reloading for the sequencer.

@bryanchriswhite, considering your experience in implementing graceful shutdowns for appgateserver and relayminer, could you provide any insights on achieving this for the sequencer? If not, we can consult the Rollkit team. I've attempted to resolve this issue but have been unsuccessful.

@Olshansk
Copy link
Member

Error: failed to initialize database: resource temporarily unavailable

I'm pretty sure this is a Tendermint/CometBFT error.

Other projects (example here, example2) hit the exact same error.


Any solution we come up with will look like (IMO) a hack in one way or another.

@okdas Can you think of a place in the infrastructure where (maybe a docker dev container?) where we do a ps aux | grep <...> | xargs kill -9 to force kill it in our env environments?


If not, we can consult the Rollkit team. I've attempted to resolve this issue but have been unsuccessful.

Let's do this IN PARALLEL to trying to figure out a solution ourselves. Feel free to even just post a link/screenshot to this PR.

@Olshansk
Copy link
Member

Olshansk commented Jan 5, 2024

@okdas Following up with feedback from the rollkit team on this.

Screenshot 2024-01-05 at 10 50 56 AM

@okdas
Copy link
Member Author

okdas commented Jan 10, 2024

@Olshansk, thanks, but this is a solution to a different problem. The current concern is: Sequencer is unable to shut down gracefully. It can only be shut down with kill -9.

We already provision new storage for a new sequencer when we don't want to persist state between restarts (cleanupBeforeEachStart is true, default behavior).

However, when we want the data to persist between restarts, we don't want to cleanup the datadir, randomize the datadir name, nor do we want to kill -9 the process (because that could corrupt the database). The error "Error: failed to initialize database: resource temporarily unavailable" ONLY appears when we manually restart in Tilt UI (and we have to do that manually due to lack of graceful shutdown).

I say we merge this as is, but let's create another ticket to fix graceful shutdown. Wdyt?

UPDATE: I was just able to start process running on my mac with 15 signal just fine. I am looking into why the process does not stop when running inside Kubernetes Pod.

@okdas okdas marked this pull request as ready for review January 10, 2024 01:42
@okdas
Copy link
Member Author

okdas commented Jan 11, 2024

OK, the issue with the DB being locked is addressed by pokt-network/helm-charts#23.

It was a surprise to me, but I've discovered that sequencer hot reloading does not work due to tilt-dev/tilt-extensions#391.

Which, I think, is actually not a big deal because with changes to the helm chart, plus changes included in that PR, it becomes painless to restart the sequencer via the UI. No hang-ups anymore. I can think of a workaround - we used a different mechanism in the old Pocket v1 repo, which worked great, but let's do that if we actually need hot reloading of the sequencer.

This PR is good to go!

@Olshansk
Copy link
Member

OK, the issue with the DB being locked is addressed by pokt-network/helm-charts#23.

Q1. I saw some work you did around the volume refactoring and cleanup-datadir, but does this mean that we're not going with the "random datadir" and rou figured out a way to make it deterministic? I simply want to understand and potentially #PUC if there's an opportunity for it.

It was a surprise to me, but I've discovered that sequencer hot reloading does not work due to tilt-dev/tilt-extensions#391.

Q2. This is something @red-0ne and I found an inconsistency in. Sometimes it works and sometimes it doesn't but I haven't done a formal deep dive. Can you confirm the following:

  • Hot reloading DOES WORK if we click restart in the UI
  • Hot reloading DOES NOT WORK if we change code, see the image rebuild, but do not click restart in the UI

If this is the case, let's create a ticket to fix it. It is very confusing (especially if you don't know how the infrastructure works) and create a discrepancy between expectation and what one is seeing as a new developer.

Copy link
Member

@Olshansk Olshansk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@okdas Let's merge with main and run e2e tests before merging this in. Also, please see the comment I left, I would say it's quite important for developer productivity.

Copy link

The CI will now also run the e2e tests on devnet, which increases the time it takes to complete all CI checks.

@github-actions github-actions bot added devnet push-image CI related - pushes images to ghcr.io labels Jan 11, 2024
@okdas
Copy link
Member Author

okdas commented Jan 11, 2024

Q1. I saw some work you did around the volume refactoring and cleanup-datadir, but does this mean that we're not going with the "random datadir" and you figured out a way to make it deterministic? I simply want to understand and potentially #PUC if there's an opportunity for it.

The problem appeared only when LocalNet was configured to preserve the blocks across restarts. The root of the problem was that the sequencer was not stopping in time before the new process started. It appeared this was happening due to the sequencer not receiving the signal to terminate the process. After moving some scripts around (which I wanted to do anyway), that issue disappeared. The sequencer now gets PID 1 inside the container and shuts down instantly when the Pod terminates.

Q2. This is something @red-0ne and I found an inconsistency in. Sometimes it works and sometimes it doesn't, but I haven't done a formal deep dive. Can you confirm the following:

* Hot reloading DOES WORK if we click `restart` in the UI

* Hot reloading DOES NOT WORK if we change code, see the image rebuild, but do not click `restart` in the UI

If this is the case, let's create a ticket to fix it. It is very confusing (especially if you don't know how the infrastructure works) and creates a discrepancy between expectation and what one is seeing as a new developer.

Hot reloading is an automatic restart after a code change. It does not work currently, due to how Tilt utilizes Helm (there are a couple of ways to use Helm charts in Tilt - and the old repo and new repo use different Tilt extensions).

Clicking the "restart" button is not hot reloading. But it does work. We've had issues with sequencer restarts in the UI - they were slow, and sometimes required multiple clicks. The root of the problem is described above, and this PR (along with helm chart changes) solves the issue.

Btw, I also changed a message bot posts after the devnet-test-e2e label is added to the PR to clarify the other issue we've run into today.

Copy link
Member

@Olshansk Olshansk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this in, but I still want to understand if we have an open ticket to resolve hot reloading or not?

@okdas okdas mentioned this pull request Jan 17, 2024
5 tasks
@okdas
Copy link
Member Author

okdas commented Jan 17, 2024

Let's merge this in, but I still want to understand if we have an open ticket to resolve hot reloading or not?

There is now: #325

@okdas okdas merged commit af93def into main Jan 17, 2024
10 checks passed
@okdas okdas deleted the dk-localnet-persistance branch January 17, 2024 01:20
Olshansk pushed a commit that referenced this pull request Jan 18, 2024
* add localnet persistance

* bump local-celestia-devnet

* rollback celestia container

* change messages
@bryanchriswhite bryanchriswhite removed push-image CI related - pushes images to ghcr.io devnet-test-e2e labels May 16, 2024
@github-actions github-actions bot removed the devnet label May 16, 2024
okdas added a commit that referenced this pull request Nov 14, 2024
* add localnet persistance

* bump local-celestia-devnet

* rollback celestia container

* change messages
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infra Infra or tooling related improvements, additions or fixes
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

3 participants