Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scylla jmx takes longer than normal (over a minute) to start #195

Open
1 of 2 tasks
ShlomiBalalis opened this issue Mar 21, 2023 · 7 comments
Open
1 of 2 tasks

Scylla jmx takes longer than normal (over a minute) to start #195

ShlomiBalalis opened this issue Mar 21, 2023 · 7 comments
Milestone

Comments

@ShlomiBalalis
Copy link

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

In this run, we ran the rebuild_streaming_err for several times. In this nemesis, we do the following:

  1. stop scylla
  2. delete a few sstable files (of non-system keyspaces)
  3. start scylla
  4. execute a nodetool rebuild
  5. interrupt it by rebooting the node instance
  6. executing a nodetool rebuild, this time to completion

The issue is that the jmx server takes an abnormal amount of time to start, such that it's only ready to receive requests over a minute after scylla starts, while previously it was a matter of ~10 seconds.

2023-03-09T14:38:04+00:00 longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-2     !INFO | scylla[589]: Scylla version 5.3.0~dev-0.20230305.95ce2e898016 with build-id 82d6475d1db2cda61d9de9fd3a060e821b356d36 starting ...
2023-03-09T14:38:04+00:00 longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-2     !INFO | scylla[589]:  [shard 0] init - Scylla version 5.3.0~dev-0.20230305.95ce2e898016 with build-id 82d6475d1db2cda61d9de9fd3a060e821b356d36 starting ...
...
2023-03-09T14:38:05+00:00 longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-2     !INFO | systemd[1]: Started Scylla JMX.
...
2023-03-09T14:38:30+00:00 longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-2     !INFO | scylla[589]:  [shard 0] init - Scylla version 5.3.0~dev-0.20230305.95ce2e898016 initialization completed.
...
2023-03-09T14:39:00+00:00 longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-2     !INFO | scylla-jmx[601]: Connecting to http://127.0.0.1:10000
2023-03-09T14:39:00+00:00 longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-2     !INFO | scylla-jmx[601]: Starting the JMX server
...
2023-03-09T14:39:47+00:00 longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-2     !INFO | scylla-jmx[601]: JMX is enabled to receive remote connections on port: 7199

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

The issue was 100% consistent, and caused all of the 6 nemesis runs in this build to fail.

Installation details

Kernel Version: 5.15.0-1031-aws
Scylla version (or git commit hash): 5.3.0~dev-20230305.95ce2e898016 with build-id 82d6475d1db2cda61d9de9fd3a060e821b356d36

Cluster size: 3 nodes (i4i.large)

Scylla Nodes used in this run:

  • longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-3 (34.241.61.144 | 10.4.1.52) (shards: 2)
  • longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-2 (18.202.223.38 | 10.4.3.3) (shards: 2)
  • longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-1 (3.252.167.78 | 10.4.2.125) (shards: 2)

OS / Image: ami-0122c861a035f0d11 (aws: eu-west-1)

Test: longevity-5gb-1h-RebuildStreamingErrMonkey-aws-test
Test id: 1011ef9b-bc10-44d9-9c6a-f08ff1e5d293
Test name: scylla-master/nemesis/longevity-5gb-1h-RebuildStreamingErrMonkey-aws-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 1011ef9b-bc10-44d9-9c6a-f08ff1e5d293
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 1011ef9b-bc10-44d9-9c6a-f08ff1e5d293

Logs:

Jenkins job URL

@mykaul
Copy link
Contributor

mykaul commented Mar 21, 2023

DNS... (resolving localhost, etc.) ?

@ShlomiBalalis
Copy link
Author

DNS... (resolving localhost, etc.) ?

If it is, it's weirdly consistent

@mykaul
Copy link
Contributor

mykaul commented Mar 23, 2023

DNS... (resolving localhost, etc.) ?

If it is, it's weirdly consistent

On different runs, different clusters, etc.?

@avikivity
Copy link
Member

The issue is that the jmx server takes an abnormal amount of time to start, such that it's only ready to receive requests over a minute after scylla starts, while previously it was a matter of ~10 seconds.

What version is "previously"?

@ShlomiBalalis
Copy link
Author

ShlomiBalalis commented Mar 29, 2023

DNS... (resolving localhost, etc.) ?

If it is, it's weirdly consistent

On different runs, different clusters, etc.?

In this run, every time the node was restarted.

The issue is that the jmx server takes an abnormal amount of time to start, such that it's only ready to receive requests over a minute after scylla starts, while previously it was a matter of ~10 seconds.

What version is "previously"?

Here's a run that uses 5.2.0~rc1-0.20230207.8ff4717fd010 with build-id 78fbb2c25e9244a62f57988313388a0260084528 for example:

2023-02-11T05:30:29+00:00 longevity-tls-50gb-3d-5-2-db-node-0530dbf7-4     !INFO | scylla[758]:  [shard  0] init - Scylla version 5.2.0~rc1-0.20230207.8ff4717fd010 with build-id 78fbb2c25e9244a62f57988313388a0260084528 starting ...
...
2023-02-11T05:30:33+00:00 longevity-tls-50gb-3d-5-2-db-node-0530dbf7-4     !INFO | systemd[1]: Started Scylla JMX.
...
2023-02-11T05:30:34+00:00 longevity-tls-50gb-3d-5-2-db-node-0530dbf7-4     !INFO | scylla-jmx[869]: Starting the JMX server
...
2023-02-11T05:30:35+00:00 longevity-tls-50gb-3d-5-2-db-node-0530dbf7-4     !INFO | scylla-jmx[869]: JMX is enabled to receive remote connections on port: 7199
...
2023-02-11T05:31:01+00:00 longevity-tls-50gb-3d-5-2-db-node-0530dbf7-4     !INFO | scylla[758]:  [shard  0] init - Scylla version 5.2.0~rc1-0.20230207.8ff4717fd010 initialization completed.

Here it took only 2 seconds for the JMX server to be up, and it started before scylla was successfully initialized. This is the standard behaviour.

@avikivity
Copy link
Member

But on other nodes, it starts quickly?

@DoronArazii DoronArazii added this to the Backlog milestone May 15, 2023
@fgelcer
Copy link

fgelcer commented May 23, 2023

But on other nodes, it starts quickly?

@ShlomiBalalis , can you please confirm it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants