Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema restore failed with: "Unrecognized strategy option {us-east} passed to org.apache.cassandra.locator.NetworkTopologyStrategy" #4041

Open
2 tasks
yarongilor opened this issue Sep 23, 2024 · 27 comments · May be fixed by scylladb/scylla-cluster-tests#9492
Assignees
Labels

Comments

@yarongilor
Copy link

Packages

Scylla version: 2024.2.0~rc2-20240904.4c26004e5311 with build-id a8549197de3c826053f88ddfd045b365b9cd8692

Kernel Version: 5.15.0-1068-aws

Issue description

The backup restore failed with error:

restore data: create "100gb_sizetiered_6_0" ("100gb_sizetiered_6_0") with CREATE KEYSPACE "100gb_sizetiered_6_0" WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '3'} AND durable_writes = true: Unrecognized strategy option {us-east} passed to org.apache.cassandra.locator.NetworkTopologyStrategy for keyspace 100gb_sizetiered_6_0

The restore task was started like:

< t:2024-09-19 15:13:31,266 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.1.202>: restore/08395d3e-1492-4af3-86dc-d9b0b03039fc
< t:2024-09-19 15:13:31,564 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.1.202>: Sep 19 15:13:31 alternator-ttl-4-loaders-no-lwt-sis-monitor-node-4afc0c3a-1 scylla-manager[11147]: {"L":"INFO","T":"2024-09-19T15:13:31.255Z","N":"restore","M":"Initialized views","views":null,"_trace_id":"jlyaoqGiR0ab_NHSOrxl0g"}
< t:2024-09-19 15:13:31,564 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.1.202>: Sep 19 15:13:31 alternator-ttl-4-loaders-no-lwt-sis-monitor-node-4afc0c3a-1 scylla-manager[11147]: {"L":"INFO","T":"2024-09-19T15:13:31.257Z","N":"scheduler","M":"PutTask","task":"restore/08395d3e-1492-4af3-86dc-d9b0b03039fc","schedule":{"cron":"{\"spec\":\"\",\"start_date\":\"0001-01-01T00:00:00Z\"}","window":null,"timezone":"Etc/UTC","start_date":"0001-01-01T00:00:00Z","interval":"","num_retries":3,"retry_wait":"10m"},"properties":{"location":["s3:manager-backup-tests-permanent-snapshots-us-east-1"],"restore_schema":true,"snapshot_tag":"sm_20240812164539UTC"},"create":true,"_trace_id":"jlyaoqGiR0ab_NHSOrxl0g"}
< t:2024-09-19 15:13:31,565 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.1.202>: Sep 19 15:13:31 alternator-ttl-4-loaders-no-lwt-sis-monitor-node-4afc0c3a-1 scylla-manager[11147]: {"L":"INFO","T":"2024-09-19T15:13:31.264Z","N":"scheduler.4253a65e","M":"Schedule","task":"restore/08395d3e-1492-4af3-86dc-d9b0b03039fc","in":"0s","begin":"2024-09-19T15:13:31.264Z","retry":0,"_trace_id":"jlyaoqGiR0ab_NHSOrxl0g"}
< t:2024-09-19 15:13:31,565 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.1.202>: Sep 19 15:13:31 alternator-ttl-4-loaders-no-lwt-sis-monitor-node-4afc0c3a-1 scylla-manager[11147]: {"L":"INFO","T":"2024-09-19T15:13:31.264Z","N":"http","M":"POST /api/v1/cluster/4253a65e-2c97-48dc-a939-7c7590741a75/tasks","from":"127.0.0.1:34234","status":201,"bytes":0,"duration":"3766ms","_trace_id":"jlyaoqGiR0ab_NHSOrxl0g"}

then failed:

< t:2024-09-19 15:13:35,364 f:base.py         l:143  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.1.202>: Command "sudo sctool  -c 4253a65e-2c97-48dc-a939-7c7590741a75 progress restore/08395d3e-1492-4af3-86dc-d9b0b03039fc" finished with status 0
< t:2024-09-19 15:13:35,364 f:cli.py          l:1132 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Restore progress
< t:2024-09-19 15:13:35,364 f:cli.py          l:1132 c:sdcm.mgmt.cli        p:DEBUG > Run:              bb1652f4-7699-11ef-bc2a-0a833fefb519
< t:2024-09-19 15:13:35,364 f:cli.py          l:1132 c:sdcm.mgmt.cli        p:DEBUG > Status:           ERROR (restoring backed-up data)
< t:2024-09-19 15:13:35,364 f:cli.py          l:1132 c:sdcm.mgmt.cli        p:DEBUG > Cause:            restore data: create "100gb_sizetiered_6_0" ("100gb_sizetiered_6_0") with CREATE KEYSPACE "100gb_sizetiered_6_0" WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '3'} AND durable_writes = true: Unrecognized strategy option {us-east} passed to org.apache.cassandra.locator.NetworkTopologyStrategy for keyspace 100gb_sizetiered_6_0
< t:2024-09-19 15:13:35,364 f:cli.py          l:1132 c:sdcm.mgmt.cli        p:DEBUG > Start time:       19 Sep 24 15:13:31 UTC
< t:2024-09-19 15:13:35,364 f:cli.py          l:1132 c:sdcm.mgmt.cli        p:DEBUG > End time: 19 Sep 24 15:13:33 UTC
< t:2024-09-19 15:13:35,364 f:cli.py          l:1132 c:sdcm.mgmt.cli        p:DEBUG > Duration: 2s
< t:2024-09-19 15:13:35,364 f:cli.py          l:1132 c:sdcm.mgmt.cli        p:DEBUG > Progress: 0% | 0%
< t:2024-09-19 15:13:35,364 f:cli.py          l:1132 c:sdcm.mgmt.cli        p:DEBUG > Snapshot Tag:     sm_20240812164539UTC
< t:2024-09-19 15:13:35,364 f:cli.py          l:1132 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2024-09-19 15:13:35,364 f:cli.py          l:1132 c:sdcm.mgmt.cli        p:DEBUG > ╭───────────────┬──────────┬──────────┬─────────┬────────────┬────────╮
< t:2024-09-19 15:13:35,364 f:cli.py          l:1132 c:sdcm.mgmt.cli        p:DEBUG > │ Keyspace      │ Progress │     Size │ Success │ Downloaded │ Failed │
< t:2024-09-19 15:13:35,364 f:cli.py          l:1132 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────┼──────────┼──────────┼─────────┼────────────┼────────┤
< t:2024-09-19 15:13:35,364 f:cli.py          l:1132 c:sdcm.mgmt.cli        p:DEBUG > │ system_schema │  0% | 0% │ 352.731k │       0 │          0 │      0 │
< t:2024-09-19 15:13:35,364 f:cli.py          l:1132 c:sdcm.mgmt.cli        p:DEBUG > ╰───────────────┴──────────┴──────────┴─────────┴────────────┴────────╯
< t:2024-09-19 15:13:35,364 f:cli.py          l:1148 c:sdcm.mgmt.cli        p:DEBUG > sctool res after parsing: [['Restore progress'], ['Run: bb1652f4-7699-11ef-bc2a-0a833fefb519'], ['Status: ERROR (restoring backed-up data)'], ['Cause: restore data: create "100gb_sizetiered_6_0" ("100gb_sizetiered_6_0") with CREATE KEYSPACE "100gb_sizetiered_6_0" WITH replication = {\'class\': \'org.apache.cassandra.locator.NetworkTopologyStrategy\', \'us-east\': \'3\'} AND durable_writes = true: Unrecognized strategy option {us-east} passed to org.apache.cassandra.locator.NetworkTopologyStrategy for keyspace 100gb_sizetiered_6_0'], ['Start time: 19 Sep 24 15:13:31 UTC'], ['End time: 19 Sep 24 15:13:33 UTC'], ['Duration: 2s'], ['Progress: 0%', '0%'], ['Snapshot Tag: sm_20240812164539UTC'], ['Keyspace', 'Progress', 'Size', 'Success', 'Downloaded', 'Failed'], ['system_schema', '0%', '0%', '352.731k', '0', '0', '0']]
2024-09-19 15:13:39.530: (DisruptionEvent Severity.ERROR) period_type=end event_id=74265a87-4830-422e-a42f-7081a9ec6230 duration=58s: nemesis_name=MgmtRestore target_node=Node alternator-ttl-4-loaders-no-lwt-sis-db-node-4afc0c3a-3 [34.242.246.113 | 10.4.3.150] errors=Schema restoration of sm_20240812164539UTC has failed!
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5207, in wrapper
    result = method(*args[1:], **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2966, in disrupt_mgmt_restore
    assert restore_task.status == TaskStatus.DONE, \
AssertionError: Schema restoration of sm_20240812164539UTC has failed!
  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • alternator-ttl-4-loaders-no-lwt-sis-db-node-4afc0c3a-6 (18.202.235.208 | 10.4.3.36) (shards: -1)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-4afc0c3a-5 (54.75.40.118 | 10.4.3.65) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-4afc0c3a-4 (34.241.184.210 | 10.4.0.247) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-4afc0c3a-3 (34.242.246.113 | 10.4.3.150) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-4afc0c3a-2 (108.129.126.116 | 10.4.1.130) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-4afc0c3a-1 (34.245.137.137 | 10.4.1.50) (shards: 14)

OS / Image: ami-0555cb82c50d0d5f1 (aws: undefined_region)

Test: longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-sisyphus-test
Test id: 4afc0c3a-7457-4d8b-a69a-8ee387d26369
Test name: enterprise-2024.2/alternator_tablets/longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-sisyphus-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 4afc0c3a-7457-4d8b-a69a-8ee387d26369
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 4afc0c3a-7457-4d8b-a69a-8ee387d26369

Logs:

Jenkins job URL
Argus

@Michal-Leszczynski
Copy link
Collaborator

Well, this is expected.
SM restores schema by applying the output of DESC SCHEMA WITH INTERNALS queried during the backup.
The problem is that schema contains topology related information - dcs in which the keyspace is replicated.
So in order to use SM restore schema task, the restore destination cluster needs to consist of the same dcs as the backed up cluster.

A workaround is to take the schema file from backup location, modify it to fit your needs, and apply it manually.

@karol-kokoszka karol-kokoszka changed the title restore data failed with: "Unrecognized strategy option {us-east} passed to org.apache.cassandra.locator.NetworkTopologyStrategy" Schema restore data failed with: "Unrecognized strategy option {us-east} passed to org.apache.cassandra.locator.NetworkTopologyStrategy" Sep 30, 2024
@karol-kokoszka
Copy link
Collaborator

karol-kokoszka commented Sep 30, 2024

@yarongilor Is there anything you suggest to change in Scylla Manager ? As per #4041 (comment) this is expected behavior of the manager.

It looks that there is no datacenter of the "us-east" name in the destination cluster.

# cassandra-rackdc.properties
#
# The lines may include white spaces at the beginning and the end.
# The rack and data center names may also include white spaces.
# All trailing and leading white spaces will be trimmed.
#
dc=thedatacentername
rack=therackname
# prefer_local=<false | true>
# dc_suffix=<Data Center name suffix, used by EC2SnitchXXX snitches>

@yarongilor
Copy link
Author

yarongilor commented Sep 30, 2024

@roydahan , @fruch , is there any known resolution for this issue?
The test ran on eu-west-1 region (with Datacenter: eu-west) and failed restoring backup to us-east datacenter. Is it a matter of wrong selected region to test? or it require an SCT fix?

@roydahan
Copy link

It's not a new issue, mostly a usability issue.
@mikliapko I think the original issue is assigned to you, are you planning to change SCT so it will change the DC name while trying to restore?

@Michal-Leszczynski
Copy link
Collaborator

Issue about restoring schema into a differenct DC setting: #4049.

@fruch
Copy link
Contributor

fruch commented Sep 30, 2024

Issue about restoring schema into a differenct DC setting: #4049.

So currently the user is supposed to do the schema restore manually

@mikliapko so I'll say we should at least skip the nemesis if the region of the snapshots doesn't match.

At least until it would be implemented on the test end or manager end.

@karol-kokoszka karol-kokoszka changed the title Schema restore data failed with: "Unrecognized strategy option {us-east} passed to org.apache.cassandra.locator.NetworkTopologyStrategy" Schema restore failed with: "Unrecognized strategy option {us-east} passed to org.apache.cassandra.locator.NetworkTopologyStrategy" Sep 30, 2024
@mikliapko
Copy link

It's not a new issue, mostly a usability issue. @mikliapko I think the original issue is assigned to you, are you planning to change SCT so it will change the DC name while trying to restore?

I don't remember we have an issue for that.
Created the new one for me to correctly handle the case when the region of the snapshots doesn't match.
#4052

@rayakurl
Copy link

rayakurl commented Oct 1, 2024

@mikliapko - IMO we can plan for a workaround, depends when this issue will be fixed from Manager side.
@karol-kokoszka , @Michal-Leszczynski - please discuss this in the next Manager refinement meeting. If it's not going to be handled soon, @mikliapko will create a WA in the test for it.

@fruch
Copy link
Contributor

fruch commented Oct 1, 2024

It's not a new issue, mostly a usability issue. @mikliapko I think the original issue is assigned to you, are you planning to change SCT so it will change the DC name while trying to restore?

I don't remember we have an issue for that. Created the new one for me to correctly handle the case when the region of the snapshots doesn't match. #4052

there was an issue about this, long long ago:
https://github.com/scylladb/qa-tasks/issues/1477

I don't know if anything was done to try to apply any workaround.

@timtimb0t
Copy link

timtimb0t commented Oct 22, 2024

@roydahan
Copy link

@mikliapko we want to have at least a workaround for this issue until it's being fixed in manager.

@mikliapko
Copy link

@mikliapko we want to have at least a workaround for this issue until it's being fixed in manager.

Sorry, lost track of this issue for a while. I’ll try to come up with a solution no later than next week.

@roydahan
Copy link

If there is no fix planned in manager or an easy workaround on the manager side, you can have a workaround in SCT by uploading backups to several regions (each with correct region in schema).
Then, change the nemesis to pull the backup from the correct region.

@mikliapko
Copy link

mikliapko commented Nov 25, 2024

A workaround is to take the schema file from backup location, modify it to fit your needs, and apply it manually.

@Michal-Leszczynski
I'd like to workaround this issue.
Where can I find a list of CQL statements I need to apply after manual modification of schema file?

@Michal-Leszczynski
Copy link
Collaborator

Where can I find a list of CQL statements I need to apply after manual modification of schema file?

It's in the backup location under /backup/schema/cluster/<clusterID>/task_<taskID>_tag_<snapshotTag>_schema_with_internals.json.gz.
This file contains json array of schema statements returned from DESCRIBE SCHEMA WITH INTERNALS.

Uncompressed file can look like that:

[
  {
    "keyspace": "restoretest_full",
    "type": "keyspace",
    "name": "restoretest_full",
    "cql_stmt": "CREATE KEYSPACE restoretest_full WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'dc1': '2'} AND durable_writes = true;"
  },
  {
    "keyspace": "restoretest_full",
    "type": "table",
    "name": "big_table",
    "cql_stmt": "CREATE TABLE restoretest_full.big_table (\n    id int,\n    data blob,\n    PRIMARY KEY (id)\n) WITH ID = d3607a70-a8bf-11ef-851d-38d002d034f1\nAND bloom_filter_fp_chance = 0.01\n    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}\n    AND comment = ''\n    AND compaction = {'class': 'NullCompactionStrategy', 'enabled': 'false'}\n    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}\n    AND crc_check_chance = 1\n    AND default_time_to_live = 0\n    AND gc_grace_seconds = 864000\n    AND max_index_interval = 2048\n    AND memtable_flush_period_in_ms = 0\n    AND min_index_interval = 128\n    AND speculative_retry = '99.0PERCENTILE'\n    AND paxos_grace_seconds = 864000\n    AND tombstone_gc = {'mode': 'repair', 'propagation_delay_in_seconds': '3600'};\n"
  },
  {
    "keyspace": "restoretest_full",
    "type": "index",
    "name": "bydata_index",
    "cql_stmt": "CREATE INDEX bydata ON restoretest_full.big_table(data);\n"
  },
  {
    "keyspace": "restoretest_full",
    "type": "view",
    "name": "testmv",
    "cql_stmt": "CREATE MATERIALIZED VIEW restoretest_full.testmv AS\n    SELECT id, data\n    FROM restoretest_full.big_table\n    WHERE data IS NOT null\n    PRIMARY KEY (id, data)\n    WITH ID = d38ae5d0-a8bf-11ef-8cb1-45ed76572674\nAND CLUSTERING ORDER BY (data ASC)\n    AND bloom_filter_fp_chance = 0.01\n    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}\n    AND comment = ''\n    AND compaction = {'class': 'SizeTieredCompactionStrategy'}\n    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}\n    AND crc_check_chance = 1\n    AND default_time_to_live = 0\n    AND gc_grace_seconds = 864000\n    AND max_index_interval = 2048\n    AND memtable_flush_period_in_ms = 0\n    AND min_index_interval = 128\n    AND speculative_retry = '99.0PERCENTILE'\n    AND paxos_grace_seconds = 864000\n    AND tombstone_gc = {'mode': 'repair', 'propagation_delay_in_seconds': '3600'};\n"
  }
]

@timtimb0t
Copy link

reproduced there:

Packages

Scylla version: 6.3.0~dev-20241122.e2e6f4f441be with build-id 2493a7aae1f855d3df502197f757822b6afc1033

Kernel Version: 6.8.0-1019-aws

Installation details

Cluster size: 5 nodes (i4i.8xlarge)

Scylla Nodes used in this run:

  • longevity-mv-si-4d-master-db-node-299884c7-8 (3.250.175.37 | 10.4.11.76) (shards: 30)
  • longevity-mv-si-4d-master-db-node-299884c7-7 (3.255.213.87 | 10.4.11.152) (shards: 30)
  • longevity-mv-si-4d-master-db-node-299884c7-6 (54.75.36.72 | 10.4.10.102) (shards: 30)
  • longevity-mv-si-4d-master-db-node-299884c7-5 (34.255.195.233 | 10.4.9.117) (shards: 30)
  • longevity-mv-si-4d-master-db-node-299884c7-4 (18.202.252.170 | 10.4.9.20) (shards: 30)
  • longevity-mv-si-4d-master-db-node-299884c7-3 (3.255.99.166 | 10.4.10.136) (shards: 30)
  • longevity-mv-si-4d-master-db-node-299884c7-2 (52.50.139.63 | 10.4.11.189) (shards: 30)
  • longevity-mv-si-4d-master-db-node-299884c7-1 (46.137.67.238 | 10.4.8.195) (shards: 30)

OS / Image: ami-001a2091244fdbdf3 (aws: undefined_region)

Test: longevity-mv-si-4days-streaming-test
Test id: 299884c7-f5ee-4e0d-8e21-a27a3509b0a6
Test name: scylla-master/tier1/longevity-mv-si-4days-streaming-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 299884c7-f5ee-4e0d-8e21-a27a3509b0a6
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 299884c7-f5ee-4e0d-8e21-a27a3509b0a6

Logs:

Jenkins job URL
Argus

@mikliapko
Copy link

Was trying to workaround it via manual restore of the schema (with putting the right region into the CQL statements for keyspace creation) - failed because of kms:decryption issue (see details here) - the backup was encrypted with the key of the region it belongs to and can't be decryption with key from replaced region.

Looks like backup uploading to several regions is the only way left so far.

@fruch
Copy link
Contributor

fruch commented Nov 26, 2024

Was trying to workaround it via manual restore of the schema (with putting the right region into the CQL statements for keyspace creation) - failed because of kms:decryption issue (see details here) - the backup was encrypted with the key of the region it belongs to and can't be decryption with key from replaced region.

Looks like backup uploading to several regions is the only way left so far.

those backup were create with KMS keys which are long gone, regardless of region.
or I'm missing something from how the flow of resoute from KMS EaR encrypted sstables should work ?

@mikliapko
Copy link

those backup were create with KMS keys which are long gone, regardless of region.

I recreated those backups a few months ago.
They must be encrypted with the relevant key.

@fruch
Copy link
Contributor

fruch commented Nov 26, 2024

those backup were create with KMS keys which are long gone, regardless of region.

I recreated those backups a few months ago. They must be encrypted with the relevant key.

I take it back, we clear the aliases and not the keys.

@mikliapko
Copy link

mikliapko commented Dec 3, 2024

Hm, we have an issue related to backup_bucket_region parameter.

In test_defaults.yaml this parameter is empty string:

backup_bucket_region: ''  # use the same region as a cluster

Then it gets rewritten by aws_config.yaml:

backup_bucket_region: 'us-east-1'

This parameter is used to configure manager agent:

        node.update_manager_agent_backup_config(
            region=self.params.get("backup_bucket_region"),
            general_config=agent_backup_general_config,
        )

As a result, if the region_name in test differs from us-east-1, we have a misconfiguration between actual region and region configured in scylla-agent.yaml.

I'm thinking about adding a validation rule for backup_bucket_region parameter in sdcm.sct_config.py. Something like this to forbid such kind of situations:

if self.get("backup_bucket_region") != self.get("region_name"):
    self["backup_bucket_region"] = self.get("region_name")

@fruch May this change lead to any unexpected consequences? How do you think?
Or perhaps you have some better ideas how to fix it?

@fruch
Copy link
Contributor

fruch commented Dec 3, 2024

Hm, we have an issue related to backup_bucket_region parameter.

In test_defaults.yaml this parameter is empty string:

backup_bucket_region: ''  # use the same region as a cluster

Then it gets rewritten by aws_config.yaml:

backup_bucket_region: 'us-east-1'

This parameter is used to configure manager agent:

    node.update_manager_agent_backup_config(
        region=self.params.get("backup_bucket_region"),
        general_config=agent_backup_general_config,
    )

As a result, if the region_name in test differs from us-east-1, we have a misconfiguration between actual region and region configured in scylla-agent.yaml.

I'm thinking about adding a validation rule for backup_bucket_region parameter in sdcm.sct_config.py. Something like this to forbid such kind of situations:

if self.get("backup_bucket_region") != self.get("region_name"):
self["backup_bucket_region"] = self.get("region_name")
@fruch May this change lead to any unexpected consequences? How do you think? Or perhaps you have some better ideas how to fix it?

as long as you have buckets in all regions for doing the backup nemesis (including of in gce / azure)
I doesn't think it's would be a problem.

I would recommend removing backup_bucket_region if it's not gonna be really usable, can be a followup

@mikliapko
Copy link

I would recommend removing backup_bucket_region if it's not gonna be really usable, can be a followup

Yes, as for now we can't set different region_name and backup_bucket_region, I suppose it make sense to get rid of it and use just region_name instead. Okay, I'll remove it in scope of this ticket.

@mikliapko
Copy link

After duplicating all snapshots and fixing backup location issues, disrupt_mgmt_restore is still failing with :

Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5426, in wrapper
    result = method(*args[1:], **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 187, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3092, in disrupt_mgmt_restore
    self.tester.set_ks_strategy_to_network_and_rf_according_to_cluster(
  File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 1111, in set_ks_strategy_to_network_and_rf_according_to_cluster
    NetworkTopologyReplicationStrategy(**datacenters).apply(node, keyspace)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/replication_strategy_utils.py", line 47, in apply
    session.execute(cql)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1318, in execute_verbose
    return execute_orig(*args, **kwargs)
  File "cassandra/cluster.py", line 2729, in cassandra.cluster.Session.execute
  File "cassandra/cluster.py", line 5120, in cassandra.cluster.ResponseFuture.result
cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Only one DC's RF can be changed at a time and not by more than 1"

The test alters keyspace with RF=3 to 5 what is prohibited.
https://argus.scylladb.com/tests/scylla-cluster-tests/0089ac32-5cc0-4168-852a-73718ab10242

@mikliapko
Copy link

I suppose to fix it, we need to implement the procedure described here:
https://opensource.docs.scylladb.com/stable/kb/rf-increase.html#example

@mikliapko
Copy link

mikliapko commented Dec 6, 2024

I wonder how this test was introduces at first time?
With this limitation disrupt_mgmt_restore test couldn't pass.

@timtimb0t
Copy link

reproduced there:
https://argus.scylladb.com/tests/scylla-cluster-tests/1b3e80d1-e6cb-46c0-a07b-0ca1c1b8974d
Backend: aws
Region: eu-west-1, eu-west-2, eu-north-1
Image id: ami-0c7b4b0835c9342f7 ami-039f35b0f1e04947e ami-03a78f37d7eaf9c88
SCT commit sha: 1d4cbaa1ed74fd3d4748a4636c3f6d57805bc24b
SCT repository: [email protected]:scylladb/scylla-cluster-tests.git
SCT branch name: origin/master
Kernel version: 6.8.0-1019-aws
Scylla version: 6.3.0~dev-20241206.7e2875d6489d
Build id: 5227dd2a3fce4d2beb83ec6c17d47ad2e8ba6f5c
Instance type: i4i.4xlarge
Node amount: 8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants