Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Init shard primary flow both in vitess-operator and vtorc lead to primary-replica replication abnormal #634

Open
lujiashun opened this issue Nov 1, 2024 · 0 comments

Comments

@lujiashun
Copy link

lujiashun commented Nov 1, 2024

Overview of the issue

using vitess-operatro to create a vitess instance, with keysapce customer of two shards(Figure 1).
for shard customer:-80, the primary is zone1-2289928654, the replica is zone1-0120139806, the replica can not be replicated,
because it counter duplicate errors, seen as Figure 2.

Replica logs is lost, We can see the primary logs in Figure 3. It shows that the primary-replica relationship is constructed before time 16:11:58,
and then the primary receives the ResetReplication command at 16:12:02, this command is probablly from vitess-operator, code can be seen in Figure 4.

Reproduction Steps

The scenario maybe like this:

  1. operator find no primary
  2. vtorc find no primay
  3. vtorc lock shard ,and select a primary, construcet primary-replica relationship, and then unlock shard;
  4. operator lock shard, and then select a primary, and reset replication, re-construct the primary-replica relationship. Unfortunatly,
    the primary reset master is failed(Figure 5), the binlog file is not purged while the replica relag log and gtid_excuted, gitd_purged is cleared.
    So the replica's has the row (primary key is 172984391830416639" , and it recevies the binglog from gtid:b3767c83-92a8-11ef-add6-066fd3332dcc:23, it reports duplicate primary. If the primary reset master scucuess, and the primary binlog file is cleared, the primary-replica relationship maybe work well.

The problems is that init shard primary is invoked twice, once in vtorc, the second time is in vitess-opearot. Normally, both works well, but
in the sceanrio like above, it breaks the primary-replica replication.

Figure 1(togology)
Image

Figure 2 (replica replication abnormal)
Image

Figure 3 (primary log)
Image

Figure 4 (vitess-operator code)
Image

Figure 5 (replicat duplicat primay key )
Image

Binary Version

Vitess v18.0.1
latest tag vitess operator

Operating System and Environment details

K8S cluster

Log Fragments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant