Skip to content

Bug Report: Failure in first call to PRS can lead to the cluster having no primary #17710

Closed
@GuptaManan100

Description

@GuptaManan100

Overview of the Issue

If PRS fails during the initialisation of a shard, and the failure happens while promoting the primary while it is writing to the topo-server such that the write succeeds, but fails with a timeout, then the tablet won't change its internal display state to a primary tablet. VTOrc sees this failure and tries to fix this by calling UndoDemotePrimary, but that doesn't change the type of the tablet to PRIMARY. It only fixes the mysql level settings and this causes the cluster to not have a primary at all.

Reproduction Steps

  1. Run PRS, and simulate a failure that happens before new primary tablet has promoted itself.

Binary Version

main

Operating System and Environment details

-

Log Fragments

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions