Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-923

MySQL active-passive: brief data loss on failover when Debezium encounters new GTID channel

    XMLWordPrintable

    Details

      Description

      Lets say we have two mysql servers in standard active-passive high availability setup. If current master node fails, automation will promote passive instance to new master and it continues to serve live traffic. And debezium is connecting to master node as well.

      Starting point:
      Server A (current master)
      uuid: abc
      gtids: abc:1-100

      Server B (slave)
      uuid: dfg
      gtid: abc:1-100 (replating from master)

      Debezium is connecting to master also, so it has
      gtids: abc:1-100

      Now assume master node fails, failover is triggered

      Server B (automation promotes it to new master)
      uuid: dfg,
      gtids: abc:1-100, dfg: 1-20

      Server A (becomes slave, starts replication from B)
      uuid: abc
      gtids: abc:1-100, dfg: 1-20

      Debezium after job restart:
      gtids: abc:1-100, dfg:1-20,

      Debezium gets connection reset error, then on job restart it successfully connects to new master (Server B), finds new gtid channel (dfg) and merges it to existing offsets and connects.

      Works, BUT! There is a timing issue.

      When encountering new gtid debezium starts reading it from mysql server latest gtid_executed position. So in case when mysql servers failover happens faster than debezium job failure detection and restart, the live data arriving to new master with new gtid channel (dfg in our example) is never processed in debezium. In our infra it can be several minutes of data lost as with large schemas debezium startup takes some time.

      What do you think about option to specify what should debezium do when encountering new gtid - take the latest executed position and continue from there or take earlies available value on server. Default could remain "latest", but in our case "earliest" would solve our problem with lost data changes on failover. Earliest could be gtid_purged channel value or if nothing purged then from position 1.

        Gliffy Diagrams

          Attachments

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                pimpelsang Eero Koplimets
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: