Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-7816

Postgres: Potential data loss on connector restart

XMLWordPrintable

    • False
    • None
    • False

      In order to make your issue reports as actionable as possible, please provide the following information, depending on the issue type.

      Bug report

      For bug reports, provide this information, please:

      What Debezium connector do you use and what version?

      2.0.0.Final

      What is the connector configuration?

      N/A

      What is the captured database version and mode of depoyment?

      Postgres 11, 12, 13.

      What behaviour do you expect?

      No data loss on connector restart at any point.

      What behaviour do you see?

      Data loss.

      Do you see the same behaviour using the latest relesead Debezium version?

      (Ideally, also verify with latest Alpha/Beta/CR version)

      Seen the same with 2.4.2.Final. It will be seen in latest version as well.

      <Your answer>

      Do you have the connector logs, ideally from start till finish?

      (You might be asked later to provide DEBUG/TRACE level log)

      N/A

      How to reproduce the issue using our tutorial deployment?

      Data loss will be reproducible test execution follows below order the following :
      1. poll() returned SourceRecords oof LSN1, LSN2, LSN3 to AbstractWorkerSourceTask#excute()(https://github.com/apache/kafka/blob/b254e787cbdefb38344c6a0da2b965e6d7707d27/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L353)
      2. All 3 records send to Kafka using KakfaProducer#send with callback and assume all 3 belong different borker partitions 
      3. LSN2 record recieved ack
      4. commitRecord updates latestoffset as LSN2 record
      5. LSN3 record recieved ack
      6. commitRecord updates latestoffset as LSN2 record
      7. commit() executed async based on config -> updated LSN3 as flushed LSN
      .. if kafka offset update happens, offset value is LSN1
      8. connector restarted
      a. Kafka offset returns LSN1 as lsn_commit
      b. replication slot returns LSN3 as db flushed lsn
      10. DB streams data from LSN3
       
      LSN1 record is lost.
       
      The data loss is due to non guaranteed ack (success or failed) order.
       
      According KafkaProducer doc : https://github.com/apache/kafka/blob/864744ffd4ddc3b0d216a3049ee0c61e9c0d3ad1/clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java#L901
       
      Kafka connect uses async callbacks here - https://github.com/apache/kafka/blob/81c222e9779c3339aa139ab930a74aba2c7c8685/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L411
      (no blocking call i.e get())
       
      ACK's order is guaranteed for the records that belong to same partition. So, ack will be out of order for the records belong to different partitions.  
       
      Out of oder ack's can be tested by adding record logs in BaseSourceTask#commitRecord() and PostgresConnectorTask#doPoll() methods. 
       
      if ack's are out of oder, latest offset in BaseSourceTask is not safe to persist as flushed LSN to DB WAL.

      Feature request or enhancement

      For feature requests or enhancements, provide this information, please:

      Which use case/requirement will be addressed by the proposed feature?

      <Your answer>

      Implementation ideas (optional)

      commit() must fetch latest offset from Kafka offset topic and use it as flushed DB LSN.

      this ensures DB flush LSN is always <= Kafka offset LSN.

            Unassigned Unassigned
            dasarianil1505 Anil Dasari
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: