After a crash in our custom value converter and restarting debezium, it skipped a lot of messages.

      A minimal repro is available here: https://github.com/tzachshabtay/debezium-bug-repro

            [DBZ-1824] Debezium skips messages after restart

            Released

            Jiri Pechanec added a comment - Released

            tshabtay Thanks, it is visible for me now too.

            My findings - the issue is only present for wal2json, pgoutpud and protobuf decodrs are immune to it.

            The problem probably lies in io.debezium.connector.postgresql.connection.AbstractMessageDecoder.shouldMessageBeSkipped(ByteBuffer, Long, Long, boolean).

            Jiri Pechanec added a comment - tshabtay Thanks, it is visible for me now too. My findings - the issue is only present for wal2json, pgoutpud and protobuf decodrs are immune to it. The problem probably lies in io.debezium.connector.postgresql.connection.AbstractMessageDecoder.shouldMessageBeSkipped(ByteBuffer, Long, Long, boolean) .

            Tzach Shabtay (Inactive) added a comment - - edited

            jpechane Ok, I modified the repro and added 30 seconds wait time before inserting events.
            It reproduces now (even with your snapshot override changes) and there is no "snapshot" flag in the connect_offsets message.

            Tzach Shabtay (Inactive) added a comment - - edited jpechane Ok, I modified the repro and added 30 seconds wait time before inserting events. It reproduces now (even with your snapshot override changes) and there is no "snapshot" flag in the connect_offsets message.

            Well, I think the problem is the timing - registring the connector does not mean it is started. SO as I see it

            • connector is registered
            • events are started to be added
            • snapshot is started and is waiting for table to be unlocked
            • events are added and table unlocked
            • snapshot continues

            Try adding 20 s of sleep before adding events

            Jiri Pechanec added a comment - Well, I think the problem is the timing - registring the connector does not mean it is started. SO as I see it connector is registered events are started to be added snapshot is started and is waiting for table to be unlocked events are added and table unlocked snapshot continues Try adding 20 s of sleep before adding events

            jpechane I'm guessing yes (because we're running for a long time without restarts in production). I looked at connect_offsets in our production environment and I don't see the "snapshot" flag at all in the entire topic (we run 0.9.5, is this a newly introduced flag?).

            Anyway, I'm still trying to understand why my repro even queries the snapshot after restart. I don't fill it in advance, I first start the connector, put the config and wait for the ack, and only then I add the records. So at first startup, the table is empty (and it's the only table in the database), so shouldn't the snapshot be marked as completed?

            Tzach Shabtay (Inactive) added a comment - jpechane I'm guessing yes (because we're running for a long time without restarts in production). I looked at connect_offsets in our production environment and I don't see the "snapshot" flag at all in the entire topic (we run 0.9.5, is this a newly introduced flag?). Anyway, I'm still trying to understand why my repro even queries the snapshot after restart. I don't fill it in advance, I first start the connector, put the config and wait for the ack, and only then I add the records. So at first startup, the table is empty (and it's the only table in the database), so shouldn't the snapshot be marked as completed?

            The connector gether list of tables, then queries all records in each table and only after last record in the last able is processed only then the snapshot is ocnsidered complete. Of course whitelist/blacklist settings applies. And don't forget we are talking about content of the table at time of plugin start.
            So if you prim the table in advance it is processed fully in snpahsot pahse and nothing in streaming phase. But do I understand it correctly that your issue was during streaming phase?

            Jiri Pechanec added a comment - The connector gether list of tables, then queries all records in each table and only after last record in the last able is processed only then the snapshot is ocnsidered complete. Of course whitelist/blacklist settings applies. And don't forget we are talking about content of the table at time of plugin start. So if you prim the table in advance it is processed fully in snpahsot pahse and nothing in streaming phase. But do I understand it correctly that your issue was during streaming phase?

            jpechane I'm completely lost at this point. What do you mean "ALL records"? How does the connector decide that "ALL records" were processed?

            Tzach Shabtay (Inactive) added a comment - jpechane I'm completely lost at this point. What do you mean "ALL records"? How does the connector decide that "ALL records" were processed?

            tshabtay But you have 4999 out of how much? Untill ALL records are process the snapshot is not completed and will be restarted.
            Your reproducer should not use snapshots at all in this case.

            Jiri Pechanec added a comment - tshabtay But you have 4999 out of how much? Untill ALL records are process the snapshot is not completed and will be restarted. Your reproducer should not use snapshots at all in this case.

            jpechane But in my repro, I have 4999 records recorded before the crash.
            Or is this a timing issue? Do I need to a "sleep" statement in my repro between putting the config in the container and writing the events so that it will have enough time to mark the snapshot as completed and won't query the snapshot after restart?
            I mean, we're trying to repro an actual event in production which happened after more than a year of running with debezium, so surely it wasn't a snapshot issue, right?

            Tzach Shabtay (Inactive) added a comment - jpechane But in my repro, I have 4999 records recorded before the crash. Or is this a timing issue? Do I need to a "sleep" statement in my repro between putting the config in the container and writing the events so that it will have enough time to mark the snapshot as completed and won't query the snapshot after restart? I mean, we're trying to repro an actual event in production which happened after more than a year of running with debezium, so surely it wasn't a snapshot issue, right?

            What happens if you write a record after the snapshot is completed? The snapshot can be marked as completed only if one record - either from snapshot or streaming is recorded.

            Jiri Pechanec added a comment - What happens if you write a record after the snapshot is completed? The snapshot can be marked as completed only if one record - either from snapshot or streaming is recorded.

              jpechane Jiri Pechanec
              tshabtay Tzach Shabtay (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: