Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-1085

Task rebalancing repeats indefinitely

    XMLWordPrintable

Details

    • Bug
    • Resolution: Obsolete
    • Major
    • None
    • 0.9.0.Beta1
    • sqlserver-connector
    • None

    Description

      I have 6 Debezium SQL Server connectors running. I decided to add a second KConnect instance for HA. When I add the second instance (or, generally speaking, when rebalancing happens for any reason ), it often happens that the rebalancing happens indefinitely.
      More precisely, when a new instance is joining, rebalancing happens. The first instance, which runs all connectors, is being closed. It takes so long, that the second instance takes over all connectors. When the first connector finally rejoins, rebalancing happens again. This time the second instance is being closed for so long, that the first one takes over all connectors. And so on.

      Please have a look at the logs and screenshot attached. For the sake of readability I tried to extract to filter the most important lines from the logs. Let me know if you need more logs.

      Timeline

      [2019-01-15 14:59:08,158] - kafka-01 - instance up and running with 6 connectors
      [2019-01-15 14:59:08,679] - kafka-02 - instance started
      [2019-01-15 14:59:10,514] - kafka-01 - rebalancing has started, all 6 connectors are being stopped
      [2019-01-15 15:00:17,645] - kafka-02 - (Re-)joining group
      [2019-01-15 15:00:18,472] - kafka-02 - Successfully joined group with generation 3
      [2019-01-15 15:00:18,624] - kafka-02 - Finished creating connector A, B, C, D, E, F
      [2019-01-15 15:00:40,516] - kafka-01 - Coordinator didn't stop in the expected time, shutting down executor now
      [2019-01-15 15:01:18,775] - kafka-01 - Finished stopping tasks in preparation for rebalance
      [2019-01-15 15:01:18,775] - kafka-01 - (Re-)joining group
      [2019-01-15 15:01:21,500] - kafka-02 - Rebalance started
      [2019-01-15 15:02:28,780] - kafka-01 - Starting connector A, B, C, D, E, F
      [2019-01-15 15:02:51,506] - kafka-02 - Coordinator didn't stop in the expected time, shutting down executor now
      [2019-01-15 15:03:25,391] - kafka-02 - Finished stopping tasks in preparation for rebalance
      [2019-01-15 15:03:25,782] - kafka-01 - Rebalance started
      ...

      I can fix it manually only - 1) stop both KConnect instances. 2) start both KConnect instances simultaneously.

      I think it is worth mentioning that the connectors I am running I quite busy - some of them process even hundreds events per second.

      Attachments

        1. connectors-per-instance.png
          connectors-per-instance.png
          64 kB
        2. kc01.log
          41 kB
        3. kc02.log
          30 kB

        Activity

          People

            Unassigned Unassigned
            grzegorz.kolakowski Grzegorz Kołakowski (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: