Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-944

PostgreSQL connector task got stuck in "RUNNING" state for 30 minutes after unhandled exception

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • 0.8.3.Final
    • postgresql-connector
    • None
    • Hide
      1. Create test table test_1 and insert one record in it
      2. Check that topic for table test_1 was created in kafka and contains one message - OK
      3. Stop schema-registry
      4. Create test table test_2 and insert one record in it
      5. Check connect logs: exception "Failed to send HTTP request..." should be there - OK
      6. Check connect logs: exception "Task is being killed and will not recover until manually restarted" should be there, but actually it is written only after 30 minutes - NOT OK
      Show
      Create test table test_1 and insert one record in it Check that topic for table test_1 was created in kafka and contains one message - OK Stop schema-registry Create test table test_2 and insert one record in it Check connect logs: exception "Failed to send HTTP request..." should be there - OK Check connect logs: exception "Task is being killed and will not recover until manually restarted" should be there, but actually it is written only after 30 minutes - NOT OK

    Description

      If an exception was thrown during task execution, then Task stucks in running state. After 30 minutes it changes its state to failed state. During that period we cannot detect failure and restart the task.

      Expected result:
      Task changes its state to failed just after exception been thrown.

      Actual result:
      Task changes its state to failed in 30 minutes after exception been thrown.

      Some investigation on this issue:
      1. PostgresConnectorTask.commit() is called from kafka connect code just after the exception was thrown.
      2. PostgresConnectorTask.commit() is blocked in RecordsStreamProducer.commit() call.
      3. RecordsStreamProducer.commit() is awaiting for lock from RecordsStreamProducer.streamChanges(): the actual lock is in org.postgresql.core.v3.CopyDualImpl writeToCopy vs readFromCopy
      4. Connect thread stacks just after exception are in "blocked.txt" attachment
      5. Connect thread stacks 30 min after exception are in "unblocked.txt" attachment
      6. Connect logs are in "connect_log.txt" attachment(exception was thrown in 12:50, and task failed only in 13:20)
      7. The problem is reproduced in 100% test runs(ses Steps to Reproduce)

      Need help guys!

      Attachments

        1. blocked.txt
          66 kB
        2. connect_log.txt
          15 kB
        3. unblocked.txt
          64 kB

        Issue Links

          Activity

            People

              Unassigned Unassigned
              6opuc_jira Boris Nadezhdin (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: