Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-911

RecordsStreamProducer#columnValues() does not take into account unchanged TOASTed columns, refreshing table schemas unnecessarily

    XMLWordPrintable

    Details

    • Steps to Reproduce:
      Hide
      1. Create a table (in a PostgreSQL database) with at least one extended storage column. A text column that stores html pages is a good example.
      2. Set the table's REPLICA IDENTITY to 'FULL'.
      3. Populate the table, ensuring the extended storage column data triggers toast processing. (Use 2+ kb of text, for example). Create at least a few thousand rows.
      4. Update all (or a large number) of the table rows, where the update does not change the extended storage column's data.
      5. Observe that the PostgreSQL Connector takes several minutes to process the changes.
      6. Update all (or a large number) of the table rows, this time changing the extended storage column's data.
      7. Observe that the PostgreSQL Connector takes several seconds to process the changes.
      Show
      Create a table (in a PostgreSQL database) with at least one extended storage column. A text column that stores html pages is a good example. Set the table's REPLICA IDENTITY to 'FULL'. Populate the table, ensuring the extended storage column data triggers toast processing. (Use 2+ kb of text, for example). Create at least a few thousand rows. Update all (or a large number) of the table rows, where the update does not change the extended storage column's data. Observe that the PostgreSQL Connector takes several minutes to process the changes. Update all (or a large number) of the table rows, this time changing the extended storage column's data. Observe that the PostgreSQL Connector takes several seconds to process the changes.

      Description

      If a row contains columns with TOASTed data and an update to that row does not alter the TOASTed data, Debezium's PostgresSQL Connector unnecessarily refreshes its local cache of table schemas while processing the update. In the worst case, this results in a query to the database for each update event on that table. The overhead is tens to the low hundreds of milliseconds. Multiplied by several thousand or more change events, and the performance hit is obvious. Where we expect an update of, say, 3000 rows to be processed in several seconds, processing instead takes almost 10 minutes.

      I have confirmed this issue when using `REPLICA IDENTITY FULL`. This setting guarantees that the `before` record will provide all column values, since the WAL record provides the entire row as the key. This setting is rarely used, as it is inefficient, which explains why this bug has gone undetected. However, there are valid use cases; the issue cannot be ignored. I suspect that the issue exists for other `REPLICA IDENTITY` settings, but I have not tested this. If so (and I am confident it does), this is a serious performance issue that will affect all PostgreSQL users.

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

                People

                • Assignee:
                  ian-axelrod Ian Axelrod
                  Reporter:
                  ian-axelrod Ian Axelrod
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  3 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved: