• Icon: Enhancement Enhancement
    • Resolution: Done
    • Icon: Major Major
    • 1.4.0.Alpha2
    • None
    • mysql-connector
    • None
    • False
    • False
    • Undefined

      One of our legacy databases stores binary data in a MySQL CHAR(32) Latin1 field. Converting this field to binary would be the ideal solution but unfortunately isn't easily feasible. 

      The problem lies in the fact that the MySQL JDBC driver converts MySQL's latin1 charset to Windows-1252 whereas our legacy system stores these bytes using IANA Latin1 (or ISO-8859-1) charset and these implementations do not have a 1:1 mapping. 

      Additionally, it seems to me that SnapshotReader altogether ignores charset mapping when consuming the ResultSet. 

      My proposal is for me to submit a PR which exposes a new configuration parameter: 

      charset.overrides which is a comma delimited list of MySQL_CHARSET:JAVA_CHARSET which can be used to override the default JDBC driver mapping.

            [DBZ-2673] Overriding Character Set Mapping

            Released

            Debezium Builder added a comment - Released

            I was able to find a solution:

            1) I created a new CustomConverter implementation which maps the raw fields' bytes into the desirable string using a configurable encoding. This works out of the box with the BinlogReader since the BinlogReader returns the fields' raw bytes. 

            2) I configured my connector with "database.characterSetResults": "NULL" to override Debezium's default UTF-8 character_set_results because I want the raw bytes from the field. 

            3) I patched (https://github.com/debezium/debezium/pull/1909) SnapshotReader's readField to return the raw bytes for char-type fields.

            Arik Cohen (Inactive) added a comment - I was able to find a solution: 1) I created a new CustomConverter implementation which maps the raw fields' bytes into the desirable string using a configurable encoding. This works out of the box with the BinlogReader since the BinlogReader returns the fields' raw bytes.  2) I configured my connector with "database.characterSetResults": "NULL" to override Debezium's default UTF-8 character_set_results because I want the raw bytes from the field.  3) I patched ( https://github.com/debezium/debezium/pull/1909 ) SnapshotReader's readField to return the raw bytes for char-type fields.

            1) Actual data is lost because the CHAR field value is encoded using the wrong charset – presumably the default JDBC URL charset setting. Still investigating that.

            2) that sounds like a good idea. I can take a look at that.

            Arik Cohen (Inactive) added a comment - 1) Actual data is lost because the CHAR field value is encoded using the wrong charset – presumably the default JDBC URL charset setting. Still investigating that. 2) that sounds like a good idea. I can take a look at that.

            Hi,

            1) What kind of information is lost? Data or metadata?
            2) RelationaColumn could be extended to provide data from io.debezium.relational.Column.charsetName()

            J.

            Jiri Pechanec added a comment - Hi, 1) What kind of information is lost? Data or metadata? 2) RelationaColumn could be extended to provide data from io.debezium.relational.Column.charsetName() J.

            Arik Cohen (Inactive) added a comment - - edited

            jpechane I like the idea and I think that generally that's the way to go.

            Couple of problems I encountered while testing it out: 

            1) The SnapshotReader passes the field's value to the converter as a String (ignoring the column's charset altogether from what I can tell) with some of the information lost.

            2) The BinlogReader passes the raw bytes of the field to the converter which is great but I don't see a way to get the column's original character set. I tried RelationalColumn#typeExpression which according to the JavaDocs should give me the character set as well but I'm only getting the type name (CHAR).

            Any pointers? 

            Arik Cohen (Inactive) added a comment - - edited jpechane I like the idea and I think that generally that's the way to go. Couple of problems I encountered while testing it out:  1) The SnapshotReader passes the field's value to the converter as a String (ignoring the column's charset altogether from what I can tell) with some of the information lost. 2) The BinlogReader passes the raw bytes of the field to the converter which is great but I don't see a way to get the column's original character set. I tried RelationalColumn#typeExpression which according to the JavaDocs should give me the character set as well but I'm only getting the type name (CHAR). Any pointers? 

            Jiri Pechanec added a comment - Hi, this sounds like a perfect fit for https://debezium.io/documentation/reference/1.3/development/converters.html

              Unassigned Unassigned
              creactiviti Arik Cohen (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: