Changed the MySQL connector to use comma-separated lists of regular expressions for the database
and table whitelist/blacklists. Literals are still accepted and will match fully-qualified table names,
although the '.' character used as a delimiter is also a special character in regular expressions and
therefore may need to be escaped with a double backslash ('\\') to more carefully match fully-qualified
table names.
Added several new configuration properties for the MySQL connector that instruct it to hide,
truncate, and/or mask certain columns. The properties' values are all lists of regular expressions
or literal fully-qualified column names. For example, the following configuration property:
column.blacklist=server.users.picture,server.users.other
will cause the connector to leave out of change event messages for the `server.users` table those
fields that correspond to the `picture` and `others` columns. This capability can be used to
This capability can be used to prevent dissemination of sensitive information in the change event
stream.
An alternative to blacklisting is masking. The following configuration property:
column.mask.with.10.chars=server\\.users\\.(\\w*email)
will cause the connector to mask in the change event messages for the `server.users` table
all values for columns whose name ends in `email`. The values will be replaced in this case with
a constant string of 10 asterisk ('*') characters, even when the email value is null.
This capability can also be used to prevent dissemination of sensitive information in the change event
stream.
Another option is to truncate string values for specific columns. The following configuration
property:
column.truncate.to.120.chars=server[.]users[.](description|biography)
will cause the connector to truncate to at most 120 characters the values of the `description` and
`biography` columns in the change event messages for the `server.users` table. Although this example
used a limit of 120 characters, any positive length can be specified; separate properties should
be used when different lengths are required. Note how the '.' delimiter in the fully-qualified names
is escaped since that same character is a special character in regular expressions. This capability
can be used to reduce the size of change event messages.
Drop table/view statements that involve more than one table generate one event for each table/view. Previously, each of those statements had the original multi-table/view statement. Now, each event has a statement that applies to only that table (generated from the original with all the same clauses).
The previous change did not correctly capture the statements for a `RENAME TO` that renamed multiple tables, so fixed the code so that it generates a single `RENAME TO` for each table rename.
Refactored the mechanism by which components can listen to the activities of a DDL parser. The new approach
should be significantly more flexible for additional types of DDL events while making it easier to maintain
backward compatibility. It also will enable passing event-specific information on each DDL event.
This utility is only suitable for unit tests and therefore is defined in the test JAR of the `debezium-core` module. It certainly should never be used for production purposes.
Changed the EmbeddedConnector framework to initialize all major components via configuration properties rather than through the public builder. This increases the size of the configurations, but it simplifies what embedding applications must do to obtain an EmbeddedConnector instance.
The DatabaseHistory framework was also changed to be configurable in similar ways to the OffsetBackingStore. Essentially, connectors that want to use it (like the MySqlConnector) will describe it as part of the connector's configuration, allowing more flexibility in which DatabaseHistory implementation is used and how it is configured whether in Kafka Connector or as part of the EmbeddedConnector.
Added a README.md to `debezium-embedded` to provide documentation and sample code showing how to use the EmbeddedConnector.
Adds a small framework for recording the DDL operations on the schema state (e.g., Tables) as they are read and applied from the log, and when restarting the connector task to recover the accumulated schema state. Where and how the DDL operations are recorded is an abstraction called `DatabaseHistory`, with three options: in-memory (primarily for testing purposes), file-based (for embedded cases and perhaps standalone Kafka Connect uses), and Kafka (for normal Kafka Connect deployments).
The `DatabaseHistory` interface methods take several parameters that are used to construct a `SourceRecord`. The `SourceRecord` type was not used, however, since that would result in this interface (and potential extension mechanism) having a dependency on and exposing the Kafka API. Instead, the more general parameters are used to keep the API simple.
The `FileDatabaseHistory` and `MemoryDatabaseHistory` implementations are both fairly simple, but the `FileDatabaseHistory` relies upon representing each recorded change as a JSON document. This is simple, is easily written to files, allows for recovery of data from the raw file, etc. Although this was done initially using Jackson, the code to read and write the JSON documents required a lot of boilerplate. Instead, the `Document` framework developed during Debezium's very early prototype stages was brought back. It provides a very usable API for working with documents, including the ability to compare documents semantically (e.g., numeric values are converted to be able to compare their numeric values rather than just compare representations) and with or without field order.
The `KafkaDatabaseHistory` is a bit more complicated, since it uses a Kafka broker to record all database schema changes on a single topic with single partition, and then upon restart uses it to recover the history from the dedicated topics. This implementation also records the changes as JSON documents, keeping it simple and independent of the Kafka Connect converters.
The connector is in a basic working state, although it is not well tested yet and upon restart does not recover the schema state from the previous run.