That way, any generated sources won't be parsed to begin with, instead of only discarding their violations. This saves ~30 sec. for the DDL parser module.
The driver upgrade mitigates some issues with using this connector with
Postgres on Azure. It comes with some behavioural changes, though:
* column metadata for DECIMAL without scale is returned differently by
the (see https://github.com/pgjdbc/pgjdbc/issues/1767): while it used
to be returned as 0, it's now returned as null. This should be
transparent to DBZ consumers
* snapshots for partitioned tables only export change events on the
partition topics now due to https://github.com/pgjdbc/pgjdbc/pull/1708;
this has an impact on consumers, but I think it's more reasonable than
exporting all change events twice, one partition table and main table
topics
* The filter config is stored in offsets now, allowing to detect changes
to the config after a connector restart
* In that case, roughly the following steps are done:
- set up a binlog reader for the "old" tables
- set up a snapshot reader for the "new" tables
- if the snapshot is done, set up a binlog reader for the new tables
- if both binlog readers are in proximity of the logs head, stop them
and set up a single binlog reader for all tables
* That behavior is disabled by default for the purposes of testing and
can be enabled via new connector option "snapshot.new.tables"
* To facilitate connector restarts while whitelist/blacklist changes are
processed, separate restart offsets are added to the offsets
* Using database encoding for string conversion
* Not making hstore schemas optional by default
* Using Jackson instead of GSon for JSON serialization
* Removing superfluous method and log messages
* Adjusting to naming and style conventions
It would have been needed for the Postres module only anyways. But it seems the generator plug-in is adding that source path automatically to the compilation already, so it's not needed at all.
Changed how the mongo-init process waits to begin by now looking for the second MongoDB server log message
saying it is ready, since the MongoDB image now has different startup behavior.
Upgraded Shyiko’s MySQL binary log client library from 0.8.0 to 0.9.0 to get new timeout behavior when it opens sockets and fix for JSON array processing.
Some of our test cases verify (de)serialization using the Avro Converter, which is included in the Confluent Platform. This commit upgrades the Confluent Platform to version 3.1.2, which matches Kafka 0.10.1.1.
The KafkaDatabaseHistory class was not behaving well in tests using my local development environment. When restoring from the persisted Kafka topic, the class would set up a Kafka consumer and see repeated messages. It is unclear whether the repeats were due to our test environment and very short poll timeouts. Regardless, the restore logic was refactored to track offsets so as to only process messages once.
The project requires that all JavaDoc for public methods exist and are valid (e.g., have all @param, @return and @throws to match the signature). However, the generated Java source for Protobuf contain numerous JavaDoc errors relative to these settings. This causes lots of errors inside Eclipse (and probably other IDEs), but ignoring/disabling the JavaDoc errors leads to improper JavaDoc (fixed in next commit). By moving the generated Protobuf source code to a separate directory (e.g., `generated-sources`), the IDEs will automatically discover the directory and the user can ignore any compiler and JavaDoc errors/warnings for those files while keeping the more strict JavaDoc checking enabled for the rest of the code.
The PostgreSQL connetor was not able to build locally, since the Maven build would wait forever trying to talk to the TCP port for PostgreSQL before starting the integration tests. Even when I corrected the `wait` specification to use the localhost (rather than the direct container address), the build successfully connected to Postgres when it started the first time but before it shutdown to adjust the configuration, and thus the tests failed as the server was shutdown. The build now looks for a specific log message which is unique and output by the container after the second startup, and this seems to work great (at least locally).
The version of the DB server required for this to work is at least 9.4. To be able to stream logical changes, the code relies on enhancements to the JDBC driver which are not yet public. Therefore, the current codebase includes the sources for the JDBC driver.
The commit also updates the general DBZ build system for:
* custom checkstyle package exclusions - required by the Postgres driver the protobuf code for now
* adds support for debugging Surefire and Failsafe
The version of the DB server required for this to work is at least 9.4
The commit also updates the general DBZ build system for:
* custom checkstyle package exclusions - required by the Postgres driver the protobuf code for now
* adds support for debugging Surefire and Failsafe
These new modules run during the '-Passembly' profile and use the new integration test framework that compares all
output produced by a connector to expected results that were previously recorded and verified. These integration test modules
can be run manually with a simple build of those modules or their parent; only the top-level 'integration-tests' module is run
during the assembly profile during builds of the entire codebase.
Adds support for MySQL 5.7's `JSON` type, which is capable of holding JSON objects, JSON arrays, and scalar values. The Debezium MySQL connector represents `JSON` values as string with a `io.debezium.data.Json` semantic type (which is basically a string schema that has a special name to denote the semantics), and the _contents_ of that string will be the JSON representation of the object, array, or scalar value.
Changed the MySQL connector to have several new configuration properties for setting up the SSL key store and trust store (which can be used in place of System or JDK properties) used for MySQL secure connections, and another property to specify what kind of SSL connection be used.
Modified several integration tests to ensure all MySQL connections are made with `useSSL=false`.
Rewrote how the MySQL connector converts temporal values to use schemas with names that identify the semantic
type of temporal value, and customized how the MySQL binlog client library creates Java object values from the
raw binlog events.
Several new "semantic" schema types were defined:
* `io.debezium.time.Year` represents a year number as an INT32 value (e.g., 2016, -345, etc.).
* `io.debezium.time.Date` represents a date by storing the epoch seconds (that is, the number of seconds past the epoch) as an INT64 value.
* `io.debezium.time.Time` represents a time by storing the milliseconds past midnight as an INT32 value.
* `io.debezium.time.MicroTime` represents a time by storing the microsconds past midnight as an INT32 value.
* `io.debezium.time.NanoTime` represents a time by storing the nanoseconds past midnight as an INT32 value.
* `io.debezium.time.Timestamp` represents a date and time (without timezone information) by storing the milliseconds past epoch as an INT64 value.
* `io.debezium.time.MicroTimestamp` represents a date and time (without timezone information) by storing the microseconds past epoch as an INT64 value.
* `io.debezium.time.NanoTimestamp` represents a date and time (without timezone information) by storing the nanoseconds past epoch as an INT64 value.
* `io.debezium.time.ZonedTime` represents a time with timezone and optional fractions of a second (but no date) by storing the ISO8601 form as a STRING value (e.g., `10:15:30+01:00`)
* `io.debezium.time.ZonedTimestamp` represents a date and time with timezone and optional fractions of a second by storing the ISO8601 form as a STRING value (e.g., `2011-12-03T10:15:30.030431+01:00`)
This range of semantic types allows for a far more accurate representation in the events of the temporal values stored within the database. The MySQL connector chooses the semantic type based upon the precision of the MySQL type (e.g., `TIMESTAMP(6)` will be represented with `io.debezium.time.MicroTimestamp`, whereas `TIMESTAMP(3)` will be represented with `io.debezium.time.Timestamp`). This ensures that the events do not lose precision and that the semantics of the database column values are retained in the events even though the values are represented with primitive values.
Obviously these Kafka Connect schema representations are different and more precise than the built-in `org.apache.kafka.connect.data.Date`, `org.apache.kafka.connect.data.Time`, and `org.apache.kafka.connect.data.Timestamp` logical types provided by Kafka Connect and used by the MySQL connector in all 0.2.x and 0.1.x versions. Migration to the new MySQL connector should be possible, although consumers may still need to know about these types to properly handle temporal values and the correct precision (i.e., consumers can just assume all date INT64 values represent milliseconds).
The MySQL binlog client library converted the raw binary event information to JDBC types using a local Calendar instance, which obviously incorporates the local timezone and cannot retain more than millisecond precision. This change extends the library's deserializers to instead use the Java 8 `javax.time` classes and to retain the exact semantics of the database values and to not lose any precisions (since the `javax.time` classes have nanosecond precision).
The same logic is also used to convert the JDBC values obtained during a snapshot from the MySQL Connect/J JDBC driver. The latter has a few quirks, such as not returning any fractional seconds for `TIME` columns, even though `java.sql.Time` can store up to milliseconds.
Most of the logic of the conversions of values and mapping to Kafka Connect schemas is handled in the new `JdbcValueConverters`, which was extracted from the existing `TableSchemaBuilder`. The MySQL connector reuses and actually extends the `JdbcValueConverters` class with its own `MySqlValueConverters` class that also adds support for MySQL-specific types such as `YEAR`. Other connectors whose values are based on JDBC types should be able to reuse and/or extend the `JdbcValueConverters` class.
Integration tests that deal with temporal types were modified to use proper expected values and comparisons.
Upgraded from Kafka 0.9.0.1 to Kafka 0.10.0. The only required change was to override the `Connector.config()` method, which returns `null` or a `ConfigDef` instance that contains detailed metadata for each of the configuration fields, including supporting recommended values and marking fields as not visible (e.g., if they don't make sense given other configuration field values). This can be used by user interfaces to data-drive the configuration of a connector. Also, the default validation logic of the Connector implementations uses a `Validator` that is pretty restrictive in its functionality.
Debezium already had a fairly decent and simple `Configuration` framework. After several attempts to try and merge these concepts, reconciling the two validation mechanisms was very complicated and involved a lot of changes. It was easier to simply continue Debezium-specific validation and to override the `Connector.validate(...)` method to use Debezium's `Configuration`-based validation. Connector-based validation logic includes determining recommended values, so Debezium's `Field` class (used to define each configuration property) was enhanced with a new `Recommender` class that is similar to Kafka's.
Additional integration tests were added to verify that the `ConfigDef` result is acceptable and that the new connector validation logic works as expected, including getting recommended values for some fields (e.g., database names, table/collection names) from MySQL and MongoDB by connecting and dynamically reading the values. This was done in a way that remains backward compatible with the regular expression formats of these fields, but in a user interface that uses the `ConfigDef` mechanism the user can simply select the databases and table/collection identifiers.
Added an integration test case to diagnose the loss of the fractional seconds from MySQL temporal values. The problem appears to be a bug in the MySQL Binary Log Connector library that we used, and this bug was reported as https://github.com/shyiko/mysql-binlog-connector-java/issues/103. That was fixed in version 0.3.2 of the library, which Stanley was kind enough to release for us.
During testing, though, several issues were discovered in how temporal values are handled and converted from the MySQL events, through the MySQL Binary Log client library, and through the Debezium MySQL connector to conform with Kafka Connect's various temporal logical schema types. Most of the issues involved converting most of the temporal values from local time zone (which is how they are created by the MySQL Binary Log client) into UTC (which is how Kafka Connect expects them). Really, java.util.Date doesn't have time zone information and instead tracks the number of milliseconds past epoch, but the conversion of normal timestamp information to the milliseconds past epoch in UTC depends on the time zone in which that conversion happens.
Added a new `debezium-connector-mongodb` module that defines a MongoDB connector. The MongoDB connector can capture and record the changes within a MongoDB replica set, or when seeded with addresses of the configuration server of a MongoDB sharded cluster, the connector captures the changes from the each replica set used as a shard. In the latter case, the connector even discovers the addition of or removal of shards.
The connector monitors each replica set using multiple tasks and, if needed, separate threads within each task. When a replica set is being monitored for the first time, the connector will perform an "initial sync" of that replica set's databases and collections. Once the initial sync has completed, the connector will then begin tailing the oplog of the replica set, starting at the exact point in time at which it started the initial sync. This equivalent to how MongoDB replication works.
The connector always uses the replica set's primary node to tail the oplog. If the replica set undergoes an election and different node becomes primary, the connector will immediately stop tailing the oplog, connect to the new primary, and start tailing the oplog using the new primary node. Likewise, if connector experiences any problems communicating with the replica set members, it will try to reconnect (using exponential backoff so as to not overwhelm the replica set) and continue tailing the oplog from where it last left off. In this way the connector is able to dynamically adjust to changes in replica set membership and to automatically handle communication failures.
The MongoDB oplog contains limited information, and in particular the events describing updates and deletes do not actually have the before or after state of the documents. Instead, the oplog events are all idempotent, so updates contain the effective changes that were made during an update, and deletes merely contain the deleted document identifier. Consequently, the connector is limited in the information it includes in its output events. Create and read events do contain the initial state, but the update contain only the changes (rather than the before and/or after states of the document) and delete events do not have the before state of the deleted document. All connector events, however, do contain the local system timestamp at which the event was processed and _source_ information detailing the origins of the event, including the replica set name, the MongoDB transaction timestamp of the event, and the transactions identifier among other things.
It is possible for MongoDB to lose commits in specific failure situations. For exmaple, if the primary applies a change and records it in its oplog before it then crashes unexpectedly, the secondary nodes may not have had a chance to read those changes from the primary's oplog before the primary crashed. If one such secondary is then elected as primary, it's oplog is missing the last changes that the old primary had recorded and no longer has those changes. In these cases where MongoDB loses changes recorded in a primary's oplog, it is possible that the MongoDB connector may or may not capture these lost changes.