It appears there is an intent to restore the interruption flag
when catching InterruptedException. However the old code was just
checking the current state of the flag instead of raising it.
* Putting collection name after replica set name for MongoDB
* Removing redundant method for converting schema to string
* Removing redundant keys from PG SourceInfo
The "documents.fetch.size" configuration property is an positive integer value that specifies the maximum number of documents that should be read in one go from each collection while taking a snapshot. The connector will read the collection contents in multiple batches of this size. Default to "0", which indicates that the server chooses an appropriate fetch size.
The previous statement was never achievable since `ReplicaSets.parse(hosts)` can never return null
Also removed some code duplication for better readability
Now it's easier to follow what makes an insert, update or delete within
the transformer, also reduced the amount of npath complexity by doing
early returns and certain extractions.
Mongo internally transforms everything into $set and $unset operations
when they are in the oplog, this guarantees that you can have operations
which are $set, $unset or combined.
This will allow consumers to recognize the Debezium connector used for creating a given message, helping them to adjust their behavior for a variety of connectors.
* Changing field.renames delimiter from equal to colon character
* Adding examples of source document structure to tests
* Adding tests for case: two renames have the same target field
* Adding missing return
* Changing Path interface to abstract class
The "field.renames" configuration property is an optional comma-separated list of the fully-qualified replacements of fields that should be used to rename fields in change event message values. Fully-qualified replacements for fields are of the form "databaseName.collectionName.fieldName.nestedFieldName=newNestedFieldName", where "databaseName" and "collectionName" may contain the wildcard (*) which matches any characters, the equal character (=) is used to determine rename mapping of field.
The "field.blacklist" configuration property is an optional comma-separated list of the fully-qualified names of fields that should be excluded from change event message values. Fully-qualified names for fields are of the form "databaseName.collectionName.fieldName.nestedFieldName", where "databaseName" and "collectionName" may contain the wildcard (*) which matches any characters.
Although the "field.blacklist" configuration property allows you to remove fields from the event values, the "_id" field is always included in the event’s key.
This simplifies interaction with execute(), as we don't need to alter collections, references as a side effect but instead can work with the return value.
This protects against authorization failures when listing collections
from DBs the connector user isn't authorized for. It also simplifies
usage of MongoPrimary#databaseNames() and collections() as consumers
don't need to apply filtering themselves.
* Renaming ConnectorTaskContext to CdcSourceTaskContext
* Renaming ReplicationContext to MongoDbTaskContext
* Making relationship from MongoDbTaskContext to ConnectionContext has-a instead of is-a
also fix some checkstyle violations which are not yet reported during build process
see full PR discussion about the rationale behind the taken approach here https://github.com/debezium/debezium/pull/258
It’s not clear how valuable these recommenders actually are. First, it’s not clear about the expected semantics: can the user use values that don’t appear in the recommended values? Second, the recommenders that return large numbers of values can be slow and can result in very large REST API responses.
Debezium was using recommenders to return the database and table/collection names, but these lists can be very large for large databases. Rather than cap the number of recommended values and have the recommender return a subset of all potential values, we will instead remove the recommenders altogether.
Changed how the mongo-init process waits to begin by now looking for the second MongoDB server log message
saying it is ready, since the MongoDB image now has different startup behavior.
Corrected the MongoDB connector upon startup to restart an initial sync if the previously recorded offset signals that an initial sync was not completed in the prior run.
Also change the connector’s replicator to buffer the last record during an initial sync so that, upon completion of the initial sync, the last record can be updated with an offset that reflects that the initial sync was completed. This way, if the initial sync is completed but there are no other events in the oplog, the connector will still consider the initial sync as completed.
This change alters the way the MongoDB connects to the various servers in a cluster. Previously, the ConnectionContext constructor currently set up the MongoDB client with credentials for the `admin` and `config` databases, and apparently the client eagerly performs authentication against all databases passed in, rather than doing this lazily as DBs are use.
Instead, the code no longer sets up the credentials for the `config` database and instead only sets up credentials for the `admin` database for authentication and authorization. This works as long as the user specified in the connector configuration can read the `config` database.
Several other changes were made to improve the error handling and reporting when the replica set information cannot be read from the `config` database.
The MongoDB connector now outputs an INFO log message whenever its task's `poll()` method returns a non-empty list of `SourceRecord` objects, where the message includes the number of records and the offset of the last record.
Upgraded from Kafka 0.9.0.1 to Kafka 0.10.0. The only required change was to override the `Connector.config()` method, which returns `null` or a `ConfigDef` instance that contains detailed metadata for each of the configuration fields, including supporting recommended values and marking fields as not visible (e.g., if they don't make sense given other configuration field values). This can be used by user interfaces to data-drive the configuration of a connector. Also, the default validation logic of the Connector implementations uses a `Validator` that is pretty restrictive in its functionality.
Debezium already had a fairly decent and simple `Configuration` framework. After several attempts to try and merge these concepts, reconciling the two validation mechanisms was very complicated and involved a lot of changes. It was easier to simply continue Debezium-specific validation and to override the `Connector.validate(...)` method to use Debezium's `Configuration`-based validation. Connector-based validation logic includes determining recommended values, so Debezium's `Field` class (used to define each configuration property) was enhanced with a new `Recommender` class that is similar to Kafka's.
Additional integration tests were added to verify that the `ConfigDef` result is acceptable and that the new connector validation logic works as expected, including getting recommended values for some fields (e.g., database names, table/collection names) from MySQL and MongoDB by connecting and dynamically reading the values. This was done in a way that remains backward compatible with the regular expression formats of these fields, but in a user interface that uses the `ConfigDef` mechanism the user can simply select the databases and table/collection identifiers.
Added a new `debezium-connector-mongodb` module that defines a MongoDB connector. The MongoDB connector can capture and record the changes within a MongoDB replica set, or when seeded with addresses of the configuration server of a MongoDB sharded cluster, the connector captures the changes from the each replica set used as a shard. In the latter case, the connector even discovers the addition of or removal of shards.
The connector monitors each replica set using multiple tasks and, if needed, separate threads within each task. When a replica set is being monitored for the first time, the connector will perform an "initial sync" of that replica set's databases and collections. Once the initial sync has completed, the connector will then begin tailing the oplog of the replica set, starting at the exact point in time at which it started the initial sync. This equivalent to how MongoDB replication works.
The connector always uses the replica set's primary node to tail the oplog. If the replica set undergoes an election and different node becomes primary, the connector will immediately stop tailing the oplog, connect to the new primary, and start tailing the oplog using the new primary node. Likewise, if connector experiences any problems communicating with the replica set members, it will try to reconnect (using exponential backoff so as to not overwhelm the replica set) and continue tailing the oplog from where it last left off. In this way the connector is able to dynamically adjust to changes in replica set membership and to automatically handle communication failures.
The MongoDB oplog contains limited information, and in particular the events describing updates and deletes do not actually have the before or after state of the documents. Instead, the oplog events are all idempotent, so updates contain the effective changes that were made during an update, and deletes merely contain the deleted document identifier. Consequently, the connector is limited in the information it includes in its output events. Create and read events do contain the initial state, but the update contain only the changes (rather than the before and/or after states of the document) and delete events do not have the before state of the deleted document. All connector events, however, do contain the local system timestamp at which the event was processed and _source_ information detailing the origins of the event, including the replica set name, the MongoDB transaction timestamp of the event, and the transactions identifier among other things.
It is possible for MongoDB to lose commits in specific failure situations. For exmaple, if the primary applies a change and records it in its oplog before it then crashes unexpectedly, the secondary nodes may not have had a chance to read those changes from the primary's oplog before the primary crashed. If one such secondary is then elected as primary, it's oplog is missing the last changes that the old primary had recorded and no longer has those changes. In these cases where MongoDB loses changes recorded in a primary's oplog, it is possible that the MongoDB connector may or may not capture these lost changes.