Go to file
Randall Hauch 2da5b37f76 DBZ-1 Added support for recording and recovering database schema
Adds a small framework for recording the DDL operations on the schema state (e.g., Tables) as they are read and applied from the log, and when restarting the connector task to recover the accumulated schema state. Where and how the DDL operations are recorded is an abstraction called `DatabaseHistory`, with three options: in-memory (primarily for testing purposes), file-based (for embedded cases and perhaps standalone Kafka Connect uses), and Kafka (for normal Kafka Connect deployments).

The `DatabaseHistory` interface methods take several parameters that are used to construct a `SourceRecord`. The `SourceRecord` type was not used, however, since that would result in this interface (and potential extension mechanism) having a dependency on and exposing the Kafka API. Instead, the more general parameters are used to keep the API simple.

The `FileDatabaseHistory` and `MemoryDatabaseHistory` implementations are both fairly simple, but the `FileDatabaseHistory` relies upon representing each recorded change as a JSON document. This is simple, is easily written to files, allows for recovery of data from the raw file, etc. Although this was done initially using Jackson, the code to read and write the JSON documents required a lot of boilerplate. Instead, the `Document` framework developed during Debezium's very early prototype stages was brought back. It provides a very usable API for working with documents, including the ability to compare documents semantically (e.g., numeric values are converted to be able to compare their numeric values rather than just compare representations) and with or without field order.

The `KafkaDatabaseHistory` is a bit more complicated, since it uses a Kafka broker to record all database schema changes on a single topic with single partition, and then upon restart uses it to recover the history from the dedicated topics. This implementation also records the changes as JSON documents, keeping it simple and independent of the Kafka Connect converters.
2016-02-02 14:27:14 -06:00
debezium-core DBZ-1 Added support for recording and recovering database schema 2016-02-02 14:27:14 -06:00
debezium-embedded DBZ-1 Added support for recording and recovering database schema 2016-02-02 14:27:14 -06:00
debezium-ingest-jdbc DBZ-4 Changed copyright statement in source code headers and adjusted checkstyle rules. 2016-01-27 08:12:01 -06:00
debezium-ingest-mysql DBZ-1 Added support for recording and recovering database schema 2016-02-02 14:27:14 -06:00
debezium-ingest-postgres DBZ-1 Added the initial stages of a MySQL source connector 2016-01-29 10:12:28 -06:00
debezium-kafka-connect Updated the copyright dates per new approach. 2016-01-25 18:33:08 -06:00
support DBZ-1 Added support for recording and recovering database schema 2016-02-02 14:27:14 -06:00
.gitattributes DBZ-6 Enforce line ending style for most file types. 2016-01-27 08:55:09 -06:00
.gitignore Initial project skeleton 2015-11-18 14:23:29 -06:00
.travis.yml Changed Docker usage on Travis-CI 2016-01-25 16:12:07 -06:00
CHANGELOG.md DBZ-1 Added the initial stages of a MySQL source connector 2016-01-29 10:12:28 -06:00
CONTRIBUTE.md Updated the README and added a CONTRIBUTE.md file with details for developers. 2016-01-25 18:33:07 -06:00
COPYRIGHT.txt Readded copyright file with correct case. 2016-01-27 09:09:30 -06:00
LICENSE.txt Renamed license file to mirror form used in other top-level filenames. 2016-01-27 09:06:23 -06:00
pom.xml DBZ-1 Added support for recording and recovering database schema 2016-02-02 14:27:14 -06:00
README.md DBZ-4 Changed copyright statement in source code headers and adjusted checkstyle rules. 2016-01-27 08:12:01 -06:00

Build Status License

Copyright Debezium Authors. Licensed under the Apache License, Version 2.0.

Debezium

Debezium is an open source project that provides a near real-time data streaming platform for change data capture (CDC). You setup and configure Debezium to monitor your databases, and then your application uses Debezium libraries to receive detailed events for each row-level change made to the database. Only committed changes are visible, so your application doesn't have to worry about transactions or changes that are rolled back. Debezium provides a single model of all change events, so your application does not have to worry about the intricacies of each kind of database management system. Additionally, since Debezium records the history of data changes in its logs, your application can be stopped and restarted at any time, and it will be able to consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely.

Monitoring databases and being notified when data changes has always been complicated. Relational database triggers can be useful, but are specific to each database and often limited to updating state within the same database (not communicating with external processes). Some databases offer APIs or frameworks for monitoring changes, but there is no standard so each database's approach is different and requires a lot of knowledged and specialized code. It still is very challenging to ensure that all changes are seen and processed in the same order while minimally impacting the database.

Debezium provides modules that do this work for you. Some modules are generic and work with multiple database management systems, but are also a bit more limited in functionality and performance. Other modules are tailored for specific database management systems, so they are often far more capable and they leverage the specific features of the system.

Common use cases

There are a number of scenarios in which Debezium can be extremely valuable, but here we outline just a few of them that are more common.

Cache invalidation

Automatically invalidate entries in a cache as soon as the record(s) for entries change or are removed. If the cache is running in a separate process (e.g., Redis, Memcache, Infinispan, and others), then the simple cache invalidation logic can be placed into a separate process or service, simplifying the main application. In some situations, the logic can be made a little more sophisticated and can use the updated data in the change events to update the affected cache entries.

Simplifying monolithic applications

Many applications update a database and then do additional work after the changes are committed: update search indexes, update a cache, send notifications, run business logic, etc. This is often called "dual-writes" since the application is writing to multiple systems outside of a single transaction. Not only is the application logic complex and more difficult to maintain, dual writes also risk losing data or making the various systems inconsistent if the application were to crash after a commit but before some/all of the other updates were performed. Using change data capture, these other activities can be performed in separate threads or separate processes/services when the data is committed in the original database. This approach is more tolerant of failures, does not miss events, scales better, and more easily supports upgrading and operations.

Sharing databases

When multiple applications share a single database, it is often non-trivial for one application to become aware of the changes committed by another application. One approach is to use a message bus, although non-transactional message busses suffer from the "dual-writes" problems mentioned above. However, this becomes very straightforward with Debezium: each application can monitor the database and react to the changes.

Data integration

Data is often stored in multiple places, especially when it is used for different purposes and has slightly different forms. Keeping the multiple systems synchronized can be challenging, but simple ETL-type solutions can be implemented quickly with Debezium and simple event processing logic.

CQRS

The Command Query Responsibility Separation (CQRS) architectural pattern uses a one data model for updating and one or more other data models for reading. As changes are recorded on the update-side, those changes are then processed and used to update the various read representations. As a result CQRS applications are usually more complicated, especially when they need to ensure reliable and totally-ordered processing. Debezium and CDC can make this more approachable: writes are recorded as normal, but Debezium captures those changes in durable, totally ordered streams that are consumed by the services that asynchronously update the read-only views. The write-side tables can represent domain-oriented entities, or when CQRS is paired with Event Sourcing the write-side tables are the append-only event log of commands.

Building Debezium

The following software is required to work with the Debezium codebase and build it locally:

See the links above for installation instructions on your platform. You can verify the versions are installed and running:

$ git --version
$ javac -version
$ mvn -version
$ docker --version

Why Docker?

Many open source software projects use Git, Java, and Maven, but requiring Docker is less common. Debezium is designed to talk to a number of external systems, such as various databases and services, and our integration tests verify Debezium does this correctly. But rather than expect you have all of these software systems installed locally, Debezium's build system uses Docker to automatically download or create the necessary images and start containers for each of the systems. The integration tests can then use these services and verify Debezium behaves as expected, and when the integration tests finish, Debezum's build will automatically stop any containers that it started.

Debezium also has a few modules that are not written in Java, and so they have to be required on the target operating system. Docker lets our build do this using images with the target operating system(s) and all necessary development tools.

Using Docker has several advantages:

  1. You don't have to install, configure, and run specific versions of each external services on your local machine, or have access to them on your local network. Even if you do, Debezium's build won't use them.
  2. We can test multiple versions of an external service. Each module can start whatever containers it needs, so different modules can easily use different versions of the services.
  3. Everyone can run complete builds locally. You don't have to rely upon a remote continuous integration server running the build in an environment set up with all the required services.
  4. All builds are consistent. When multiple developers each build the same codebase, they should see exactly the same results -- as long as they're using the same or equivalent JDK, Maven, and Docker versions. That's because the containers will be running the same versions of the services on the same operating systems. Plus, all of the tests are designed to connect to the systems running in the containers, so nobody has to fiddle with connection properties or custom configurations specific to their local environments.
  5. No need to clean up the services, even if those services modify and store data locally. Docker images are cached, so building them reusing them to start containers is fast and consistent. However, Docker containers are never reused: they always start in their pristine initial state, and are discarded when they are shutdown. Integration tests rely upon containers, and so cleanup is handled automatically.

Building the code

First obtain the code by cloning the Git repository:

$ git clone https://github.com/debezium/debezium.git
$ cd keycloak

Then build the code using Maven:

$ mvn clean install

The build starts and uses several Docker containers for different DBMSes. Note that if Docker is not running, you'll likely get an arcane error -- if this is the case, always verify that Docker is running, perhaps by using docker ps to list the running containers.

Contributing

The Debezium community welcomes anyone that wants to help out in any way, whether that includes reporting problems, helping with documentation, or contributing code changes to fix bugs, add tests, or implement new features. See this document for details.