tet123/documentation/modules/ROOT/pages/configuration/avro.adoc

209 lines
11 KiB
Plaintext
Raw Normal View History

= Avro Serialization
include::../_attributes.adoc[]
2019-11-25 22:06:33 +01:00
:toc:
:toc-placement: macro
:linkattrs:
:icons: font
:source-highlighter: highlight.js
2019-11-25 22:06:33 +01:00
toc::[]
Debezium connectors are used with the Kafka Connect framework to capture changes in databases and generate change events.
The Kafka Connect workers then apply to each of the messages generated by the connector the transformations configured for the connector,
2020-05-22 12:23:41 +02:00
serialize each message key and value into a binary form using the configured https://kafka.apache.org/documentation/#connect_running[_converters_],
and finally write each messages into the correct Kafka topic.
2020-05-22 12:23:41 +02:00
The converters can either be specified in the Kafka Connect worker configuration,
in which case the same converters are used for all connectors deployed to that worker's cluster.
Alternatively, they can be specified for an individual connector.
Kafka Connect comes with a _JSON converter_ that serializes the message keys and values into JSON documents.
The JSON converter can be configured to include or exclude the message schema using the (`key.converter.schemas.enable` and `value.converter.schemas.enable`) properties.
Our xref:tutorial.adoc[tutorial] shows what the messages look like when both payload and schemas are included, but the schemas make the messages very verbose.
If you want your messages serialized with JSON, consider setting these properties to `false` to exclude the verbose schema information.
2020-05-22 12:23:41 +02:00
Alternatively, you can serialize the message keys and values using https://avro.apache.org/[Apache Avro].
The Avro binary format is extremely compact and efficient, and Avro schemas make it possible to ensure that the messages have the correct structure.
Avro's schema evolution mechanism makes it possible to evolve the schemas over time,
which is essential for Debezium connectors that dynamically generate the message schemas to match the structure of the database tables.
Over time, the change events captured by Debezium connectors and written by Kafka Connect into a topic may have different versions of the same schema,
and Avro serialization makes it far easier for consumers to adapt to the changing schema.
2020-05-22 12:48:11 +02:00
== The Apicurio API and Schema Registry
2020-05-22 12:23:41 +02:00
The open-source project https://github.com/Apicurio/apicurio-registry[Apicurio Registry] provides several components that work with Avro:
2020-05-22 12:23:41 +02:00
* An Avro converter that can be used in Kafka Connect workers to map the Kafka Connect schemas into Avro schemas and to then use those Avro schemas to serialize the message keys and values into the very compact Avro binary form.
2020-05-20 10:26:28 +02:00
* An API/Schema registry that tracks all of the Avro schemas used in Kafka topics, and where the Avro Converter sends the generated Avro schemas.
Since the Avro schemas are stored in this registry, each message need only include a tiny _schema identifier_.
This makes each message even smaller, and for an I/O bound system like Kafka this means more total throughput of the producers and consumers.
* Avro _Serdes_ (serializers and deserializers) for Kafka producers and consumers.
2020-05-22 12:23:41 +02:00
Any Kafka consumer applications you write to consume change events can use the Avro Serdes to deserialize the changes events.
2020-05-20 10:26:28 +02:00
You can install them into any Kafka distribution and use them with Kafka Connect.
2020-05-22 12:23:41 +02:00
[NOTE]
====
The Apicurio project also provides a JSON converter that can be used with the Apicurio registry.
This combines the advantage of less verbose messages (as messages don't contain the schema information themselves, but only a schema id)
with human-readable JSON.
====
Another option is using the Confluent schema registry, which is described further below.
== Technical Information
A system that wants to use Avro serialization needs to complete two steps:
2020-05-22 12:48:11 +02:00
* Deploy an https://github.com/Apicurio/apicurio-registry[Apicurio API/Schema Registry] instance
2020-05-22 12:23:41 +02:00
* Install the Avro converter from https://repo1.maven.org/maven2/io/apicurio/apicurio-registry-distro-connect-converter/{apicurio-version}/apicurio-registry-distro-connect-converter-{apicurio-version}-converter.tar.gz[the installation package] into Kafka's _libs_ directory or directly into a plug-in directory
* Use thes the following properties to configure Apache Connect instance
[source]
----
2020-05-20 10:26:28 +02:00
key.converter=io.apicurio.registry.utils.converter.AvroConverter
key.converter.apicurio.registry.url=http://apicurio:8080
key.converter.apicurio.registry.converter.serializer=io.apicurio.registry.utils.serde.AvroKafkaSerializer
key.converter.apicurio.registry.converter.deserializer=io.apicurio.registry.utils.serde.AvroKafkaDeserializer
key.converter.apicurio.registry.global-id=io.apicurio.registry.utils.serde.strategy.AutoRegisterIdStrategy
value.converter=io.apicurio.registry.utils.converter.AvroConverter
value.converter.apicurio.registry.url=http://apicurio:8080
value.converter.apicurio.registry.converter.serializer=io.apicurio.registry.utils.serde.AvroKafkaSerializer
value.converter.apicurio.registry.converter.deserializer=io.apicurio.registry.utils.serde.AvroKafkaDeserializer
value.converter.apicurio.registry.global-id=io.apicurio.registry.utils.serde.strategy.AutoRegisterIdStrategy
----
2020-05-22 12:23:41 +02:00
Note that Kafka Connect internally always uses the JSON key/value converters for storing configuration and offsets.
2020-05-22 12:23:41 +02:00
== Debezium Container Images
2020-05-22 12:23:41 +02:00
Deploy an Apicurio Registry instance (this example uses a non-production in-memory instance):
2020-05-20 10:26:28 +02:00
[source]
----
docker run -it --rm --name apicurio \
-p 8080:8080 apicurio/apicurio-registry-mem:{apicurio-version}
----
Build a Debezium image with Avro converter from https://github.com/debezium/debezium-examples/blob/master/tutorial/debezium-with-apicurio/Dockerfile[Dockerfile]:
[source]
[subs="attributes"]
----
docker build --build-arg DEBEZIUM_VERSION={debezium-docker-label} -t debezium/connect-apicurio:{debezium-docker-label} .
----
Run a Kafka Connect image configured to use Avro:
[source]
[subs="attributes"]
----
docker run -it --rm --name connect \
--link zookeeper:zookeeper \
--link kafka:kafka \
--link mysql:mysql \
--link apicurio:apicurio \
-e GROUP_ID=1 \
-e CONFIG_STORAGE_TOPIC=my_connect_configs \
-e OFFSET_STORAGE_TOPIC=my_connect_offsets \
-e KEY_CONVERTER=io.confluent.connect.avro.AvroConverter \
-e VALUE_CONVERTER=io.confluent.connect.avro.AvroConverter \
-e CONNECT_KEY_CONVERTER=io.apicurio.registry.utils.converter.AvroConverter \
-e CONNECT_KEY_CONVERTER_APICURIO.REGISTRY_URL=http://apicurio:8080 \
-e CONNECT_KEY_CONVERTER_APICURIO.REGISTRY_CONVERTER.SERIALIZER=io.apicurio.registry.utils.serde.AvroKafkaSerializer \
-e CONNECT_KEY_CONVERTER_APICURIO.REGISTRY_CONVERTER.DESERIALIZER=io.apicurio.registry.utils.serde.AvroKafkaDeserializer \
-e CONNECT_KEY_CONVERTER_APICURIO.REGISTRY_GLOBAL-ID=io.apicurio.registry.utils.serde.strategy.AutoRegisterIdStrategy \
-e CONNECT_VALUE_CONVERTER=io.apicurio.registry.utils.converter.AvroConverter \
-e CONNECT_VALUE_CONVERTER_APICURIO_REGISTRY_URL=http://apicurio:8080 \
-e CONNECT_VALUE_CONVERTER_APICURIO_REGISTRY_CONVERTER_SERIALIZER=io.apicurio.registry.utils.serde.AvroKafkaSerializer \
-e CONNECT_VALUE_CONVERTER_APICURIO_REGISTRY_CONVERTER_DESERIALIZER=io.apicurio.registry.utils.serde.AvroKafkaDeserializer \
-e CONNECT_VALUE_CONVERTER_APICURIO_REGISTRY_GLOBAL-ID=io.apicurio.registry.utils.serde.strategy.AutoRegisterIdStrategy \
-p 8083:8083 debezium/connect-apicurio:{debezium-docker-label}
----
[[avro-naming]]
== Naming
As stated in the Avro link:https://avro.apache.org/docs/current/spec.html#names[documentation], names must adhere to the following rules:
* Start with `[A-Za-z_]`
* Subsequently contain only `[A-Za-z0-9_]` characters
2020-05-22 12:48:11 +02:00
Debezium uses the column's name as the basis for the corresponding Avro field.
2020-05-20 10:26:28 +02:00
This can lead to problems during serialization if the column name does not also adhere to the Avro naming rules above.
Debezium provides a configuration option, `sanitize.field.names` that can be set to `true` if you have columns that do not adhere to the rule-set above, allowing those fields to be serialized without having to actually modify your schema.
== Confluent Schema Registry
2020-05-22 12:23:41 +02:00
There is an alternative https://github.com/confluentinc/schema-registry[schema registry] implementation developed by Confluent.
2020-05-20 10:26:28 +02:00
The configuration is slightly different.
2020-05-22 12:23:41 +02:00
Here are the properties that should be used:
2020-05-20 10:26:28 +02:00
[source]
----
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
----
2020-05-22 12:23:41 +02:00
An instance of the Confluent Schema Registry can be deployed like so:
[source]
----
docker run -it --rm --name schema-registry \
--link zookeeper \
-e SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL=zookeeper:2181 \
-e SCHEMA_REGISTRY_HOST_NAME=schema-registry \
-e SCHEMA_REGISTRY_LISTENERS=http://schema-registry:8081 \
-p 8181:8181 confluentinc/cp-schema-registry
----
Run a Kafka Connect image configured to use Avro:
[source]
[subs="attributes"]
----
docker run -it --rm --name connect \
--link zookeeper:zookeeper \
--link kafka:kafka \
--link mysql:mysql \
--link schema-registry:schema-registry \
-e GROUP_ID=1 \
-e CONFIG_STORAGE_TOPIC=my_connect_configs \
-e OFFSET_STORAGE_TOPIC=my_connect_offsets \
-e KEY_CONVERTER=io.confluent.connect.avro.AvroConverter \
-e VALUE_CONVERTER=io.confluent.connect.avro.AvroConverter \
-e CONNECT_KEY_CONVERTER_SCHEMA_REGISTRY_URL=http://schema-registry:8081 \
-e CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL=http://schema-registry:8081 \
-p 8083:8083 debezium/connect:{debezium-docker-label}
----
Run a console consumer which reads new Avro messages from the `db.myschema.mytable` topic and decodes to JSON:
[source]
[subs="attributes"]
----
docker run -it --rm --name avro-consumer \
--link zookeeper:zookeeper \
--link kafka:kafka \
--link mysql:mysql \
--link schema-registry:schema-registry \
debezium/connect:{debezium-docker-label} \
/kafka/bin/kafka-console-consumer.sh \
--bootstrap-server kafka:9092 \
--property print.key=true \
--formatter io.confluent.kafka.formatter.AvroMessageFormatter \
--property schema.registry.url=http://schema-registry:8081 \
--topic db.myschema.mytable
----
2020-05-22 12:23:41 +02:00
== Getting More Information
link:/blog/2016/09/19/Serializing-Debezium-events-with-Avro/[This post] from the Debezium blog
describes the concepts of serializers, converters etc. and discusses the advantages of using Avro.
Note that some details around Kafka Connect converters have slightly changed since the time of writing this post.
For a complete example of using Avro as the message format for Debezium data change events,
please see the https://github.com/debezium/debezium-examples/tree/master/tutorial#using-mysql-and-the-avro-message-format[MySQL and the Avro message format] tutorial example.