Getting started with the Kafka Connect Cassandra Source

Part 1 sourcing data from Cassandra into Apache Kafka
photo of Mike Barlotta
Mike Barlotta

Mike Barlotta, Agile Data Engineer at WalmartLabs introduces how Kafka Connect and Stream Reactor can be leveraged to bring data from Cassandra into Apache Kafka.



This post will look at how to setup and tune the Cassandra Source connector that is available from Landoop. The Cassandra Source connector is used to read data from a Cassandra table, writing the contents into a Kafka topic using only a configuration file. This enables data that has been saved to be easily turned into an event stream.

In our example we will be capturing data representing a pack (i.e. a large box) of items being shipped. Each pack is pushed to consumers in a JSON format on a Kafka topic.

The Cassandra data model and Cassandra Source connector

Modeling data in Cassandra must be done around the queries that are needed to access the data (see this article for details). Typically this means that there will be one table for each query and data (in our case about the pack) will be duplicated across numerous tables.

Regardless of the other tables used for the product, the Cassandra Source connector needs a table that will allow us to query for data using a time range. The connector is designed around its ability to generate a CQL query based on configuration. It uses this query to retrieve data from the table that is available within a configurable time range. Once all of this data has been published, Kafka Connect will mark the upper end of the time range as an offset. The connector will then query the table for more data using the next time range starting with the date/time stored in the offset. We will look at how to configure this later. For now we want to focus on the constraints for the table. Since Cassandra doesn’t support joins, the table we are pulling data from must have all of the data that we want to put onto the Kafka topic. Data in other tables will not be available to Kafka Connect.

In it’s simplest form a table used by the Cassandra Source connector might look like this:

CREATE TABLE IF NOT EXISTS "pack_events" (
    event_id TEXT,
    event_ts TIMESTAMP,
    event_data TEXT,
PRIMARY KEY ((event_id),event_ts));

The event_id is the partition key. This is used by Cassandra to determine which nodes in the cluster will store the data. The event_ts is part of the cluster key. It determines the order of the data within the partition (see this article for details). It is also the column that is used by the Cassandra source connector to manage time ranges. In this example, the event_data column stores the JSON representation of the pack.

This is not the only structure for a table that will work. The table that is queried by the Cassandra Source connector can use numerous columns to represent the partition key and the data. However, the connector requires a single time based column (either TIMESTAMP or TIMEUUID) in order to work correctly.

This would be an equally valid table for use with the Cassandra Source connector.

CREATE TABLE IF NOT EXISTS "kc_events" (
    event_id1 TEXT,
    event_id2 TEXT,
    event_ts TIMEUUID,
    event_data1 TEXT,
    event_data2 TEXT,
PRIMARY KEY ((event_id1, event_id2)));

The most efficient way to access data in this table is to query for data with the partition key. This would allow Cassandra to quickly identify the node containing the data we are interested in.

SELECT * FROM pack_events WHERE event_id = "1234";

However, the Cassandra Source connector has no way of knowing the ids of the data that it will need to publish to a Kafka topic. That is why it uses a time range.

The reason we can’t use the event_ts as the partition key is because Cassandra does not support these operators (>, >=, <=, <) on the partition key when querying. And without these we would not be able to query across date/time ranges (see this article for details).

There’s just one more thing. If we tried to run the following query it would fail.

SELECT * FROM pack_events
WHERE event_ts > ‘2018–01–22T20:28:20.869Z’
AND event_ts <= '2018-01-22T20:28:50.869Z';

The connector must supply the ALLOW FILTERING option to the end of this query for it to work. This addition allows Cassandra to search all of the nodes in the cluster for the data in the specified time range (see this article for details).

Configuring the connector: KCQL basics

The Landoop connectors are configured using Kafka Connect Query Language (KCQL). This provides a concise and consistent way to configure the connectors (at least the ones from Landoop). The KCQL and other basic properties are provided via a JSON formatted property file.

For the sake of this post, let’s create a file named connect-cassandra-source.json

{
  "name": "packs",
  "config": {
   "tasks.max": "1",
   "connector.class": …

The name of the connector needs to be unique across all the connectors installed into Kafka Connect.

The connector.class is used to specify which connector is being used.​​

  • com.datamountaineer.streamreactor.connect.cassandra.source.CassandraSourceConnector

The next set of configuration (shown below) is used to specify the information needed to connect to the Cassandra cluster and which keyspace to use.

  • connect.cassandra.contact.points
  • connect.cassandra.port
  • connect.cassandra.username
  • connect.cassandra.password
  • connect.cassandra.consistency.level
  • connect.cassandra.key.space
{
  "name": "packs",
  "config": {
    "tasks.max": "1",
    "connector.class": "com.datamountaineer.streamreactor.connect.cassandra.source.CassandraSourceConnector",
    "connect.cassandra.contact.points": "localhost",
    "connect.cassandra.port": 9042,
    "connect.cassandra.username": "cassandra",
    "connect.cassandra.password": "cassandra",
    "connect.cassandra.consistency.level": "LOCAL_ONE",
    "connect.cassandra.key.space": "blog",

    "connect.cassandra.import.mode": "incremental",
    "connect.cassandra.kcql": "INSERT INTO test_topic SELECT event_data, event_ts FROM pack_events IGNORE event_ts PK event_ts WITHUNWRAP INCREMENTALMODE=TIMESTAMP",

     …
  }
}

There are two values for the connect.cassandra.import.mode. Those are bulk and incremental. The bulk option will query everything in the table every time that the Kafka Connect polling occurs. We will set this to incremental.

The interesting part of the configuration is the connect.cassandra.kcql property (shown above). The KCQL statement tells the connector which table in the Cassandra cluster to use, how to use the columns on the table, and where to publish the data.

The first part of the KCQL statement tells the connector the name of the Kafka topic where the data will be published. In our case that is the topic named test_topic.

INSERT INTO test_topic

The next part of the KCQL statement tells the connector how to deal with the table. The SELECT/FROM specifies the table to poll with the queries. It also specifies the columns whose values should be retrieved. The column that keeps track of the date/time must be part of the SELECT statement. However, if we don’t want that data as part of what we publish to the Kafka topic we can use the IGNORE.

SELECT event_data, event_ts FROM pack_events IGNORE event_ts

The next part of the statement, the PK, tells the connector which of the columns is used to manage the date/time. This is considered the primary key for the connector.

PK event_ts WITHUNWRAP INCREMENTALMODE="TIMESTAMP"

The INCREMENTALMODE tells the connector what the data type of the PK column is. That is going to be either TIMESTAMP or TIMEUUID.

Finally, the WITHUNWRAP option tells the connector to publish the data to the topic as a String rather than as a JSON object.

For example, if we had the following value in the event_data column:

{ "foo":"bar" }

We would want to publish this as seen above.

Leaving the WITHUNWRAP option off will result in the following value being published to the topic.

{
  "schema": {
    "type": "struct",
    "fields": [{
      "type": "string",
      "optional": true,
      "field": "event_data"
    }],
    "optional": false,
    "name": "blog.pack_events"
  },
  "payload": {
    "event_data": "{\"foo\":\"bar\"}"
  }
}

If we leave WITHUNWRAP off, when using the StringConverter (more on that later) we would get the following:

Struct:{event_data={"foo":"bar"}}

We will need to use the combination of WITHUNWRAP and the StringConverter to get the result we want.

Configuring the connector: Tuning Parameters

We’ll explore these in another post. But for now let’s start looking for data in our table with a starting date/time of today. We’ll also poll every second.

{
  "name": "packs",
  "config": {
    "tasks.max": "1",
    …
    "connect.cassandra.initial.offset": "2018–01–22 00:00:00.0000000Z",
    "connect.cassandra.import.poll.interval": 1000
  }
}

Setting up the infrastructure

We will be using the following products:

  • Apache Cassandra 3.11.1
  • Apache Kafka and Kafka Connect 1.0
  • Landoop Cassandra Source 1.0

Installing Cassandra

Installation instructions for Apache Cassandra can be found on the web ( link). Once installed and started the cluster can be verified using the following command:

nodetool -h [IP] status

this will generate a response as follows:

Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address   Load       Tokens       Owns (effective)  Host ID   Rack
UN  10.x.x.x  96.13 GiB   64           39.6%            [UUID]    r6
UN  10.x.x.x  148.98 GiB  64           33.6%            [UUID]    r5
UN  10.x.x.x  88.08 GiB   64           36.4%            [UUID]    r5
UN  10.x.x.x  97.96 GiB   64           30.4%            [UUID]    r6
UN  10.x.x.x  146.89 GiB  64           33.2%            [UUID]    r7
UN  10.x.x.x  205.24 GiB  64           36.8%            [UUID]    r7

Installing Kafka and Kafka Connect

Kafka Connect is shipped and installed as part of Apache Kafka. Instructions for these are also available on the web (link).

  1. Download the tar file (link).
  2. Install the tar file
tar -xzf kafka_2.11–1.0.0.tgz
cd kafka_2.11–1.0.0

Starting Kafka

This post will not attempt to explain the architecture behind a Kafka cluster. However, a typical installation will have several Kafka brokers and Apache Zookeeper.

To run Kafka, first start Zookeeper, then start the Kafka brokers. The commands below assume a local installation with only one node.

bin/zookeeper-server-start.sh config/zookeeper.properties

and

bin/kafka-server-start.sh config/server.properties

Once we have Kafka installed and running, we need to create four topics. One is used by our application to publish our pack JSON. The other three are required by Kafka Connect. We will continue to assume that most are running this initially on a laptop so we will set the replication factor to 1.

bin/kafka-topics.sh — create — topic test_topic -zookeeper localhost:2181 — replication-factor 1 — partitions 3

and

bin/kafka-topics.sh — create — zookeeper localhost:2181 — topic connect-configs — replication-factor 1 — partitions 1 — config cleanup.policy=compact
bin/kafka-topics.sh — create — zookeeper localhost:2181 — topic connect-offsets — replication-factor 1 — partitions 50 — config cleanup.policy=compact
bin/kafka-topics.sh — create — zookeeper localhost:2181 — topic connect-status — replication-factor 1 — partitions 10 — config cleanup.policy=compact

In order to verify that the four topics have been created, run the following command:

bin/kafka-topics.sh — list — zookeeper localhost:2181

Installing the Cassandra Source connector

Landoop offers numerous connectors for Kafka Connect. These are all available as open source. The first thing we need to do is download the Cassandra Source connector jar file (link).

  • kafka-connect-cassandra-1.0.0–1.0.0-all.tar.gz

Unzip the tar file and copy the jar file to the libs folder under the Kafka install directory.

Configuring Kafka Connect

We need to tell Kafka Connect where the Kafka cluster is. In the config folder where Kafka was installed we will find the file: connect-distributed.properties. Look for the bootstrap.servers key. Update that to point to the cluster.

bootstrap.servers=localhost:9092

Starting Kafka Connect

We can now start up our distributed Kafka Connect service. For more information on stand-alone vs distributed mode, see the documentation (

In case you are wondering , “Data Mountaineer” is a company from Netherlands that merged with Landoop (link).

Adding the Cassandra Source connector

Kafka Connect has a REST API to interact with connectors (check this out for details on the API). We need to add the Cassandra Source connector to the Kafka Connect. This is done by sending the property file (connect-cassandra-source.json) to Kafka Connect through the REST API.

curl -X POST -H "Content-Type: application/json" -d @connect-cassandra-source.json localhost:8083/connectors

Once we have successfully loaded the connector, we can check to see the installed connectors using this API:

curl localhost:8083/connectors

That should return a list of the connectors by their configured names.

["packs"]

Testing the Cassandra Source connector

In order to test everything out we will need to insert some data into our table.

INSERT INTO pack_events (event_id, event_ts, event_data)
VALUES (‘500’, ‘2018–01–22T20:28:50.869Z’, ‘{"foo":"bar"}’);

We can check what is being written to the Kafka topic by running the following command:

bin/kafka-console-consumer.sh — bootstrap-server localhost:9092 — topic test_topic

At this point, we might be surprised to see something like this:

{
  "schema":{
    "type":"string",
    "optional":false
  },
  "payload":"{\"foo\":\"bar\"}"
}

That is better than what we were getting without WITHUNWRAP but isn’t exactly what we were hoping for. To get the JSON value that was written to the table column we need to update the connect-distributed.properties file. Open this up and look for JsonConverter. Replace those lines with the following:

key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter

Restart Kafka Connect. Insert another row into the table. Now we should get what we want.

{ "foo":"bar" }

Happy coding!

Read the second part of this article here - Tuning the Kafka Connect Cassandra Source (part 2)

References


Share this article

Did you like this article?

Subscribe to get new blogs, updates & news

Follow us for news, updates and releases!
@LandoopLtd
LENSES
For Apache Kafka ®
Download Now
Share this article


2 Minute Overview


Discover awesome features


Community


Join us at Landoop Community


Resources


Repos, Docs, Trainings, Tutorials


Free Download


ALL-IN-ONE free for developers!