Skip to content

Streaming Text Files to Kafka

Ahmed Elbahtemy edited this page Apr 16, 2019 · 19 revisions

Use Case Overview

In this use case, we use Brooklin to create datastreams to publish text file content to a locally deployed instance of Apache Kafka.

Use Case Summary

Instructions

1. Set up Kafka

  1. Download the latest Kafka tarball and untar it.
    tar -xzf kafka_2.12-2.2.0.tgz
    cd kafka_2.12-2.2.0
  2. Start a ZooKeeper server
    bin/zookeeper-server-start.sh config/zookeeper.properties
  3. Start a Kafka server
    bin/kafka-server-start.sh config/server.properties

2. Set up Brooklin

  1. Download the latest tarball (tgz) from Brooklin releases to a convenient location on your computer.
  2. Untar the Brooklin tarball
    tar -xzf brooklin-1.0.0.tgz
    cd brooklin-1.0.0 
  3. Run Brooklin
    bin/brooklin-server-start.sh config/server.properties

3. Create a Datastream

  1. Create a datastream to stream the contents of any file of your choice to Kafka.

    # Replace NOTICE below with a file path of your choice or leave it as 
    # is if you would like to use the NOTICE file as an example text file
    bin/brooklin-rest-client.sh -o CREATE -u http://localhost:32311/ -n first-file-datastream -s NOTICE -c file -p 1 -t kafka -m '{"owner":"test-user"}'

    Here are the options we used to create this datastream:

    -o CREATE                      The operation is datastream creation
    -u http://localhost:32311/     Datstream Management Service URI
    -n first-file-datastream       Datastream name
    -s NOTICE                      Datastream source URI (source file path in this case)
    -c file                        Connector name ("file" refers to FileConnector)
    -p 1                           Number of source partitions
    -t kafka                       Transport provider name ("kafka" refers to KafkaTransportProvider)
    -m '{"owner":"test-user"}'     Datastream metadata (specifying datastream owner is mandatory)
    
  2. Verify the datastream creation by requesting all datastream metadata from Brooklin.

    bin/brooklin-rest-client.sh -o READALL -u http://localhost:32311/
    
  3. You can also view the streaming progress by querying the diagnostics REST endpoint of the Datastream Management Service.

    curl -s "http://localhost:32311/diag?scope=file&type=connector&q=status&content=position?"

4. Verify the Data Transfer to Kafka

  1. Verify a Kafka topic has been created to hold the data of your newly created datastream. The topic name will have the datastream name (i.e. first-file-datastream) as a prefix.

    cd <kafka-dir>  # Replace with Kafka directory
    bin/kafka-topics.sh --list --bootstrap-server localhost:9092
  2. Print the Kafka topic contents

    # Replace <topic-name> below with name of Kafka topic
    bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic <topic-name> --from-beginning

5. Create More Datastreams

Feel free to create more datastreams to publish more files to Kafka.

6. Stop Brooklin and Kafka

When you are done, run the following commands to stop all running apps.

# Replace <brooklin-dir> and <kafka-dir> with Brooklin and Kafka directories, respectively
<brooklin-dir>/bin/brooklin-server-stop.sh
<kafka-dir>/bin/kafka-server-stop.sh
<kafka-dir>/bin/zookeeper-server-stop.sh