Skip to content

Streaming Text Files to Kafka

Ahmed Elbahtemy edited this page Apr 18, 2019 · 19 revisions

Use Case Overview

In this use case, we create Brooklin datastreams to publish text file contents to a locally deployed instance of Apache Kafka.

Use Case Summary

Instructions

1. Set up Kafka

  1. Download the latest Kafka tarball and untar it.
    tar -xzf kafka_2.12-2.2.0.tgz
    cd kafka_2.12-2.2.0
  2. Start a ZooKeeper server
    bin/zookeeper-server-start.sh config/zookeeper.properties
  3. Start a Kafka server
    bin/kafka-server-start.sh config/server.properties

2. Set up Brooklin

  1. Download the latest tarball (tgz) from Brooklin releases.
  2. Untar the Brooklin tarball
    tar -xzf brooklin-1.0.0.tgz
    cd brooklin-1.0.0 
  3. Run Brooklin
    bin/brooklin-server-start.sh config/server.properties

3. Create a Datastream

  1. Create a datastream to stream the contents of any file of your choice to Kafka.

    # Replace NOTICE below with a file path of your choice or leave it as 
    # is if you would like to use the NOTICE file as an example text file
    bin/brooklin-rest-client.sh -o CREATE -u http://localhost:32311/ -n first-file-datastream -s NOTICE -c file -p 1 -t kafka -m '{"owner":"test-user"}'

    Here are the options we used to create this datastream:

    -o CREATE                      The operation is datastream creation
    -u http://localhost:32311/     Datstream Management Service URI
    -n first-file-datastream       Datastream name
    -s NOTICE                      Datastream source URI (source file path in this case)
    -c file                        Connector name ("file" refers to FileConnector)
    -p 1                           Number of source partitions
    -t kafka                       Transport provider name ("kafka" refers to KafkaTransportProvider)
    -m '{"owner":"test-user"}'     Datastream metadata (specifying datastream owner is mandatory)
    
  2. Verify the datastream creation by requesting all datastream metadata from Brooklin using the command line REST client.

    bin/brooklin-rest-client.sh -o READALL -u http://localhost:32311/
  3. You can also view the streaming progress by querying the diagnostics REST endpoint of the Datastream Management Service.

    curl -s "http://localhost:32311/diag?scope=file&type=connector&q=status&content=position?"
  4. Additionally, you can view some more information about the different Datastreams and DatastreamTasks by querying the health monitoring REST endpoint of the Datastream Management Service.

    curl -s "http://localhost:32311/health"

4. Verify the Data Transfer to Kafka

  1. Verify a Kafka topic has been created to hold the data of your newly created datastream. The topic name will have the datastream name (i.e. first-file-datastream) as a prefix.

    cd <kafka-dir>  # Replace with Kafka directory
    bin/kafka-topics.sh --list --bootstrap-server localhost:9092
  2. Print the Kafka topic contents

    # Replace <topic-name> below with name of Kafka topic
    bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic <topic-name> --from-beginning

5. Create More Datastreams

Feel free to create more datastreams to publish more files to Kafka.

6. Stop Brooklin, Kafka, and ZooKeeper

When you are done, run the following commands to stop all running apps.

# Replace <brooklin-dir> and <kafka-dir> with Brooklin and Kafka directories, respectively
<brooklin-dir>/bin/brooklin-server-stop.sh
<kafka-dir>/bin/kafka-server-stop.sh
<kafka-dir>/bin/zookeeper-server-stop.sh