Skip to content

Commit

Permalink
docs: upgrade theme and restructure configurations (#178)
Browse files Browse the repository at this point in the history
* docs: cleanup configurations

* docs: update advance configurations

* docs: fix broken links
  • Loading branch information
ravisuhag authored Jun 30, 2022
1 parent 9a4b24c commit ab23d10
Show file tree
Hide file tree
Showing 33 changed files with 666 additions and 1,953 deletions.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Binary file not shown.
File renamed without changes.
File renamed without changes.
4 changes: 2 additions & 2 deletions docs/docs/concepts/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ The section details all integrating systems for Firehose deployment. These are e

### Kafka

- The Kafka topic\(s\) where Firehose reads from. The [`SOURCE_KAFKA_TOPIC`](../reference/configuration/#source_kafka_topic) config can be set in Firehose.
- The Kafka topic\(s\) where Firehose reads from. The [`SOURCE_KAFKA_TOPIC`](../advance/generic#source_kafka_topic) config can be set in Firehose.

### ProtoDescriptors

Expand All @@ -105,4 +105,4 @@ The section details all integrating systems for Firehose deployment. These are e

- InfluxDB - time-series database where all Firehose metrics are stored. Integration through the Telegraf component.

For a complete set of configurations please refer to the sink-specific [configuration](../reference/configuration/).
For a complete set of configurations please refer to the sink-specific [configuration](../advance/generic/).
37 changes: 18 additions & 19 deletions docs/docs/contribute/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

The following guide will help you quickly run Firehose in your local machine. The main components of Firehose are:

* Consumer: Handles data consumption from Kafka.
* Sink: Package which handles sinking data.
* Metrics: Handles the metrics via StatsD client
- Consumer: Handles data consumption from Kafka.
- Sink: Package which handles sinking data.
- Metrics: Handles the metrics via StatsD client

## Requirements

Expand All @@ -20,20 +20,20 @@ export PATH=~/Downloads/jdk1.8.0_291/bin:$PATH

Firehose environment variables can be configured in either of the following ways -

* append a new line at the end of `env/local.properties` file. Variables declared in `local.properties` file are automatically added to the environment during runtime.
* run `export SAMPLE_VARIABLE=287` on a UNIX shell, to directly assign the required environment variable.
- append a new line at the end of `env/local.properties` file. Variables declared in `local.properties` file are automatically added to the environment during runtime.
- run `export SAMPLE_VARIABLE=287` on a UNIX shell, to directly assign the required environment variable.

### Kafka Server

Apache Kafka server service must be set up, from which Firehose's Kafka consumer will pull messages. Kafka Server version greater than 2.4 is currently supported by Firehose. Kafka Server URL and port address, as well as other Kafka-specific parameters must be configured in the corresponding environment variables as defined in the [Generic configuration](../reference/configuration/#generic) section.
Apache Kafka server service must be set up, from which Firehose's Kafka consumer will pull messages. Kafka Server version greater than 2.4 is currently supported by Firehose. Kafka Server URL and port address, as well as other Kafka-specific parameters must be configured in the corresponding environment variables as defined in the [Generic configuration](../advance/generic) section.

Read the[ official guide](https://kafka.apache.org/quickstart) on how to install and configure Apache Kafka Server.

### Destination Sink Server

The sink to which Firehose will stream Kafka's data to, must have its corresponding server set up and configured. The URL and port address of the database server / HTTP/GRPC endpoint , along with other sink - specific parameters must be configured the environment variables corresponding to that particular sink.

Configuration parameter variables of each sink can be found in the [Configurations](../reference/configuration/) section.
Configuration parameter variables of each sink can be found in the [Configurations](../advance/generic/) section.

### Schema Registry

Expand All @@ -45,18 +45,18 @@ Refer [this guide](https://github.com/odpf/stencil/tree/master/server#readme) on

Firehose sends critical metrics via StatsD client. Refer the[ Monitoring](../concepts/monitoring.md#setting-up-grafana-with-firehose) section for details on how to setup Firehose with Grafana. Alternatively, you can set up any other visualization platform for monitoring Firehose. Following are the typical requirements -

* StatsD host \(e.g. Telegraf\) for aggregation of metrics from Firehose StatsD client
* A time-series database \(e.g. InfluxDB\) to store the metrics
* GUI visualization dashboard \(e.g. Grafana\) for detailed visualisation of metrics
- StatsD host \(e.g. Telegraf\) for aggregation of metrics from Firehose StatsD client
- A time-series database \(e.g. InfluxDB\) to store the metrics
- GUI visualization dashboard \(e.g. Grafana\) for detailed visualisation of metrics

## Running locally

```bash
# Clone the repo
$ git clone https://github.com/odpf/firehose.git
$ git clone https://github.com/odpf/firehose.git

# Build the jar
$ ./gradlew clean build
$ ./gradlew clean build

# Configure env variables
$ cat env/local.properties
Expand All @@ -65,7 +65,7 @@ $ cat env/local.properties
$ ./gradlew runConsumer
```

**Note:** Sample configuration for other sinks along with some advanced configurations can be found [here](../reference/configuration/)
**Note:** Sample configuration for other sinks along with some advanced configurations can be found [here](../advance/generic/)

### Running tests

Expand Down Expand Up @@ -99,9 +99,8 @@ $ git commit -s -m "feat: my first commit"

#### Good practices to keep in mind

* Follow the [conventional commit](https://www.conventionalcommits.org/en/v1.0.0/) format for all commit messages.
* Fill in the description based on the default template configured when you first open the PR
* Include kind label when opening the PR
* Add WIP: to PR name if more work needs to be done prior to review
* Avoid force-pushing as it makes reviewing difficult

- Follow the [conventional commit](https://www.conventionalcommits.org/en/v1.0.0/) format for all commit messages.
- Fill in the description based on the default template configured when you first open the PR
- Include kind label when opening the PR
- Add WIP: to PR name if more work needs to be done prior to review
- Avoid force-pushing as it makes reviewing difficult
36 changes: 17 additions & 19 deletions docs/docs/guides/create_firehose.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,9 @@

This page contains how-to guides for creating Firehose with different sinks along with their features.

{% hint style="info" %}
If you'd like to connect to a sink which is not yet supported, you can create a new sink by following the [contribution guidelines](../contribute/contribution.md)
{% endhint %}

## Create a Log Sink

Firehose provides a log sink to make it easy to consume messages in [standard output](https://en.wikipedia.org/wiki/Standard_streams#Standard_output_%28stdout%29). A log sink firehose requires the following [variables](../reference/configurations) to be set. Firehose log sink can work in key as well as message parsing mode configured through [`KAFKA_RECORD_PARSER_MODE`](../reference/configurations#kafka_record_parser_mode)
Firehose provides a log sink to make it easy to consume messages in [standard output](https://en.wikipedia.org/wiki/Standard_streams#Standard_output_%28stdout%29). A log sink firehose requires the following [variables](../advance/generic.md) to be set. Firehose log sink can work in key as well as message parsing mode configured through [`KAFKA_RECORD_PARSER_MODE`](../advance/generic.md#kafka_record_parser_mode)

An example log sink configurations:

Expand Down Expand Up @@ -36,11 +32,16 @@ event_timestamp {
}
```

## Define generic configurations

- These are the configurations that remain common across all the Sink Types.
- You don’t need to modify them necessarily, It is recommended to use them with the default values. More details [here](../advance/generic#standard).

## Create an HTTP Sink

Firehose [HTTP](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol) sink allows users to read data from Kafka and write to an HTTP endpoint. it requires the following [variables](../sinks/http-sink.md#http-sink) to be set. You need to create your own HTTP endpoint so that the Firehose can send data to it.

### Supported Methods
### Supported methods

Firehose supports `PUT` and `POST` verbs in its HTTP sink. The method can be configured using [`SINK_HTTP_REQUEST_METHOD`](../sinks/http-sink.md#sink_http_request_method).

Expand Down Expand Up @@ -76,20 +77,20 @@ Limitations:
- validation on the level of valid JSON template. But after data has been replaced the resulting string may or may not be a valid JSON. Users must do proper testing/validation from the service side.
- If selecting fields from complex data types like repeated/messages/map of proto, the user must do filtering based first as selecting a field that does not exist would fail.

## Create a JDBC SINK
## Create a JDBC sink

- Supports only PostgresDB as of now.
- Data read from Kafka is written to the PostgresDB database and it requires the following [variables](../sinks/jdbc-sink.md#jdbc-sink) to be set.

_**Note: Schema \(Table, Columns, and Any Constraints\) being used in firehose configuration must exist in the Database already.**_

## Create an InfluxDB Sink
## Create an InfluxDB sink

- Data read from Kafka is written to the InfluxDB time-series database and it requires the following [variables](../sinks/influxdb-sink.md#influx-sink) to be set.

_**Note:**_ [_**DATABASE**_](../sinks/influxdb-sink.md#sink_influx_db_name) _**and**_ [_**RETENTION POLICY**_](../sinks/influxdb-sink.md#sink_influx_retention_policy) _**being used in firehose configuration must exist already in the Influx, It’s outside the scope of a firehose and won’t be generated automatically.**_

## Create a Redis Sink
## Create a Redis sink

- it requires the following [variables](../sinks/redis-sink.md) to be set.
- Redis sink can be created in 2 different modes based on the value of [`SINK_REDIS_DATA_TYPE`](../sinks/redis-sink.md#sink_redis_data_type): HashSet or List
Expand All @@ -99,40 +100,37 @@ _**Note:**_ [_**DATABASE**_](../sinks/influxdb-sink.md#sink_influx_db_name) _**a
- Redis sink also supports different [Deployment Types](../sinks/redis-sink.md#sink_redis_deployment_type) `Standalone` and `Cluster`.
- Limitation: Firehose Redis sink only supports HashSet and List entries as of now.

## Create an Elasticsearch Sink
## Create an Elasticsearch sink

- it requires the following [variables](../sinks/elasticsearch-sink.md) to be set.
- In the Elasticsearch sink, each message is converted into a document in the specified index with the Document type and ID as specified by the user.
- Elasticsearch sink supports reading messages in both JSON and Protobuf formats.
- Using [Routing Key](../sinks/elasticsearch-sink.md#sink_es_routing_key_name) one can route documents to a particular shard in Elasticsearch.

## Create a GRPC Sink
## Create a GRPC sink

- Data read from Kafka is written to a GRPC endpoint and it requires the following [variables](../sinks/grpc-sink.md) to be set.
- You need to create your own GRPC endpoint so that the Firehose can send data to it. The response proto should have a field “success” with value as true or false.

## Create an MongoDB Sink
## Create an MongoDB sink

- it requires the following [variables](../sinks/mongo-sink.md) to be set.
- In the MongoDB sink, each message is converted into a BSON Document and then inserted/updated/upserted into the specified Mongo Collection
- MongoDB sink supports reading messages in both JSON and Protobuf formats.

## Define Standard Configurations

- These are the configurations that remain common across all the Sink Types.
- You don’t need to modify them necessarily, It is recommended to use them with the default values. More details [here](../reference/configuration#standard).

## Create a Blob Sink
## Create a Blob sink

- it requires the following [variables](../sinks/blob-sink.md) to be set.
- Only support google cloud storage for now.
- Only support writing protobuf message to apache parquet file format for now.
- The protobuf message need to have a `google.protobuf.Timestamp` field as partitioning timestamp, `event_timestamp` field is usually being used.
- Google cloud credential with some google cloud storage permission is required to run this sink.

## Create a Bigquery Sink
## Create a Bigquery sink

- it requires the following [variables](../sinks/bigquery-sink.md) to be set.
- This sink will generate bigquery schema from protobuf message schema and update bigquery table with the latest generated schema.
- The protobuf message of a `google.protobuf.Timestamp` field might be needed when table partitioning is enabled.
- Google cloud credential with some bigquery permission is required to run this sink.

If you'd like to connect to a sink which is not yet supported, you can create a new sink by following the [contribution guidelines](../contribute/contribution.md)
3 changes: 1 addition & 2 deletions docs/docs/guides/manage.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Consumer Lag

When it comes to decreasing the topic lag, it often helps to have the environment variable - [`SOURCE_KAFKA_CONSUMER_CONFIG_MAX_POLL_RECORDS`](../reference/configuration/#source_kafka_consumer_config_max_poll_records) config to be increased from the default of 500 to something higher.
When it comes to decreasing the topic lag, it often helps to have the environment variable - [`SOURCE_KAFKA_CONSUMER_CONFIG_MAX_POLL_RECORDS`](../advance/generic/#source_kafka_consumer_config_max_poll_records) config to be increased from the default of 500 to something higher.

Additionally, you can increase the workers in the Firehose which will effectively multiply the number of records being processed by Firehose. However, please be mindful of the caveat mentioned below.

Expand All @@ -11,4 +11,3 @@ Additionally, you can increase the workers in the Firehose which will effectivel
Be mindful of the fact that your sink also needs to be able to process this higher volume of data being pushed to it. Because if it is not, then this will only compound the problem of increasing lag.

Alternatively, if your underlying sink is not able to handle increased \(or default\) volume of data being pushed to it, adding some sort of a filter condition in the Firehose to ignore unnecessary messages in the topic would help you bring down the volume of data being processed by the sink.

11 changes: 1 addition & 10 deletions docs/docs/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,21 +35,12 @@ Following sinks are supported in the Firehose
- [Bigquery](https://cloud.google.com/bigquery) - A data warehouse provided by Google Cloud
- [Blob Storage](https://gocloud.dev/howto/blob/) - A data storage architecture for large stores of unstructured data like google cloud storage, amazon s3, apache hadoop distributed filesystem

## How is Firehose different from Kafka-Connect?

- **Ease of use:** Firehose is easier to install, and using different sinks only requires changing a few configurations. When used in distributed mode across multiple nodes, it requires connectors to be installed across all the workers within your Kafka-Connect cluster.
- **Filtering:** Value-based filtering is much easier to implement as compared to Kafka-Connect. Requires no additional plugins/schema-registry to be installed.
- **Extensible:** Provides a comprehensible abstract sink contract making it easier to add a new sink in Firehose. Firehose also comes with an inbuilt serialization/deserialization and doesn't require any converters and serializers when implementing a new sink.
- **Easy monitoring:** Firehose provides a detailed health dashboard \(Grafana\) for effortless monitoring.
- **Connectors:** Some of the Kafka connect available connectors usually have limitations. Its usually rare to find all the required features in a single connector and so is to find documentation for the same
- **Fully open-source:** Firehose is completely open-source while separation of commercial and open-source features is not very structured in Kafka Connect and for monitoring and advanced features, confluent control center requires an enterprise subscription

## How can I get started?

Explore the following resources to get started with Firehose:

- [Guides](./guides/create_firehose.md) provide guidance on creating Firehose with different sinks.
- [Concepts](./concepts/overview.md) describe all important Firehose concepts.
- [FAQs](./reference/faq.md) lists down some common frequently asked questions about Firehose and related components.
- [Reference](./reference/configuration/) contains details about configurations, metrics, FAQs, and other aspects of Firehose.
- [Reference](./advance/generic/) contains details about configurations, metrics, FAQs, and other aspects of Firehose.
- [Contributing](./contribute/contribution.md) contains resources for anyone who wants to contribute to Firehose.
Loading

0 comments on commit ab23d10

Please sign in to comment.