Skip to content

Commit

Permalink
docs: improve documentation site (#76)
Browse files Browse the repository at this point in the history
* docs: integrate docusaurus

* docs: update key features

* docs: add users and fix broken paths

* docs: add help section
  • Loading branch information
ravisuhag authored Oct 1, 2021
1 parent 41dd8c5 commit f00e3c7
Show file tree
Hide file tree
Showing 88 changed files with 35,124 additions and 261 deletions.
4 changes: 0 additions & 4 deletions .gitbook.yaml

This file was deleted.

32 changes: 32 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: Build

on: push
branches:
- main
workflow_dispatch:

jobs:
documentation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-node@v2
- name: Installation
uses: bahmutov/npm-install@v1
with:
install-command: yarn
working-directory: docs
- name: Build docs
working-directory: docs
run: cd docs && yarn build
- name: Deploy docs
env:
GIT_USER: ravisuhag
GIT_PASS: ${{ secrets.DOCU_RS_TOKEN }}
DEPLOYMENT_BRANCH: gh-pages
CURRENT_BRANCH: main
working-directory: docs
run: |
git config --global user.email "[email protected]"
git config --global user.name "ravisuhag"
yarn deploy
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
Dagger or Data Aggregator is an easy-to-use, configuration over code, cloud-native framework built on top of Apache Flink for stateful processing of real-time streaming data. With Dagger, you don't need to write custom applications or manage resources to process data in real-time.
Instead, you can write SQLs to do the processing and analysis on streaming data.

<p align="center"><img src="./docs/assets/overview.svg" /></p>
<p align="center"><img src="./docs/static/img/overview.svg" /></p>

## Key Features
Discover why to use Dagger
Expand Down
20 changes: 20 additions & 0 deletions docs/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Dependencies
/node_modules

# Production
/build

# Generated files
.docusaurus
.cache-loader

# Misc
.DS_Store
.env.local
.env.development.local
.env.test.local
.env.production.local

npm-debug.log*
yarn-debug.log*
yarn-error.log*
52 changes: 25 additions & 27 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,33 @@
# Introduction
Dagger or Data Aggregator is an easy-to-use, configuration over code, cloud-native framework built on top of Apache Flink for stateful processing of real-time streaming data. With Dagger, you don't need to write custom applications or manage resources to process data in real-time.
Instead, you can write SQLs to do the processing and analysis on streaming data.
# Website

![](./assets/overview.svg)
This website is built using [Docusaurus 2](https://docusaurus.io/), a modern static website generator.

## Key Features
Discover why to use Dagger
### Installation

* **Processing:** Dagger can transform, aggregate, join and enrich Protobuf data in real-time.
* **Scale:** Dagger scales in an instant, both vertically and horizontally for high performance streaming sink and zero data drops.
* **Extensibility:** Add your own sink to dagger with a clearly defined interface or choose from already provided ones.
* **Pluggability:** Add custom business logic in form of plugins \(UDFs, Transformers, Preprocessors and Post Processors\) independent of the core logic.
* **Metrics:** Always know what’s going on with your deployment with built-in [monitoring](docs/../reference/metrics.md) of throughput, response times, errors and more.
```
$ yarn
```

## What problems Dagger solves?
* Map reduce -> [SQL](https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/sql.html)
* Enrichment -> [Post Processors](docs/../advance/post_processor.md)
* Aggregation -> [SQL](https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/sql.html), [UDFs](docs/../guides/use_udf.md)
* Masking -> [Hash Transformer](docs/../reference/transformers.md#HashTransformer)
* Deduplication -> [Deduplication Transformer](docs/../reference/transformers.md#DeDuplicationTransformer)
* Realtime long window processing -> [Longbow](docs/../advance/longbow.md)
### Local Development

To know more, follow the detailed [documentation](https://odpf.gitbook.io/dagger).
```
$ yarn start
```

## Usage
This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server.

Explore the following resources to get started with Dagger:
### Build

* [Guides](docs/../guides/overview.md) provides guidance on [creating Dagger](docs/../guides/overview.md) with different sinks.
* [Concepts](docs/../concepts/overview.md) describes all important Dagger concepts.
* [Advance](docs/../advance/overview.md) contains details regarding advance features of Dagger.
* [Reference](docs/../reference/overview.md) contains details about configurations, metrics and other aspects of Dagger.
* [Contribute](docs/../contribute/contribution.md) contains resources for anyone who wants to contribute to Dagger.
* [Usecase](docs/../usecase/overview.md) describes examples use cases which can be solved via Dagger.
```
$ yarn build
```

This command generates static content into the `build` directory and can be served using any static contents hosting service.

### Deployment

```
$ GIT_USER=<Your GitHub username> USE_SSH=true yarn deploy
```

If you are using GitHub pages for hosting, this command is a convenient way to build the website and push to the `gh-pages` branch.
53 changes: 0 additions & 53 deletions docs/SUMMARY.md

This file was deleted.

89 changes: 0 additions & 89 deletions docs/assets/overview.svg

This file was deleted.

3 changes: 3 additions & 0 deletions docs/babel.config.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
module.exports = {
presets: [require.resolve('@docusaurus/core/lib/babel/preset')],
};
11 changes: 11 additions & 0 deletions docs/blog/2021-08-20-dagger-launch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
slug: introducing-dagger
title: Introducing Dagger
authors:
name: Ravi Suhag
title: Maintainer
url: https://github.com/ravisuhag
tags: [odpf, dagger]
---

We are live!
5 changes: 5 additions & 0 deletions docs/blog/authors.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ravisuhag:
name: Ravi Suhag
title: Maintainer
url: https://github.com/ravisuhag
image_url: https://github.com/ravisuhag.png
9 changes: 5 additions & 4 deletions docs/advance/DARTS.md → docs/docs/advance/DARTS.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Introduction
# Darts

In data streaming pipelines, in certain cases, not entire data is present in the event itself. One scenario for such cases can be where some particular information is present in form of static data that you need in runtime. DARTS(Dagger Refer-Table Service) allows you to join streaming data from a reference data store. It supports reference data store in the form of a list or <key, value> map. It enables the refer-table with the help of [UDFs](docs/../../guides/use_udf.md) which can be used in the SQL query. Currently, we only support GCS as a reference data source.

# Types of DARTS
Expand All @@ -13,7 +14,7 @@ This UDF can be used in cases where we want to fetch static information from a <
Let’s assume we need to find out the number of bookings getting completed in a particular District per minute. The input schema only has information regarding service_area_id but not the District. The mapping of service_area_id to District is present in a static key-value map. We can utilize DartGet in order to get this information in our query.

<p align="center">
<img src="../assets/dart-get.png" width="80%"/>
<img src="/img/dart-get.png" width="80%"/>
</p>

Sample input schema for booking
Expand Down Expand Up @@ -63,7 +64,7 @@ This UDF can be used in cases where we want to verify the presence of a key in a
Let’s assume we need to count the number of completed bookings per minute but excluding a few blacklisted customers. This static list of blacklisted customers is present remotely. We can utilize DartContains UDF here.

<p align="center">
<img src="../assets/dart-contains.png" width="80%"/>
<img src="/img/dart-contains.png" width="80%"/>
</p>

Sample input schema for booking
Expand Down Expand Up @@ -111,7 +112,7 @@ Most of DARTS configurations are via UDF contract, for other glabal configs refe

### Persistence layer
We store all data references in GCS. We recommend using GCS where the data reference is not too big and updates are very few. GCS also offers a cost-effective solution.
We currently support storing data in a specific bucket, with a custom folder for every dart. The data is always kept in a file and you have to pass the relative path <custom-folder>/filename.json. There should not be many reads as every time we read the list from GCS we read the whole selected list and cache it in Dagger.
We currently support storing data in a specific bucket, with a custom folder for every dart. The data is always kept in a file and you have to pass the relative path `<custom-folder>/filename.json`. There should not be many reads as every time we read the list from GCS we read the whole selected list and cache it in Dagger.

### Caching mechanism
Dart fetches the data from GCS after configurable refresh period or when entire data is missing from the cache or is empty. After Dart fetches the data, it stores it in the application state.
Expand Down
4 changes: 2 additions & 2 deletions docs/advance/longbow.md → docs/docs/advance/longbow.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Introduction
# Longbow
This is another type of processor which is also applied post SQL query processing in the Dagger workflow. This feature allows users to aggregate data over long windows in real-time. For certain use-cases, you need to know the historical data for an event in a given context. Eg: For a booking event, a risk prevention system would be interested in 1 month pattern of the customer.
Longbow solves the above problem, the entire historical context gets added in the same event which allows downstream systems to process data without any external dependency. In order to achieve this, we store the historical data in an external data source. After evaluating a lot of data sources we found [Bigtable](https://cloud.google.com/bigtable) to be a good fit primarily because of its low scan queries latencies. This currently works only for Kafka sink.

Expand Down Expand Up @@ -35,7 +35,7 @@ This component is responsible for reading the historical data from Bigtable and
In this example, let's assume we have booking events in a Kafka cluster and we want to get information of all the order numbers and their driver ids for customers in the last 30 days. Here customer_id will become longbow_key.

<p align="center">
<img src="../assets/longbow.png" width="80%"/>
<img src="/img/longbow.png" width="80%"/>
</p>

Sample input schema for booking
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Introduction
# Longbow+

Longbow+ is an enhanced version of [longbow](docs/../../advance/longbow.md). It is also used for long windowed aggregations. One of the limitations for longbow was the lack of support for complex data types. Longbow+ lets you select as many complex fields as required and it has its own DSL in place to query these complex fields. This currently works only for Kafka sink.

# Components
Expand Down Expand Up @@ -26,7 +27,7 @@ LongbowWrite has two responsibilities, to write an incoming event to BigTable an
In this example, let's assume we have booking events in a Kafka cluster and we want to get information of all the order numbers, their driver ids and location(complex field) for customers in the last 30 days. Here customer_id will become longbow_write_key.

<p align="center">
<img src="../assets/longbowplus-writer.png" width="80%"/>
<img src="/img/longbowplus-writer.png" width="80%"/>
</p>

Sample input schema for booking
Expand Down Expand Up @@ -122,7 +123,7 @@ It reads the output of LongbowWrite and fetches the data for a particular key an
In this example, we are consuming the output from the longbow_writer example mentioned above.

<p align="center">
<img src="../assets/longbowplus-reader.png" width="80%"/>
<img src="/img/longbowplus-reader.png" width="80%"/>
</p>

Sample output schema for longbow reader output
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,17 +1,4 @@
# Table of Contents
* [Introduction](post_processor.md#introduction)
* [Flow of Execution](post_processor.md#flow-of-execution)
* [Types of Post Processors](post_processor.md#types-of-post-processors)
* [External Post Processor](post_processor.md#external-post-processor)
* [Elasticsearch](post_processor.md#elasticsearch)
* [HTTP](post_processor.md#http)
* [Postgres](post_processor.md#postgres)
* [GRPC](post_processor.md#grpc)
* [Internal Post Processor](post_processor.md#internal-post-processor)
* [Transformers](post_processor.md#transformers)
* [Post Processor requirements](post_processor.md#post-processor-requirements)

# Introduction
# Post Processors
Post Processors give the capability to do custom stream processing after the SQL processing is performed. Complex transformation, enrichment & aggregation use cases are difficult to execute & maintain using SQL. Post Processors solve this problem through code and/or configuration. This can be used to enrich the stream from external sources (HTTP, ElasticSearch, PostgresDB, GRPC), enhance data points using function or query and transform through user-defined code.

All the post processors mentioned in this doc can be applied in a sequential manner, which enables you to get information from multiple different external data sources and apply as many transformers as required. The output of one processor will be the input for the other and the final result will be pushed to the configured sink.
Expand All @@ -22,21 +9,21 @@ In the flow of Post Processors, all types of processors viz; External Post Proce
* Let's assume that you want to find cashback given for a particular order number from an external API endpoint. You can use an [HTTP external post-processor](post_processor.md#http) for this. Here is a basic Data flow diagram.

<p align="center">
<img src="../assets/external-http-post-processor.png" width="80%"/>
<img src="/img/external-http-post-processor.png" width="80%"/>
</p>

* In the above example, assume you also want to output the information of customer_id and amount which are fields from input proto. [Internal Post Processor](post_processor.md#internal-post-processor) can be used for selecting these fields from the input stream.

<p align="center">
<img src="../assets/external-internal-post-processor.png" width="80%"/>
<img src="/img/external-internal-post-processor.png" width="80%"/>
</p>

* After getting customer_id, amount and cashback amount, you may want to round off the cashback amount. For this, you can write a custom [transformer](docs/../../guides/use_transformer.md) which is a simple Java Flink Map function to calculate the round-off amount.

**Note:** All the above processors are chained sequentially on the output of the previous processor. The order of execution is determined via the order provided in JSON config.

<p align="center">
<img src="../assets/external-internal-transformer-post-processor.png" width="80%"/>
<img src="/img/external-internal-transformer-post-processor.png" width="80%"/>
</p>

# Types of Post Processors
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Introduction
# Pre Processors
Pre processors enable the users to add Flink [operators/transformations](https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/operators) before passing on the stream to the SQL query. Each stream registered on dagger can have chained pre processors. They will run and transform the data before SQL processing.

# Type of Preprocessors
Expand All @@ -8,7 +8,7 @@ Currently, there is only one type of pre-processor.
# Data flow in preprocessors

<p align="center">
<img src="../assets/pre-processor.png" />
<img src="/img/pre-processor.png" />
</p>

In the above diagram:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Dagger Architecture
# Architecture

Dagger or Data Aggregator is a cloud native framework for processing real-time streaming data built on top of Apache Flink.

## System Design

### Components

![Dagger Architecture](../assets/architecture.png)
![Dagger Architecture](/img/architecture.png)

_**Stream**_

Expand Down
4 changes: 2 additions & 2 deletions docs/concepts/basics.md → docs/docs/concepts/basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,12 +50,12 @@ Dagger provides two different types of windows
- Hop/Tumbling Windows

Each element to a window of specified window size. Tumbling windows have a fixed size and do not overlap. For example, if you specify a tumbling window with a size of 5 minutes, the current window will be evaluated and a new window will be started every five minutes as illustrated by the following figure. (image credit: [Flink Operators](https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/operators/windows.html))
![Tumble Window](../assets/tumble.png)
![Tumble Window](/img/tumble.png)

- Sliding Windows

Each element gets assigned to windows of fixed length. An additional window slide parameter controls how frequently a sliding window is started. Hence, sliding windows can be overlapping if the slide is smaller than the window size. In this case, elements are assigned to multiple windows. (image credit: [Flink Operators](https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/operators/windows.html))
![Sliding Window](../assets/sliding.png)
![Sliding Window](/img/sliding.png)

### Rowtime

Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
## Dagger Lifecycle
# Lifecycle

Architecturally after the creation of Dagger, it goes through several stages before materializing the results to an output stream.

<p align="center">
<img src="../assets/dagger-lifecycle.png" />
<img src="/img/dagger-lifecycle.png" />
</p>

- `Stage-1` : Dagger registers all defined configurations. JobManager validates the configurations and the query and creates a job-graph for the same.
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

The following is a set of guidelines for contributing to Dagger. These are mostly guidelines, not hard and first rules. Use your best judgment, and feel free to propose changes to this document over a pull request. Here are some important resources:

- [Concepts](docs/../../concepts) section will explain to you about Dagger architecture.
- [Concepts](docs/../../concepts/overview) section will explain to you about Dagger architecture.
- Our [roadmap](docs/../../roadmap.md) is the 10000-foot view of where we're heading in near future.
- Github [issues](https://github.com/odpf/dagger/issues) track the ongoing and reported issues.

Expand Down
File renamed without changes.
Loading

0 comments on commit f00e3c7

Please sign in to comment.