docs: improve documentation site (#76)

* docs: integrate docusaurus * docs: update key features * docs: add users and fix broken paths * docs: add help section
raystack · Oct 1, 2021 · f00e3c7 · f00e3c7
1 parent 41dd8c5
commit f00e3c7
Show file tree

Hide file tree

Showing 88 changed files with 35,124 additions and 261 deletions.
diff --git a/.gitbook.yaml b/.gitbook.yaml
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -0,0 +1,32 @@
+name: Build
+
+on: push
+    branches:
+      - main
+    workflow_dispatch:
+
+jobs:
+  documentation:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+      - uses: actions/setup-node@v2
+      - name: Installation
+        uses: bahmutov/npm-install@v1
+        with:
+          install-command: yarn
+          working-directory: docs
+      - name: Build docs
+        working-directory: docs
+        run: cd docs && yarn build
+      - name: Deploy docs
+        env:
+          GIT_USER: ravisuhag
+          GIT_PASS: ${{ secrets.DOCU_RS_TOKEN }}
+          DEPLOYMENT_BRANCH: gh-pages
+          CURRENT_BRANCH: main
+        working-directory: docs
+        run: |
+          git config --global user.email "[email protected]"
+          git config --global user.name "ravisuhag"
+          yarn deploy
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@
 Dagger or Data Aggregator is an easy-to-use, configuration over code, cloud-native framework built on top of Apache Flink for stateful processing of real-time streaming data. With Dagger, you don't need to write custom applications or manage resources to process data in real-time.
 Instead, you can write SQLs to do the processing and analysis on streaming data.
 
-<p align="center"><img src="./docs/assets/overview.svg" /></p>
+<p align="center"><img src="./docs/static/img/overview.svg" /></p>
 
 ## Key Features
 Discover why to use Dagger

diff --git a/docs/.gitignore b/docs/.gitignore
@@ -0,0 +1,20 @@
+# Dependencies
+/node_modules
+
+# Production
+/build
+
+# Generated files
+.docusaurus
+.cache-loader
+
+# Misc
+.DS_Store
+.env.local
+.env.development.local
+.env.test.local
+.env.production.local
+
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
diff --git a/docs/README.md b/docs/README.md
@@ -1,35 +1,33 @@
-# Introduction
-Dagger or Data Aggregator is an easy-to-use, configuration over code, cloud-native framework built on top of Apache Flink for stateful processing of real-time streaming data. With Dagger, you don't need to write custom applications or manage resources to process data in real-time.
-Instead, you can write SQLs to do the processing and analysis on streaming data.
+# Website
 
-![](./assets/overview.svg)
+This website is built using [Docusaurus 2](https://docusaurus.io/), a modern static website generator.
 
-## Key Features
-Discover why to use Dagger
+### Installation
 
-* **Processing:** Dagger can transform, aggregate, join and enrich Protobuf data in real-time.
-* **Scale:** Dagger scales in an instant, both vertically and horizontally for high performance streaming sink and zero data drops.
-* **Extensibility:** Add your own sink to dagger with a clearly defined interface or choose from already provided ones.
-* **Pluggability:** Add custom business logic in form of plugins \(UDFs, Transformers, Preprocessors and Post Processors\) independent of the core logic. 
-* **Metrics:** Always know what’s going on with your deployment with built-in [monitoring](docs/../reference/metrics.md) of throughput, response times, errors and more.
+```
+$ yarn
+```
 
-## What problems Dagger solves?
-* Map reduce -> [SQL](https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/sql.html)
-* Enrichment -> [Post Processors](docs/../advance/post_processor.md)
-* Aggregation -> [SQL](https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/sql.html), [UDFs](docs/../guides/use_udf.md)
-* Masking -> [Hash Transformer](docs/../reference/transformers.md#HashTransformer)
-* Deduplication -> [Deduplication Transformer](docs/../reference/transformers.md#DeDuplicationTransformer)
-* Realtime long window processing -> [Longbow](docs/../advance/longbow.md)
+### Local Development
 
-To know more, follow the detailed [documentation](https://odpf.gitbook.io/dagger).
+```
+$ yarn start
+```
 
-## Usage
+This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server.
 
-Explore the following resources to get started with Dagger:
+### Build
 
-* [Guides](docs/../guides/overview.md) provides guidance on [creating Dagger](docs/../guides/overview.md) with different sinks.
-* [Concepts](docs/../concepts/overview.md) describes all important Dagger concepts.
-* [Advance](docs/../advance/overview.md) contains details regarding advance features of Dagger.
-* [Reference](docs/../reference/overview.md) contains details about configurations, metrics and other aspects of Dagger.
-* [Contribute](docs/../contribute/contribution.md) contains resources for anyone who wants to contribute to Dagger.
-* [Usecase](docs/../usecase/overview.md) describes examples use cases which can be solved via Dagger.
+```
+$ yarn build
+```
+
+This command generates static content into the `build` directory and can be served using any static contents hosting service.
+
+### Deployment
+
+```
+$ GIT_USER=<Your GitHub username> USE_SSH=true yarn deploy
+```
+
+If you are using GitHub pages for hosting, this command is a convenient way to build the website and push to the `gh-pages` branch.
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
diff --git a/docs/assets/overview.svg b/docs/assets/overview.svg
diff --git a/docs/babel.config.js b/docs/babel.config.js
@@ -0,0 +1,3 @@
+module.exports = {
+  presets: [require.resolve('@docusaurus/core/lib/babel/preset')],
+};
diff --git a/docs/blog/2021-08-20-dagger-launch.md b/docs/blog/2021-08-20-dagger-launch.md
@@ -0,0 +1,11 @@
+---
+slug: introducing-dagger
+title: Introducing Dagger
+authors:
+  name: Ravi Suhag
+  title: Maintainer
+  url: https://github.com/ravisuhag
+tags: [odpf, dagger]
+---
+
+We are live!
diff --git a/docs/blog/authors.yml b/docs/blog/authors.yml
@@ -0,0 +1,5 @@
+ravisuhag:
+  name: Ravi Suhag
+  title: Maintainer
+  url: https://github.com/ravisuhag
+  image_url: https://github.com/ravisuhag.png
diff --git a/docs/advance/DARTS.md → docs/docs/advance/DARTS.md b/docs/advance/DARTS.md → docs/docs/advance/DARTS.md
@@ -1,4 +1,5 @@
-# Introduction
+# Darts
+
 In data streaming pipelines, in certain cases, not entire data is present in the event itself. One scenario for such cases can be where some particular information is present in form of static data that you need in runtime. DARTS(Dagger Refer-Table Service) allows you to join streaming data from a reference data store. It supports reference data store in the form of a list or <key, value> map. It enables the refer-table with the help of [UDFs](docs/../../guides/use_udf.md) which can be used in the SQL query. Currently, we only support GCS as a reference data source.
 
 # Types of DARTS
@@ -13,7 +14,7 @@ This UDF can be used in cases where we want to fetch static information from a <
 Let’s assume we need to find out the number of bookings getting completed in a particular District per minute. The input schema only has information regarding service_area_id but not the District. The mapping of service_area_id to District is present in a static key-value map. We can utilize DartGet in order to get this information in our query.
 
 <p align="center">
-  <img src="../assets/dart-get.png" width="80%"/>
+  <img src="/img/dart-get.png" width="80%"/>
 </p>
 
 Sample input schema for booking
@@ -63,7 +64,7 @@ This UDF can be used in cases where we want to verify the presence of a key in a
 Let’s assume we need to count the number of completed bookings per minute but excluding a few blacklisted customers. This static list of blacklisted customers is present remotely. We can utilize DartContains UDF here.
 
 <p align="center">
-  <img src="../assets/dart-contains.png" width="80%"/>
+  <img src="/img/dart-contains.png" width="80%"/>
 </p>
 
 Sample input schema for booking
@@ -111,7 +112,7 @@ Most of DARTS configurations are via UDF contract, for other glabal configs refe
 
 ### Persistence layer
 We store all data references in GCS. We recommend using GCS where the data reference is not too big and updates are very few. GCS also offers a cost-effective solution.
-We currently support storing data in a specific bucket, with a custom folder for every dart. The data is always kept in a file and you have to pass the relative path <custom-folder>/filename.json. There should not be many reads as every time we read the list from GCS we read the whole selected list and cache it in Dagger.
+We currently support storing data in a specific bucket, with a custom folder for every dart. The data is always kept in a file and you have to pass the relative path `<custom-folder>/filename.json`. There should not be many reads as every time we read the list from GCS we read the whole selected list and cache it in Dagger.
 
 ### Caching mechanism
 Dart fetches the data from GCS after configurable refresh period or when entire data is missing from the cache or is empty. After Dart fetches the data, it stores it in the application state.

diff --git a/docs/advance/longbow.md → docs/docs/advance/longbow.md b/docs/advance/longbow.md → docs/docs/advance/longbow.md
@@ -1,4 +1,4 @@
-# Introduction
+# Longbow
 This is another type of processor which is also applied post SQL query processing in the Dagger workflow. This feature allows users to aggregate data over long windows in real-time. For certain use-cases, you need to know the historical data for an event in a given context. Eg: For a booking event, a risk prevention system would be interested in 1 month pattern of the customer.
 Longbow solves the above problem, the entire historical context gets added in the same event which allows downstream systems to process data without any external dependency. In order to achieve this, we store the historical data in an external data source. After evaluating a lot of data sources we found [Bigtable](https://cloud.google.com/bigtable) to be a good fit primarily because of its low scan queries latencies. This currently works only for Kafka sink.
 
@@ -35,7 +35,7 @@ This component is responsible for reading the historical data from Bigtable and
 In this example, let's assume we have booking events in a Kafka cluster and we want to get information of all the order numbers and their driver ids for customers in the last 30 days. Here customer_id will become longbow_key.
 
 <p align="center">
-  <img src="../assets/longbow.png" width="80%"/>
+  <img src="/img/longbow.png" width="80%"/>
 </p>
 
 Sample input schema for booking

diff --git a/docs/advance/longbow_plus.md → docs/docs/advance/longbow_plus.md b/docs/advance/longbow_plus.md → docs/docs/advance/longbow_plus.md
@@ -1,4 +1,5 @@
-# Introduction
+# Longbow+
+
 Longbow+ is an enhanced version of [longbow](docs/../../advance/longbow.md). It is also used for long windowed aggregations. One of the limitations for longbow was the lack of support for complex data types. Longbow+ lets you select as many complex fields as required and it has its own DSL in place to query these complex fields. This currently works only for Kafka sink.
 
 # Components
@@ -26,7 +27,7 @@ LongbowWrite has two responsibilities, to write an incoming event to BigTable an
 In this example, let's assume we have booking events in a Kafka cluster and we want to get information of all the order numbers, their driver ids and location(complex field) for customers in the last 30 days. Here customer_id will become longbow_write_key.
 
 <p align="center">
-  <img src="../assets/longbowplus-writer.png" width="80%"/>
+  <img src="/img/longbowplus-writer.png" width="80%"/>
 </p>
 
 Sample input schema for booking
@@ -122,7 +123,7 @@ It reads the output of LongbowWrite and fetches the data for a particular key an
 In this example, we are consuming the output from the longbow_writer example mentioned above.
 
 <p align="center">
-  <img src="../assets/longbowplus-reader.png" width="80%"/>
+  <img src="/img/longbowplus-reader.png" width="80%"/>
 </p>
 
 Sample output schema for longbow reader output

diff --git a/docs/advance/overview.md → docs/docs/advance/overview.md b/docs/advance/overview.md → docs/docs/advance/overview.md
diff --git a/docs/advance/post_processor.md → docs/docs/advance/post_processor.md b/docs/advance/post_processor.md → docs/docs/advance/post_processor.md
@@ -1,17 +1,4 @@
-# Table of Contents
-* [Introduction](post_processor.md#introduction)
-* [Flow of Execution](post_processor.md#flow-of-execution)
-* [Types of Post Processors](post_processor.md#types-of-post-processors)
-  * [External Post Processor](post_processor.md#external-post-processor)
-    * [Elasticsearch](post_processor.md#elasticsearch)
-    * [HTTP](post_processor.md#http)
-    * [Postgres](post_processor.md#postgres)
-    * [GRPC](post_processor.md#grpc)
-  * [Internal Post Processor](post_processor.md#internal-post-processor)
-  * [Transformers](post_processor.md#transformers)
-* [Post Processor requirements](post_processor.md#post-processor-requirements)
-
-# Introduction
+# Post Processors
 Post Processors give the capability to do custom stream processing after the SQL processing is performed. Complex transformation, enrichment & aggregation use cases are difficult to execute & maintain using SQL. Post Processors solve this problem through code and/or configuration. This can be used to enrich the stream from external sources (HTTP, ElasticSearch, PostgresDB, GRPC), enhance data points using function or query and transform through user-defined code.
 
 All the post processors mentioned in this doc can be applied in a sequential manner, which enables you to get information from multiple different external data sources and apply as many transformers as required. The output of one processor will be the input for the other and the final result will be pushed to the configured sink.
@@ -22,21 +9,21 @@ In the flow of Post Processors, all types of processors viz; External Post Proce
 * Let's assume that you want to find cashback given for a particular order number from an external API endpoint. You can use an [HTTP external post-processor](post_processor.md#http) for this. Here is a basic Data flow diagram.
 
 <p align="center">
-  <img src="../assets/external-http-post-processor.png" width="80%"/>
+  <img src="/img/external-http-post-processor.png" width="80%"/>
 </p>
 
 * In the above example, assume you also want to output the information of customer_id and amount which are fields from input proto. [Internal Post Processor](post_processor.md#internal-post-processor) can be used for selecting these fields from the input stream.
 
 <p align="center">
-  <img src="../assets/external-internal-post-processor.png" width="80%"/>
+  <img src="/img/external-internal-post-processor.png" width="80%"/>
 </p>
 
 * After getting customer_id, amount and cashback amount, you may want to round off the cashback amount. For this, you can write a custom [transformer](docs/../../guides/use_transformer.md) which is a simple Java Flink Map function to calculate the round-off amount.
 
   **Note:** All the above processors are chained sequentially on the output of the previous processor. The order of execution is determined via the order provided in JSON config.
 
 <p align="center">
-  <img src="../assets/external-internal-transformer-post-processor.png" width="80%"/>
+  <img src="/img/external-internal-transformer-post-processor.png" width="80%"/>
 </p>
 
 # Types of Post Processors

diff --git a/docs/advance/pre_processor.md → docs/docs/advance/pre_processor.md b/docs/advance/pre_processor.md → docs/docs/advance/pre_processor.md
@@ -1,4 +1,4 @@
-# Introduction
+# Pre Processors
 Pre processors enable the users to add Flink [operators/transformations](https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/operators) before passing on the stream to the SQL query. Each stream registered on dagger can have chained pre processors. They will run and transform the data before SQL processing.
 
 # Type of Preprocessors
@@ -8,7 +8,7 @@ Currently, there is only one type of pre-processor.
 # Data flow in preprocessors
 
 <p align="center">
-  <img src="../assets/pre-processor.png" />
+  <img src="/img/pre-processor.png" />
 </p>
 
 In the above diagram:

diff --git a/docs/concepts/architecture.md → docs/docs/concepts/architecture.md b/docs/concepts/architecture.md → docs/docs/concepts/architecture.md
@@ -1,12 +1,12 @@
-# Dagger Architecture
+# Architecture
 
 Dagger or Data Aggregator is a cloud native framework for processing real-time streaming data built on top of Apache Flink.
 
 ## System Design
 
 ### Components
 
-![Dagger Architecture](../assets/architecture.png)
+![Dagger Architecture](/img/architecture.png)
 
 _**Stream**_
 

diff --git a/docs/concepts/basics.md → docs/docs/concepts/basics.md b/docs/concepts/basics.md → docs/docs/concepts/basics.md
@@ -50,12 +50,12 @@ Dagger provides two different types of windows
 - Hop/Tumbling Windows
 
   Each element to a window of specified window size. Tumbling windows have a fixed size and do not overlap. For example, if you specify a tumbling window with a size of 5 minutes, the current window will be evaluated and a new window will be started every five minutes as illustrated by the following figure. (image credit: [Flink Operators](https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/operators/windows.html))
-  ![Tumble Window](../assets/tumble.png)
+  ![Tumble Window](/img/tumble.png)
 
 - Sliding Windows
 
   Each element gets assigned to windows of fixed length. An additional window slide parameter controls how frequently a sliding window is started. Hence, sliding windows can be overlapping if the slide is smaller than the window size. In this case, elements are assigned to multiple windows. (image credit: [Flink Operators](https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/operators/windows.html))
-  ![Sliding Window](../assets/sliding.png)
+  ![Sliding Window](/img/sliding.png)
 
 ### Rowtime
 

diff --git a/docs/concepts/lifecycle.md → docs/docs/concepts/lifecycle.md b/docs/concepts/lifecycle.md → docs/docs/concepts/lifecycle.md
@@ -1,9 +1,9 @@
-## Dagger Lifecycle
+# Lifecycle
 
 Architecturally after the creation of Dagger, it goes through several stages before materializing the results to an output stream.
 
 <p align="center">
-  <img src="../assets/dagger-lifecycle.png" />
+  <img src="/img/dagger-lifecycle.png" />
 </p>
 
 - `Stage-1` : Dagger registers all defined configurations. JobManager validates the configurations and the query and creates a job-graph for the same.

diff --git a/docs/concepts/overview.md → docs/docs/concepts/overview.md b/docs/concepts/overview.md → docs/docs/concepts/overview.md
diff --git a/docs/contribute/add_transformer.md → docs/docs/contribute/add_transformer.md b/docs/contribute/add_transformer.md → docs/docs/contribute/add_transformer.md
diff --git a/docs/contribute/add_udf.md → docs/docs/contribute/add_udf.md b/docs/contribute/add_udf.md → docs/docs/contribute/add_udf.md
diff --git a/docs/contribute/contribution.md → docs/docs/contribute/contribution.md b/docs/contribute/contribution.md → docs/docs/contribute/contribution.md
@@ -2,7 +2,7 @@
 
 The following is a set of guidelines for contributing to Dagger. These are mostly guidelines, not hard and first rules. Use your best judgment, and feel free to propose changes to this document over a pull request. Here are some important resources:
 
-- [Concepts](docs/../../concepts) section will explain to you about Dagger architecture.
+- [Concepts](docs/../../concepts/overview) section will explain to you about Dagger architecture.
 - Our [roadmap](docs/../../roadmap.md) is the 10000-foot view of where we're heading in near future.
 - Github [issues](https://github.com/odpf/dagger/issues) track the ongoing and reported issues.
 

diff --git a/docs/contribute/development.md → docs/docs/contribute/development.md b/docs/contribute/development.md → docs/docs/contribute/development.md