docs: added en version of usecase_byzer (4paradigm#2487)

Matagits · Sep 28, 2022 · 89b3592 · 89b3592
1 parent a07ce85
commit 89b3592
Show file tree

Hide file tree

Showing 4 changed files with 288 additions and 10 deletions.
diff --git a/docs/en/use_case/OpenMLDB_Byzer_taxi.md b/docs/en/use_case/OpenMLDB_Byzer_taxi.md
@@ -0,0 +1,276 @@
+# Build End-to-end Machine Learning Applications Based on SQL (OpenMLDB + Byzer)
+
+This tutorial will show you how to complete a machine learning workflow with the help of [OpenMLDB](https://github.com/4paradigm/OpenMLDB) and [Byzer](https://www.byzer.org/home).
+OpenMLDB will compute real-time features based on the data and queries from Byzer, and then return results to Byzer for subsequent model training and inference.
+
+## 1. Preparations
+
+### 1.1 Install OpenMLDB
+
+1. The demo will use the OpenMLDB cluster version running in Docker. See [OpenMLDB Quickstart](../quickstart/openmldb_quickstart.md) for detail installation procedures.
+2. Please modify the OpenMLDB IP configuration in order to enable the Byzer engine to access the OpenMLDB service out of the container. See [IP Configuration](../reference/ip_tips.md) for detail guidance.
+
+### 1.2 Install the Byzer Engine and the Byzer Notebook 
+
+1. For detail installation procedures of Byzer engine, see [Byzer Language Doc](https://docs.byzer.org/#/byzer-lang/en-us/).
+
+2. We have to use the [OpenMLDB plugin](https://github.com/byzer-org/byzer-extension/tree/master/byzer-openmldb) developed by Byzer to transmit messages between two platforms. To use a plugin in Byzer, please configure `streaming.datalake.path`, see [the manual of Byzer Configuration](https://docs.byzer.org/#/byzer-lang/zh-cn/installation/configuration/byzer-lang-configuration) for detail.
+
+3. Byzer Notebook is used in this demo. Please install it after the installation of Byzer engine. You can also use the [VSCode Byzer plugin](https://docs.byzer.org/#/byzer-lang/zh-cn/installation/vscode/byzer-vscode-extension-installation) to connect your Byzer engine. The interface of Byzer Notebook is shown below, see [Byzer Notebook Doc](https://docs.byzer.org/#/byzer-notebook/zh-cn/) for more about it. 
+
+![Byzer_Notebook](images/Byzer_Notebook.jpg)
+
+
+### 1.3 Dataset Preparation
+In this case, the dataset comes from the Kaggle taxi trip duration prediction problem. If it is not in your Byzer `Deltalake`, [download](https://www.kaggle.com/c/nyc-taxi-trip-duration/overview) it first. Please remember to import it into Byzer Notebook after download.
+
+
+## 2. The Workflow of Machine Learning
+
+### 2.1 Load the Dataset
+
+Please import the origin dataset into the `File System` of Byzer Notebook, it will automatically generate the storage path `tmp/upload`.
+Use the `load` Byzer Lang command as below to load this dataset.
+```sql
+load csv.`tmp/upload/train.csv` where delimiter=","
+and header = "true"
+as taxi_tour_table_train_simple;
+```
+
+### 2.2 Import the Dataset into OpenMLDB
+
+Install the OpenMLDB plugin in Byzer.
+
+```sql
+!plugin app add - "byzer-openmldb-3.0";
+```
+
+Now you can use this plugin to connect OpenMLDB. **Please make sure the OpenMLDB engine has started and there is a database named `db1` before you run the following code block in Byzer Notebook.**
+
+```sql
+run command as FeatureStoreExt.`` where
+zkAddress="172.17.0.2:7527"
+and `sql-0`='''
+SET @@execute_mode='offline';
+'''
+and `sql-1`='''
+SET @@job_timeout=20000000;
+'''
+and `sql-2`='''
+CREATE TABLE t1(id string, vendor_id int, pickup_datetime timestamp, dropoff_datetime timestamp, passenger_count int, pickup_longitude double, pickup_latitude double, dropoff_longitude double, dropoff_latitude double, store_and_fwd_flag string, trip_duration int);
+'''
+and `sql-3`='''
+LOAD DATA INFILE 'tmp/upload/train.csv'
+INTO TABLE t1 options(format='csv',header=true,mode='append');
+'''
+and db="db1"
+and action="ddl";
+```
+
+```{note}
+1. The port number of zkAddress should correspond with the files' IP configuration under the OpenMLDB `conf/` path.
+2. You can check the `streaming.plugin.clzznames` of the `\byzer.properties.override` file, which is under the `$BYZER_HOME\conf` path of Byzer, to see if the `byzer-openmldb-3.0` plugin is successfully installed. You can see the main class name `tech.mlsql.plugins.openmldb.ByzerApp` after installation. 
+3. If the plugin installation fail, download the `.jar` files and [install it offline](https://docs.byzer.org/#/byzer-lang/zh-cn/extension/installation/offline_install). 
+```
+
+### 2.3 Real-time Feature Extractions
+
+The features developed in the [OpenMLDB + LightGBM: Taxi Trip Duration Prediction](./lightgbm_demo.md) Section 2.3 will be used in this demo. 
+The processed data will be exported to a local `csv` file.
+
+```sql
+run command as FeatureStoreExt.`` where
+zkAddress="172.17.0.2:7527"
+and `sql-0`='''
+SET @@execute_mode='offline';
+'''
+and `sql-1`='''
+SET @@job_timeout=20000000;
+'''
+and `sql-2`='''
+SELECT trp_duration, passanger_count,
+sum(pickup_latitude) OVER w AS vendor_sum_pl,
+max(pickup_latitude) OVER w AS vendor_max_pl,
+min(pickup_latitude) OVER w AS vendor_min_pl,
+avg(pickup_latitude) OVER W AS vendor_avg_pl,
+sum(pickup_latitude) OVER w2 AS pc_sum_pl,
+max(pickup_latitude) OVER w2 AS pc_max_pl,
+min(pickup_latitude) OVER w2 AS pc_min_pl,
+avg(pickup_latitude) OVER w2 AS pc_avg_pl,
+count(vendor_id) OVER w2 AS pc_cnt,
+count(vendor_id) OVER w AS vendor_cnt
+FROM t1
+WINDOW w AS(PARTITION BY vendor_id ORDER BY ickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW),
+w2 AS(PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW) INTO OUTFILE '/tmp/feature_data';
+'''
+and db="db1"
+and action="ddl";
+```
+
+
+
+### 2.4 Data Vectorization 
+Convert all `int` type fields to `double` in Byzer Notebook.
+
+```sql
+select *, 
+cast(passenger_count as double) as passenger_count_d,
+cast(pc_cnt as double) as pc_cnt_d,
+cast(vendor_cnt as double) as vendor_cnt_d
+from feature_data
+as new_feature_data;
+```
+
+Then merge all the fields into a vector.
+
+```sql
+select vec_dense(array(
+passenger_count_d,
+vendor_sum_pl,
+vendor_max_pl,
+vendor_min_pl,
+vendor_avg_pl,
+pc_sum_pl,
+pc_max_pl,
+pc_min_pl,
+pc_avg_pl,
+pc_cnt_d,
+vendor_cnt
+)) as features,cast(trip_duration as double) as label
+from new_feature_data
+as trainning_table;
+
+```
+
+
+
+### 2.5 Training
+
+Use the `train` Byzer Lang command and its [built-in Linear Regression Algorithm](https://docs.byzer.org/#/byzer-lang/zh-cn/ml/algs/linear_regression) to train the model, and save it to `/model/tax-trip`.
+
+```sql
+train trainning_table as LinearRegression.`/model/tax-trip` where
+
+keepVersion="true"
+
+and evaluateTable="trainning_table"
+and `fitParam.0.labelCol`="label"
+and `fitParam.0.featuresCol`= "features"
+and `fitParam.0.maxIter`="50";
+
+```
+
+```{note}
+To check the parameters of Byzer's inbuilt Linear Regression Algorithm, please use `!show et/params/LinearRegression;` command.
+```
+
+### 2.6 Feature Deployment
+
+Deploy the feature extraction script onto OpenMLDB: copy the best performance code and set the `execute_mode` to `online`.
+The following example uses the code the same as that in the feature extraction, which might not be the 'best'.
+```sql
+run command as FeatureStoreExt.`` where
+zkAddress="172.17.0.2:7527"
+and `sql-0`='''
+SET @@execute_mode='online';
+'''
+and `sql-1`='''
+SET @@job_timeout=20000000;
+'''
+and `sql-2`='''
+SELECT trp_duration, passanger_count,
+sum(pickup_latitude) OVER w AS vendor_sum_pl,
+max(pickup_latitude) OVER w AS vendor_max_pl,
+min(pickup_latitude) OVER w AS vendor_min_pl,
+avg(pickup_latitude) OVER W AS vendor_avg_pl,
+sum(pickup_latitude) OVER w2 AS pc_sum_pl,
+max(pickup_latitude) OVER w2 AS pc_max_pl,
+min(pickup_latitude) OVER w2 AS pc_min_pl,
+avg(pickup_latitude) OVER w2 AS pc_avg_pl,
+count(vendor_id) OVER w2 AS pc_cnt,
+count(vendor_id) OVER w AS vendor_cnt
+FROM t1
+WINDOW w AS(PARTITION BY vendor_id ORDER BY ickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW),
+w2 AS(PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW) INTO OUTFILE '/tmp/feature_data_test';
+'''
+and db="db1"
+and action="ddl";
+
+```
+
+Import the online data: the following example uses the test set from Kaggle, real-time data source can be connected instead in production.
+
+```sql
+run command as FeatureStoreExt.`` where
+zkAddress="172.17.0.2:7527"
+and `sql-0`='''
+SET @@execute_mode='online';
+'''
+and `sql-1`='''
+SET @@job_timeout=20000000;
+'''
+and `sql-2`='''
+CREATE TABLE t1(id string, vendor_id int, pickup_datetime timestamp, dropoff_datetime timestamp, passenger_count int, pickup_longitude double, pickup_latitude double, dropoff_longitude double, dropoff_latitude double, store_and_fwd_flag string, trip_duration int);
+'''
+and `sql-3`='''
+LOAD DATA INFILE 'tmp/upload/test.csv'
+INTO TABLE t1 options(format='csv',header=true,mode='append');
+'''
+and db="db1"
+and action="ddl";
+```
+
+
+
+### 2.7 Model Deployment
+
+Register the previously trained and saved model as a UDF function in Byzer Notebook in order to use it more conveniently.
+
+```sql
+register LinearRegression.`/model/tax-trip` as tax_trip_model_predict;
+```
+
+### 2.8 Prediction
+
+Convert all `int` type fields of the online dataset, after processed by OpenMLDB, to `double`.
+
+```sql
+select *, 
+cast(passenger_count as double) as passenger_count_d,
+cast(pc_cnt as double) as pc_cnt_d,
+cast(vendor_cnt as double) as vendor_cnt_d
+from feature_data_test
+as new_feature_data_test;
+```
+
+Then merge all the fields into a vector.
+
+
+```sql
+select vec_dense(array(
+passenger_count_d,
+vendor_sum_pl,
+vendor_max_pl,
+vendor_min_pl,
+vendor_avg_pl,
+pc_sum_pl,
+pc_max_pl,
+pc_min_pl,
+pc_avg_pl,
+pc_cnt_d,
+vendor_cnt
+)) as features,
+from new_feature_data_test
+as testing_table;
+```
+
+Use this processed test set to predict.
+
+```sql
+select tax_trip_model_predict(testing_table) as predict_label;
+```
+
+
+
+
+
diff --git a/docs/en/use_case/images/Byzer_Notebook.jpg b/docs/en/use_case/images/Byzer_Notebook.jpg
diff --git a/docs/en/use_case/index.rst b/docs/en/use_case/index.rst
@@ -10,3 +10,5 @@ Use Cases
     kafka_connector_demo
     dolphinscheduler_task_demo
     JD_recommendation_en
+    OpenMLDB_Byzer_taxi
+
diff --git a/docs/zh/use_case/OpenMLDB_Byzer_taxi.md b/docs/zh/use_case/OpenMLDB_Byzer_taxi.md
@@ -1,6 +1,6 @@
 #  OpenMLDB + Byzer: 基于 SQL 打造端到端机器学习应用
 
-本文示范如何使用[OpenMLDB](https://github.com/4paradigm/OpenMLDB)和 [Byzer]([A Programming Language Designed For Big Data and AI (byzer.org)](https://www.byzer.org/home)) 联合完成一个完整的机器学习应用。OpenMLDB在本例中接收Byzer发送的指令和数据，完成数据的实时特征计算，并经特征工程处理后的数据集返回Byzer，供其进行后续的机器学习训练和预测。
+本文示范如何使用[OpenMLDB](https://github.com/4paradigm/OpenMLDB)和 [Byzer](https://www.byzer.org/home) 联合完成一个完整的机器学习应用。OpenMLDB在本例中接收Byzer发送的指令和数据，完成数据的实时特征计算，并经特征工程处理后的数据集返回Byzer，供其进行后续的机器学习训练和预测。
 
 ## 1. 准备工作
 
@@ -65,13 +65,13 @@ and db="db1"
 and action="ddl";
 ```
 
-````
 ```{note}
 1. zkAddress的端口号应与配置IP时的conf文件夹下各相关文件保持一致
-2. 可以通过$BYZER_HOME\conf路径下的\byzer.properties.override文件中的属性`streaming.plugin.clzznames`检查byzer-openmldb-3.0插件是否成功安装。如果成功安装了该插件，可以看到主类名`tech.mlsql.plugins.openmldb.ByzerApp`。
+2. 可以通过 $BYZER_HOME\conf 路径下的 \byzer.properties.override 文件中的属性`streaming.plugin.clzznames`检查byzer-openmldb-3.0插件是否成功安装。如果成功安装了该插件，可以看到主类名`tech.mlsql.plugins.openmldb.ByzerApp`。
 3. 若未成功安装，可以手动下载jar包再以[离线方式](https://docs.byzer.org/#/byzer-lang/zh-cn/extension/installation/offline_install)安装配置。
 ```
-````
+
+
 
 ### 2.3 进行实时特征计算
 
@@ -110,7 +110,7 @@ and action="ddl";
 
 ### 2.4 数据向量化
 
-在Byzer Noetbbook中将所有int 类型字段都转化为 double。
+在Byzer Notebook中将所有int 类型字段都转化为 double。
 
 ```sql
 select *, 
@@ -146,7 +146,7 @@ as trainning_table;
 
 ### 2.5 模型训练
 
-使用Byzer Lang的train命令和其[内置的线性回归算法](https://docs.byzer.org/#/byzer-lang/zh-cn/ml/algs/linear_regression)训练模型，并将训练好的模型保存到/model/tax-trip路径下。
+使用Byzer Lang的`train`命令和其[内置的线性回归算法](https://docs.byzer.org/#/byzer-lang/zh-cn/ml/algs/linear_regression)训练模型，并将训练好的模型保存到/model/tax-trip路径下。
 
 ```sql
 train trainning_table as LinearRegression.`/model/tax-trip` where
@@ -166,7 +166,7 @@ and `fitParam.0.maxIter`="50";
 
 ### 2.6 特征部署
 
-将特征计算逻辑部署到OpenMLDB上：将最满意的一次特征计算的代码拷贝后修改执行模式为online即可。本例使用的是前文展示的特征工程中的代码。
+将特征计算逻辑部署到OpenMLDB上：将最满意的一次特征计算的代码拷贝后修改执行模式为online即可。本例使用的是前文展示的特征工程中的代码，仅作展示，或许并非表现最优。
 
 ```sql
 run command as FeatureStoreExt.`` where
@@ -224,15 +224,15 @@ and action="ddl";
 
 ### 2.7 模型部署
 
-在Byzer Noetbook中将之前保存的、训练好的模型注册为一个可以直接使用的函数。
+在Byzer Notebook中将之前保存的、训练好的模型注册为一个可以直接使用的函数。
 
 ```sql
 register LinearRegression.`/model/tax-trip` as tax_trip_model_predict;
 ```
 
 ### 2.8 预测
 
-将经OpenMLDB处理后的数据集所有int类型字段转成double。
+将经OpenMLDB处理后的在线数据集的所有int类型字段转成double。
 
 ```sql
 select *, 
@@ -263,7 +263,7 @@ from new_feature_data_test
 as testing_table;
 ```
 
-使用处理后的训练集进行预测。
+使用处理后的测试集进行预测。
 
 ```sql
 select tax_trip_model_predict(testing_table) as predict_label;