diff --git a/docs/en/use_case/OpenMLDB_Byzer_taxi.md b/docs/en/use_case/OpenMLDB_Byzer_taxi.md new file mode 100644 index 00000000000..9554f77ea87 --- /dev/null +++ b/docs/en/use_case/OpenMLDB_Byzer_taxi.md @@ -0,0 +1,276 @@ +# Build End-to-end Machine Learning Applications Based on SQL (OpenMLDB + Byzer) + +This tutorial will show you how to complete a machine learning workflow with the help of [OpenMLDB](https://github.com/4paradigm/OpenMLDB) and [Byzer](https://www.byzer.org/home). +OpenMLDB will compute real-time features based on the data and queries from Byzer, and then return results to Byzer for subsequent model training and inference. + +## 1. Preparations + +### 1.1 Install OpenMLDB + +1. The demo will use the OpenMLDB cluster version running in Docker. See [OpenMLDB Quickstart](../quickstart/openmldb_quickstart.md) for detail installation procedures. +2. Please modify the OpenMLDB IP configuration in order to enable the Byzer engine to access the OpenMLDB service out of the container. See [IP Configuration](../reference/ip_tips.md) for detail guidance. + +### 1.2 Install the Byzer Engine and the Byzer Notebook + +1. For detail installation procedures of Byzer engine, see [Byzer Language Doc](https://docs.byzer.org/#/byzer-lang/en-us/). + +2. We have to use the [OpenMLDB plugin](https://github.com/byzer-org/byzer-extension/tree/master/byzer-openmldb) developed by Byzer to transmit messages between two platforms. To use a plugin in Byzer, please configure `streaming.datalake.path`, see [the manual of Byzer Configuration](https://docs.byzer.org/#/byzer-lang/zh-cn/installation/configuration/byzer-lang-configuration) for detail. + +3. Byzer Notebook is used in this demo. Please install it after the installation of Byzer engine. You can also use the [VSCode Byzer plugin](https://docs.byzer.org/#/byzer-lang/zh-cn/installation/vscode/byzer-vscode-extension-installation) to connect your Byzer engine. The interface of Byzer Notebook is shown below, see [Byzer Notebook Doc](https://docs.byzer.org/#/byzer-notebook/zh-cn/) for more about it. + +![Byzer_Notebook](images/Byzer_Notebook.jpg) + + +### 1.3 Dataset Preparation +In this case, the dataset comes from the Kaggle taxi trip duration prediction problem. If it is not in your Byzer `Deltalake`, [download](https://www.kaggle.com/c/nyc-taxi-trip-duration/overview) it first. Please remember to import it into Byzer Notebook after download. + + +## 2. The Workflow of Machine Learning + +### 2.1 Load the Dataset + +Please import the origin dataset into the `File System` of Byzer Notebook, it will automatically generate the storage path `tmp/upload`. +Use the `load` Byzer Lang command as below to load this dataset. +```sql +load csv.`tmp/upload/train.csv` where delimiter="," +and header = "true" +as taxi_tour_table_train_simple; +``` + +### 2.2 Import the Dataset into OpenMLDB + +Install the OpenMLDB plugin in Byzer. + +```sql +!plugin app add - "byzer-openmldb-3.0"; +``` + +Now you can use this plugin to connect OpenMLDB. **Please make sure the OpenMLDB engine has started and there is a database named `db1` before you run the following code block in Byzer Notebook.** + +```sql +run command as FeatureStoreExt.`` where +zkAddress="172.17.0.2:7527" +and `sql-0`=''' +SET @@execute_mode='offline'; +''' +and `sql-1`=''' +SET @@job_timeout=20000000; +''' +and `sql-2`=''' +CREATE TABLE t1(id string, vendor_id int, pickup_datetime timestamp, dropoff_datetime timestamp, passenger_count int, pickup_longitude double, pickup_latitude double, dropoff_longitude double, dropoff_latitude double, store_and_fwd_flag string, trip_duration int); +''' +and `sql-3`=''' +LOAD DATA INFILE 'tmp/upload/train.csv' +INTO TABLE t1 options(format='csv',header=true,mode='append'); +''' +and db="db1" +and action="ddl"; +``` + +```{note} +1. The port number of zkAddress should correspond with the files' IP configuration under the OpenMLDB `conf/` path. +2. You can check the `streaming.plugin.clzznames` of the `\byzer.properties.override` file, which is under the `$BYZER_HOME\conf` path of Byzer, to see if the `byzer-openmldb-3.0` plugin is successfully installed. You can see the main class name `tech.mlsql.plugins.openmldb.ByzerApp` after installation. +3. If the plugin installation fail, download the `.jar` files and [install it offline](https://docs.byzer.org/#/byzer-lang/zh-cn/extension/installation/offline_install). +``` + +### 2.3 Real-time Feature Extractions + +The features developed in the [OpenMLDB + LightGBM: Taxi Trip Duration Prediction](./lightgbm_demo.md) Section 2.3 will be used in this demo. +The processed data will be exported to a local `csv` file. + +```sql +run command as FeatureStoreExt.`` where +zkAddress="172.17.0.2:7527" +and `sql-0`=''' +SET @@execute_mode='offline'; +''' +and `sql-1`=''' +SET @@job_timeout=20000000; +''' +and `sql-2`=''' +SELECT trp_duration, passanger_count, +sum(pickup_latitude) OVER w AS vendor_sum_pl, +max(pickup_latitude) OVER w AS vendor_max_pl, +min(pickup_latitude) OVER w AS vendor_min_pl, +avg(pickup_latitude) OVER W AS vendor_avg_pl, +sum(pickup_latitude) OVER w2 AS pc_sum_pl, +max(pickup_latitude) OVER w2 AS pc_max_pl, +min(pickup_latitude) OVER w2 AS pc_min_pl, +avg(pickup_latitude) OVER w2 AS pc_avg_pl, +count(vendor_id) OVER w2 AS pc_cnt, +count(vendor_id) OVER w AS vendor_cnt +FROM t1 +WINDOW w AS(PARTITION BY vendor_id ORDER BY ickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW), +w2 AS(PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW) INTO OUTFILE '/tmp/feature_data'; +''' +and db="db1" +and action="ddl"; +``` + + + +### 2.4 Data Vectorization +Convert all `int` type fields to `double` in Byzer Notebook. + +```sql +select *, +cast(passenger_count as double) as passenger_count_d, +cast(pc_cnt as double) as pc_cnt_d, +cast(vendor_cnt as double) as vendor_cnt_d +from feature_data +as new_feature_data; +``` + +Then merge all the fields into a vector. + +```sql +select vec_dense(array( +passenger_count_d, +vendor_sum_pl, +vendor_max_pl, +vendor_min_pl, +vendor_avg_pl, +pc_sum_pl, +pc_max_pl, +pc_min_pl, +pc_avg_pl, +pc_cnt_d, +vendor_cnt +)) as features,cast(trip_duration as double) as label +from new_feature_data +as trainning_table; + +``` + + + +### 2.5 Training + +Use the `train` Byzer Lang command and its [built-in Linear Regression Algorithm](https://docs.byzer.org/#/byzer-lang/zh-cn/ml/algs/linear_regression) to train the model, and save it to `/model/tax-trip`. + +```sql +train trainning_table as LinearRegression.`/model/tax-trip` where + +keepVersion="true" + +and evaluateTable="trainning_table" +and `fitParam.0.labelCol`="label" +and `fitParam.0.featuresCol`= "features" +and `fitParam.0.maxIter`="50"; + +``` + +```{note} +To check the parameters of Byzer's inbuilt Linear Regression Algorithm, please use `!show et/params/LinearRegression;` command. +``` + +### 2.6 Feature Deployment + +Deploy the feature extraction script onto OpenMLDB: copy the best performance code and set the `execute_mode` to `online`. +The following example uses the code the same as that in the feature extraction, which might not be the 'best'. +```sql +run command as FeatureStoreExt.`` where +zkAddress="172.17.0.2:7527" +and `sql-0`=''' +SET @@execute_mode='online'; +''' +and `sql-1`=''' +SET @@job_timeout=20000000; +''' +and `sql-2`=''' +SELECT trp_duration, passanger_count, +sum(pickup_latitude) OVER w AS vendor_sum_pl, +max(pickup_latitude) OVER w AS vendor_max_pl, +min(pickup_latitude) OVER w AS vendor_min_pl, +avg(pickup_latitude) OVER W AS vendor_avg_pl, +sum(pickup_latitude) OVER w2 AS pc_sum_pl, +max(pickup_latitude) OVER w2 AS pc_max_pl, +min(pickup_latitude) OVER w2 AS pc_min_pl, +avg(pickup_latitude) OVER w2 AS pc_avg_pl, +count(vendor_id) OVER w2 AS pc_cnt, +count(vendor_id) OVER w AS vendor_cnt +FROM t1 +WINDOW w AS(PARTITION BY vendor_id ORDER BY ickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW), +w2 AS(PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW) INTO OUTFILE '/tmp/feature_data_test'; +''' +and db="db1" +and action="ddl"; + +``` + +Import the online data: the following example uses the test set from Kaggle, real-time data source can be connected instead in production. + +```sql +run command as FeatureStoreExt.`` where +zkAddress="172.17.0.2:7527" +and `sql-0`=''' +SET @@execute_mode='online'; +''' +and `sql-1`=''' +SET @@job_timeout=20000000; +''' +and `sql-2`=''' +CREATE TABLE t1(id string, vendor_id int, pickup_datetime timestamp, dropoff_datetime timestamp, passenger_count int, pickup_longitude double, pickup_latitude double, dropoff_longitude double, dropoff_latitude double, store_and_fwd_flag string, trip_duration int); +''' +and `sql-3`=''' +LOAD DATA INFILE 'tmp/upload/test.csv' +INTO TABLE t1 options(format='csv',header=true,mode='append'); +''' +and db="db1" +and action="ddl"; +``` + + + +### 2.7 Model Deployment + +Register the previously trained and saved model as a UDF function in Byzer Notebook in order to use it more conveniently. + +```sql +register LinearRegression.`/model/tax-trip` as tax_trip_model_predict; +``` + +### 2.8 Prediction + +Convert all `int` type fields of the online dataset, after processed by OpenMLDB, to `double`. + +```sql +select *, +cast(passenger_count as double) as passenger_count_d, +cast(pc_cnt as double) as pc_cnt_d, +cast(vendor_cnt as double) as vendor_cnt_d +from feature_data_test +as new_feature_data_test; +``` + +Then merge all the fields into a vector. + + +```sql +select vec_dense(array( +passenger_count_d, +vendor_sum_pl, +vendor_max_pl, +vendor_min_pl, +vendor_avg_pl, +pc_sum_pl, +pc_max_pl, +pc_min_pl, +pc_avg_pl, +pc_cnt_d, +vendor_cnt +)) as features, +from new_feature_data_test +as testing_table; +``` + +Use this processed test set to predict. + +```sql +select tax_trip_model_predict(testing_table) as predict_label; +``` + + + + + diff --git a/docs/en/use_case/images/Byzer_Notebook.jpg b/docs/en/use_case/images/Byzer_Notebook.jpg new file mode 100644 index 00000000000..18ae0f85739 Binary files /dev/null and b/docs/en/use_case/images/Byzer_Notebook.jpg differ diff --git a/docs/en/use_case/index.rst b/docs/en/use_case/index.rst index dd6e7dabd59..770b85ca958 100644 --- a/docs/en/use_case/index.rst +++ b/docs/en/use_case/index.rst @@ -10,3 +10,5 @@ Use Cases kafka_connector_demo dolphinscheduler_task_demo JD_recommendation_en + OpenMLDB_Byzer_taxi + diff --git a/docs/zh/use_case/OpenMLDB_Byzer_taxi.md b/docs/zh/use_case/OpenMLDB_Byzer_taxi.md index 34aa1ae8355..16499b95868 100644 --- a/docs/zh/use_case/OpenMLDB_Byzer_taxi.md +++ b/docs/zh/use_case/OpenMLDB_Byzer_taxi.md @@ -1,6 +1,6 @@ # OpenMLDB + Byzer: 基于 SQL 打造端到端机器学习应用 -本文示范如何使用[OpenMLDB](https://github.com/4paradigm/OpenMLDB)和 [Byzer]([A Programming Language Designed For Big Data and AI (byzer.org)](https://www.byzer.org/home)) 联合完成一个完整的机器学习应用。OpenMLDB在本例中接收Byzer发送的指令和数据,完成数据的实时特征计算,并经特征工程处理后的数据集返回Byzer,供其进行后续的机器学习训练和预测。 +本文示范如何使用[OpenMLDB](https://github.com/4paradigm/OpenMLDB)和 [Byzer](https://www.byzer.org/home) 联合完成一个完整的机器学习应用。OpenMLDB在本例中接收Byzer发送的指令和数据,完成数据的实时特征计算,并经特征工程处理后的数据集返回Byzer,供其进行后续的机器学习训练和预测。 ## 1. 准备工作 @@ -65,13 +65,13 @@ and db="db1" and action="ddl"; ``` -```` ```{note} 1. zkAddress的端口号应与配置IP时的conf文件夹下各相关文件保持一致 -2. 可以通过$BYZER_HOME\conf路径下的\byzer.properties.override文件中的属性`streaming.plugin.clzznames`检查byzer-openmldb-3.0插件是否成功安装。如果成功安装了该插件,可以看到主类名`tech.mlsql.plugins.openmldb.ByzerApp`。 +2. 可以通过 $BYZER_HOME\conf 路径下的 \byzer.properties.override 文件中的属性`streaming.plugin.clzznames`检查byzer-openmldb-3.0插件是否成功安装。如果成功安装了该插件,可以看到主类名`tech.mlsql.plugins.openmldb.ByzerApp`。 3. 若未成功安装,可以手动下载jar包再以[离线方式](https://docs.byzer.org/#/byzer-lang/zh-cn/extension/installation/offline_install)安装配置。 ``` -```` + + ### 2.3 进行实时特征计算 @@ -110,7 +110,7 @@ and action="ddl"; ### 2.4 数据向量化 -在Byzer Noetbbook中将所有int 类型字段都转化为 double。 +在Byzer Notebook中将所有int 类型字段都转化为 double。 ```sql select *, @@ -146,7 +146,7 @@ as trainning_table; ### 2.5 模型训练 -使用Byzer Lang的train命令和其[内置的线性回归算法](https://docs.byzer.org/#/byzer-lang/zh-cn/ml/algs/linear_regression)训练模型,并将训练好的模型保存到/model/tax-trip路径下。 +使用Byzer Lang的`train`命令和其[内置的线性回归算法](https://docs.byzer.org/#/byzer-lang/zh-cn/ml/algs/linear_regression)训练模型,并将训练好的模型保存到/model/tax-trip路径下。 ```sql train trainning_table as LinearRegression.`/model/tax-trip` where @@ -166,7 +166,7 @@ and `fitParam.0.maxIter`="50"; ### 2.6 特征部署 -将特征计算逻辑部署到OpenMLDB上:将最满意的一次特征计算的代码拷贝后修改执行模式为online即可。本例使用的是前文展示的特征工程中的代码。 +将特征计算逻辑部署到OpenMLDB上:将最满意的一次特征计算的代码拷贝后修改执行模式为online即可。本例使用的是前文展示的特征工程中的代码,仅作展示,或许并非表现最优。 ```sql run command as FeatureStoreExt.`` where @@ -224,7 +224,7 @@ and action="ddl"; ### 2.7 模型部署 -在Byzer Noetbook中将之前保存的、训练好的模型注册为一个可以直接使用的函数。 +在Byzer Notebook中将之前保存的、训练好的模型注册为一个可以直接使用的函数。 ```sql register LinearRegression.`/model/tax-trip` as tax_trip_model_predict; @@ -232,7 +232,7 @@ register LinearRegression.`/model/tax-trip` as tax_trip_model_predict; ### 2.8 预测 -将经OpenMLDB处理后的数据集所有int类型字段转成double。 +将经OpenMLDB处理后的在线数据集的所有int类型字段转成double。 ```sql select *, @@ -263,7 +263,7 @@ from new_feature_data_test as testing_table; ``` -使用处理后的训练集进行预测。 +使用处理后的测试集进行预测。 ```sql select tax_trip_model_predict(testing_table) as predict_label;