forked from 4paradigm/OpenMLDB
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: added en version of usecase_byzer (4paradigm#2487)
- Loading branch information
1 parent
a07ce85
commit 89b3592
Showing
4 changed files
with
288 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,276 @@ | ||
# Build End-to-end Machine Learning Applications Based on SQL (OpenMLDB + Byzer) | ||
|
||
This tutorial will show you how to complete a machine learning workflow with the help of [OpenMLDB](https://github.com/4paradigm/OpenMLDB) and [Byzer](https://www.byzer.org/home). | ||
OpenMLDB will compute real-time features based on the data and queries from Byzer, and then return results to Byzer for subsequent model training and inference. | ||
|
||
## 1. Preparations | ||
|
||
### 1.1 Install OpenMLDB | ||
|
||
1. The demo will use the OpenMLDB cluster version running in Docker. See [OpenMLDB Quickstart](../quickstart/openmldb_quickstart.md) for detail installation procedures. | ||
2. Please modify the OpenMLDB IP configuration in order to enable the Byzer engine to access the OpenMLDB service out of the container. See [IP Configuration](../reference/ip_tips.md) for detail guidance. | ||
|
||
### 1.2 Install the Byzer Engine and the Byzer Notebook | ||
|
||
1. For detail installation procedures of Byzer engine, see [Byzer Language Doc](https://docs.byzer.org/#/byzer-lang/en-us/). | ||
|
||
2. We have to use the [OpenMLDB plugin](https://github.com/byzer-org/byzer-extension/tree/master/byzer-openmldb) developed by Byzer to transmit messages between two platforms. To use a plugin in Byzer, please configure `streaming.datalake.path`, see [the manual of Byzer Configuration](https://docs.byzer.org/#/byzer-lang/zh-cn/installation/configuration/byzer-lang-configuration) for detail. | ||
|
||
3. Byzer Notebook is used in this demo. Please install it after the installation of Byzer engine. You can also use the [VSCode Byzer plugin](https://docs.byzer.org/#/byzer-lang/zh-cn/installation/vscode/byzer-vscode-extension-installation) to connect your Byzer engine. The interface of Byzer Notebook is shown below, see [Byzer Notebook Doc](https://docs.byzer.org/#/byzer-notebook/zh-cn/) for more about it. | ||
|
||
 | ||
|
||
|
||
### 1.3 Dataset Preparation | ||
In this case, the dataset comes from the Kaggle taxi trip duration prediction problem. If it is not in your Byzer `Deltalake`, [download](https://www.kaggle.com/c/nyc-taxi-trip-duration/overview) it first. Please remember to import it into Byzer Notebook after download. | ||
|
||
|
||
## 2. The Workflow of Machine Learning | ||
|
||
### 2.1 Load the Dataset | ||
|
||
Please import the origin dataset into the `File System` of Byzer Notebook, it will automatically generate the storage path `tmp/upload`. | ||
Use the `load` Byzer Lang command as below to load this dataset. | ||
```sql | ||
load csv.`tmp/upload/train.csv` where delimiter="," | ||
and header = "true" | ||
as taxi_tour_table_train_simple; | ||
``` | ||
|
||
### 2.2 Import the Dataset into OpenMLDB | ||
|
||
Install the OpenMLDB plugin in Byzer. | ||
|
||
```sql | ||
!plugin app add - "byzer-openmldb-3.0"; | ||
``` | ||
|
||
Now you can use this plugin to connect OpenMLDB. **Please make sure the OpenMLDB engine has started and there is a database named `db1` before you run the following code block in Byzer Notebook.** | ||
|
||
```sql | ||
run command as FeatureStoreExt.`` where | ||
zkAddress="172.17.0.2:7527" | ||
and `sql-0`=''' | ||
SET @@execute_mode='offline'; | ||
''' | ||
and `sql-1`=''' | ||
SET @@job_timeout=20000000; | ||
''' | ||
and `sql-2`=''' | ||
CREATE TABLE t1(id string, vendor_id int, pickup_datetime timestamp, dropoff_datetime timestamp, passenger_count int, pickup_longitude double, pickup_latitude double, dropoff_longitude double, dropoff_latitude double, store_and_fwd_flag string, trip_duration int); | ||
''' | ||
and `sql-3`=''' | ||
LOAD DATA INFILE 'tmp/upload/train.csv' | ||
INTO TABLE t1 options(format='csv',header=true,mode='append'); | ||
''' | ||
and db="db1" | ||
and action="ddl"; | ||
``` | ||
|
||
```{note} | ||
1. The port number of zkAddress should correspond with the files' IP configuration under the OpenMLDB `conf/` path. | ||
2. You can check the `streaming.plugin.clzznames` of the `\byzer.properties.override` file, which is under the `$BYZER_HOME\conf` path of Byzer, to see if the `byzer-openmldb-3.0` plugin is successfully installed. You can see the main class name `tech.mlsql.plugins.openmldb.ByzerApp` after installation. | ||
3. If the plugin installation fail, download the `.jar` files and [install it offline](https://docs.byzer.org/#/byzer-lang/zh-cn/extension/installation/offline_install). | ||
``` | ||
|
||
### 2.3 Real-time Feature Extractions | ||
|
||
The features developed in the [OpenMLDB + LightGBM: Taxi Trip Duration Prediction](./lightgbm_demo.md) Section 2.3 will be used in this demo. | ||
The processed data will be exported to a local `csv` file. | ||
|
||
```sql | ||
run command as FeatureStoreExt.`` where | ||
zkAddress="172.17.0.2:7527" | ||
and `sql-0`=''' | ||
SET @@execute_mode='offline'; | ||
''' | ||
and `sql-1`=''' | ||
SET @@job_timeout=20000000; | ||
''' | ||
and `sql-2`=''' | ||
SELECT trp_duration, passanger_count, | ||
sum(pickup_latitude) OVER w AS vendor_sum_pl, | ||
max(pickup_latitude) OVER w AS vendor_max_pl, | ||
min(pickup_latitude) OVER w AS vendor_min_pl, | ||
avg(pickup_latitude) OVER W AS vendor_avg_pl, | ||
sum(pickup_latitude) OVER w2 AS pc_sum_pl, | ||
max(pickup_latitude) OVER w2 AS pc_max_pl, | ||
min(pickup_latitude) OVER w2 AS pc_min_pl, | ||
avg(pickup_latitude) OVER w2 AS pc_avg_pl, | ||
count(vendor_id) OVER w2 AS pc_cnt, | ||
count(vendor_id) OVER w AS vendor_cnt | ||
FROM t1 | ||
WINDOW w AS(PARTITION BY vendor_id ORDER BY ickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW), | ||
w2 AS(PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW) INTO OUTFILE '/tmp/feature_data'; | ||
''' | ||
and db="db1" | ||
and action="ddl"; | ||
``` | ||
|
||
|
||
|
||
### 2.4 Data Vectorization | ||
Convert all `int` type fields to `double` in Byzer Notebook. | ||
|
||
```sql | ||
select *, | ||
cast(passenger_count as double) as passenger_count_d, | ||
cast(pc_cnt as double) as pc_cnt_d, | ||
cast(vendor_cnt as double) as vendor_cnt_d | ||
from feature_data | ||
as new_feature_data; | ||
``` | ||
|
||
Then merge all the fields into a vector. | ||
|
||
```sql | ||
select vec_dense(array( | ||
passenger_count_d, | ||
vendor_sum_pl, | ||
vendor_max_pl, | ||
vendor_min_pl, | ||
vendor_avg_pl, | ||
pc_sum_pl, | ||
pc_max_pl, | ||
pc_min_pl, | ||
pc_avg_pl, | ||
pc_cnt_d, | ||
vendor_cnt | ||
)) as features,cast(trip_duration as double) as label | ||
from new_feature_data | ||
as trainning_table; | ||
|
||
``` | ||
|
||
|
||
|
||
### 2.5 Training | ||
|
||
Use the `train` Byzer Lang command and its [built-in Linear Regression Algorithm](https://docs.byzer.org/#/byzer-lang/zh-cn/ml/algs/linear_regression) to train the model, and save it to `/model/tax-trip`. | ||
|
||
```sql | ||
train trainning_table as LinearRegression.`/model/tax-trip` where | ||
|
||
keepVersion="true" | ||
|
||
and evaluateTable="trainning_table" | ||
and `fitParam.0.labelCol`="label" | ||
and `fitParam.0.featuresCol`= "features" | ||
and `fitParam.0.maxIter`="50"; | ||
|
||
``` | ||
|
||
```{note} | ||
To check the parameters of Byzer's inbuilt Linear Regression Algorithm, please use `!show et/params/LinearRegression;` command. | ||
``` | ||
|
||
### 2.6 Feature Deployment | ||
|
||
Deploy the feature extraction script onto OpenMLDB: copy the best performance code and set the `execute_mode` to `online`. | ||
The following example uses the code the same as that in the feature extraction, which might not be the 'best'. | ||
```sql | ||
run command as FeatureStoreExt.`` where | ||
zkAddress="172.17.0.2:7527" | ||
and `sql-0`=''' | ||
SET @@execute_mode='online'; | ||
''' | ||
and `sql-1`=''' | ||
SET @@job_timeout=20000000; | ||
''' | ||
and `sql-2`=''' | ||
SELECT trp_duration, passanger_count, | ||
sum(pickup_latitude) OVER w AS vendor_sum_pl, | ||
max(pickup_latitude) OVER w AS vendor_max_pl, | ||
min(pickup_latitude) OVER w AS vendor_min_pl, | ||
avg(pickup_latitude) OVER W AS vendor_avg_pl, | ||
sum(pickup_latitude) OVER w2 AS pc_sum_pl, | ||
max(pickup_latitude) OVER w2 AS pc_max_pl, | ||
min(pickup_latitude) OVER w2 AS pc_min_pl, | ||
avg(pickup_latitude) OVER w2 AS pc_avg_pl, | ||
count(vendor_id) OVER w2 AS pc_cnt, | ||
count(vendor_id) OVER w AS vendor_cnt | ||
FROM t1 | ||
WINDOW w AS(PARTITION BY vendor_id ORDER BY ickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW), | ||
w2 AS(PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW) INTO OUTFILE '/tmp/feature_data_test'; | ||
''' | ||
and db="db1" | ||
and action="ddl"; | ||
|
||
``` | ||
|
||
Import the online data: the following example uses the test set from Kaggle, real-time data source can be connected instead in production. | ||
|
||
```sql | ||
run command as FeatureStoreExt.`` where | ||
zkAddress="172.17.0.2:7527" | ||
and `sql-0`=''' | ||
SET @@execute_mode='online'; | ||
''' | ||
and `sql-1`=''' | ||
SET @@job_timeout=20000000; | ||
''' | ||
and `sql-2`=''' | ||
CREATE TABLE t1(id string, vendor_id int, pickup_datetime timestamp, dropoff_datetime timestamp, passenger_count int, pickup_longitude double, pickup_latitude double, dropoff_longitude double, dropoff_latitude double, store_and_fwd_flag string, trip_duration int); | ||
''' | ||
and `sql-3`=''' | ||
LOAD DATA INFILE 'tmp/upload/test.csv' | ||
INTO TABLE t1 options(format='csv',header=true,mode='append'); | ||
''' | ||
and db="db1" | ||
and action="ddl"; | ||
``` | ||
|
||
|
||
|
||
### 2.7 Model Deployment | ||
|
||
Register the previously trained and saved model as a UDF function in Byzer Notebook in order to use it more conveniently. | ||
|
||
```sql | ||
register LinearRegression.`/model/tax-trip` as tax_trip_model_predict; | ||
``` | ||
|
||
### 2.8 Prediction | ||
|
||
Convert all `int` type fields of the online dataset, after processed by OpenMLDB, to `double`. | ||
|
||
```sql | ||
select *, | ||
cast(passenger_count as double) as passenger_count_d, | ||
cast(pc_cnt as double) as pc_cnt_d, | ||
cast(vendor_cnt as double) as vendor_cnt_d | ||
from feature_data_test | ||
as new_feature_data_test; | ||
``` | ||
|
||
Then merge all the fields into a vector. | ||
|
||
|
||
```sql | ||
select vec_dense(array( | ||
passenger_count_d, | ||
vendor_sum_pl, | ||
vendor_max_pl, | ||
vendor_min_pl, | ||
vendor_avg_pl, | ||
pc_sum_pl, | ||
pc_max_pl, | ||
pc_min_pl, | ||
pc_avg_pl, | ||
pc_cnt_d, | ||
vendor_cnt | ||
)) as features, | ||
from new_feature_data_test | ||
as testing_table; | ||
``` | ||
|
||
Use this processed test set to predict. | ||
|
||
```sql | ||
select tax_trip_model_predict(testing_table) as predict_label; | ||
``` | ||
|
||
|
||
|
||
|
||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,3 +10,5 @@ Use Cases | |
kafka_connector_demo | ||
dolphinscheduler_task_demo | ||
JD_recommendation_en | ||
OpenMLDB_Byzer_taxi | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters