From c73d17edc3260f14624f5dbd9fe5c2d4c68d8a63 Mon Sep 17 00:00:00 2001 From: Siqi Wang Date: Wed, 21 Feb 2024 12:02:41 +0800 Subject: [PATCH] remove unused files for usecase folder (#3753) --- docs/en/use_case/airflow_provider_demo.md | 123 -------- .../en/use_case/dolphinscheduler_task_demo.md | 215 -------------- docs/en/use_case/kafka_connector_demo.md | 224 --------------- docs/en/use_case/lightgbm_demo.md | 206 ------------- docs/en/use_case/pulsar_connector_demo.md | 270 ------------------ 5 files changed, 1038 deletions(-) delete mode 100644 docs/en/use_case/airflow_provider_demo.md delete mode 100644 docs/en/use_case/dolphinscheduler_task_demo.md delete mode 100644 docs/en/use_case/kafka_connector_demo.md delete mode 100644 docs/en/use_case/lightgbm_demo.md delete mode 100644 docs/en/use_case/pulsar_connector_demo.md diff --git a/docs/en/use_case/airflow_provider_demo.md b/docs/en/use_case/airflow_provider_demo.md deleted file mode 100644 index 9019ba2c5a6..00000000000 --- a/docs/en/use_case/airflow_provider_demo.md +++ /dev/null @@ -1,123 +0,0 @@ -# Airflow OpenMLDB Provider -We provide the [Airflow OpenMLDB Provider](https://github.com/4paradigm/OpenMLDB/tree/main/extensions/airflow-provider-openmldb) to use the OpenMLDB in Airflow DAG more easily. -This manual will use the Airflow to manage the training and deployment tasks in the [TalkingData Demo](talkingdata_demo). - -## TalkingData DAG - -We will use the DAG created by [example_openmldb_complex.py](https://github.com/4paradigm/OpenMLDB/blob/main/extensions/airflow-provider-openmldb/openmldb_provider/example_dags/example_openmldb_complex.py) in the Airflow. -You can import the DAG into the Airflow and run it directly. -![airflow dag](images/airflow_dag.png) - -The workflow of the DAG is shown above. The tables will be created at first, then the offline data will be imported and processed for feature extraction. After training, if the AUC of the model is greater than 99.0, then the SQL script and the model can be deployed. Otherwise, the workflow will report failure. - -## Demo - -The DAG mentioned above will be used to complete the feature extraction and deployment work in the [TalkingData Demo](talkingdata_demo), and the predict_server in this demo is responsible for the real-time prediction after deployment. - -### 0 Preparations - -#### 0.1 Download DAG -Both the DAG and the training script can be gained by downloading [airflow_demo_files](https://openmldb.ai/download/airflow_demo/airflow_demo_files.tar.gz). - -``` -wget https://openmldb.ai/download/airflow_demo/airflow_demo_files.tar.gz -tar zxf airflow_demo_files.tar.gz -ls airflow_demo_files -``` -For the newest version, please visit [GitHub example_dags](https://github.com/4paradigm/OpenMLDB/tree/main/extensions/airflow-provider-openmldb/openmldb_provider/example_dags). - - -#### 0.2 Start the Docker Image - -- It is recommended to install and start the OpenMLDB image and the Airflow in Docker. -- The port of the container needs to be exposed for the Airflow Web login. -- Please project the previously downloaded files to the path `/work/airflow/dags`, where Airflow will access for the DAG. - -``` -docker run -p 8080:8080 -v `pwd`/airflow_demo_files:/work/airflow/dags -it 4pdosc/openmldb:0.8.4 bash -``` - -#### 0.3 Download and Install the Airflow and the Airflow OpenMLDB Provider -Run the following command in Docker. -``` -pip3 install airflow-provider-openmldb -``` -Since the Airflow OpenMLDB Provider relies on the Airflow, they will be downloaded together. - -#### 0.4 Prepare the Dataset -Since the data import path of the DAG is `/tmp/train_sample.csv`, we have to copy the data file to `/tmp` directory. -``` -cp /work/talkingdata/train_sample.csv /tmp/ -``` - -### 1 Start the OpenMLDB and the Airflow -The following commands will start the OpenMLDB cluster. The `predict_server` supports deployment and test, and the standalone Airflow. - -``` -/work/init.sh -python3 /work/talkingdata/predict_server.py --no-init > predict.log 2>&1 & -export AIRFLOW_HOME=/work/airflow -cd /work/airflow -airflow standalone -``` - -The username and the password for the Airflow standalone are shown in the picture below. - -![airflow login](images/airflow_login.png) - -Please visit `http://localhost:8080`, enter the username and the password as shown. - -```{caution} -`Airflow standalone` is a foreground process, the exit will lead to the whole termiantion of the process. -You can quit the Airflow after the DAG is finished then go for [Step 3](#3-test) or just put the Airflow process to the background. -``` - -### 2 Run the DAG -Open the DAG example_openmldb_complex in the Airflow Web and click the `Code` to check the detail of the DAG. -![dag home](images/dag_home.png) - -You can see the `openmldb_conn_id` that is used in `Code`. The DAG doesn't use the address of OpenMLDB but use the connection. We need to create a new connection and name it the same. -![dag code](images/dag_code.png) - -#### 2.1 Create the Connection -Click the 'connection' in the 'Admin'. -![connection](images/connection.png) - -Add a connection. -![add connection](images/add_connection.png) - -Please use the address of the OpenMLDB Api Server rather than the address of zookeeper as the Airflow OpenMLDB Provider is connected to the OpenMLDB Api Server. - -![connection settings](images/connection_settings.png) - -The created connection is shown as the picture below. -![display](images/connection_display.png) - -#### 2.2 Run the DAG -Run the DAG to complete a turn of model training, SQL and model deployment. -A successful run should look like the following figure. -![dag run](images/dag_run.png) - -### 3 Test - -If you run the Airflow foreground, you can quit the Airflow as the subsequent procedures do not depend on it. -#### 3.1 Import the Online Data -Although the DAG has deployed the SQL and the model, there is no data in the online database. -You should run the following command to import the online data. -``` -curl -X POST http://127.0.0.1:9080/dbs/example_db -d'{"mode":"online", "sql":"load data infile \"file:///tmp/train_sample.csv\" into table example_table options(mode=\"append\");"}' -``` -This is an asynchronous operation, but it won't take too long because of the small data size. -If you want to check the execution state of the command, please use `SHOW JOBS`. -``` -curl -X POST http://127.0.0.1:9080/dbs/example_db -d'{"mode":"online", "sql":"show jobs"}' -``` - -#### 3.2 Prediction -Run the following prediction script which will use the latest deployed SQL and model. -``` -python3 /work/talkingdata/predict.py -``` -The result is shown below. -![result](images/airflow_test_result.png) - diff --git a/docs/en/use_case/dolphinscheduler_task_demo.md b/docs/en/use_case/dolphinscheduler_task_demo.md deleted file mode 100644 index 5a4a8e6bfb8..00000000000 --- a/docs/en/use_case/dolphinscheduler_task_demo.md +++ /dev/null @@ -1,215 +0,0 @@ -# Building End-to-End MLOps Workflows (OpenMLDB + DolphinScheduler) - -## Background -In the closed loop of machine learning applications from development to deployment, data processing, feature engineering, and model training often cost a lot of time and manpower. To facilitate AI applications development and deployment, we have developed the DolphinScheduler OpenMLDB Task, which integrates feature engineering into the workflow of DolphinScheduler to build an end-to-end MLOps workflow. This article will briefly introduce and demonstrate the operation process of the DolphinScheduler OpenMLDB Task. - -```{seealso} -See [DolphinScheduler OpenMLDB Task Official Documentation](https://dolphinscheduler.apache.org/en-us/docs/3.1.5/guide/task/openmldb) for full details. -``` - -## Scenarios and Functions -### Why We Need the DolphinScheduler OpenMLDB Task - -![image-20220610170510779](../../zh/use_case/images/ecosystem.png) - -As an open-source machine learning database that provides full-stack solutions for data and feature engineering, the key point for OpenMLDB is to improve ease-of-use and integrate the open-source ecosystem. As shown in the above figure, accessing the data source can make it easier for the data in DataOps to feed into OpenMLDB, and the features provided by OpenMLDB also need to smoothly enter ModelOps for training. - -In this article, we focus on the integration with the workflow scheduler platform DolphinScheduler. The DolphinScheduler OpenMLDB Task can operate OpenMLDB more easily. At the same time, the OpenMLDB task is also managed by workflow and is fully automated. - -### What Can the DolphinScheduler OpenMLDB Task Do - -By writing the OpenMLDB task, we can meet the requirements of OpenMLDB for offline import, feature extraction, SQL deployment, real-time data import, etc. We can build an end-to-end machine learning pipeline using OpenMLDB based on DolphinScheduler. - -![image-20220610174647990](../../zh/use_case/images/task_func.png) - -For example, the typical workflow of machine learning based on OpenMLDB is shown in the figure above, steps 1-4 in the process correspond to offline data import, feature extraction, SQL deployment, and real-time data import, which can be written through the DolphinScheduler OpenMLDB Task. - -In addition to the feature engineering done by OpenMLDB, the prediction also requires model inference. So next, based on the TalkingData advertising fraud detection scenario from the Kaggle competition, we will demonstrate how to use the DolphinScheduler OpenMLDB Task to build an end-to-end machine learning pipeline. For details of the TalkingData competition, see [talkingdata-adtracking-fraud-detection](https://www.kaggle.com/competitions/talkingdata-adtracking-fraud-detection/discussion). - -## Demo -### Configuration - -** Use OpenMLDB docker image** - -The demo can run on MacOS or Linux, the OpenMLDB docker image is recommended. We'll start OpenMLDB and DolphinScheduler in the same container, expose the DolphinScheduler web port: -``` -docker run -it -p 12345:12345 4pdosc/openmldb:0.8.4 bash -``` - -```{attention} -The DolphinScheduler requires a user of the operating system with `sudo` permission. Therefore, it is recommended to download and start the DolphinScheduler in the OpenMLDB container. Otherwise, please prepare the operating system user with sudo permission. -``` - -The docker image doesn't have sudo, but DolphinScheduler needs it in runtime. So install it: -``` -apt update && apt install sudo -``` - -And DolphinScheduler task running uses sh, but the docker image default sh is `dash`. Change it to `bash`: -``` -dpkg-reconfigure dash -``` -And enter `no`. - -**Source Data** - -The workflow will load data from `/tmp/train_sample.csv`,so prepare it: -``` -curl -SLo /tmp/train_sample.csv https://openmldb.ai/download/dolphinschduler-task/train_sample.csv -``` - -**Start OpenMLDB Cluster and Predict Server** - -In the container, you can directly run the following command to start the OpenMLDB cluster. -``` -./init.sh -``` - -We will complete a workflow of importing data, offline training, and deploying the SQL and model online after successful training. For the online part of the model, you can use the simple predict server in `/work/talkingdata`. Run it in the background: -``` -cd /work -curl -SLo predict_server.py https://openmldb.ai/download/dolphinschduler-task/predict_server.py -python3 predict_server.py --no-init > predict.log 2>&1 & -``` -```{tip} -If online predict test got errors, please check the log`/work/predict.log`. -``` - -**Start DolphinScheduler** - -You can download the DolphinScheduler package in [official](https://dolphinscheduler.apache.org/zh-cn/download/3.1.5), or the mirror site prepared by us, in[dolphinscheduler-bin download link](http://openmldb.ai/download/dolphinschduler-task/apache-dolphinscheduler-dev-3.1.5-bin.tar.gz). - -Start the DolphinScheduler standalone version. The steps are as follows. For more information, please refer to [Official Documentation](https://dolphinscheduler.apache.org/en-us/docs/3.1.5/guide/installation/standalone)。 -``` -curl -SLO https://dlcdn.apache.org/dolphinscheduler/3.1.5/apache-dolphinscheduler-3.1.5-bin.tar.gz -# mirror: curl -SLO http://openmldb.ai/download/dolphinschduler-task/apache-dolphinscheduler-dev-3.1.5-bin.tar.gz -tar -xvzf apache-dolpSchedulerler-*-bin.tar.gz -cd apache-dolpSchedulerler-*-bin -sed -i s#/opt/soft/python#/usr/bin/python3#g bin/env/dolphinscheduler_env.sh -sh ./bin/dolpSchedulerler-daemon.sh start standalone-server -``` - -```{hint} -The OpenMLDB Task in old version (< 3.1.2) has problems,can't work, please use the newer package(>=3.1.3). If you want the DolphinScheduler in old version, ask us for the fix version. - -In higher version of DolphinScheduler, `bin/env/dolphinscheduler_env.sh` may be changed, we need to append `PYTHON_HOME` to it, run `echo "export PYTHON_HOME=/usr/bin/python3" >> bin/env/dolphinscheduler_env.sh`. - -We have set the Python environment by modify `PYTHON_HOME` in `bin/env/dolphinscheduler_env.sh`, as shown in the previous code(Python Task needs to explicitly set the python environment, cuz we use Python3). If you have started the DolphinScheduler already, you can also set the environment on the web page after startup. The setting method is as follows. **Note that in this case, it is necessary to confirm that all tasks in the workflow use this environment** - -Note that before the DolphinScheduler standalone runs, the configured temporary environment variable `PYTHON_HOME` does not affect the environment in the work server. -``` - -Now you can login to DolphinScheduler at http://localhost:12345/dolphinscheduler/ui (If you access it by another machine, use the IP address). The default user name and password are: admin/dolphinscheduler123。 - -```{note} -The worker server of DolphinScheduler requires the OpenMLDB Python SDK. The worker of DolphinScheduler standalone is the local machine, so you only need to install the OpenMLDB Python SDK on the local machine. The Python SDK is ready in our OpenMLDB image. If you are not running the docker image, install the SDK by `pip3 install openmldb`. -``` - -**Download workflow json** - -Workflows can be created manually. In this example, we directly provide JSON workflow files, [Click to Download](http://openmldb.ai/download/dolphinschduler-task/workflow_openmldb_demo.json), and you can directly import it later into the DolphinScheduler environment and make simple modifications to complete the whole workflow. - -Note that, you should download the workflow file in the machine which you open the browser. We'll upload the file on web. - -### Demo Steps - -#### Step 1. Initialize Configuration - -You need to first create a tenant in the DolphinScheduler Web, and then enter the tenant management interface, fill in the operating system user with sudo permission, and use the default for the queue. You can use **root** if you run it in the docker container. - -![create tenant](../../zh/use_case/images/ds_create_tenant.png) - -Then you need to bind the tenant to the user. For simplicity, we directly bind to the admin user. Enter the user management page and click edit admin user. - -![bind tenant](../../zh/use_case/images/ds_bind_tenant.png) - -After binding, the user status is similar to the following figure. - -![bind status](../../zh/use_case/images/ds_bind_status.png) - -#### Step 2. Create Workflow -In the DolphinScheduler, you need to create a project first, and then create a workflow in the project. Therefore, first create a test project, as shown in the following figure. Click create a project and enter the project. - -![create project](../../zh/use_case/images/ds_create_project.png) - -![project](../../zh/use_case/images/ds_project.png) - -After entering the project, you can import the [downloaded workflow file](https://github.com/4paradigm/OpenMLDB/releases/download/v0.5.1/workflow_openmldb_demo.json). As shown in the following figure, please click Import workflow in the workflow definition interface. - -![import workflow](../../zh/use_case/images/ds_import_workflow.png) - -After the import, the workflow will appear in the workflow list, similar to the following figure. - -![workflow list](../../zh/use_case/images/ds_workflow_list.png) - -Then you click the workflow name to view the workflow details, as shown in the following figure. - -![workflow detail](../../zh/use_case/images/ds_workflow_detail.png) - -**Note**: This needs to be modified because the task ID will change after importing the workflow. In particular, the upstream and downstream id in the switch task do not exist and need to be manually changed. - -![switch](../../zh/use_case/images/ds_switch.png) - -As shown in the above figure, there is a non-existent ID in the settings of the switch task. Please change the successful and failed "branch flow" and "pre-check condition" to the task of the current workflow. - -The correct result is shown in the following figure: - -![right](../../zh/use_case/images/ds_switch_right.png) - -After modification, we save the workflow. Tenant in the imported workflow will be deemed as default in the default mode and also can be run. If you want to specify your tenant, please select a tenant when saving the workflow, as shown in the following figure. -![set tenant](../../zh/use_case/images/ds_set_tenant.png) - -#### Step 3. Online Operation - -After saving the workflow, you need to go online before running. The run button will not light up until it is online. As shown in the following figure. - -![run](../../zh/use_case/images/ds_run.png) - -Please click run and wait for the workflow to complete. You can view the workflow running details in the Workflow Instance interface, as shown in the following figure. - -![run status](../../zh/use_case/images/ds_run_status.png) - -To demonstrate the process of a successful launch, the validation does not perform actual validation, but directly returns the validation success and flows into the deploy branch. After running the deploy branch, the deploy SQL and subsequent tasks are successful, the predict server receives the latest model. - -```{note} -If the instance in Workflow Instance got `Failed`, click the instance name, jump to the detail page. Double click the failed task, and click `View log` in the in the top right-hand corner, check the log for detail error messages. - -`load offline data`, `feture extraction` and `load online` task may be succeed in DolphinScheduler, But the job is failed in OpenMLDB. So we may get the error 'No object to concatenate'(Traceback `pd.concat`) in `train` task, it means no feature source. - -If it's wrong, please check the real state of each jobs in OpenMLDB. You can run `echo "show jobs;" | /work/openmldb/bin/openmldb --zk_cluster=127.0.0.1:2181 --zk_root_path=/openmldb --role=sql_client`. If the job state is `FAILED`, find the job log. See [job log path](../../zh/quickstart/beginner_must_read.md#离线) to find it. -``` - -#### 4. online predict test -The predict server also provides online prediction services, which are requested through `curl /predict`. We simply construct a real-time request and send it to the predict server. -``` -curl -X POST 127.0.0.1:8881/predict -d '{"ip": 114904, - "app": 11, - "device": 1, - "os": 15, - "channel": 319, - "click_time": 1509960088000, - "is_attributed": 0}' -``` -The returned results are as follows: - -![predict](../../zh/use_case/images/ds_predict.png) - -#### Supplement - -If you rerun the workflow, `deploy sql` task may failed cause deployment`demo` is exists. Please delete the deployment in container before rerun the workflow: -``` -/work/openmldb/bin/openmldb --zk_cluster=127.0.0.1:2181 --zk_root_path=/openmldb --role=sql_client --database=demo_db --interactive=false --cmd="drop deployment demo;" -``` - -You can check if deployment is deleted: -``` -/work/openmldb/bin/openmldb --zk_cluster=127.0.0.1:2181 --zk_root_path=/openmldb --role=sql_client --database=demo_db --interactive=false --cmd="show deployment demo;" -``` - -Restart the DolphinScheduler server(the metadata will be cleaned, you need to reset the config and create the workflow again): -``` -./bin/dolphinscheduler-daemon.sh stop standalone-server -./bin/dolphinscheduler-daemon.sh start standalone-server -``` - -If you want to store the metadata,check [Pseudo-Cluster Deployment](https://dolphinscheduler.apache.org/en-us/docs/3.1.5/guide/installation/pseudo-cluster) to use the database. diff --git a/docs/en/use_case/kafka_connector_demo.md b/docs/en/use_case/kafka_connector_demo.md deleted file mode 100644 index 70288b0001d..00000000000 --- a/docs/en/use_case/kafka_connector_demo.md +++ /dev/null @@ -1,224 +0,0 @@ -# Importing Real-Time Data Streams from Kafka - -## Introduction - -Apache Kafka is an event streaming platform. It can be used as the online data source of OpenMLDB, import the real-time data from data stream into OpenMLDB online. For more information about Kafka, please refer to the official website https://kafka.apache.org/. We have developed a Kafka connector to bridge the OpenMLDB, which can connect Kafka and OpenMLDB without obstacles. In this document, you will learn the concept and usage of this connector. - -Please note that in order to make the demonstration easier, this article will use the Kafka Connect standalone mode to start the connector. The connector can be started in the distributed mode. - -:::{seealso} - -For OpenMLDB Kafka Connector implementation, please refer to [extensions/kafka-connect-jdbc](https://github.com/4paradigm/OpenMLDB/tree/main/extensions/kafka-connect-jdbc). -::: - -## Overview - -### Download and Preparation - -- Download Kafka: please click [kafka downloads](https://kafka.apache.org/downloads) to download `kafka_2.13-3.1.0.tgz`. -- Download the connector package and dependencies: please click on [kafka-connect-jdbc.tgz](https://github.com/4paradigm/OpenMLDB/releases/download/v0.5.0/kafka-connect-jdbc.tgz). -- Download the configuration and script files (for the demonstration purpose used in this article): please click on [kafka_demo_files.tgz](http://openmldb.ai/download/kafka-connector/kafka_demo_files.tgz). - -This article will start the OpenMLDB in docker container, so there is no need to download the OpenMLDB separately. Moreover, Kafka and connector can be started in the same container. We recommend that you save the three downloaded packages to the same directory. Let's assume that the packages are in the `/work/kafka` directory. - -``` -docker run -it -v `pwd`:/work/kafka --name openmldb 4pdosc/openmldb:0.8.4 bash -``` - -### Steps - -The brief process of using the connector is shown in the figure below. We will describe each step in detail next. - -In general, the use process can be summarized into four steps: - -1. Start OpenMLDB and create the database -2. Start Kafka and create topic -3. Start OpenMLDB Kafka Connector -4. Proceed for test or normal use - -![demo steps](../../zh/use_case/images/kafka_connector_steps.png) - - -## Step 1: Start the OpenMLDB and Create a Database - -### Start the OpenMLDB Cluster - -In the OpenMLDB container, start the cluster: - -``` -/work/init.sh -``` - -:::{caution} - -At present, only the OpenMLDB cluster version can be used as the receiver of sink, and the data will only be sink to the online storage of the cluster. -::: - -### Create Database - -We can quickly create a database through the pipe without logging into the client CLI: - -``` -echo "create database kafka_test;" | /work/openmldb/bin/openmldb --zk_cluster=127.0.0.1:2181 --zk_root_path=/openmldb --role=sql_client -``` - -## Step 2: Start the Kafka and Create topic - -### Start Kafka - -Unzip the Kafka and start the Kafka using the start script. - -``` -cd kafka -tar -xzf kafka_2.13-3.1.0.tgz -cd kafka_2.13-3.1.0 -./bin/kafka-server-start.sh -daemon config/server.properties -``` - -:::{note} - -The OpenMLDB service has used port 2181 to start zookeeper. Kafka does not need to start zookeeper again. Therefore, you only need to start the server here. -::: - -You can check whether Kafka is working normally. You can use `ps` to check. If the Kafka start failed, check the log `logs/server.log`. - -``` -ps axu|grep kafka -``` - -### Create Topics - -We create a topic named `topic1`. Please note that special characters should not appear in the name of the topic. - -``` -./bin/kafka-topics.sh --create --topic topic1 --bootstrap-server localhost:9092 -``` - -You can `describe` the topic to confirm whether it is normal. - -``` -./bin/kafka-topics.sh --describe --topic topic1 --bootstrap-server localhost:9092 -``` - -![topic status](../../zh/use_case/images/kafka_topic_describe.png) - -## Step 3: Start the Connector - -First, unzip the connector and the kafka_demo_files package in `/work/kafka`. - -``` -cd /work/kafka -tar zxf kafka-connect-jdbc.tgz -tar zxf kafka_demo_files.tgz -``` - -kafka_demo_files has the configuration files which are required to start the connector. And ensure to put the connector plug-in in the correct location. - -The first configuration file is the configuration of the connector itself, `connect-standalone.properties`. The key configuration of the `plugin.path` is as follows: - -``` -plugin.path=/usr/local/share/java -``` - -Connector and all dependent packages required to run it need to be put into this directory. The command is as follows: - -``` -mkdir -p /usr/local/share/java -cp -r /work/kafka/kafka-connect-jdbc /usr/local/share/java/ -``` - -The second configuration is the `openmldb-sink.properties` which is the config to connect the OpenMLDB cluster, as follows: - -``` -name=test-sink -connector.class=io.confluent.connect.jdbc.JdbcSinkConnector -tasks.max=1 -topics=topic1 -connection.url=jdbc:openmldb:///kafka_test?zk=127.0.0.1:2181&zkPath=/openmldb -auto.create=true -``` - -```{tip} -See [Configuring Connectors](https://kafka.apache.org/documentation/#connect_configuring) for full details about the config options. - -The option `connection.url` should be the right OpenMLDB address and database. The database must exist. -``` - -In the connection configuration, you need to fill in the correct OpenMLDB URL address. The connector receives the message of topic1 and automatically creates a table (auto.create). - -Next, start the connector using the Kafka connector standalone mode. - -``` -cd /work/kafka/kafka_2.13-3.1.0 -./bin/connect-standalone.sh -daemon ../kafka_demo_files/connect-standalone.properties ../kafka_demo_files/openmldb-sink.properties -``` - -Check whether the connector is started and correctly connected to the OpenMLDB cluster. You can check with `logs/connect.log`. Under normal circumstances, the log should have `Executing sink task`. - -## Step 4: Test - -### Send Messages - -We use the console producer provided by Kafka as the message sending tool for testing. - -Since we haven't created a table yet, our message should contain the schema to help Kafka parse the message and write it to OpenMLDB. - -``` -{"schema":{"type":"struct","fields":[{"type":"int16","optional":true,"field":"c1_int16"},{"type":"int32","optional":true,"field":"c2_int32"},{"type":"int64","optional":true,"field":"c3_int64"},{"type":"float","optional":true,"field":"c4_float"},{"type":"double","optional":true,"field":"c5_double"},{"type":"boolean","optional":true,"field":"c6_boolean"},{"type":"string","optional":true,"field":"c7_string"},{"type":"int64","name":"org.apache.kafka.connect.data.Date","optional":true,"field":"c8_date"},{"type":"int64","name":"org.apache.kafka.connect.data.Timestamp","optional":true,"field":"c9_timestamp"}],"optional":false,"name":"foobar"},"payload":{"c1_int16":1,"c2_int32":2,"c3_int64":3,"c4_float":4.4,"c5_double":5.555,"c6_boolean":true,"c7_string":"c77777","c8_date":19109,"c9_timestamp":1651051906000}} -``` - -More conveniently, we save the above message in the file `kafka_demo_files/message` where you can use it directly to send message to the Kafka with the console producer. - -``` -./bin/kafka-console-producer.sh --topic topic1 --bootstrap-server localhost:9092 < ../kafka_demo_files/message -``` - -```{tip} -If you want to send messages without the schema,but you don't have Schema Registry. You can create the table in OpenMLDB, and set `auto.schema=true` in Kafka connector, see [kafka connect jdbc doc](https://github.com/4paradigm/OpenMLDB/blob/main/extensions/kafka-connect-jdbc/DEVELOP.md) for full details. Only support to use with JsonConverter. -``` - -### Check Results - -We can query OpenMLDB to check whether the insertion is successful. The query script of `kafka_demo_files/select.sql` is as follows: - -``` -set @@execute_mode='online'; -use kafka_test; -select * from topic1; -``` - -You can directly run the query script with a query: - -``` -/work/openmldb/bin/openmldb --zk_cluster=127.0.0.1:2181 --zk_root_path=/openmldb --role=sql_client < ../kafka_demo_files/select.sql -``` - -![openmldb result](../../zh/use_case/images/kafka_openmldb_result.png) - -## Debug - -### Logs - -Kafka server log is `log/server.log`, check it if the Kafka server can't work. - -And the connector log is `log/connect.log`, check it if the producer failed or can't get the result in OpenMLDB. - -### Reinit - -If you met some error, you can reinitialize the environment to retry. - -To terminate kafka, kill the two daemon process: -``` -ps axu|grep kafka | grep -v grep | awk '{print $2}' | xargs kill -9 -``` - -To delete the data, ref [TERMINATE THE KAFKA ENVIRONMENT](https://kafka.apache.org/quickstart#quickstart_kafkaterminate): -``` -rm -rf /tmp/kafka-logs /tmp/kraft-combined-logs -``` - -Plz DO NOT kill zookeeper process or delete `/tmp/zookeeper` here, cuz OpenMLDB use the same zookeeper cluster too. We will kill the zookeeper process and delete the zookeeper data dir when we reinitialize the OpenMLDB cluster: -``` -/work/init.sh -``` -And then create the database in OpenMLDB, start the Kafka ... diff --git a/docs/en/use_case/lightgbm_demo.md b/docs/en/use_case/lightgbm_demo.md deleted file mode 100644 index c1310fdea66..00000000000 --- a/docs/en/use_case/lightgbm_demo.md +++ /dev/null @@ -1,206 +0,0 @@ -# OpenMLDB + LightGBM: Taxi Trip Duration Prediction - -In this document, we will take [the taxi travel time prediction problem on Kaggle as an example](https://www.kaggle.com/c/nyc-taxi-trip-duration/overview) to demonstrate how to use the OpenMLDB and LightGBM together to build a complete machine learning application. - -Note that: (1) this case is based on the OpenMLDB cluster version for tutorial demonstration; (2) this document uses the pre-compiled docker image. If you want to test it in the OpenMLDB environment compiled and built by yourself, you need to configure and use our [Spark Distribution for Feature Engineering Optimization](https://github.com/4paradigm/spark). Please refer to relevant documents of [compilation](https://openmldb.ai/docs/en/main/deploy/compile.html) (Refer to Chapter: "Spark Distribution Optimized for OpenMLDB") and the [installation and deployment documents](https://openmldb.ai/docs/en/main/deploy/install_deploy.html) (Refer to the section: [Deploy TaskManager](https://openmldb.ai/docs/en/main/deploy/install_deploy.html#deploy-taskmanager)). - -### 1. Preparation and Preliminary Knowledge - -#### 1.1. Pull and Start the OpenMLDB Docker Image - -- Note: Please make sure that the Docker Engine version number is > = 18.03 - -- Pull the OpenMLDB docker image and run the corresponding container: - -```bash -docker run -it 4pdosc/openmldb:0.8.4 bash -``` - -The image is preinstalled with OpenMLDB and preset with all scripts, third-party libraries, open-source tools and training data required for this case. - -```{note} -Note that all the commands below run in the docker container by default, and are assumed to be in the default directory (`/work/taxi-trip`). -``` - -#### 1.2. Initialize Environment - -```bash -./init.sh -cd taxi-trip -``` - -We provide the init.sh script in the image that helps users to quickly initialize the environment including: - -- Configure zookeeper - -- Start cluster version OpenMLDB - -#### 1.3. Start OpenMLDB CLI Client - -```bash -/work/openmldb/bin/openmldb --zk_cluster=127.0.0.1:2181 --zk_root_path=/openmldb --role=sql_client -``` - -```{note} -Note that most of the commands in this tutorial are executed under the OpenMLDB CLI. In order to distinguish from the ordinary shell environment, the commands executed under the OpenMLDB CLI use a special prompt of >. -``` - -#### 1.4. Preliminary Knowledge: Non-Blocking Task of Cluster Version - -Some commands in the cluster version are non-blocking tasks, including `LOAD DATA` in online mode and `LOAD DATA`, `SELECT`, `SELECT INTO` commands in the offline mode. After submitting a task, you can use relevant commands such as `SHOW JOBS` and `SHOW JOB` to view the task progress. For details, see the offline task management document. - -### 2. Machine Learning Based on OpenMLDB and LightGBM - -#### 2.1. Creating Databases and Data Tables - -The following commands are executed in the OpenMLDB CLI environment. - -```sql -> CREATE DATABASE demo_db; -> USE demo_db; -> CREATE TABLE t1(id string, vendor_id int, pickup_datetime timestamp, dropoff_datetime timestamp, passenger_count int, pickup_longitude double, pickup_latitude double, dropoff_longitude double, dropoff_latitude double, store_and_fwd_flag string, trip_duration int); -``` - -#### 2.2. Offline Data Preparation - -First, you need to switch to offline execution mode. Next, import the sample data `/work/taxi-trip/data/taxi_tour_table_train_simple.csv` as offline data that is used for offline feature calculation. - -The following commands are executed under the OpenMLDB CLI. - -```sql -> USE demo_db; -> SET @@execute_mode='offline'; -> LOAD DATA INFILE '/work/taxi-trip/data/taxi_tour_table_train_simple.snappy.parquet' INTO TABLE t1 options(format='parquet', header=true, mode='append'); -``` - -```{note} -Note that `LOAD DATA` is a non-blocking task. You can use the command `SHOW JOBS` to view the running status of the task. Please wait for the task to run successfully (`state` to `FINISHED` status) before proceeding to the next step. -``` - -#### 2.3. The Feature Extraction Script - -Usually, users need to analyze the data according to the goal of machine learning before designing the features, and then design and investigate the features according to the analysis. Data analysis and feature research of the machine learning are not the scope of this paper, and we will not expand it. We assumes that users already have the basic theoretical knowledge of machine learning, the ability to solve machine learning problems, the ability to understand SQL syntax, and the ability to use SQL syntax to construct features. - -For this case, the user has designed several features after the analysis and research: - -| Feature Name | Feature Meaning | SQL Feature Representation | -| --------------- | ------------------------------------------------------------ | --------------------------------------- | -| trip_duration | Travel time of a single trip | `trip_duration` | -| passenger_count | Number of passengers | `passenger_count` | -| vendor_sum_pl | Cumulative number of taxis of the same brand in the time window in the past 1 day (pickup_latitude) | `sum(pickup_latitude) OVER w` | -| vendor_max_pl | The largest number of taxis of the same brand in the time window in the past 1 day (pickup_latitude) | `max(pickup_latitude) OVER w` | -| vendor_min_pl | The minimum number of taxis of the same brand in the time window in the past 1 day (pickup_latitude) | `min(pickup_latitude) OVER w` | -| vendor_avg_pl | Average number of taxis of the same brand in the time window in the past 1 day (pickup_latitude) | `avg(pickup_latitude) OVER w` | -| pc_sum_pl | Cumulative trips of the same passenger capacity in the time window in the past 1 day (pickup_latitude) | `sum(pickup_latitude) OVER w2` | -| pc_max_pl | The maximum number of trips with the same passenger capacity in the time window in the past 1 day (pickup_latitude) | `max(pickup_latitude) OVER w2` | -| pc_min_pl | The minimum number of trips with the same passenger capacity in the time window in the past 1 day (pickup_latitude) | `min(pickup_latitude) OVER w2` | -| pc_avg_pl | Average number of trips with the same passenger capacity in the time window in the past 1 day (pickup_latitude) | `avg(pickup_latitude) OVER w2` | -| pc_cnt | The total number of trips with the same passenger capacity in the time window in the past 1 day | `count(vendor_id) OVER w2` | -| vendor_cnt | Total trips of taxis of the same brand in the time window in the past 1 day | `count(vendor_id) OVER w AS vendor_cnt` | - -#### 2.4. Offline Feature Extraction - -In the offline mode, the user extracts features and outputs the feature results to `/tmp/feature_data` that is saved in the data directory for subsequent model training. The `SELECT` command corresponds to the SQL feature extraction script generated based on the above table. The following commands are executed under the OpenMLDB CLI. - -```sql -> USE demo_db; -> SET @@execute_mode='offline'; -> SELECT trip_duration, passenger_count, -sum(pickup_latitude) OVER w AS vendor_sum_pl, -max(pickup_latitude) OVER w AS vendor_max_pl, -min(pickup_latitude) OVER w AS vendor_min_pl, -avg(pickup_latitude) OVER w AS vendor_avg_pl, -sum(pickup_latitude) OVER w2 AS pc_sum_pl, -max(pickup_latitude) OVER w2 AS pc_max_pl, -min(pickup_latitude) OVER w2 AS pc_min_pl, -avg(pickup_latitude) OVER w2 AS pc_avg_pl, -count(vendor_id) OVER w2 AS pc_cnt, -count(vendor_id) OVER w AS vendor_cnt -FROM t1 -WINDOW w AS (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW), -w2 AS (PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW) INTO OUTFILE '/tmp/feature_data'; -``` - -Note that the cluster version `SELECT INTO` is a non-blocking task. You can use the command `SHOW JOBS` to view the running status of the task. Please wait for the task to run successfully (`state` to `FINISHED` status) before proceeding to the next step. - -#### 2.5. Model Training - -1. Model training will not be carry out in the OpenMLDB thus, exit the OpenMLDB CLI through the following `quit` command. - -```bash -> quit -``` - -2. Then in the command line, you execute train.py. It uses the open-source training tool `lightgbm` to train the model based on the offline features generated in the previous step, and the training results are stored in `/tmp/model.txt`. - -```bash -python3 train.py /tmp/feature_data /tmp/model.txt -``` - -#### 2.6. Online SQL Deployment - -Assuming that the model produced by the features designed in Section 2.3 in the previous model training meets the expectation. The next step is to deploy the feature extraction SQL script online to provide real-time feature extraction. - -1. Restart OpenMLDB CLI for SQL online deployment - -```bash -/work/openmldb/bin/openmldb --zk_cluster=127.0.0.1:2181 --zk_root_path=/openmldb --role=sql_client -``` - -2. To execute online deployment, the following commands are executed in OpenMLDB CLI. - -```sql -> USE demo_db; -> SET @@execute_mode='online'; -> DEPLOY demo OPTIONS(RANGE_BIAS='inf', ROWS_BIAS='inf') SELECT trip_duration, passenger_count, -sum(pickup_latitude) OVER w AS vendor_sum_pl, -max(pickup_latitude) OVER w AS vendor_max_pl, -min(pickup_latitude) OVER w AS vendor_min_pl, -avg(pickup_latitude) OVER w AS vendor_avg_pl, -sum(pickup_latitude) OVER w2 AS pc_sum_pl, -max(pickup_latitude) OVER w2 AS pc_max_pl, -min(pickup_latitude) OVER w2 AS pc_min_pl, -avg(pickup_latitude) OVER w2 AS pc_avg_pl, -count(vendor_id) OVER w2 AS pc_cnt, -count(vendor_id) OVER w AS vendor_cnt -FROM t1 -WINDOW w AS (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW), -w2 AS (PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW); -``` - -#### 2.7. Online Data Import - -We need to import the data for real-time feature extraction. First, you need to switch to **online** execution mode. Then, in the online mode, import the sample data `/work/taxi-trip/data/taxi_tour_table_train_simple.csv` as the online data source. The following commands are executed under the OpenMLDB CLI. - -```sql -> USE demo_db; -> SET @@execute_mode='online'; -> LOAD DATA INFILE 'file:///work/taxi-trip/data/taxi_tour_table_train_simple.csv' INTO TABLE t1 options(format='csv', header=true, mode='append'); -``` - -Note that the cluster version `SELECT INTO` is a non-blocking task. You can use the command `SHOW JOBS` to view the running status of the task. Please wait for the task to run successfully (`state` to `FINISHED` status) before proceeding to the next step. - -#### 2.8. Start Online Prediction Service - -1. If you have not exited the OpenMLDB CLI, use the `quit` command to exit the OpenMLDB CLI. -2. Start the prediction service from the command line: - -``` -./start_predict_server.sh 127.0.0.1:9080 /tmp/model.txt -``` - -#### 2.9. Send Real-Time Request - -The `predict.py` script will send a line of request data to the prediction service. A returned results will be received and finally, prints them out. - -```bash -# Run inference with a HTTP request -python3 predict.py -# The following output is expected (the numbers might be slightly different) -----------------ins--------------- -[[ 2. 40.774097 40.774097 40.774097 40.774097 40.774097 40.774097 - 40.774097 40.774097 1. 1. ]] ----------------predict trip_duration ------------- -848.014745715936 s -``` - diff --git a/docs/en/use_case/pulsar_connector_demo.md b/docs/en/use_case/pulsar_connector_demo.md deleted file mode 100644 index dd3733d291b..00000000000 --- a/docs/en/use_case/pulsar_connector_demo.md +++ /dev/null @@ -1,270 +0,0 @@ -# Importing Real-Time Data Streams from Pulsar - -## Introduction - -Apache Pulsar is a cloud-native, distributed messaging and streaming platform. It can be used as online data source for OpenMLDB to import real-time data streams. You can learn more about Pulsar from the project website [https://pulsar.apache.org/](https://pulsar.apache.org/). We have developed an OpenMLDB JDBC Connector to work seamlessly with Pulsar. In this document, you will learn the concepts and usages of this connector. - -Note that, for the sake of simplicity, for this document, we use Pulsar Standalone, OpenMLDB cluster and a simple JSON message producer to show how the OpenMLDB JDBC Connector works. The connector also works well with the Pulsar Cluster. - -## Overview - -### Download - -- You can download the entire demo package [here](https://openmldb.ai/download/pulsar-connector/files.tar.gz), which are needed by this demo, including the connector nar, schema files, and config files. - -- If you would like to download the connector only, you can [download it here](https://github.com/4paradigm/OpenMLDB/releases/download/v0.4.4/pulsar-io-jdbc-openmldb-2.11.0-SNAPSHOT.nar) from the OpenMLDB release. - -### Workflow - -The below figure summarizes the workflow of using this connector. We will further explain the detail later. Moreover, we have recorded the steps at [terminalizer page](https://terminalizer.com/view/be2309235671) for easy reference; or you can also download the demo script [demo.yml](https://github.com/vagetablechicken/pulsar-openmldb-connector-demo/blob/main/demo.yml). -![demo steps](images/demo_steps.png) - - -## Step 1 -### Create OpenMLDB Cluster -Use docker to start it simply, and we need to create a test table, you can check on [Get started with cluster version of OpenMLDB](https://openmldb.ai/docs/en/v0.5/quickstart/openmldb_quickstart.html#get-started-with-cluster-version-of-openmldb) . -```{caution} -Only OpenMLDB cluster mode can be the sink dist, and only write to online storage. -``` - -We recommend that you use ‘host network’ to run docker. And bind volume ‘files’ too. The sql scripts are in it. -``` -docker run -dit --network host -v `pwd`/files:/work/pulsar_files --name openmldb 4pdosc/openmldb:0.8.4 bash -docker exec -it openmldb bash -``` -```{note} -Even the host network, docker on macOS cannot support connecting to the container from the host. You only can connect openmldb cluster in other containers, like pulsar container. -``` -In OpenMLDB container, start the cluster: -``` -./init.sh -``` -### Create table -We use a script to create the table, create.sql content: -``` -create database pulsar_test; -use pulsar_test; -create table connector_test(id string, vendor_id int, pickup_datetime bigint, dropoff_datetime bigint, passenger_count int, pickup_longitude double, pickup_latitude double, dropoff_longitude double, dropoff_latitude double, store_and_fwd_flag string, trip_duration int); -desc connector_test; -``` -Run the script: -``` -/work/openmldb/bin/openmldb --zk_cluster=127.0.0.1:2181 --zk_root_path=/openmldb --role=sql_client < /work/pulsar_files/create.sql -``` - -![table desc](images/table.png) - -```{note} -JSONSchema and JDBC base connector can't support 'java.sql.Timestamp' now. So we use 'bigint' to be the timestamp column type(it works in OpenMLDB). -``` -## Step 2 -### Start Pulsar Standalone -It’s simpler and quicker to run Pulsar in docker. - -We **recommend** that you use 'host network' to run docker, to avoid network problems about docker containers. - -And we need to use pulsar-admin to create a sink, it’s in the docker container. So we will run the container in bash first, and run cmds in it. - -Don’t forget to bind the dir ‘files’. - -``` -docker run -dit --network host -v `pwd`/files:/pulsar/files --name pulsar apachepulsar/pulsar:2.9.1 bash -docker exec -it pulsar bash -``` - -In Pulsar container, start the pulsar standalone server. -``` -bin/pulsar-daemon start standalone --zookeeper-port 5181 -``` -```{note} -OpenMLDB want to use the port 2181, so we should change the zk port here. We will use zk port 2181 to connect OpenMLDB, but zk port in Pulsar standalone won’t affect anything. -``` -You can `ps` to check if the pulsar runs well. If failed, check the standalone server log `logs/pulsar-standalone-....log`. -``` -ps axu|grep pulsar -``` - -When you start a local standalone cluster, a public/default namespace is created automatically. The namespace is used for development purposes, ref [pulsar doc](https://pulsar.apache.org/docs/en/2.9.0/standalone/#start-pulsar-standalone). - -**We will create the sink in the namespace**. - -```{seealso} -If you really want to start pulsar locally, see [Set up a standalone Pulsar locally](https://pulsar.apache.org/docs/en/standalone/). -``` -#### Q&A -Q: -``` -2022-04-07T03:15:59,289+0000 [main] INFO org.apache.zookeeper.server.NIOServerCnxnFactory - binding to port 0.0.0.0/0.0.0.0:5181 -2022-04-07T03:15:59,289+0000 [main] ERROR org.apache.pulsar.zookeeper.LocalBookkeeperEnsemble - Exception while instantiating ZooKeeper -java.net.BindException: Address already in use -``` -How to fix it? -A: Pulsar wants an unused address to start zk server,5181 is used too. Change another port in '--zookeeper-port'. - -Q: 8080 is already used? -A: change the port 'webServicePort' in `conf/standalone.conf`. Don’t forget the 'webServiceUrl' in `conf/client.conf`, pulsar-admin needs the conf. - -Q: 6650 is already used? -A: change 'brokerServicePort' in `conf/standalone.conf` and 'brokerServiceUrl' in `conf/client.conf`. - -### Connector installation(Optional) -In the previous step, we bind mount ‘files’, the connector nar is in it. -We’ll use ‘non built-in connector’ mode to set up the connector(use ‘archive’ in sink config). - -If you really want the connector to be the built-in connector, copy it to ‘connectors’. -``` -mkdir connectors -cp files/pulsar-io-jdbc-openmldb-2.11.0-SNAPSHOT.nar connectors/ -``` -You want to change or add more connectors, you can update connectors when pulsar standalone is running: -``` -bin/pulsar-admin sinks reload -``` - -Built-in OpenMLDB connector's sink type is 'jdbc-openmldb'. - -### Create sink -We use the 'public/default' namespace to create sink, and we need a sink config file, it’s `files/pulsar-openmldb-jdbc-sink.yaml`, content: -``` - tenant: "public" - namespace: "default" - name: "openmldb-test-sink" - archive: "files/pulsar-io-jdbc-openmldb-2.11.0-SNAPSHOT.nar" - inputs: ["test_openmldb"] - configs: - jdbcUrl: "jdbc:openmldb:///pulsar_test?zk=localhost:2181&zkPath=/openmldb" - tableName: "connector_test" -``` -```{describe} -'name' is the sink name. - -We use 'archive' to set the sink connector, so we use openmldb connector as non built-in connector. - -'input' means the topic names, we use only one here. - -'config' is jdbc config which used to connect openmldb cluster. -``` - -Then create a sink and check it, notice that the input topic is 'test_openmldb'. -``` -./bin/pulsar-admin sinks create --sink-config-file files/pulsar-openmldb-jdbc-sink.yaml -./bin/pulsar-admin sinks status --name openmldb-test-sink -``` -![init sink status](images/init_sink_status.png) - -### Create Schema -Upload schema to topic 'test_openmldb', schema type is JSON. We’ll produce the JSON message in the same schema later. The schema file is ‘files/openmldb-table-schema’. -Schema content: -``` - { - "type": "JSON", - "schema":"{\"type\":\"record\",\"name\":\"OpenMLDBSchema\",\"namespace\":\"com.foo\",\"fields\":[{\"name\":\"id\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"vendor_id\",\"type\":\"int\"},{\"name\":\"pickup_datetime\",\"type\":\"long\"},{\"name\":\"dropoff_datetime\",\"type\":\"long\"},{\"name\":\"passenger_count\",\"type\":\"int\"},{\"name\":\"pickup_longitude\",\"type\":\"double\"},{\"name\":\"pickup_latitude\",\"type\":\"double\"},{\"name\":\"dropoff_longitude\",\"type\":\"double\"},{\"name\":\"dropoff_latitude\",\"type\":\"double\"},{\"name\":\"store_and_fwd_flag\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"trip_duration\",\"type\":\"int\"}]}", - "properties": {} - } -``` - -Upload schema and check it, commands: -``` -./bin/pulsar-admin schemas upload test_openmldb -f ./files/openmldb-table-schema -./bin/pulsar-admin schemas get test_openmldb -``` -For demonstration purposes, we omit the fields part. The result as follows: -![topic schema](images/topic_schema.png) -## Test -### Send messages -We use the first 2 rows of sample data(in openmldb docker `data/taxi_tour_table_train_simple.csv`) to be the test messages, as follows. - -![test data](images/test_data.png) - -#### Java Producer -Producer JAVA code in [demo producer](https://github.com/vagetablechicken/pulsar-client-java). Essential code is ![snippet](images/producer_code.png) - -So the producer will send the 2 messages to topic ‘test_openmldb’. And then Pulsar will read the messages and write them to OpenMLDB cluster online storage. - -The package is in ‘files’. You can run it directly. - -``` -java -cp files/pulsar-client-java-1.0-SNAPSHOT-jar-with-dependencies.jar org.example.Client -``` - -#### Python Producer -You can write the Producer in Python, please check the code in `files/pulsar_client.py`. -Before run it, you should install the pulsar python client: -``` -pip3 install pulsar-client==2.9.1 -``` -Then run the producer: -``` -python3 files/pulsar_client.py -``` - -### Check -#### Check in Pulsar -We can check the sink status: -``` -./bin/pulsar-admin sinks status --name openmldb-test-sink -``` -![sink status](images/sink_status.png) -```{note} -"numReadFromPulsar": pulsar sent 2 messages to the sink instance. - -"numWrittenToSink": sink instance write 2 messages to OpenMLDB. -``` - -#### Check in OpenMLDB -And we can get these messages data in the OpenMLDB table’s **online storage** now. -The script select.sql content: -``` -set @@execute_mode='online'; -use pulsar_test; -select *, string(timestamp(pickup_datetime)), string(timestamp(dropoff_datetime)) from connector_test; -``` -In OpenMLDB container, run: -``` -/work/openmldb/bin/openmldb --zk_cluster=127.0.0.1:2181 --zk_root_path=/openmldb --role=sql_client < /work/pulsar_files/select.sql -``` -![openmldb result](images/openmldb_result.png) - -### Debug - -If the OpenMLDB table doesn't have the data, but the sinks status shows it has written to OpenMLDB, the sink instance may have some problems. You should check the sink log, the path is `logs/functions/public/default/openmldb-test-sink/openmldb-test-sink-0.log`. If you use another sink name, the path will change. - -Pulsar will retry to write the failed messages. So if you sent the wrong message 1 and then sent the right message 2, even the right message 2 has written to OpenMLDB, the wrong message 1 will be sent and print the error in log. It's confusing. We'd recommend you to truncate the topic before testing again. -``` -./bin/pulsar-admin topics truncate persistent://public/default/test_openmldb -``` -If you use another sink name, you can get it by `./bin/pulsar-admin topics list public/default`. - -#### debug log - -If the sink instance log is not enough, you can open the debug level of log. You should modify the log config, and restart the sink instance. - -`vim conf/functions_log4j2.xml` and modify it: - -```xml - - pulsar.log.level - debug - -``` -```xml - - ${sys:pulsar.log.level} - - ${sys:pulsar.log.appender} - ${sys:pulsar.log.level} - - -``` - -Then restart the sink instance: -``` -./bin/pulsar-admin sinks restart --name openmldb-test-sink -``` - -#### reinitialize Pulsar -``` -bin/pulsar-daemon stop standalone --zookeeper-port 5181 -rm -r data logs -bin/pulsar-daemon start standalone --zookeeper-port 5181 -```