Skip to content

Commit

Permalink
Merge branch '4paradigm:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
Matagits authored Apr 7, 2024
2 parents 07cf0bb + 0cef78a commit 39ac6ad
Show file tree
Hide file tree
Showing 9 changed files with 208 additions and 108 deletions.
4 changes: 4 additions & 0 deletions docs/en/deploy/install_deploy.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@ Generally, ldd version should be >= 2.17, and GLIBC_2.17 should be present in li

If you need to deploy ZooKeeper and TaskManager, you need a Java runtime environment.

Servers needs Java 1.8 or above.

Zookeeper Client 3.4.14 requires `Java 1.7` - `Java 13`. Java SDK depends on it, so it should use the same Java version, don't run in higher version. If you wish to use zkCli, please use `Java 1.8` or `Java 11`.

### Hardware

Regarding hardware requirements:
Expand Down
79 changes: 59 additions & 20 deletions docs/en/quickstart/sdk/python_sdk.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,8 @@ cursor.close()

This section demonstrates the use of the Python SDK through OpenMLDB SQLAlchemy. Similarly, if any of the DBAPI interfaces fail, they will raise a `DatabaseError` exception. Users can catch and handle this exception as needed. The handling of return values should follow the SQLAlchemy standard.

The integrated SQLAlchemy defaults to version 2.0 while remaining compatible with the old version 1.4. If a user's SQLAlchemy version is 1.4, they can adjust the interface names according to the [version differences](python_sdk.md#sqlalchemy-version-differences). OpenMLDB SDK only supports version 1.4 in version 0.8.5 and before. Starting from version 0.8.5 (excluding 0.8.5), it begins to support version 2.0.

### Create Connection

```python
Expand All @@ -134,98 +136,135 @@ connection = engine.connect()

### Create Database

Use the `connection.execute()` interface to create database `db1`:
Use the `connection.exec_driver_sql()` interface to create database `db1`:

```python
try:
connection.execute("CREATE DATABASE db1")
connection.exec_driver_sql("CREATE DATABASE db1")
except Exception as e:
print(e)

connection.execute("USE db1")
connection.exec_driver_sql("USE db1")
```

### Create Table

Use the `connection.execute()` interface to create table `t1`:
Use the `connection.exec_driver_sql()` interface to create table `t1`:

```python
try:
connection.execute("CREATE TABLE t1 ( col1 bigint, col2 date, col3 string, col4 string, col5 int, index(key=col3, ts=col1))")
connection.exec_driver_sql("CREATE TABLE t1 ( col1 bigint, col2 date, col3 string, col4 string, col5 int, index(key=col3, ts=col1))")
except Exception as e:
print(e)
```

### Insert Data into Table

Use the `connection.execute (ddl)` interface to execute the SQL insert statement, and you can insert data into the table:
Use the `connection.exec_driver_sql (ddl)` interface to execute the SQL insert statement, and you can insert data into the table:

```python
try:
connection.execute("INSERT INTO t1 VALUES(1000, '2020-12-25', 'guangdon', 'shenzhen', 1);")
connection.exec_driver_sql("INSERT INTO t1 VALUES(1000, '2020-12-25', 'guangdon', 'shenzhen', 1);")
except Exception as e:
print(e)
```

Use the `connection.execute (ddl, data)` interface to execute the insert statement of SQL with placeholder. You can specify the insert data dynamically or insert multiple rows:
Use the `connection.exec_driver_sql (ddl, data)` interface to execute the insert statement of SQL with placeholder. You can specify the insert data dynamically or insert multiple rows:

```python
try:
insert = "INSERT INTO t1 VALUES(1002, '2020-12-27', ?, ?, 3);"
connection.execute(insert, ({"col3":"fujian", "col4":"fuzhou"}))
connection.execute(insert, [{"col3":"jiangsu", "col4":"nanjing"}, {"col3":"zhejiang", "col4":"hangzhou"}])
connection.exec_driver_sql(insert, ({"col3":"fujian", "col4":"fuzhou"}))
connection.exec_driver_sql(insert, [{"col3":"jiangsu", "col4":"nanjing"}, {"col3":"zhejiang", "col4":"hangzhou"}])
except Exception as e:
print(e)
```

### Execute SQL Batch Query

Use the `connection.execute (sql)` interface to execute SQL batch query statements:
Use the `connection.exec_driver_sql (sql)` interface to execute SQL batch query statements:

```python
try:
rs = connection.execute("SELECT * FROM t1")
rs = connection.exec_driver_sql("SELECT * FROM t1")
for row in rs:
print(row)
rs = connection.execute("SELECT * FROM t1 WHERE col3 = ?;", ('hefei'))
rs = connection.execute("SELECT * FROM t1 WHERE col3 = ?;",[('hefei'), ('shanghai')])
rs = connection.exec_driver_sql("SELECT * FROM t1 WHERE col3 = ?;", tuple(['hefei']))
except Exception as e:
print(e)
```

### Execute SQL Query

Use the `connection.execute (sql, request)` interface to execute the SQL request query. You can put the input data into the second parameter of the execute function:
Use the `connection.exec_driver_sql (sql, request)` interface to execute the SQL request query. You can put the input data into the second parameter of the execute function:

```python
try:
rs = connection.execute("SELECT * FROM t1", ({"col1":9999, "col2":'2020-12-27', "col3":'zhejiang', "col4":'hangzhou', "col5":100}))
rs = connection.exec_driver_sql("SELECT * FROM t1", ({"col1":9999, "col2":'2020-12-27', "col3":'zhejiang', "col4":'hangzhou', "col5":100}))
except Exception as e:
print(e)
```

### Delete Table

Use the `connection.execute (ddl)` interface to delete table `t1`:
Use the `connection.exec_driver_sql (ddl)` interface to delete table `t1`:

```python
try:
connection.execute("DROP TABLE t1")
connection.exec_driver_sql("DROP TABLE t1")
except Exception as e:
print(e)
```

### Delete Database

Use the connection.execute(ddl)interface to delete database `db1`:
Use the connection.exec_driver_sql(ddl)interface to delete database `db1`:

```python
try:
connection.execute("DROP DATABASE db1")
connection.exec_driver_sql("DROP DATABASE db1")
except Exception as e:
print(e)
```

### SQLAlchemy Version Differences

Differences in Native SQL Usage: In SQLAlchemy 1.4, the method `connection.execute()` is used, while in SQLAlchemy 2.0, the method `connection.exec_driver_sql()` is used. The general differences between these two methods are as follows, for more details, refer to the official documentation.

```python
# DDL Example1 - [SQLAlchemy 1.4]
connection.execute("CREATE TABLE t1 (col1 bigint, col2 date)")
# DDL Example1 - [SQLAlchemy 2.0]
connection.exec_driver_sql("CREATE TABLE t1 (col1 bigint, col2 date)")

# Insert Example1 - [SQLAlchemy 1.4]
connection.execute("INSERT INTO t1 VALUES(1000, '2020-12-25');")
connection.execute("INSERT INTO t1 VALUES(?, ?);", ({"col1":1001, "col2":"2020-12-26"}))
connection.execute("INSERT INTO t1 VALUES(?, ?);", [{"col1":1002, "col2":"2020-12-27"}])
# Insert Example1 - [SQLAlchemy 2.0]
connection.exec_driver_sql("INSERT INTO t1 VALUES(1000, '2020-12-25');")
connection.exec_driver_sql("INSERT INTO t1 VALUES(?, ?);", ({"col1":1001, "col2":"2020-12-26"}))
connection.exec_driver_sql("INSERT INTO t1 VALUES(?, ?);", [{"col1":1002, "col2":"2020-12-27"}])

# Query Example1 - [SQLAlchemy 1.4] - Native SQL Query
connection.execute("select * from t1 where col3 = ?;", 'hefei')
connection.execute("select * from t1 where col3 = ?;", ['hefei'])
connection.execute("select * from t1 where col3 = ?;", [('hefei')])
# Query Example1 - [SQLAlchemy 2.0] - Native SQL Query
connection.exec_driver_sql("select * from t1 where col3 = ?;", tuple(['hefei']))

# Query Example2 - [SQLAlchemy 1.4] - ORM Query
connection.execute(select([self.test_table]))
# Query Example2 - [SQLAlchemy 2.0] - ORM Query
connection.execute(select(self.test_table))

# Query Example3 - [SQLAlchemy 1.4] - SQL Request Query
connection.execute("SELECT * FROM t1", ({"col1":9999, "col2":'2020-12-28'}))
# Query Example3 - [SQLAlchemy 2.0] - SQL Request Query
connection.exec_driver_sql("SELECT * FROM t1", ({"col1":9999, "col2":'2020-12-28'}))

```

## Notebook Magic Function

The OpenMLDB Python SDK supports the expansion of Notebook magic function. Use the following statement to register the function.
Expand Down
35 changes: 32 additions & 3 deletions docs/zh/deploy/conf.md
Original file line number Diff line number Diff line change
Expand Up @@ -273,13 +273,15 @@ batchjob.jar.path=
namenode.uri=
offline.data.prefix=file:///tmp/openmldb_offline_storage/
hadoop.conf.dir=
hadoop.user.name=
#enable.hive.support=false
```

### Spark Config详解

Spark Config中重点关注的配置如下:

<a id="about-config-env"></a>
```{note}
理解配置项与环境变量的关系。
Expand Down Expand Up @@ -307,7 +309,21 @@ TaskManager只接受`local`及其变种、`yarn`、`yarn-cluster`、`yarn-client

local模式即Spark任务运行在本地(TaskManager所在主机),该模式下不需要太多配置,只需要注意两点:
- 离线表的存储地址`offline.data.prefix`,默认为`file:///tmp/openmldb_offline_storage/`,即TaskManager所在主机的`/tmp`目录,你可以修改该配置为其他目录。
- 可以配置为HDFS路径,需要在**启动TaskManager前**配置环境变量`HADOOP_CONF_DIR`为Hadoop配置文件所在目录(注意是环境变量,不是TaskManager的配置项),文件目录中应包含Hadoop的`core-site.xml``hdfs-site.xml`等配置文件,更多见[Spark官方文档](https://spark.apache.org/docs/3.2.1/configuration.html#inheriting-hadoop-cluster-configuration)
- 可以配置为HDFS路径,如果配置为HDFS路径,需要正确配置变量 `hadoop.conf.dir``hadoop.user.name`,其中 `hadoop.conf.dir` 表示Hadoop配置文件所在目录(注意该目录是TaskManager节点目录;文件目录中应包含Hadoop的`core-site.xml``hdfs-site.xml`等配置文件,更多见[Spark官方文档](https://spark.apache.org/docs/3.2.1/configuration.html#inheriting-hadoop-cluster-configuration)),`hadoop.user.name` 表示hadoop运行用户,可以通过以下三种方式之一配置这两个变量:
1.`conf/taskmanager.properties` 配置文件中配置变量 `hadoop.conf.dir`, `hadoop.user.name`
2. 在(TaskManager节点)**启动TaskManager前**配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME`
3. 拷贝Hadoop配置文件(`core-site.xml``hdfs-site.xml`等)到 `{spark.home}/conf` 目录中
> sbin部署不能传递非指定的变量,目前TaskManager只会传递环境变量 `SPARK_HOME``RUNNER_JAVA_HOME`。所以如果是sbin部署,尽量使用第一种方法。
>
> 如果使用第二种方式,配置的环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` 最好是永久生效的,如果不希望环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` 永久生效,可以在一个session里,先临时配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` ,然后启动TaskManager,例如
> ```bash
> cd <openmldb部署根目录>
> export HADOOP_CONF_DIR=<这里替换为Hadoop配置目录>
> export HADOOP_USER_NAME=<这里替换为Hadoop用户名>
> bash bin/start.sh start taskmanager
> ```
>
> 环境变量生效范围参考 <a href="#about-config-env">理解配置项与环境变量的关系</a>
```{note}
HDFS路径目前需要配置`namenode.uri`,删除离线表时会连接HDFS FileSystem`namenode.uri`,并删除离线表的存储目录(Offline Table Path)。未来将废弃此配置项。
```
Expand All @@ -319,9 +335,22 @@ local模式即Spark任务运行在本地(TaskManager所在主机),该模


##### yarn/yarn-cluster模式

"yarn"和"yarn-cluster"是同一个模式,即Spark任务运行在Yarn集群上,该模式下需要配置的参数较多,主要包括:
-**启动TaskManager前**配置环境变量`HADOOP_CONF_DIR`为Hadoop和Yarn的配置文件所在目录,文件目录中应包含Hadoop的`core-site.xml``hdfs-site.xml`、Yarn的`yarn-site.xml`等配置文件,参考[Spark官方文档](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn)
- 正确配置变量 `hadoop.conf.dir``hadoop.user.name`,其中 `hadoop.conf.dir` 表示Hadoop和Yarn配置文件所在目录(注意该目录是TaskManager节点目录;文件目录中应包含Hadoop的`core-site.xml``hdfs-site.xml`, `yarn-site.xml`等配置文件,参考[Spark官方文档](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn)),`hadoop.user.name` 表示hadoop运行用户,可以通过以下三种方式之一配置这两个变量:
1.`conf/taskmanager.properties` 配置文件中配置变量 `hadoop.conf.dir`, `hadoop.user.name`
2. 在(TaskManager节点)**启动TaskManager前**配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME`
3. 拷贝Hadoop和Yarn配置文件(`core-site.xml``hdfs-site.xml`等)到 `{spark.home}/conf` 目录中
> sbin部署不能传递非指定的变量,目前TaskManager只会传递环境变量 `SPARK_HOME``RUNNER_JAVA_HOME`。所以如果是sbin部署,尽量使用第一种方法。
>
> 如果使用第二种方式,配置的环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` 最好是永久生效的,如果不希望环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` 永久生效,可以在一个session里,先临时配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` ,然后启动TaskManager,例如
> ```bash
> cd <openmldb部署根目录>
> export HADOOP_CONF_DIR=<这里替换为Hadoop配置目录>
> export HADOOP_USER_NAME=<这里替换为Hadoop用户名>
> bash bin/start.sh start taskmanager
> ```
>
> 环境变量生效范围参考 <a href="#about-config-env">理解配置项与环境变量的关系</a>
- `spark.yarn.jars`配置Yarn需要读取的Spark运行jar包地址,必须是`hdfs://`地址。可以上传[OpenMLDB Spark 发行版](../../tutorial/openmldbspark_distribution.md)解压后的`jars`目录到HDFS上,并配置为`hdfs://<hdfs_path>/jars/*`(注意通配符)。[如果不配置该参数,Yarn会将`$SPARK_HOME/jars`打包上传分发,并且每次离线任务都要分发](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#preparations),效率较低,所以推荐配置。
- `batchjob.jar.path`必须是HDFS路径(具体到包名),上传batchjob jar包到HDFS上,并配置为对应地址,保证Yarn集群上所有Worker可以获得batchjob包。
- `offline.data.prefix`必须是HDFS路径,保证Yarn集群上所有Worker可读写数据。应使用前面配置的环境变量`HADOOP_CONF_DIR`中的Hadoop集群地址。
Expand Down
3 changes: 3 additions & 0 deletions docs/zh/deploy/install_deploy.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ strings /lib64/libc.so.6 | grep ^GLIBC_

如果需要部署 ZooKeeper 和 TaskManager,则需要有 Java 运行环境。

> 两个Server需要`Java 1.8`及以上版本。
> Zookeeper Client 3.4.14 需要 `Java 1.7` - `Java 13` 版本。Java SDK也使用这个Client,所以同样要求不能使用较高版本Java。如果希望使用zkCli,推荐使用 `Java 1.8``Java 11` 版本。
### 硬件

* CPU:
Expand Down
Loading

0 comments on commit 39ac6ad

Please sign in to comment.