Merge branch '4paradigm:main' into main

Matagits · Apr 7, 2024 · 39ac6ad · 39ac6ad
2 parents 07cf0bb + 0cef78a
commit 39ac6ad
Show file tree

Hide file tree

Showing 9 changed files with 208 additions and 108 deletions.
diff --git a/docs/en/deploy/install_deploy.md b/docs/en/deploy/install_deploy.md
@@ -22,6 +22,10 @@ Generally, ldd version should be >= 2.17, and GLIBC_2.17 should be present in li
 
 If you need to deploy ZooKeeper and TaskManager, you need a Java runtime environment.
 
+Servers needs Java 1.8 or above.
+
+Zookeeper Client 3.4.14 requires `Java 1.7` - `Java 13`. Java SDK depends on it, so it should use the same Java version, don't run in higher version. If you wish to use zkCli, please use `Java 1.8` or `Java 11`.
+
 ### Hardware
 
 Regarding hardware requirements:

diff --git a/docs/en/quickstart/sdk/python_sdk.md b/docs/en/quickstart/sdk/python_sdk.md
@@ -116,6 +116,8 @@ cursor.close()
 
 This section demonstrates the use of the Python SDK through OpenMLDB SQLAlchemy. Similarly, if any of the DBAPI interfaces fail, they will raise a `DatabaseError` exception. Users can catch and handle this exception as needed. The handling of return values should follow the SQLAlchemy standard.
 
+The integrated SQLAlchemy defaults to version 2.0 while remaining compatible with the old version 1.4. If a user's SQLAlchemy version is 1.4, they can adjust the interface names according to the [version differences](python_sdk.md#sqlalchemy-version-differences). OpenMLDB SDK only supports version 1.4 in version 0.8.5 and before. Starting from version 0.8.5 (excluding 0.8.5), it begins to support version 2.0.
+
 ### Create Connection
 
 ```python
@@ -134,98 +136,135 @@ connection = engine.connect()
 
 ### Create Database
 
-Use the `connection.execute()` interface to create database `db1`:
+Use the `connection.exec_driver_sql()` interface to create database `db1`:
 
 ```python
 try:
-    connection.execute("CREATE DATABASE db1")
+    connection.exec_driver_sql("CREATE DATABASE db1")
 except Exception as e:
     print(e)
 
-connection.execute("USE db1")
+connection.exec_driver_sql("USE db1")
 ```
 
 ### Create Table
 
-Use the `connection.execute()` interface to create table `t1`:
+Use the `connection.exec_driver_sql()` interface to create table `t1`:
 
 ```python
 try:
-    connection.execute("CREATE TABLE t1 ( col1 bigint, col2 date, col3 string, col4 string, col5 int, index(key=col3, ts=col1))")
+    connection.exec_driver_sql("CREATE TABLE t1 ( col1 bigint, col2 date, col3 string, col4 string, col5 int, index(key=col3, ts=col1))")
 except Exception as e:
     print(e)
 ```
 
 ### Insert Data into Table
 
-Use the `connection.execute (ddl)` interface to execute the SQL insert statement, and you can insert data into the table:
+Use the `connection.exec_driver_sql (ddl)` interface to execute the SQL insert statement, and you can insert data into the table:
 
 ```python
 try:
-    connection.execute("INSERT INTO t1 VALUES(1000, '2020-12-25', 'guangdon', 'shenzhen', 1);")
+    connection.exec_driver_sql("INSERT INTO t1 VALUES(1000, '2020-12-25', 'guangdon', 'shenzhen', 1);")
 except Exception as e:
     print(e)
 ```
 
-Use the `connection.execute (ddl, data)` interface to execute the insert statement of SQL with placeholder. You can specify the insert data dynamically or insert multiple rows:
+Use the `connection.exec_driver_sql (ddl, data)` interface to execute the insert statement of SQL with placeholder. You can specify the insert data dynamically or insert multiple rows:
 
 ```python
 try:
     insert = "INSERT INTO t1 VALUES(1002, '2020-12-27', ?, ?, 3);"
-    connection.execute(insert, ({"col3":"fujian", "col4":"fuzhou"}))
-    connection.execute(insert, [{"col3":"jiangsu", "col4":"nanjing"}, {"col3":"zhejiang", "col4":"hangzhou"}])
+    connection.exec_driver_sql(insert, ({"col3":"fujian", "col4":"fuzhou"}))
+    connection.exec_driver_sql(insert, [{"col3":"jiangsu", "col4":"nanjing"}, {"col3":"zhejiang", "col4":"hangzhou"}])
 except Exception as e:
     print(e)
 ```
 
 ### Execute SQL Batch Query
 
-Use the `connection.execute (sql)` interface to execute SQL batch query statements:
+Use the `connection.exec_driver_sql (sql)` interface to execute SQL batch query statements:
 
 ```python
 try:
-    rs = connection.execute("SELECT * FROM t1")
+    rs = connection.exec_driver_sql("SELECT * FROM t1")
     for row in rs:
         print(row)
-    rs = connection.execute("SELECT * FROM t1 WHERE col3 = ?;", ('hefei'))
-    rs = connection.execute("SELECT * FROM t1 WHERE col3 = ?;",[('hefei'), ('shanghai')])
+    rs = connection.exec_driver_sql("SELECT * FROM t1 WHERE col3 = ?;", tuple(['hefei']))
 except Exception as e:
     print(e)
 ```
 
 ### Execute SQL Query
 
-Use the `connection.execute (sql, request)` interface to execute the SQL request query. You can put the input data into the second parameter of the execute function:
+Use the `connection.exec_driver_sql (sql, request)` interface to execute the SQL request query. You can put the input data into the second parameter of the execute function:
 
 ```python
 try:
-    rs = connection.execute("SELECT * FROM t1", ({"col1":9999, "col2":'2020-12-27', "col3":'zhejiang', "col4":'hangzhou', "col5":100}))
+    rs = connection.exec_driver_sql("SELECT * FROM t1", ({"col1":9999, "col2":'2020-12-27', "col3":'zhejiang', "col4":'hangzhou', "col5":100}))
 except Exception as e:
     print(e)
 ```
 
 ### Delete Table
 
-Use the `connection.execute (ddl)` interface to delete table `t1`:
+Use the `connection.exec_driver_sql (ddl)` interface to delete table `t1`:
 
 ```python
 try:
-    connection.execute("DROP TABLE t1")
+    connection.exec_driver_sql("DROP TABLE t1")
 except Exception as e:
     print(e)
 ```
 
 ### Delete Database
 
-Use the connection.execute（ddl）interface to delete database `db1`:
+Use the connection.exec_driver_sql（ddl）interface to delete database `db1`:
 
 ```python
 try:
-    connection.execute("DROP DATABASE db1")
+    connection.exec_driver_sql("DROP DATABASE db1")
 except Exception as e:
     print(e)
 ```
 
+### SQLAlchemy Version Differences
+
+Differences in Native SQL Usage: In SQLAlchemy 1.4, the method `connection.execute()` is used, while in SQLAlchemy 2.0, the method `connection.exec_driver_sql()` is used. The general differences between these two methods are as follows, for more details, refer to the official documentation.
+
+```python
+# DDL Example1 - [SQLAlchemy 1.4]
+connection.execute("CREATE TABLE t1 (col1 bigint, col2 date)")
+# DDL Example1 - [SQLAlchemy 2.0]
+connection.exec_driver_sql("CREATE TABLE t1 (col1 bigint, col2 date)")
+
+# Insert Example1 - [SQLAlchemy 1.4]
+connection.execute("INSERT INTO t1 VALUES(1000, '2020-12-25');")
+connection.execute("INSERT INTO t1 VALUES(?, ?);", ({"col1":1001, "col2":"2020-12-26"}))
+connection.execute("INSERT INTO t1 VALUES(?, ?);", [{"col1":1002, "col2":"2020-12-27"}])
+# Insert Example1 - [SQLAlchemy 2.0]
+connection.exec_driver_sql("INSERT INTO t1 VALUES(1000, '2020-12-25');")
+connection.exec_driver_sql("INSERT INTO t1 VALUES(?, ?);", ({"col1":1001, "col2":"2020-12-26"}))
+connection.exec_driver_sql("INSERT INTO t1 VALUES(?, ?);", [{"col1":1002, "col2":"2020-12-27"}])
+
+# Query Example1 - [SQLAlchemy 1.4] - Native SQL Query
+connection.execute("select * from t1 where col3 = ?;", 'hefei') 
+connection.execute("select * from t1 where col3 = ?;", ['hefei'])
+connection.execute("select * from t1 where col3 = ?;", [('hefei')])
+# Query Example1 - [SQLAlchemy 2.0] - Native SQL Query
+connection.exec_driver_sql("select * from t1 where col3 = ?;", tuple(['hefei']))
+
+# Query Example2 - [SQLAlchemy 1.4] - ORM Query
+connection.execute(select([self.test_table]))
+# Query Example2 - [SQLAlchemy 2.0] - ORM Query
+connection.execute(select(self.test_table))
+
+# Query Example3 - [SQLAlchemy 1.4] - SQL Request Query
+connection.execute("SELECT * FROM t1", ({"col1":9999, "col2":'2020-12-28'}))
+# Query Example3 - [SQLAlchemy 2.0] - SQL Request Query
+connection.exec_driver_sql("SELECT * FROM t1", ({"col1":9999, "col2":'2020-12-28'}))
+
+```
+
 ## Notebook Magic Function
 
 The OpenMLDB Python SDK supports the expansion of Notebook magic function. Use the following statement to register the function.

diff --git a/docs/zh/deploy/conf.md b/docs/zh/deploy/conf.md
@@ -273,13 +273,15 @@ batchjob.jar.path=
 namenode.uri=
 offline.data.prefix=file:///tmp/openmldb_offline_storage/
 hadoop.conf.dir=
+hadoop.user.name=
 #enable.hive.support=false
 ```
 
 ### Spark Config详解
 
 Spark Config中重点关注的配置如下：
 
+<a id="about-config-env"></a>
 ```{note}
 理解配置项与环境变量的关系。
 
@@ -307,7 +309,21 @@ TaskManager只接受`local`及其变种、`yarn`、`yarn-cluster`、`yarn-client
 
 local模式即Spark任务运行在本地（TaskManager所在主机），该模式下不需要太多配置，只需要注意两点：
 - 离线表的存储地址`offline.data.prefix`，默认为`file:///tmp/openmldb_offline_storage/`，即TaskManager所在主机的`/tmp`目录，你可以修改该配置为其他目录。
-  - 可以配置为HDFS路径，需要在**启动TaskManager前**配置环境变量`HADOOP_CONF_DIR`为Hadoop配置文件所在目录（注意是环境变量，不是TaskManager的配置项），文件目录中应包含Hadoop的`core-site.xml`、`hdfs-site.xml`等配置文件，更多见[Spark官方文档](https://spark.apache.org/docs/3.2.1/configuration.html#inheriting-hadoop-cluster-configuration)。
+  - 可以配置为HDFS路径，如果配置为HDFS路径，需要正确配置变量 `hadoop.conf.dir` 和 `hadoop.user.name`，其中 `hadoop.conf.dir` 表示Hadoop配置文件所在目录（注意该目录是TaskManager节点目录；文件目录中应包含Hadoop的`core-site.xml`、`hdfs-site.xml`等配置文件，更多见[Spark官方文档](https://spark.apache.org/docs/3.2.1/configuration.html#inheriting-hadoop-cluster-configuration)），`hadoop.user.name` 表示hadoop运行用户，可以通过以下三种方式之一配置这两个变量：
+    1. 在 `conf/taskmanager.properties` 配置文件中配置变量 `hadoop.conf.dir`, `hadoop.user.name`
+    2. 在(TaskManager节点)**启动TaskManager前**配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME`
+    3. 拷贝Hadoop配置文件（`core-site.xml`、`hdfs-site.xml`等）到 `{spark.home}/conf` 目录中
+    > sbin部署不能传递非指定的变量，目前TaskManager只会传递环境变量 `SPARK_HOME` 和 `RUNNER_JAVA_HOME`。所以如果是sbin部署，尽量使用第一种方法。
+    > 
+    > 如果使用第二种方式，配置的环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` 最好是永久生效的，如果不希望环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` 永久生效，可以在一个session里，先临时配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` ，然后启动TaskManager，例如
+    > ```bash
+    > cd <openmldb部署根目录>
+    > export HADOOP_CONF_DIR=<这里替换为Hadoop配置目录>
+    > export HADOOP_USER_NAME=<这里替换为Hadoop用户名>
+    > bash bin/start.sh start taskmanager
+    > ```
+    >
+    > 环境变量生效范围参考 <a href="#about-config-env">理解配置项与环境变量的关系</a>
   ```{note}
   HDFS路径目前需要配置`namenode.uri`，删除离线表时会连接HDFS FileSystem`namenode.uri`，并删除离线表的存储目录（Offline Table Path）。未来将废弃此配置项。
   ```
@@ -319,9 +335,22 @@ local模式即Spark任务运行在本地（TaskManager所在主机），该模
 
 
 ##### yarn/yarn-cluster模式
-
 "yarn"和"yarn-cluster"是同一个模式，即Spark任务运行在Yarn集群上，该模式下需要配置的参数较多，主要包括：
-- 在**启动TaskManager前**配置环境变量`HADOOP_CONF_DIR`为Hadoop和Yarn的配置文件所在目录，文件目录中应包含Hadoop的`core-site.xml`、`hdfs-site.xml`、Yarn的`yarn-site.xml`等配置文件，参考[Spark官方文档](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn)。
+- 正确配置变量 `hadoop.conf.dir` 和 `hadoop.user.name`，其中 `hadoop.conf.dir` 表示Hadoop和Yarn配置文件所在目录（注意该目录是TaskManager节点目录；文件目录中应包含Hadoop的`core-site.xml`、`hdfs-site.xml`, `yarn-site.xml`等配置文件，参考[Spark官方文档](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn)），`hadoop.user.name` 表示hadoop运行用户，可以通过以下三种方式之一配置这两个变量：
+  1. 在 `conf/taskmanager.properties` 配置文件中配置变量 `hadoop.conf.dir`, `hadoop.user.name`
+  2. 在(TaskManager节点)**启动TaskManager前**配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME`
+  3. 拷贝Hadoop和Yarn配置文件（`core-site.xml`、`hdfs-site.xml`等）到 `{spark.home}/conf` 目录中
+  > sbin部署不能传递非指定的变量，目前TaskManager只会传递环境变量 `SPARK_HOME` 和 `RUNNER_JAVA_HOME`。所以如果是sbin部署，尽量使用第一种方法。
+  >
+  > 如果使用第二种方式，配置的环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` 最好是永久生效的，如果不希望环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` 永久生效，可以在一个session里，先临时配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` ，然后启动TaskManager，例如
+  > ```bash
+    > cd <openmldb部署根目录>
+    > export HADOOP_CONF_DIR=<这里替换为Hadoop配置目录>
+    > export HADOOP_USER_NAME=<这里替换为Hadoop用户名>
+    > bash bin/start.sh start taskmanager
+    > ```
+  >
+  > 环境变量生效范围参考 <a href="#about-config-env">理解配置项与环境变量的关系</a>
 - `spark.yarn.jars`配置Yarn需要读取的Spark运行jar包地址，必须是`hdfs://`地址。可以上传[OpenMLDB Spark 发行版](../../tutorial/openmldbspark_distribution.md)解压后的`jars`目录到HDFS上，并配置为`hdfs://<hdfs_path>/jars/*`（注意通配符）。[如果不配置该参数，Yarn会将`$SPARK_HOME/jars`打包上传分发，并且每次离线任务都要分发](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#preparations)，效率较低，所以推荐配置。
 - `batchjob.jar.path`必须是HDFS路径（具体到包名），上传batchjob jar包到HDFS上，并配置为对应地址，保证Yarn集群上所有Worker可以获得batchjob包。
 - `offline.data.prefix`必须是HDFS路径，保证Yarn集群上所有Worker可读写数据。应使用前面配置的环境变量`HADOOP_CONF_DIR`中的Hadoop集群地址。

diff --git a/docs/zh/deploy/install_deploy.md b/docs/zh/deploy/install_deploy.md
@@ -21,6 +21,9 @@ strings /lib64/libc.so.6 | grep ^GLIBC_
 
 如果需要部署 ZooKeeper 和 TaskManager，则需要有 Java 运行环境。
 
+> 两个Server需要`Java 1.8`及以上版本。
+> Zookeeper Client 3.4.14 需要 `Java 1.7` - `Java 13` 版本。Java SDK也使用这个Client，所以同样要求不能使用较高版本Java。如果希望使用zkCli，推荐使用 `Java 1.8` 或 `Java 11` 版本。
+
 ### 硬件
 
 * CPU：