Skip to content

Commit

Permalink
feat: add read-write tidb sdk and update related usage docs (#3815)
Browse files Browse the repository at this point in the history
* feat: add read-write tidb sdk and update related usage docs

* feat: optimize tidb document description

* feat: optimize tidb document description

---------

Co-authored-by: Yuan Haitao <[email protected]>
  • Loading branch information
yht520100 and Yuan Haitao authored Mar 28, 2024
1 parent 1979328 commit 35e2088
Show file tree
Hide file tree
Showing 5 changed files with 165 additions and 1 deletion.
3 changes: 2 additions & 1 deletion docs/en/integration/offline_data_sources/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,5 @@ Offline Data Source

hive
s3
iceberg
iceberg
tidb
80 changes: 80 additions & 0 deletions docs/en/integration/offline_data_sources/tidb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# TiDB

## Introduction

[TiDB](https://docs.pingcap.com/) is an open-source distributed relational database with key features including horizontal scaling, high availability suitable for financial use, real-time HTAP, cloud-native architecture, and compatibility with MySQL 5.7 protocol and ecosystem. OpenMLDB supports the use of TiDB as an offline storage engine for importing data and exporting feature computation data.

## Usage

### Installation

[OpenMLDB Spark Distribution](../../tutorial/openmldbspark_distribution.md) v0.8.5 and later versions utilize the TiSpark tool to interact with TiDB. The current release includes TiSpark 3.1.x dependencies (`tispark-assembly-3.2_2.12-3.1.5.jar`, `mysql-connector-java-8.0.29.jar`). If your TiSpark version doesn't match your TiDB version, refer to the [TiSpark documentation](https://docs.pingcap.com/tidb/stable/tispark-overview) for compatible dependencies to add to Spark's classpath/jars.


### Configuration

You need to add TiDB configurations to Spark configurations. There are two ways to do so:

- taskmanager.properties(.template): Add TiDB configurations to the `spark.default.conf` property, then restart the taskmanager.
- CLI: Add this configuration to the ini conf and start CLI using `--spark_conf`, refer to [Client Spark Configuration File](../../reference/client_config/client_spark_config.md).

For details on TiDB configurations for TiSpark, refer to [TiSpark Configuration](https://docs.pingcap.com/tidb/stable/tispark-overview#tispark-configurations).

For example, configuration in `taskmanager.properties(.template)`:

```properties
spark.default.conf=spark.sql.extensions=org.apache.spark.sql.TiExtensions;spark.sql.catalog.tidb_catalog=org.apache.spark.sql.catalyst.catalog.TiCatalog;spark.sql.catalog.tidb_catalog.pd.addresses=127.0.0.1:2379;spark.tispark.pd.addresses=127.0.0.1:2379;spark.sql.tidb.addr=127.0.0.1;spark.sql.tidb.port=4000;spark.sql.tidb.user=root;spark.sql.tidb.password=root;
```

Once either configuration is successful, access TiDB tables using the format `tidb_catalog.<db_name>.<table_name>`. If you do not want to add the catalog name prefix of tidb, you can set `spark.sql.catalog.default=tidb_catalog` in the configuration. This allows accessing TiDB tables using the format `<db_name>.<table_name>`.

## Data Format

TiDB schema reference can be found at [TiDB Schema](https://docs.pingcap.com/tidb/stable/data-type-overview). Currently, only the following TiDB data formats are supported:

| OpenMLDB Data Format | TiDB Data Format |
|----------------------|-------------------------|
| BOOL | BOOL |
| SMALLINT | Currently not supported |
| INT | Currently not supported |
| BIGINT | BIGINT |
| FLOAT | FLOAT |
| DOUBLE | DOUBLE |
| DATE | DATE |
| TIMESTAMP | TIMESTAMP |
| STRING | VARCHAR(M) |

## Importing TiDB Data into OpenMLDB

Importing data from TiDB sources is supported through the [`LOAD DATA INFILE`](../../openmldb_sql/dml/LOAD_DATA_STATEMENT.md) API, using the specific URI interface format `tidb://tidb_catalog.[db].[table]` to import data from TiDB. Note:

- Both offline and online engines can import TiDB data sources.
- TiDB import supports symbolic links, which can reduce hard copying and ensure that OpenMLDB always reads the latest data from TiDB. To enable soft link data import, use the parameter `deep_copy=false`.
- The `OPTIONS` parameter only supports `deep_copy`, `mode`, and `sql`.

For example:

```sql
LOAD DATA INFILE 'tidb://tidb_catalog.db1.t1' INTO TABLE t1 OPTIONS(deep_copy=false);
```

Data loading also supports using SQL statements to filter specific data from TiDB data tables. Note that the SQL must conform to SparkSQL syntax, and the data table is the registered table name without the `tidb://` prefix.

For example:

```sql
LOAD DATA INFILE 'tidb://tidb_catalog.db1.t1' INTO TABLE tidb_catalog.db1.t1 OPTIONS(deep_copy=true, sql='SELECT * FROM tidb_catalog.db1.t1 where key=\"foo\"')
```

## Exporting OpenMLDB Offline Engine Data to TiDB

Exporting data from OpenMLDB to TiDB sources is supported through the [`SELECT INTO`](../../openmldb_sql/dql/SELECT_INTO_STATEMENT.md) API, using the specific URI interface format `tidb://tidb_catalog.[db].[table]` to export data to the TiDB data warehouse. Note:

- The database and table must already exist. Currently, automatic creation of non-existent databases or tables is not supported.
- Only the export mode `mode` is effective in the `OPTIONS` parameter. Other parameters are not effective, and the current parameter is mandatory.

For example:

```sql
SELECT col1, col2, col3 FROM t1 INTO OUTFILE 'tidb://tidb_catalog.db1.t1' options(mode='append');
```
1 change: 1 addition & 0 deletions docs/zh/integration/offline_data_sources/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@
hive
s3
iceberg
tidb
80 changes: 80 additions & 0 deletions docs/zh/integration/offline_data_sources/tidb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# TiDB

## 简介

[TiDB](https://docs.pingcap.com/zh/) 是一款开源分布式关系型数据库,支持水平扩缩容、金融级高可用、实时 HTAP、云原生的分布式数据库、兼容 MySQL 5.7 协议和 MySQL 生态等重要特性。OpenMLDB 支持使用 TiDB 作为离线存储引擎,用于读取和导出特征计算的数据。

## 使用

### 安装

[OpenMLDB Spark 发行版](../../tutorial/openmldbspark_distribution.md) v0.8.5 及以上版本使用了TiSpark工具来操作TiDB数据库, 当前版本已包含 TiSpark 3.1.x 依赖(tispark-assembly-3.2_2.12-3.1.5.jarh、mysql-connector-java-8.0.29.jar)。如果TiSpark版本不兼容现有的TiDB版本,你可以从[TiSpark文档](https://docs.pingcap.com/zh/tidb/stable/tispark-overview)查找下载对应的TiSpark依赖,并将其添加到Spark的classpath/jars中。


### 配置

你需要将TiDB配置添加到Spark配置中。有两种方式:

- taskmanager.properties(.template): 在配置项 `spark.default.conf` 中加入TiDB配置,随后重启taskmanager。
- CLI: 在 ini conf 中加入此配置项,并使用`--spark_conf`启动CLI,参考[客户端Spark配置文件](../../reference/client_config/client_spark_config.md)

TiDB关于TiSpark的配置详情参考[TiSpark Configuration](https://docs.pingcap.com/zh/tidb/stable/tispark-overview#tispark-%E9%85%8D%E7%BD%AE)

例如,在`taskmanager.properties(.template)`中的配置:

```properties
spark.default.conf=spark.sql.extensions=org.apache.spark.sql.TiExtensions;spark.sql.catalog.tidb_catalog=org.apache.spark.sql.catalyst.catalog.TiCatalog;spark.sql.catalog.tidb_catalog.pd.addresses=127.0.0.1:2379;spark.tispark.pd.addresses=127.0.0.1:2379;spark.sql.tidb.addr=127.0.0.1;spark.sql.tidb.port=4000;spark.sql.tidb.user=root;spark.sql.tidb.password=root;
```

任一配置成功后,均使用`tidb_catalog.<db_name>.<table_name>`的格式访问TiDB表。如果不想添加tidb的catalog名称前缀,可以在配置中设置`spark.sql.catalog.default=tidb_catalog`。这样可以使用`<db_name>.<table_name>`的格式访问TiDB表。

## 数据格式

TiDB schema参考[TiDB Schema](https://docs.pingcap.com/zh/tidb/stable/data-type-overview)。目前,仅支持以下TiDB数据格式:

| OpenMLDB 数据格式 | TiDB 数据格式 |
| ----------------- |---------|
| BOOL | BOOL |
| SMALLINT | 暂不支持 |
| INT | 暂不支持 |
| BIGINT | BIGINT |
| FLOAT | FLOAT |
| DOUBLE | DOUBLE |
| DATE | DATE |
| TIMESTAMP | TIMESTAMP |
| STRING | VARCHAR(M) |

## 导入 TiDB 数据到 OpenMLDB

对于 TiDB 数据源的导入是通过 API [`LOAD DATA INFILE`](../../openmldb_sql/dml/LOAD_DATA_STATEMENT.md) 进行支持,通过使用特定的 URI 接口 `tidb://tidb_catalog.[db].[table]` 的格式进行导入 TiDB 内的数据。注意:

- 离线和在线引擎均可以导入 TiDB 数据源
- TiDB 导入支持软连接,可以减少硬拷贝并且保证 OpenMLDB 随时读取到 TiDB 的最新数据。启用软链接方式进行数据导入:使用参数 `deep_copy=false`
- `OPTIONS` 参数仅有 `deep_copy``mode``sql` 有效

举例:

```sql
LOAD DATA INFILE 'tidb://tidb_catalog.db1.t1' INTO TABLE t1 OPTIONS(deep_copy=false);
```

加载数据还支持使用 SQL 语句筛选 TiDB 数据表特定数据,注意 SQL 必须符合 SparkSQL 语法,数据表为注册后的表名,不带 `tidb://` 前缀。

举例:

```sql
LOAD DATA INFILE 'tidb://tidb_catalog.db1.t1' INTO TABLE tidb_catalog.db1.t1 OPTIONS(deep_copy=true, sql='SELECT * FROM tidb_catalog.db1.t1 where key=\"foo\"')
```

## 导出 OpenMLDB 离线引擎数据到 TiDB

对于 TiDB 数据源的导出是通过 API [`SELECT INTO`](../../openmldb_sql/dql/SELECT_INTO_STATEMENT.md) 进行支持,通过使用特定的 URI 接口 `tidb://tidb_catalog.[db].[table]` 的格式进行导出到 TiDB 数仓。注意:

- 数据库和数据表必须已经存在,目前不支持对于不存在的数据库或数据表进行自动创建
- `OPTIONS` 参数只有导出模式`mode`生效,其他参数均不生效,当前参数为必填项

举例:

```sql
SELECT col1, col2, col3 FROM t1 INTO OUTFILE 'tidb://tidb_catalog.db1.t1' options(mode='append');
```
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,8 @@ object HybridseUtil {
"iceberg"
} else if (file.toLowerCase().startsWith("openmldb://")) {
"openmldb" // TODO(hw): no doc for it
} else if (file.toLowerCase().startsWith("tidb://")) {
"tidb"
} else {
parseOption(getOptionFromNode(node, "format"), "csv", getStringOrDefault).toLowerCase
}
Expand Down

0 comments on commit 35e2088

Please sign in to comment.