Skip to content

Commit

Permalink
fix: local load/write validate and refactor
Browse files Browse the repository at this point in the history
  • Loading branch information
vagetablechicken committed Dec 7, 2023
1 parent a8c0226 commit fd678fa
Show file tree
Hide file tree
Showing 9 changed files with 490 additions and 337 deletions.
4 changes: 2 additions & 2 deletions docs/zh/openmldb_sql/dml/LOAD_DATA_STATEMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,8 @@ FilePathPattern
| header | Boolean | true | 是否包含表头, 默认为`true`|
| null_value | String | null | NULL值,默认填充`"null"`。加载时,遇到null_value的字符串将被转换为`"null"`,插入表中。 |
| format | String | csv | 导入文件的格式:<br />`csv`: 不显示指明format时,默认为该值。<br />`parquet`: 集群版还支持导入parquet格式文件,单机版不支持。 |
| quote | String | " | 输入数据的包围字符串。字符串长度<=1。<br />load_mode=`cluster`默认为双引号`"`。配置包围字符后,被包围字符包围的内容将作为一个整体解析。例如,当配置包围字符串为"#"时, `1, 1.0, #This is a string field, even there is a comma#, normal_string`将为解析为三个filed,第一个是整数1,第二个是浮点1.0,第三个是一个字符串,第四个虽然没有quote,但也是一个字符串。<br /> **local_mode=`local`默认为`\0`,不处理包围字符。** |
| mode | String | "error_if_exists" | 导入模式:<br />`error_if_exists`: 仅离线模式可用,若离线表已有数据则报错。<br />`overwrite`: 仅离线模式可用,数据将覆盖离线表数据。<br />`append`:离线在线均可用,若文件已存在,数据将追加到原文件后面。 |
| quote | String | " | 输入数据的包围字符串。字符串长度<=1。<br />load_mode=`cluster`默认为双引号`"`。配置包围字符后,被包围字符包围的内容将作为一个整体解析。例如,当配置包围字符串为"#"时, `1, 1.0, #This is a string field, even there is a comma#, normal_string`将为解析为三个filed,第一个是整数1,第二个是浮点1.0,第三个是一个字符串,第四个虽然没有quote,但也是一个字符串。<br /> **local_mode=`local`默认为`\0`也可使用空字符串赋值,不处理包围字符。** |
| mode | String | "error_if_exists" | 导入模式:<br />`error_if_exists`: 仅离线模式可用,若离线表已有数据则报错。<br />`overwrite`: 仅离线模式可用,数据将覆盖离线表数据。<br />`append`:离线在线均可用,若文件已存在,数据将追加到原文件后面。<br /> **local_mode=`local`默认为`append`** |
| deep_copy | Boolean | true | `deep_copy=false`仅支持离线load, 可以指定`INFILE` Path为该表的离线存储地址,从而不需要硬拷贝。 |
| load_mode | String | cluster | `load_mode='local'`仅支持从csv本地文件导入在线存储, 它通过本地客户端同步插入数据;<br /> `load_mode='cluster'`仅支持集群版, 通过spark插入数据,支持同步或异步模式 |
| thread | Integer | 1 | 仅在本地文件导入时生效,即`load_mode='local'`或者单机版,表示本地插入数据的线程数。 最大值为`50`|
Expand Down
6 changes: 4 additions & 2 deletions docs/zh/openmldb_sql/dql/SELECT_INTO_STATEMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ options_list:
| null_value | String | null | NULL填充值,默认填充`"null"` |
| format | String | csv | 输出文件格式:<br />`csv`:不显示指明format时,默认为该值<br />`parquet`:集群版离线模式支持导出parquet格式文件,但集群在线和单机版不支持 |
| mode | String | error_if_exists | 输出模式:<br />`error_if_exists`: 表示若文件已经在则报错。<br />`overwrite`: 表示若文件已存在,数据将覆盖原文件内容。<br />`append`:表示若文件已存在,数据将追加到原文件后面。<br />不显示配置时,默认为`error_if_exists`|
| quote | String | "" | 输出数据的包围字符串,字符串长度<=1。默认为"",表示输出数据包围字符串为空。当配置包围字符串时,将使用包围字符串包围一个field。例如,我们配置包围字符串为`"#"`,原始数据为{1, 1.0, This is a string, with comma}。输出的文本为`1, 1.0, #This is a string, with comma#`|
| quote | String | " | 输出数据的包围字符串,字符串长度<=1。默认为双引号`"`,表示输出数据包围字符串为空。当配置包围字符串时,将使用包围字符串包围一个field。例如,我们配置包围字符串为`"#"`,原始数据为{1, 1.0, This is a string, with comma}。输出的文本为`1, 1.0, #This is a string, with comma#`<br /> **单机模式或集群在线模式默认为`\0`,也可使用空字符串赋值,不处理包围字符。** |
| coalesce | Int | 0 | 仅集群版离线模式支持,默认值为0,不进行合并(可能输出多个文件),可指定最终输出几个文件。例如,coalesce=1,会将所有part合并为1个文件。 |


Expand All @@ -44,7 +44,9 @@ SELECT ... INTO OUTFILE 'file_path' OPTIONS (key = value, ...)

## FilePath

FilePath支持'file://', 'hdfs://', 'hive://'三种。其中'file://'和'hdfs://'地址为目录,而非文件名,'hive://'导出到Hive表中,格式为`hive://<db>.<table>`
集群离线模式中,FilePath支持'file://', 'hdfs://', 'hive://'三种。其中'file://'和'hdfs://'地址为目录,而非文件名,'hive://'导出到Hive表中,格式为`hive://<db>.<table>`

单机版或集群在线模式时,FilePath只能是file格式,且必须为文件名,不可以是目录。

## Hive 支持

Expand Down
2 changes: 1 addition & 1 deletion hybridse/include/node/sql_node.h
Original file line number Diff line number Diff line change
Expand Up @@ -981,7 +981,7 @@ class ConstNode : public ExprNode {
case kInt64:
return static_cast<int64_t>(val_.vlong);
case kFloat:
return static_cast<int64_t>(val_.vfloat);
return static_cast<int64_t>(val_.vfloat); // TODO: why int64_t

Check warning on line 984 in hybridse/include/node/sql_node.h

View workflow job for this annotation

GitHub Actions / cpplint

[cpplint] reported by reviewdog 🐶 At least two spaces is best between code and comments [whitespace/comments] [2] Raw Output: hybridse/include/node/sql_node.h:984: At least two spaces is best between code and comments [whitespace/comments] [2]

Check warning on line 984 in hybridse/include/node/sql_node.h

View workflow job for this annotation

GitHub Actions / cpplint

[cpplint] reported by reviewdog 🐶 Missing username in TODO; it should look like "// TODO(my_username): Stuff." [readability/todo] [2] Raw Output: hybridse/include/node/sql_node.h:984: Missing username in TODO; it should look like "// TODO(my_username): Stuff." [readability/todo] [2]
case kDouble:
return static_cast<int64_t>(val_.vdouble);
default: {
Expand Down
3 changes: 3 additions & 0 deletions src/sdk/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,9 @@ if(TESTING_ENABLE)

add_executable(split_test split_test.cc)
target_link_libraries(split_test ${BIN_LIBS})

add_executable(options_map_parser_test options_map_parser_test.cc)
target_link_libraries(options_map_parser_test ${BIN_LIBS})
endif()

set(SDK_LIBS openmldb_sdk openmldb_catalog client zk_client schema openmldb_flags openmldb_codec openmldb_proto base hybridse_sdk zookeeper_mt)
Expand Down
217 changes: 0 additions & 217 deletions src/sdk/file_option_parser.h

This file was deleted.

Loading

0 comments on commit fd678fa

Please sign in to comment.