Skip to content

Commit

Permalink
[Feature][File] Support config null format for text file read (#8109)
Browse files Browse the repository at this point in the history
  • Loading branch information
hailin0 authored Nov 27, 2024
1 parent 555f5eb commit 2dbf02d
Show file tree
Hide file tree
Showing 25 changed files with 208 additions and 14 deletions.
10 changes: 9 additions & 1 deletion docs/en/connector-v2/source/FtpFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you

## Options

| name | type | required | default value |
| name | type | required | default value |
|---------------------------|---------|----------|---------------------|
| host | string | yes | - |
| port | int | yes | - |
Expand All @@ -62,6 +62,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you
| compress_codec | string | no | none |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
| null_format | string | no | - |
| common-options | | no | - |

### host [string]
Expand Down Expand Up @@ -336,6 +337,13 @@ The compress codec of archive files and the details that supported as the follow
Only used when file_format_type is json,text,csv,xml.
The encoding of the file to read. This param will be parsed by `Charset.forName(encoding)`.

### null_format [string]

Only used when file_format_type is text.
null_format to define which strings can be represented as null.

e.g: `\N`

### common options

Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details.
Expand Down
1 change: 1 addition & 0 deletions docs/en/connector-v2/source/HdfsFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ Read data from hdfs file system.
| compress_codec | string | no | none | The compress codec of files |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 | |
| null_format | string | no | - | Only used when file_format_type is text. null_format to define which strings can be represented as null. e.g: `\N` |
| common-options | | no | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. |

### delimiter/field_delimiter [string]
Expand Down
8 changes: 8 additions & 0 deletions docs/en/connector-v2/source/LocalFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you
| compress_codec | string | no | none |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
| null_format | string | no | - |
| common-options | | no | - |
| tables_configs | list | no | used to define a multiple table task |

Expand Down Expand Up @@ -330,6 +331,13 @@ The compress codec of archive files and the details that supported as the follow
Only used when file_format_type is json,text,csv,xml.
The encoding of the file to read. This param will be parsed by `Charset.forName(encoding)`.

### null_format [string]

Only used when file_format_type is text.
null_format to define which strings can be represented as null.

e.g: `\N`

### common options

Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details
Expand Down
1 change: 1 addition & 0 deletions docs/en/connector-v2/source/OssFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,7 @@ If you assign file type to `parquet` `orc`, schema option not required, connecto
| xml_use_attr_format | boolean | no | - | Specifies whether to process data using the tag attribute format, only used when file_format is xml. |
| compress_codec | string | no | none | Which compress codec the files used. |
| encoding | string | no | UTF-8 |
| null_format | string | no | - | Only used when file_format_type is text. null_format to define which strings can be represented as null. e.g: `\N` |
| file_filter_pattern | string | no | | Filter pattern, which used for filtering files. |
| common-options | config | no | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. |

Expand Down
8 changes: 8 additions & 0 deletions docs/en/connector-v2/source/OssJindoFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ It only supports hadoop version **2.9.X+**.
| compress_codec | string | no | none |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
| null_format | string | no | - |
| common-options | | no | - |

### path [string]
Expand Down Expand Up @@ -343,6 +344,13 @@ The compress codec of archive files and the details that supported as the follow
Only used when file_format_type is json,text,csv,xml.
The encoding of the file to read. This param will be parsed by `Charset.forName(encoding)`.

### null_format [string]

Only used when file_format_type is text.
null_format to define which strings can be represented as null.

e.g: `\N`

### common options

Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details.
Expand Down
1 change: 1 addition & 0 deletions docs/en/connector-v2/source/S3File.md
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,7 @@ If you assign file type to `parquet` `orc`, schema option not required, connecto
| compress_codec | string | no | none | |
| archive_compress_codec | string | no | none | |
| encoding | string | no | UTF-8 | |
| null_format | string | no | - | Only used when file_format_type is text. null_format to define which strings can be represented as null. e.g: `\N` |
| file_filter_pattern | string | no | | Filter pattern, which used for filtering files. |
| common-options | | no | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. |

Expand Down
3 changes: 2 additions & 1 deletion docs/en/connector-v2/source/SftpFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ The File does not have a specific type list, and we can indicate which SeaTunnel

## Source Options

| Name | Type | Required | default value | Description |
| Name | Type | Required | default value | Description |
|---------------------------|---------|----------|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| host | String | Yes | - | The target sftp host is required |
| port | Int | Yes | - | The target sftp port is required |
Expand All @@ -94,6 +94,7 @@ The File does not have a specific type list, and we can indicate which SeaTunnel
| compress_codec | String | No | None | The compress codec of files and the details that supported as the following shown: <br/> - txt: `lzo` `None` <br/> - json: `lzo` `None` <br/> - csv: `lzo` `None` <br/> - orc: `lzo` `snappy` `lz4` `zlib` `None` <br/> - parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `None` <br/> Tips: excel type does Not support any compression format |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
| null_format | string | no | - | Only used when file_format_type is text. null_format to define which strings can be represented as null. e.g: `\N` |
| common-options | | No | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. |

### file_filter_pattern [string]
Expand Down
1 change: 1 addition & 0 deletions docs/zh/connector-v2/source/HdfsFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@
| kerberos_keytab_path | string || - | kerberos 的 keytab 路径。 |
| skip_header_row_number | long || 0 | 跳过前几行,但仅适用于 txt 和 csv。例如,设置如下:`skip_header_row_number = 2`。然后 Seatunnel 将跳过源文件中的前两行。 |
| file_filter_pattern | string || - | 过滤模式,用于过滤文件。 |
| null_format | string || - | 定义哪些字符串可以表示为 null,但仅适用于 txt 和 csv. 例如: `\N` |
| schema | config || - | 上游数据的模式字段。 |
| sheet_name | string || - | 读取工作簿的表格,仅在文件格式为 excel 时使用。 |
| compress_codec | string || none | 文件的压缩编解码器。 |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -285,11 +285,11 @@ private boolean compareMapValue(Map<?, ?> value, MapType<?, ?> type, Map<?, ?> c

private Boolean checkType(Object value, SeaTunnelDataType<?> fieldType) {
if (value == null) {
if (fieldType.getSqlType() == SqlType.NULL) {
return true;
} else {
return false;
}
return true;
}

if (fieldType.getSqlType() == SqlType.NULL) {
return false;
}

if (fieldType.getSqlType() == SqlType.ROW) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,12 @@ public class BaseSourceConfigOptions {
.withDescription(
"The separator between columns in a row of data. Only needed by `text` file format");

public static final Option<String> NULL_FORMAT =
Options.key("null_format")
.stringType()
.noDefaultValue()
.withDescription("The string that represents a null value");

public static final Option<String> ENCODING =
Options.key("encoding")
.stringType()
Expand Down
Loading

0 comments on commit 2dbf02d

Please sign in to comment.