Skip to content

Commit

Permalink
Merge pull request #24 from KxSystems/null_support
Browse files Browse the repository at this point in the history
Null support
  • Loading branch information
vgrechin-kx authored Mar 14, 2023
2 parents a40803e + 142300a commit 0c4e452
Show file tree
Hide file tree
Showing 34 changed files with 4,987 additions and 864 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
arrowkdb.code-workspace
.vscode/
build/
test.q
unit.q
*.user
6 changes: 3 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -82,16 +82,16 @@ before_install:
script:
- if [[ $TESTS == "True" && "x$OD" != "x" && "x$QLIC_KC" != "x" ]]; then
curl -o test.q -L https://github.com/KxSystems/hdf5/raw/master/test.q;
q test.q tests/ -q;
q test.q tests -q && q test.q tests/null_mapping -q && q test.q tests/null_bitmap -q;
fi
- if [[ $TRAVIS_OS_NAME == "windows" && $BUILD == "True" ]]; then
7z a -tzip -r $FILE_NAME ./cmake/$FILE_ROOT/*;
elif [[ $BUILD == "True" && ( $TRAVIS_OS_NAME == "linux" || $TRAVIS_OS_NAME == "osx" ) ]]; then
tar -zcvf $FILE_NAME -C cmake/$FILE_ROOT .;
elif [[ $TRAVIS_OS_NAME == "windows" ]]; then
7z a -tzip $FILE_NAME README.md install.bat LICENSE q examples proto;
7z a -tzip $FILE_NAME README.md install.bat LICENSE q docs examples proto;
elif [[ $TRAVIS_OS_NAME == "linux" || $TRAVIS_OS_NAME == "osx" ]]; then
tar -zcvf $FILE_NAME README.md install.sh LICENSE q examples proto;
tar -zcvf $FILE_NAME README.md install.sh LICENSE q docs examples proto;
fi

deploy:
Expand Down
3 changes: 3 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@ project(arrowkdb CXX)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -DKXVER=3")
set(CMAKE_CXX_STANDARD 14)
IF(APPLE)
set(CMAKE_CXX_STANDARD 17)
endif()
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)

Expand Down
66 changes: 66 additions & 0 deletions docs/null-bitmap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Arrowkdb null bitmap

## Problem

Previously arrowkdb ignored the null bitmap when reading or writing an arrow array. This was due to the following reasons: 

- Using the kdb null values will result in some strange corner cases.

- Mapping to kdb nulls would hurt the performance.

 Users have requested that arrowkdb provides a null bitmap when reading an arrow array so that the user can use this array in their applications.

When reading an arrow array using arrowkdb the user can now choose to return the null bitmap as well as the data values. The shape of the null bitmap structure is exactly the same as the data structure.  It is then left to the user to interpret the two structures as appropriate for their application.

Note: there is currently no support for null bitmap with the writer functions. 


## Implementation

The null bitmap feature is supported when reading:

* Parquet files
* Arrow IPC files
* Arrow IPC streams

To demonstrate it we first use the null mapping support to create a Parquet file containing nulls (although you can read null bitmaps from files generated by other writers such as PyArrow):

```q
q)options:(``NULL_MAPPING)!(::;`bool`uint8`int8`uint16`int16`uint32`int32`uint64`int64`float16`float32`float64`date32`date64`month_interval`day_time_interval`timestamp`time32`time64`duration`utf8`large_utf8`binary`large_binary`fixed_size_binary!(0b;0x00;0x00;0Nh;0Nh;0Ni;0Ni;0N;0N;0Nh;0Ne;0n;0Nd;0Np;0Nm;0Nn;0Np;0Nt;0Nn;0Nn;"";"";`byte$"";`byte$"";`byte$""))
q)table:([]col1:0N 1 2; col2:1.1 0n 2.2; col3:("aa"; "bb"; ""))
q).arrowkdb.pq.writeParquetFromTable["file.parquet";table;options]
```


Each reader function in arrowkdb takes an options dictionary.  A new `WITH_NULL_BITMAP option has been added.  When this option is set the reader functions return a two item mixed list, rather than one (the data values and null bitmap):

```q
q)read_results:.arrowkdb.pq.readParquetToTable["file.parquet";(enlist `WITH_NULL_BITMAP)!enlist 1]
q)read_results
+`col1`col2`col3!(0 1 2;1.1 0 2.2;("aa";"bb";""))
+`col1`col2`col3!(100b;010b;001b)
```

The null bitmap is a separate structure to kdb:

```q
q)first read_results
col1 col2 col3
--------------
0 1.1 "aa"
1 0 "bb"
2 2.2 ""
q)last read_results
col1 col2 col3
--------------
1 0 0
0 1 0
0 0 1
```


## Limitations

- The use of a null bitmap with the writer functions is not supported. 

- Since the null bitmap structure and data structure must have the same shape, arrow arrays which use nested datatypes (list, map, struct, union, dictionaries) where the parent array contains null values cannot be represented.  For example, an array with a struct datatype in arrow can have either null child field values or the parent struct value could be null.  The null bitmap structure will only reflect the null bitmap of the child field datatypes.
109 changes: 109 additions & 0 deletions docs/null-mapping.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Arrowkdb null mapping

## Problem

Previously arrowkdb ignored the null bitmap when reading or writing an arrow array. Users have requested that arrowkdb maps arrow nulls into kdb.

Unlike arrow, not all kdb types have a null value and those that do overload one value in the range (the 0N* values typically map to INT_MIN or NaN). 

For example:

- Each item in an arrow boolean array can be 0b, 1b or NULL.  kdb has no boolean null. 

- kdb doesn't have a byte null. 

- Unlike arrow, kdb can't distinguish between:
- a null string and empty string. 
- the " " character and null.


## Implementation

When reading and writing an arrow array using arrowkdb the user can now choose whether to map arrow nulls. Each reader and writer function in arrowkdb takes an options dictionary.  A new `NULL_MAPPING option containing a dictionary of datatypes > null values has been added which allows the user to specify whether an arrow datatype should be null mapped and what value to use for null in kdb.

> :warning: **An identify function (::) may be required in the options dictionary values**
>
> The options dictionary values list can be `7h`, `11h` or mixed list of `-7|-11|4|99|101h`. Therefore if the only set option is NULL_MAPPING, an additional empty key and corresponding value identity function (::) must be included in the options to make the values a mixed list.
The following Arrow datatype are supported, along with possible null mapping values:

```q
q)options:(``NULL_MAPPING)!(::;`bool`uint8`int8`uint16`int16`uint32`int32`uint64`int64`float16`float32`float64`date32`date64`month_interval`day_time_interval`timestamp`time32`time64`duration`utf8`large_utf8`binary`large_binary`fixed_size_binary!(0b;0x00;0x00;0Nh;0Nh;0Ni;0Ni;0N;0N;0Nh;0Ne;0Nf;0Nd;0Np;0Nm;0Nn;0Np;0Nt;0Nn;0Nn;"";"";`byte$"";`byte$"";`byte$""))
q)options`NULL_MAPPING
bool | 0b
uint8 | 0x00
int8 | 0x00
uint16 | 0Nh
int16 | 0Nh
uint32 | 0Ni
int32 | 0Ni
uint64 | 0N
int64 | 0N
float16 | 0Nh
float32 | 0Ne
float64 | 0n
date32 | 0Nd
date64 | 0Np
month_interval | 0Nm
day_time_interval| 0Nn
timestamp | 0Np
time32 | 0Nt
time64 | 0Nn
duration | 0Nn
utf8 | ""
large_utf8 | ""
binary | `byte$()
large_binary | `byte$()
fixed_size_binary| `byte$()
```

The type of each value in this dictionary must be the atomic type of the corresponding list representation for that datatype.  Where a datatype isn't present in this dictionary, arrowkdb will ignore the null bitmap (as per the previous behaviour).

## Example

Using these null mapping we can pretty print an arrow arrow where the kdb nulls have been mapped to arrow nulls:

```q
q)options:(``NULL_MAPPING)!(::;`bool`uint8`int8`uint16`int16`uint32`int32`uint64`int64`float16`float32`float64`date32`date64`month_interval`day_time_interval`timestamp`time32`time64`duration`utf8`large_utf8`binary`large_binary`fixed_size_binary!(0b;0x00;0x00;0Nh;0Nh;0Ni;0Ni;0N;0N;0Nh;0Ne;0n;0Nd;0Np;0Nm;0Nn;0Np;0Nt;0Nn;0Nn;"";"";`byte$"";`byte$"";`byte$""))
q)table:([]col1:0N 1 2; col2:1.1 0n 2.2; col3:("aa"; "bb"; ""))
q).arrowkdb.tb.prettyPrintTableFromTable[table;options]
col1: int64
col2: double
col3: string
----
col1:
[
[
null,
1,
2
]
]
col2:
[
[
1.1,
null,
2.2
]
]
col3:
[
[
"aa",
"bb",
null
]
]
q)
```




## Limitations

- There is no null mapping for arrow arrays which use nested datatypes (list, map, struct, union, dictionaries) where the parent array contains null values.  For example, an array with a struct datatype in arrow can have either null child field values or the parent struct value could be null.  Arrowkdb will only map nulls for the child fields using the above mapping.

- There is a loss of performance when choosing to map nulls, but this should not be significant. 
Loading

0 comments on commit 0c4e452

Please sign in to comment.