Skip to content

Commit

Permalink
Updates to the null checks. The manual of detecting nulls.
Browse files Browse the repository at this point in the history
  • Loading branch information
piotrczarnas committed Feb 7, 2024
1 parent 4e2fdf5 commit 75efd7b
Show file tree
Hide file tree
Showing 11 changed files with 332 additions and 29 deletions.
Original file line number Diff line number Diff line change
@@ -1,12 +1,309 @@
# Detecting data quality issues with nulls
Read this guide to learn what types of data quality checks are supported in DQOps to detect issues related to nulls.
The data quality checks are configured in the `nulls` category in DQOps.
# Detecting empty and incomplete columns
Read this guide to learn how to detect empty columns or incomplete columns containing too many null values in a dataset.

## Nulls category
Data quality checks that are detecting issues related to nulls are listed below.
The data quality checks that detect empty and incomplete columns are configured in the `nulls` category in DQOps.

## Issues with empty and incomplete columns
DQOps categorizes data quality issues with empty and incomplete columns in the *Completeness* data quality dimension.

### Types of completeness issues
We identify three types of data completeness issues related to the null values in a dataset.

- Incomplete columns that contain a few null values.

- Probably incomplete columns that are expected to contain some null values,
but the measured percentage of null values is higher than anticipated.

- Empty columns that have no values. Empty columns were most likely defined to store some information that was never provided.

- Inconsistently incomplete columns whose percentage of null values changes over time.
DQOps detects these types of issues using time series anomaly detection.

### The causes of completeness issues
Null values appear in the dataset for several reasons.

- A human error during a data entry leaves a required field empty.

- The field is optional, and the process followed by users does not require entering the information.

- A bug in the data entry form skipped the validation of required fields.

- A bug in the data transformation or mapping code did not pass the value of a required field downstream.

- The field was required, but the restriction was lifted later in time

### Problems caused by incomplete columns
Null values can cause problems in several data analytics and data transformation areas.

- Dashboards will show lower numbers because null values are not included in calculations.

- Some SQL queries that use filters may fail to return any records when a column has any null values.
It is caused by Three-Valued Logic used by SQL language to compare values to a null value.
For example, an SQL filter: `WHERE product_id NOT IN (SELECT nullable_product_id FROM table_with_nulls)`
will stop working due to a comparison to a null value.
*This query type is a particular case that forces any SQL-compliant database engine to return no rows.*
These types of data quality issues are hard to find, especially when a Business Intelligence engine generates the queries.


## Incomplete columns
We say that a column is incomplete when it contains some null values.
The following example shows the data profiling statistics of a column with over 16% of null values.

![Column with null values profiling statistics](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/null-count-profiling-statistics-min.png){ loading=lazy }

### Detect incomplete columns with UI
DQOps uses a [*nulls_count*](../checks/column/nulls/nulls-count.md) data quality check to count null values.
It raises a data quality issue when any null values are found.

The default value of the *max_count* parameter is 0, which asserts that no null values are present.

![Detect incomplete columns with some null values using a data quality check](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/incomplete-column-with-some-nulls-check-in-editor-min.png){ loading=lazy }

### Detect incomplete columns in YAML
The following example shows a [*nulls_count*](../checks/column/nulls/nulls-count.md) check configured in a YAML file.

``` { .yaml linenums="1" hl_lines="13-15" }
# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
columns:
street_number:
type_snapshot:
column_type: STRING
nullable: true
monitoring_checks:
daily:
nulls:
daily_nulls_count:
error:
max_count: 0
```

## Partially incomplete columns
A partially incomplete column is nullable, and it is acceptable to store some null values in the column, but the column contains too many null values.

We can define a minimum completeness level in many ways:

- A maximum number of rows containing a null value in the column.
We can use the [*nulls_count*](../checks/column/nulls/nulls-count.md) data quality check with a higher value of the *max_count* parameter.

- A maximum percentage of null values.
DQOps has a [*nulls_percent*](../checks/column/nulls/nulls-percent.md) data quality check for that purpose.

- A minimum number of rows that must have a non-null value.
DQOps has a dedicated data quality check [*not_nulls_count*](../checks/column/nulls/not-nulls-count.md) that also detects empty columns.

- Or a minimum percentage of non-null values in a column.
The data quality check [*not_nulls_percent*](../checks/column/nulls/not-nulls-percent.md) supports this case.

### Detect in UI
The [*nulls_percent*](../checks/column/nulls/nulls-percent.md) check measures the percentage of null values in a column.
DQOps supports configuring multiple alert severity levels by using a different threshold.

The following example raises a warning severity issue when the percent of the null value is above 16%.
An issue at an error severity level is raised when the percent of null values exceeds 20%.

![Detect incomplete columns with a minimum accepted percentage of nulls](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/detect-incomplete-columns-with-accepted-percent-of-nulls-min.png){ loading=lazy }

### Detect in YAML
The configuration of the [*nulls_percent*](../checks/column/nulls/nulls-percent.md) check is straightforward in YAML.

``` { .yaml linenums="1" hl_lines="13-17" }
# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
columns:
street_number:
type_snapshot:
column_type: STRING
nullable: true
monitoring_checks:
daily:
nulls:
daily_nulls_percent:
warning:
max_percent: 16.0
error:
max_percent: 20.0
```

## Detect empty columns
We say a column is empty when it has no values and all rows contain only nulls.

DQOps detects empty columns using the [*not_nulls_count*](../checks/column/nulls/not-nulls-count.md)
data quality check with the default configuration.

The [*not_nulls_count*](../checks/column/nulls/not-nulls-count.md) check has a rule parameter *min_count* that verifies a minimum number of rows containing a value.
The default value is 1 row, which finds empty columns not passing that limit.

### Detect empty columns in UI
The [*not_nulls_count*](../checks/column/nulls/not-nulls-count.md) check configured with the default settings finds empty columns.
The following screen shows a valid column that was not empty.

![Detect empty columns in tables with a data quality check](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/detect-empty-tables-check-min.png){ loading=lazy }

### Detect empty columns in UI
The configuration of the [*not_nulls_count*](../checks/column/nulls/not-nulls-count.md) that detects empty columns is shown below.

``` { .yaml linenums="1" hl_lines="13-15" }
# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
columns:
street_number:
type_snapshot:
column_type: STRING
nullable: true
monitoring_checks:
daily:
nulls:
daily_not_nulls_count:
error:
min_count: 1
```


## Detect a minimum number of non-null values
The configuration of the [*not_nulls_count*](../checks/column/nulls/not-nulls-count.md) check is easy to adapt to detect columns
that should have at least a given number of non-null values.

The minimum accepted number of non-null values is configured by setting the *min_count* parameter to a desired count.

### Detect in UI
The following example shows how to assert that a column contains at least 1500000 non-null values.

![Detect columns with too little non-null values in a column](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/detect-columns-with-too-little-not-null-values-min.png){ loading=lazy }

### Detect in YAML
The configuration of the [*not_nulls_count*](../checks/column/nulls/not-nulls-count.md) check in a YAML file only uses a different value of the *min_count* parameter.

``` { .yaml linenums="1" hl_lines="15" }
# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
columns:
street_number:
type_snapshot:
column_type: STRING
nullable: true
monitoring_checks:
daily:
nulls:
daily_not_nulls_count:
error:
min_count: 1500000
```

## Anomalies of data completeness
An unexpected change in the percentage of null value is a noticeable data anomaly.
DQOps uses time series anomaly detection to identify these issues.
The [*nulls_percent_anomaly*](../checks/column/nulls/nulls-percent-anomaly.md) check measures the percentage of null values for each day and raises data quality issues for anomalies.

The [*nulls_percent_anomaly*](../checks/column/nulls/nulls-percent-anomaly.md) check supports two methods of operation.

- A [daily monitoring](../dqo-concepts/definition-of-data-quality-checks/data-observability-monitoring-checks.md#daily-monitoring-checks)
check measures the percentage of all rows in a monitored table that contain null values.

- A [daily partition](../dqo-concepts/definition-of-data-quality-checks/partition-checks.md#daily-partitioning)
check analyzes every daily partition.
The check raises a data quality issue when the percentage of null values between daily partitions changes.
It can happen when the transformation logic in the data pipeline was recently modified,
and an invalid transformation has problems with data conversion.

### Configuring completeness anomaly detection in UI
The following sample shows how to configure the
[*daily_partition_nulls_percent_anomaly*](../checks/column/nulls/nulls-percent-anomaly.md#daily-partition-nulls-percent-anomaly)
check for detecting null anomalies across daily partitions.
The configuration of the
[*daily_nulls_percent_anomaly*](../checks/column/nulls/nulls-percent-anomaly.md#daily-nulls-percent-anomaly) check
that monitors the whole table every day is the same,
but the [*daily_nulls_percent_anomaly*](../checks/column/nulls/nulls-percent-anomaly.md#daily-nulls-percent-anomaly) check
requires 30 days of monitoring before it will show any results.

![Detect anomalies in the percentage of null values in a column](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/detect-anomalies-in-percent-of-null-values-in-date-partitions-min.png){ loading=lazy }

### Configuring completeness anomaly detection in YAML
The [*nulls_percent_anomaly*](../checks/column/nulls/nulls-percent-anomaly.md) check only requires the configuration of the *anomaly_percent* parameters
for each [issue severity level](../dqo-concepts/definition-of-data-quality-checks/index.md#issue-severity-levels).

This example shows the configuration of a daily monitoring check that measures the percentage of null values in the whole table.

``` { .yaml linenums="1" hl_lines="13-15" }
# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
columns:
street_number:
type_snapshot:
column_type: STRING
nullable: true
monitoring_checks:
daily:
nulls:
daily_not_nulls_count:
error:
min_count: 1500000
```

The configuration of the [*daily_partition_nulls_percent_anomaly*](../checks/column/nulls/nulls-percent-anomaly.md#daily-partition-nulls-percent-anomaly)
check for analyzing daily partitions is similar.

``` { .yaml linenums="1" hl_lines="18-21" }
# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
timestamp_columns:
partition_by_column: created_date
incremental_time_window:
daily_partitioning_recent_days: 7
monthly_partitioning_recent_months: 1
columns:
street_number:
type_snapshot:
column_type: STRING
nullable: true
partitioned_checks:
daily:
nulls:
daily_partition_nulls_percent_anomaly:
warning:
anomaly_percent: 3.0
error:
anomaly_percent: 0.5
```

## Completeness data quality issues
The data quality dashboards for monitoring null values are found in the *Data Quality Dimensions -> Completeness* folder.

### Whole table monitoring dashboards
The **Current completeness issues on columns** dashboard shows an aggregated view of active data quality issues with empty or incomplete columns.

This dashboard shows only the status of the most recent evaluation of all data quality checks for null values.
When the data is fixed in the data source, and the failed data quality check is rerun,
the issue will disappear from the dashboard.

![Data quality dashboard showing empty and incomplete detected by data quality checks](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/empty-and-incomplete-columns-issues-shown-on-monitoring-dashboard-min.png){ loading=lazy }

### Partition monitoring dashboards
DQOps has a separate set of data quality dashboards for partitioned data.
These dashboards are found in the "Partitions" folder. They show issues for every daily or monthly partition.

The top section of the partition's *Current completeness issues on columns" dashboard shows the data sources,
affected tables, and the types of completeness issues.

![Partitions with nulls shown on a dashboard - the filters](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/partition-completeness-status-dashboard-top-min.png){ loading=lazy }

The next section shows more details about incomplete or empty columns.
The status identifies the highest severity issue by color.

![List of columns in a partitioned table that have incomplete data shown on a data quality dashboard](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/completeness-issues-dashboard-column-details-min.png){ loading=lazy }

## Detecting nulls issues
How to detect nulls data quality issues.

## Use cases
| **Name of the example** | **Description** |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,12 @@ const DeleteOnlyDataDialog = ({
selectedReference,
checkTypes
}: DeleteOnlyDataDialogProps) => {
const [startDate, setStartDate] = useState(new Date());
const [startDate, setStartDate] = useState(new Date(new Date().getTime() - 1000 * 3600 * 24 * 30));
const [endDate, setEndDate] = useState(new Date());
const [mode, setMode] = useState('all');
const [mode, setMode] = useState('part');
const [params, setParams] = useState({
deleteErrors: true,
deleteStatistics: true,
deleteStatistics: false,
deleteCheckResults: true,
deleteSensorReadouts: true
});
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,6 @@ public boolean isStandard() {
*/
@Override
public DefaultDataQualityDimensions getDefaultDataQualityDimension() {
return DefaultDataQualityDimensions.Consistency;
return DefaultDataQualityDimensions.Completeness;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,6 @@ protected ChildHierarchyNodeFieldMap getChildMap() {
*/
@Override
public DefaultDataQualityDimensions getDefaultDataQualityDimension() {
return DefaultDataQualityDimensions.Consistency;
return DefaultDataQualityDimensions.Completeness;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,6 @@ protected ChildHierarchyNodeFieldMap getChildMap() {
*/
@Override
public DefaultDataQualityDimensions getDefaultDataQualityDimension() {
return DefaultDataQualityDimensions.Consistency;
return DefaultDataQualityDimensions.Completeness;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,6 @@ protected ChildHierarchyNodeFieldMap getChildMap() {
*/
@Override
public DefaultDataQualityDimensions getDefaultDataQualityDimension() {
return DefaultDataQualityDimensions.Consistency;
return DefaultDataQualityDimensions.Completeness;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,6 @@ protected ChildHierarchyNodeFieldMap getChildMap() {
*/
@Override
public DefaultDataQualityDimensions getDefaultDataQualityDimension() {
return DefaultDataQualityDimensions.Consistency;
return DefaultDataQualityDimensions.Completeness;
}
}
Loading

0 comments on commit 75efd7b

Please sign in to comment.