Skip to content

Commit

Permalink
Put SQL permission statements into closed block
Browse files Browse the repository at this point in the history
  • Loading branch information
pflooky committed Aug 25, 2023
1 parent 1891abf commit afd490c
Show file tree
Hide file tree
Showing 12 changed files with 280 additions and 298 deletions.
54 changes: 27 additions & 27 deletions docs/setup/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,17 @@ seen [here](../sample/docker/data/custom/application.conf).

Flags are used to control which processes are executed when you run Data Caterer.

| Config | Default | Paid | Description |
|------------------------------|---------|------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| enableGenerateData | true | N | Enable/disable data generation |
| enableCount | true | N | Count the number of records generated. Can be disabled to improve performance |
| enableFailOnError | true | N | Whilst saving generated data, if there is an error, it will stop any further data from being generated |
| enableSaveSinkMetadata | true | N | Enable/disable HTML reports summarising data generated, metadata of data generated (if `enableSinkMetadata` is enabled) and validation results (if `enableValidation` is enabled) |
| enableSinkMetadata | true | N | Run data profiling for the generated data. Shown in HTML reports if `enableSaveSinkMetadata` is enabled |
| enableValidation | false | N | Run validations as described in plan. Results can be viewed from logs or from HTML report if `enableSaveSinkMetadata` is enabled |
| enableGeneratePlanAndTasks | false | Y | Enable/disable plan and task auto generation based off data source connections |
| enableRecordTracking | false | Y | Enable/disable which data records have been generated for any data source |
| enableDeleteGeneratedRecords | false | Y | Delete all generated records based off record tracking (if `enableRecordTracking` has been set to true) |
| Config | Default | Paid | Description |
|--------------------------------|---------|------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `enableGenerateData` | true | N | Enable/disable data generation |
| `enableCount` | true | N | Count the number of records generated. Can be disabled to improve performance |
| `enableFailOnError` | true | N | Whilst saving generated data, if there is an error, it will stop any further data from being generated |
| `enableSaveSinkMetadata` | true | N | Enable/disable HTML reports summarising data generated, metadata of data generated (if `enableSinkMetadata` is enabled) and validation results (if `enableValidation` is enabled) |
| `enableSinkMetadata` | true | N | Run data profiling for the generated data. Shown in HTML reports if `enableSaveSinkMetadata` is enabled |
| `enableValidation` | false | N | Run validations as described in plan. Results can be viewed from logs or from HTML report if `enableSaveSinkMetadata` is enabled |
| `enableGeneratePlanAndTasks` | false | Y | Enable/disable plan and task auto generation based off data source connections |
| `enableRecordTracking` | false | Y | Enable/disable which data records have been generated for any data source |
| `enableDeleteGeneratedRecords` | false | Y | Delete all generated records based off record tracking (if `enableRecordTracking` has been set to true) |

## Folders

Expand All @@ -29,14 +29,14 @@ records generated.

These folder pathways can be defined as a cloud storage pathway (i.e. `s3a://my-bucket/task`).

| Config | Default | Paid | Description |
|--------------------------------|-----------------------------------------|------|---------------------------------------------------------------------------------------------------------------------|
| planFilePath | /opt/app/plan/customer-create-plan.yaml | N | Plan file path to use when generating and/or validating data |
| taskFolderPath | /opt/app/task | N | Task folder path that contains all the task files (can have nested directories) |
| validationFolderPath | /opt/app/validation | N | Validation folder path that contains all the validation files (can have nested directories) |
| generatedDataResultsFolderPath | /opt/app/html | N | Where HTML reports get generated that contain information about data generated along with any validations performed |
| generatedPlanAndTaskFolderPath | /tmp | Y | Folder path where generated plan and task files will be saved |
| recordTrackingFolderPath | /opt/app/record-tracking | Y | Where record tracking parquet files get saved |
| Config | Default | Paid | Description |
|----------------------------------|-----------------------------------------|------|---------------------------------------------------------------------------------------------------------------------|
| `planFilePath` | /opt/app/plan/customer-create-plan.yaml | N | Plan file path to use when generating and/or validating data |
| `taskFolderPath` | /opt/app/task | N | Task folder path that contains all the task files (can have nested directories) |
| `validationFolderPath` | /opt/app/validation | N | Validation folder path that contains all the validation files (can have nested directories) |
| `generatedDataResultsFolderPath` | /opt/app/html | N | Where HTML reports get generated that contain information about data generated along with any validations performed |
| `generatedPlanAndTaskFolderPath` | /tmp | Y | Folder path where generated plan and task files will be saved |
| `recordTrackingFolderPath` | /opt/app/record-tracking | Y | Where record tracking parquet files get saved |

## Metadata

Expand All @@ -51,11 +51,11 @@ You may face issues if the number of records in the data source is large as data
Similarly, it can be expensive
when analysing the generated data if the number of records generated is large.

| Config | Default | Paid | Description |
|------------------------------------|---------|------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| numRecordsFromDataSource | 10000 | Y | Number of records read in from the data source |
| numRecordsForAnalysis | 10000 | Y | Number of records used for data profiling from the records gathered in `numRecordsFromDataSource` |
| oneOfDistinctCountVsCountThreshold | 0.1 | Y | Threshold ratio to determine if a field is of type `oneOf` (i.e. a field called `status` that only contains `open` or `closed`. Distinct count = 2, total count = 10, ratio = 2 / 10 = 0.2 therefore marked as `oneOf`) |
| Config | Default | Paid | Description |
|--------------------------------------|---------|------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `numRecordsFromDataSource` | 10000 | Y | Number of records read in from the data source |
| `numRecordsForAnalysis` | 10000 | Y | Number of records used for data profiling from the records gathered in `numRecordsFromDataSource` |
| `oneOfDistinctCountVsCountThreshold` | 0.1 | Y | Threshold ratio to determine if a field is of type `oneOf` (i.e. a field called `status` that only contains `open` or `closed`. Distinct count = 2, total count = 10, ratio = 2 / 10 = 0.2 therefore marked as `oneOf`) |

## Generation

Expand All @@ -64,6 +64,6 @@ sources prone to failure under load.
To help alleviate these issues or speed up performance, you can control the number of records that get generated in each
batch.

| Config | Default | Paid | Description |
|--------------------|---------|------|-----------------------------------------------------------------|
| numRecordsPerBatch | 100000 | N | Number of records across all data sources to generate per batch |
| Config | Default | Paid | Description |
|----------------------|---------|------|-----------------------------------------------------------------|
| `numRecordsPerBatch` | 100000 | N | Number of records across all data sources to generate per batch |
86 changes: 41 additions & 45 deletions docs/setup/connection/connection.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,16 +22,17 @@ All connection details follow the same pattern.
}
```

When defining a configuration value that can be defined by a system property or environment variable at runtime, you can
define that via the following:

```
url = "localhost"
url = ${?POSTGRES_URL}
```

The above defines that if there is a system property or environment variable named `POSTGRES_URL`, then that value will
be used for the `url`, otherwise, it will default to `localhost`.
!!! info "Overriding configuration"
When defining a configuration value that can be defined by a system property or environment variable at runtime, you can
define that via the following:

```
url = "localhost"
url = ${?POSTGRES_URL}
```

The above defines that if there is a system property or environment variable named `POSTGRES_URL`, then that value will
be used for the `url`, otherwise, it will default to `localhost`.

### Example task per data source

Expand Down Expand Up @@ -126,46 +127,43 @@ jdbc {
}
```

Ensure that the user has write permission so it is able to save the table to the target tables.
<details>
Ensure that the user has write permission, so it is able to save the table to the target tables.

```sql
GRANT INSERT ON <schema>.<table> TO <user>;
```
??? tip "SQL Permission Statements"

</details>
```sql
GRANT INSERT ON <schema>.<table> TO <user>;
```

#### Postgres

##### Permissions

Following permissions are required when generating plan and tasks:
<details>

```sql
GRANT SELECT ON information_schema.tables TO < user >;
GRANT SELECT ON information_schema.columns TO < user >;
GRANT SELECT ON information_schema.key_column_usage TO < user >;
GRANT SELECT ON information_schema.table_constraints TO < user >;
GRANT SELECT ON information_schema.constraint_column_usage TO < user >;
```

</details>
??? tip "SQL Permission Statements"

```sql
GRANT SELECT ON information_schema.tables TO < user >;
GRANT SELECT ON information_schema.columns TO < user >;
GRANT SELECT ON information_schema.key_column_usage TO < user >;
GRANT SELECT ON information_schema.table_constraints TO < user >;
GRANT SELECT ON information_schema.constraint_column_usage TO < user >;
```

#### MySQL

##### Permissions

Following permissions are required when generating plan and tasks:
<details>

```sql
GRANT SELECT ON information_schema.columns TO < user >;
GRANT SELECT ON information_schema.statistics TO < user >;
GRANT SELECT ON information_schema.key_column_usage TO < user >;
```
??? tip "SQL Permission Statements"

</details>
```sql
GRANT SELECT ON information_schema.columns TO < user >;
GRANT SELECT ON information_schema.statistics TO < user >;
GRANT SELECT ON information_schema.key_column_usage TO < user >;
```

### Cassandra

Expand All @@ -189,24 +187,22 @@ org.apache.spark.sql.cassandra {

##### Permissions

Ensure that the user has write permission so it is able to save the table to the target tables.
<details>
Ensure that the user has write permission, so it is able to save the table to the target tables.

```sql
GRANT INSERT ON <schema>.<table> TO <user>;
```
??? tip "CQL Permission Statements"

</details>
```sql
GRANT INSERT ON <schema>.<table> TO <user>;
```

Following permissions are required when generating plan and tasks:
<details>

```sql
GRANT SELECT ON system_schema.tables TO <user>;
GRANT SELECT ON system_schema.columns TO <user>;
```
??? tip "CQL Permission Statements"

</details>
```sql
GRANT SELECT ON system_schema.tables TO <user>;
GRANT SELECT ON system_schema.columns TO <user>;
```

### Kafka

Expand Down
3 changes: 3 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@ markdown_extensions:
- attr_list
- def_list
- md_in_html
- admonition
- pymdownx.details
- pymdownx.superfences
- pymdownx.emoji:
emoji_index: !!python/name:materialx.emoji.twemoji
emoji_generator: !!python/name:materialx.emoji.to_svg
Expand Down
Loading

0 comments on commit afd490c

Please sign in to comment.