Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/add vlmd extract #14

Merged
merged 51 commits into from
Jan 21, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
2f29ca3
(HP-1727): add validation of json object for extract
george42-ctds Oct 24, 2024
8a96343
(HP-1727): move get_schema to after checking input
george42-ctds Oct 29, 2024
943154e
(HP-1727): move some utils from validate to parent directory
george42-ctds Oct 29, 2024
11084ba
(HP-1727): handle array data in csv validation
george42-ctds Oct 29, 2024
e4bfe73
(HP-1727): add file_utils
george42-ctds Oct 29, 2024
10e4a56
(HP-1727): update config for extract
george42-ctds Oct 29, 2024
c7968cd
(HP-1727): remove unused import
george42-ctds Oct 29, 2024
29cf039
(HP-1727): add mappings
george42-ctds Oct 29, 2024
030c263
(HP-1727): add extract utils
george42-ctds Oct 29, 2024
5da921a
(HP-1727): add extract module
george42-ctds Oct 30, 2024
c96b520
(HP-1727): fix ValidationError handling
george42-ctds Oct 31, 2024
ee6db2a
(HP-1727): clean up exceptions
george42-ctds Oct 31, 2024
e552a56
(HP-1727): add unit tests for conversion
george42-ctds Oct 31, 2024
6fbf7f8
(HP-1727): add unit tests
george42-ctds Oct 31, 2024
0d09bc4
(HP-1727): change case in logger name
george42-ctds Oct 31, 2024
ad13dfe
(HP-1727): update READMEs
george42-ctds Oct 31, 2024
df031d3
(HP-1727): add unit tests for conversion
george42-ctds Oct 31, 2024
32d7281
(HP-1727): add unit test for file writing
george42-ctds Nov 1, 2024
9f88142
Update contact e-mail, re-lock
george42-ctds Nov 5, 2024
51acbf8
(HP-1827): update constraint type from string to integer in csv schema
george42-ctds Dec 17, 2024
9ad2e5e
(HP-1827): add new test data
george42-ctds Dec 17, 2024
af516d5
(HP-1827): handle empty output_dir
george42-ctds Dec 17, 2024
485ef79
(HP-1827): move read method to utils
george42-ctds Dec 17, 2024
83fcd21
(HP-1827): add new validate_extract module
george42-ctds Dec 17, 2024
2b559e2
(HP-1827): add new tests for validate_extract module
george42-ctds Dec 17, 2024
e975cf8
(HP-1827): add fixes for existing extract and validate modules
george42-ctds Dec 17, 2024
7c120de
(HP-1827): update vlmd README to describe combined validate_extract
george42-ctds Dec 17, 2024
caba140
(HP-1827) raise error for zero-length converted csv data
george42-ctds Dec 19, 2024
25b822e
(HP-1827) remove unneeded test code
george42-ctds Dec 19, 2024
e19b5a9
(HP-1827) remove replicated extract and validate code
george42-ctds Dec 19, 2024
d9bc46e
(HP-1827) update import to use new combined module
george42-ctds Dec 19, 2024
dc854ee
(HP-1827) removed duplicated test
george42-ctds Dec 19, 2024
47acef8
(HP-1827) recover extract, leave logic in validate
george42-ctds Dec 23, 2024
10178ee
(HP-1827) recover tests for extract
george42-ctds Dec 23, 2024
8796a5e
(HP-1827) recover extract in README
george42-ctds Dec 23, 2024
79a8a01
(HP-1827) remove unused imports
george42-ctds Dec 23, 2024
46b08f9
(HP-1827) add test for additional properties
george42-ctds Dec 23, 2024
2b0b51e
(HP-1827) add input file for new test
george42-ctds Dec 23, 2024
dc2c0d1
(HP-1827) Change variable names to snake case
george42-ctds Jan 7, 2025
5f816fa
(HP-1827) Change variable names to snake case
george42-ctds Jan 7, 2025
cc771ed
(HP-1827) raise error for non-integer array index, add unit test
george42-ctds Jan 8, 2025
330463a
(HP-1827) use regex for checking array indexing in prop names
george42-ctds Jan 8, 2025
b1aff01
(HP-1827) initialize is_one_unique
george42-ctds Jan 13, 2025
eeb5d1f
(HP-1827) add unit tests
george42-ctds Jan 13, 2025
a7051fe
(HP-1827) change variables to snake case
george42-ctds Jan 14, 2025
959f736
(HP-1827) fix bug in conversion of csv 'custom', update tests
george42-ctds Jan 15, 2025
d5ca096
(HP-1827) move parse block to function, add unit test
george42-ctds Jan 15, 2025
d868716
(HP-1827) change key name to snake case
george42-ctds Jan 15, 2025
54b8af2
(HP-1827) removed 'csv_validator', now uses 'jsonschema.Validate'
george42-ctds Jan 15, 2025
1c4e3ae
(HP1827) re-order imports
george42-ctds Jan 16, 2025
112eba5
(HP-1827) minor reformatting
george42-ctds Jan 16, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@ In the notebooks directory there are jupyter notebooks that may be used to downl

These notebooks perform optimally within a HEAL Gen3 Workspace and the notebooks will be automatically installed to a user's workspace when the workspace is initiated. However, you may also use these notebooks on your local machine.

### VLMD validation
### VLMD extraction and validation

The [VLMD validation docs](heal/vlmd/README.md) describe how to use the SDK for validating VLMD dictionaries.
The [VLMD docs](heal/vlmd/README.md) describe how to use the SDK for extracting and validating VLMD dictionaries.

### Run tests

Expand Down
57 changes: 48 additions & 9 deletions heal/vlmd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,71 @@

## VLMD validation

This module validates VLMD data dictionaries against stored schemas.
This module validates VLMD data dictionaries against stored schemas. The `vlmd_validate()` method
will attempt an extraction as part of the validation process.

The `vlmd_validate()` method raises a `jsonschema.ValidationError` for an invalid input file.
The `vlmd_validate()` method raises a `jsonschema.ValidationError` for an invalid input file and
will raise an `ExtractionError` if the input_file cannot be converted

Example validation code:

```
from jsonschema import ValidationError

from heal.vlmd import vlmd_validate
from heal.vlmd import vlmd_validate, ExtractionError

input_file = "vlmd_dd.json"
try:
vlmd_validate(input_file)
except ValidationError as e:
# handle validation error

except ValidationError as v_err:
# handle validation error

except ExtractionError as e_err:
# handle extraction error

```

## VLMD extract

The extract module implements extraction and conversion of dictionaries into different formats.

The current formats are csv, json, and tsv.

The `vlmd_extract()` method raises a `jsonschema.ValidationError` for an invalid input files
and raises an `ExtractionError` for any other type of error.

Example extraction code:

```
from jsonschema import ValidationError

from heal.vlmd import vlmd_extract, ExtractionError

try:
vlmd_extract("vlmd_for_extraction.csv", output_dir="./output")

except ValidationError as v_err:
# handle validation error

except ExtractionError as e_err:
# handle extraction error
```

The above will write a HEAL-compliant VLMD json dictionary to

`output/heal-dd_vlmd_for_extraction.json`

### Adding new validators
## Adding new file types for extraction and validation

The module currently validates the following types of dictionaries: csv, json, tsv.
The above moduels currently handle the following types of dictionaries: csv, json, tsv.

To add code for a new dictionary file type:

* Create a new schema for the data type or validate against the existing json schema
* Create a new validator module for the new file type
* Call the new module from the `validator.py` module
* If possible create a new validator module for the new file type
* Call the new validator module from the `validate.py` module
* Create a new extractor module for the new file type, possibly using `pandas`
* Call the new extractor module from the `conversion.py` module
* Add new file writing utilities if saving converted dictionaries in the new format
* Create unit tests as needed for new code
5 changes: 4 additions & 1 deletion heal/vlmd/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
from heal.vlmd.validate.validate import vlmd_validate
from heal.vlmd.validate.validate import vlmd_validate, ExtractionError

# place 'extract' import after 'validate' import
from heal.vlmd.extract.extract import vlmd_extract
14 changes: 14 additions & 0 deletions heal/vlmd/config.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,26 @@
import json

# file prefix
OUTPUT_FILE_PREFIX = "heal-dd"

# file suffixes
ALLOWED_INPUT_TYPES = ["csv", "tsv", "json"]
ALLOWED_FILE_TYPES = ["auto"] + ALLOWED_INPUT_TYPES
ALLOWED_SCHEMA_TYPES = ["auto", "csv", "json", "tsv"]
ALLOWED_OUTPUT_TYPES = ["csv", "json"]

# schemas
csv_schema_file = "heal/vlmd/schemas/heal_csv.json"
with open(csv_schema_file, "r") as f:
CSV_SCHEMA = json.load(f)

json_schema_file = "heal/vlmd/schemas/heal_json.json"
with open(json_schema_file, "r") as f:
JSON_SCHEMA = json.load(f)

# schema
JSON_SCHEMA_VERSION = JSON_SCHEMA.get("version", "0.3.2")
TOP_LEVEL_PROPS = {
"schemaVersion": JSON_SCHEMA_VERSION,
"title": "HEAL Data Dictionary",
}
91 changes: 91 additions & 0 deletions heal/vlmd/extract/conversion.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
from functools import partial
from pathlib import Path

from cdislogging import get_logger

from heal.vlmd import mappings
from heal.vlmd.config import JSON_SCHEMA, TOP_LEVEL_PROPS
from heal.vlmd.extract.csv_dict_conversion import convert_datadict_csv
from heal.vlmd.extract.json_dict_conversion import convert_template_json
from heal.vlmd.utils import clean_json_fields

logger = get_logger("vlmd-conversion", log_level="debug")

choice_fxn = {
"csv-data-dict": partial(
convert_datadict_csv,
rename_map=mappings.rename_map,
recode_map=mappings.recode_map,
),
"json-template": convert_template_json,
}

ext_map = {
".csv": "csv-data-dict",
".json": "json-template",
}


def _detect_input_type(filepath, ext_to_input_type=ext_map):
ext = filepath.suffix
input_type = ext_to_input_type.get(ext, None)
return input_type


def convert_to_vlmd(
input_filepath,
input_type=None,
data_dictionary_props=None,
) -> dict:
"""
Converts a data dictionary to HEAL compliant json or csv format.

Args
input_filepath (str): Path to input file. Currently converts data dictionaries in csv, json, and tsv.
input_type (str): The input type. See keys of 'choice_fxn' dict for options, currently:
csv-data-dict, json-template.
data_dictionary_props (dict):
The other data-dictionary level properties. By default, will give the data_dictionary `title` property as the file name stem.

Returns
Dictionary with:
1. csvtemplated array of fields.
2. jsontemplated data dictionary object as specified by an originally drafted design doc.
That is, a dictionary with title:<title>,description:<description>,data_dictionary:<fields>
where data dictionary is an array of fields as specified by the JSON schema.

"""

input_filepath = Path(input_filepath)

input_type = input_type or _detect_input_type(input_filepath)
logger.debug(f"Converting file '{input_filepath}' of input_type '{input_type}'")
if input_type not in choice_fxn.keys():
logger.error(f"Unexpected input type {input_type}")
raise ValueError(
f"Unexpected input_type '{input_type}', not in {choice_fxn.keys()}"
)

# get data dictionary package based on the input type
data_dictionary_props = data_dictionary_props or {}
data_dictionary_package = choice_fxn[input_type](
input_filepath, data_dictionary_props
)
logger.debug(f"Data Dictionary Package keys {data_dictionary_package.keys()}")

# For now we return the csv and json in one package.
# If any multiple data dictionaries are needed then implement the methods in
# https://github.com/HEAL/healdata-utils/blob/5080227454d8e731d46a51aa6933c93523eb3b9a/src/healdata_utils/conversion.py#L196
package = data_dictionary_package

# add schema version
for field in package["template_csv"]["fields"]:
field.update({"schemaVersion": JSON_SCHEMA["version"], **field})

# remove empty json fields, add schema version (in TOP_LEVEL_PROPS)
package["template_json"]["fields"] = clean_json_fields(
package["template_json"]["fields"]
)
package["template_json"] = {**TOP_LEVEL_PROPS, **dict(package["template_json"])}

return package
Loading
Loading