Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/add vlmd extract #14

Open
wants to merge 42 commits into
base: master
Choose a base branch
from
Open

Feat/add vlmd extract #14

wants to merge 42 commits into from

Conversation

george42-ctds
Copy link
Collaborator

@george42-ctds george42-ctds commented Nov 1, 2024

JIRA ticket: HP-1727

New Features

  • Add VLMD extract

Breaking Changes

Bug Fixes

Improvements

Dependency updates

Deployment changes

Copy link

github-actions bot commented Nov 1, 2024

The style in this PR agrees with black. ✔️

This formatting comment was generated automatically by a script in uc-cdis/wool.

Copy link

@smvgarcia smvgarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

email change looks correct!

@@ -8,7 +8,7 @@
from heal.vlmd.validate.csv_validator import vlmd_validate_csv
from heal.vlmd.validate.json_validator import vlmd_validate_json

logger = get_logger("VALIDATE", log_level="debug")
logger = get_logger("vlmd-validate", log_level="debug")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the log_level specifications should be removed from child level methods. If they were set at a higher level then we would have more control in changing the level, say when invoked by the CLI.

@@ -8,7 +9,7 @@
from heal.vlmd.config import ALLOWED_SCHEMA_TYPES
from heal.vlmd.validate.utils import get_schema

logger = get_logger("JSON_VALIDATOR", log_level="debug")
logger = get_logger("json-validator", log_level="debug")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another example where the log_level in a child method could be removed and controlled from a higher level method.

)

propnames_exp = "|".join("^" + name + "$" for name in props_to_rearrange)
names_to_rearrange = [name for name in propnames if re.match(propnames_exp, name)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to do re.match() here? I think what the propnames_exp and re.match() do here is the same as if name in props_to_rearrange, or maybe I'm missing something

is_one_unique = len(flat_fields_df[name].map(str).unique()) == 1
else:
is_one_unique = flat_fields_df[name].nunique() == 1
if is_one_unique:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_one_unique might have not be defined at here

if is_one_unique:
flat_record[name] = flat_fields_df.pop(name).iloc[0]
if isinstance(flat_record[name], pd.Series):
flat_record[name] = flat_record[name].to_list()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I wrote this block of logic in the old healdata-utils code to patch some urgent bugs. But I didn't get to the root of the problem, so it is more like a band-aid fix
What I really hope we can do is to figure out why flat_record[name] can have so many possible types (pd.DataFrame, pd.Series, etc.). I feel that we might not need to support dealing with that many different types of flat_record[name] here, if we can fix the logic upstream to make sure the value of flat_record[name] will only ends up in one type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also create another ticket for what I said above, and put a FIXME note here

droplist: dict = None,
item_sep: str = "|",
keyval_sep: str = "=",
) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many variable / function names in this file that should be in snake cases but are not. We should change them



def convert_templatejson(
jsontemplate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments for snake cases for this file as well

for prop_name in nested_names[:-1]:
if "[" in prop_name:
array_name, array_index = prop_name.split("[")
array_index = int(array_index[:-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably needs some error handling if the casting fails here

else:
prop_json[array_name][array_index] = {array_name: prop}
else:
prop_json[last_prop_name] = prop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L195-L210 is very similar to L176-L193 above, except it does something special at the end for the last element. Probably better if written this way to combine the two

nested_names_length = len(nested_names)
for i, prop_name in enumerate(nested_names):
  if "[" in prop_name:
  ...
  if i == nested_names_length - 1
  // do something for the last element
  else:
  // do other things for other elements

logger.debug("Validating data")
jsonschema.validate(instance=data, schema=schema)
logger.debug(data)
# jsonschema.validate(instance=data, schema=schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants