Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modified prov-info and <id>/prov-info to use dataset_type instead of … #596

Merged
merged 4 commits into from
Jan 17, 2024

Conversation

DerekFurstPitt
Copy link
Contributor

modified prov-info and /prov-info to use dataset_type instead of data
types. samples/prov-info didn't appear to include any references. Also
applied an identical change to what was done in the data-sankey branch to
make validate_organ_code accept an optional argument that includes the organ
dictionary so that ontology api doesn't need called thousands of times.

…data

types. samples/prov-info didn't appear to include any references. Also
applied an identical change to what was done in the data-sankey branch to
make validate_organ_code accept an optional argument that includes the organ
dictionary so that ontology api doesn't need called thousands of times.
@@ -2721,7 +2719,7 @@ def get_prov_info():
# Token is not required, but if an invalid token is provided,
# we need to tell the client with a 401 error
validate_token_if_auth_header_exists(request)

organ_types_dict = schema_manager.get_organ_types()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the organ_types_dict is assigned twice with the same call:

entity-api/src/app.py

Lines 2722 to 2728 in 95b243e

organ_types_dict = schema_manager.get_organ_types()
if user_in_hubmap_read_group(request):
published_only = False
# Parsing the organ types yaml has to be done here rather than calling schema.schema_triggers.get_organ_description
# because that would require using a urllib request for each dataset
organ_types_dict = schema_manager.get_organ_types()

Just keep one.

@@ -3016,7 +2992,7 @@ def get_prov_info_for_dataset(id):
# Token is not required, but if an invalid token provided,
# we need to tell the client with a 401 error
validate_token_if_auth_header_exists(request)

organ_types_dict = schema_manager.get_organ_types()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found duplicate in /datasets/<id>/prov-info too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuanzhou I wonder if this duplication was caused because I applied the same change in 2 different branches. I'll fix this real quick. So what happens is, whenever that validate organ code function got added, it made an external call to ontology api (this was probably added while converting yaml references to calls to ontology) but since validate organ code is called once per dataset, it ended up making the prov info endpoints take several minutes to run. I saw that this validate organ code function gets called multiple times throughout entity api and I wanted to createa fix that required as little reworking and re-testing of those other endpoints as possible, so I just added an optional argument to that function where that dict can be passed in so in situations where that function is called inside a loop, its not making constant calls to ontology. The thing is, I had to make this change in 2 different branches in entity-api, so I was curious if there would be any sort of duplication. Will push fix in a couple minutes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the duplicates in the sankey data PR, could just be in this prov-info branch.

That's good performance optimization you did even though that organ call internally is being cached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes my concern was less about the speed of the prov-info endpoints (since as you mentioned, its cached) but rather clogging up ontology with thousands of requests.

@DerekFurstPitt
Copy link
Contributor Author

@yuanzhou those duplicates have been removed

@yuanzhou yuanzhou merged commit 56899dc into dev-integrate Jan 17, 2024
2 checks passed
@yuanzhou yuanzhou deleted the Derek-Furst/prov-info-dataset-type branch January 26, 2024 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants