-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harvard Dataverse file retriever implementation for Heal SDK #11
Merged
Merged
Changes from all commits
Commits
Show all changes
27 commits
Select commit
Hold shift + click to select a range
25d80c1
black formatting
piotrsenkow e24fc88
black formatting
piotrsenkow 55c1a9f
adding unit testing and modified logic for harvard dataverse file ret…
piotrsenkow d9bb4a1
missing variables breaking unit test
piotrsenkow 62ac89f
fixing breaking unit test
piotrsenkow 393f158
fixing breaking tests
piotrsenkow df99f92
fixing breaking tests
piotrsenkow 93b1364
fixing breaking tests
piotrsenkow 4a4f2d2
fixing breaking tests
piotrsenkow cbe00cd
fixing breaking tests
piotrsenkow dcf78b8
fixing breaking tests
piotrsenkow 4b21499
fixing breaking tests
piotrsenkow f4aa6a5
fixing breaking tests
piotrsenkow 17529f3
fixing breaking tests
piotrsenkow b1533f5
fixing breaking tests
piotrsenkow 14b9e53
Merge branch 'master' into piotr/harvard
piotrsenkow 92255f4
refactor and adding utils
piotrsenkow d67c2f3
black
piotrsenkow 009f18e
broken import due to refactoring
piotrsenkow c9fe359
changing parameter names in unit tests as functions have been slightl…
piotrsenkow 1dca4cb
refactoring
piotrsenkow 1b41e11
refactoring
piotrsenkow 9466551
refactoring
piotrsenkow b4f00a7
refactor
piotrsenkow 5a8ce80
Merge branch 'master' into piotr/harvard
piotrsenkow f5fcf7e
Adding TODO disclaimer about not having to use WTS token to access ha…
piotrsenkow b89b3e3
Merge branch 'piotr/harvard' of https://github.com/uc-cdis/heal-platf…
piotrsenkow File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
""" | ||
This module includes an external file retriever function intended to be called | ||
by the external_files_download module in the Gen3-SDK. | ||
|
||
The retriever function sends requests to the Harvard Dataverse for downloading studies or files. | ||
TODO: QDR and Harvard Dataverse use the same Dataverse API, however, we do NOT need to use WTS token to access with Harvard | ||
The Dataverse documentation describes how to download studies | ||
https://guides.dataverse.org/en/latest/api/dataaccess.html#basic-download-by-dataset | ||
|
||
""" | ||
|
||
from pathlib import Path | ||
from typing import Dict, List | ||
from utils import unpackage_object, get_id, download_from_url | ||
|
||
from cdislogging import get_logger | ||
from gen3.tools.download.drs_download import DownloadStatus | ||
|
||
logger = get_logger("__name__", log_level="debug") | ||
|
||
|
||
def get_harvard_dataverse_files( | ||
file_metadata_list: List, download_path: str = "." | ||
) -> Dict: | ||
""" | ||
Retrieves external data from the Harvard Dataverse. | ||
|
||
Args: | ||
file_metadata_list (List of Dict): list of studies or files | ||
download_path (str): path to download files and unpack | ||
|
||
Returns: | ||
Dict of download status | ||
""" | ||
if not Path(download_path).exists(): | ||
logger.critical(f"Download path does not exist: {download_path}") | ||
return None | ||
|
||
completed = {} | ||
logger.debug(f"Input file metadata list={file_metadata_list}") | ||
|
||
for file_metadata in file_metadata_list: | ||
id = get_id(file_metadata) | ||
if id is None: | ||
logger.warning( | ||
f"Could not find 'study_id' or 'file_id' in metadata {file_metadata}" | ||
) | ||
continue | ||
logger.info(f"ID = {id}") | ||
completed[id] = DownloadStatus(filename=id, status="pending") | ||
|
||
download_url = get_download_url_for_harvard_dataverse(file_metadata) | ||
if download_url is None: | ||
logger.critical(f"Could not get download_url for {id}") | ||
completed[id].status = "invalid url" | ||
continue | ||
|
||
logger.debug(f"Ready to send request to download_url: GET {download_url}") | ||
downloaded_file = download_from_url( | ||
api_url=download_url, | ||
headers=None, | ||
download_path=download_path, | ||
) | ||
if downloaded_file is None: | ||
completed[id].status = "failed" | ||
continue | ||
|
||
if downloaded_file.endswith("zip"): | ||
# unpack if download is zip file | ||
try: | ||
logger.debug(f"Ready to unpack {downloaded_file}.") | ||
unpackage_object(filepath=downloaded_file) | ||
except Exception as e: | ||
logger.critical(f"{id} had an issue while being unpackaged: {e}") | ||
completed[id].status = "failed" | ||
|
||
completed[id].status = "downloaded" | ||
# remove the zip file | ||
Path(downloaded_file).unlink() | ||
else: | ||
completed[id].status = "downloaded" | ||
|
||
if not completed: | ||
return None | ||
return completed | ||
|
||
|
||
def get_download_url_for_harvard_dataverse(file_metadata: Dict) -> str: | ||
""" | ||
Get the download url for Harvard Dataverse. | ||
|
||
Args: | ||
file_metadata (Dict) | ||
|
||
Returns: | ||
url, None if there are errors | ||
""" | ||
base_url = "https://dataverse.harvard.edu/api/access" | ||
if "use_harvard_staging" in file_metadata and bool( | ||
file_metadata["use_harvard_staging"] | ||
): | ||
base_url = "https://demo.dataverse.org/api/access" | ||
if "study_id" in file_metadata: | ||
url = f"{base_url}/dataset/:persistentId/?persistentId={file_metadata.get('study_id')}" | ||
else: | ||
url = None | ||
|
||
return url | ||
|
||
|
||
def is_valid_harvard_file_metadata(file_metadata: Dict) -> bool: | ||
""" | ||
Check that the file_metadata has the required keys: | ||
'study_id' or 'file_id'. | ||
|
||
Args: | ||
file_metadata (Dict) | ||
|
||
Returns: | ||
True if valid file_metadata object. | ||
""" | ||
if not isinstance(file_metadata, dict): | ||
logger.critical(f"Invalid metadata - item is not a dict: {file_metadata}") | ||
return False | ||
if "study_id" not in file_metadata: | ||
logger.critical( | ||
f"Invalid metadata - missing required Harvard Dataverse keys {file_metadata}" | ||
) | ||
return False | ||
return True |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add some TODO notes in here to remind us about the lack of auth parts