Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add context (res id and name) when downloading data #24

Merged
merged 19 commits into from
Jun 13, 2024

Conversation

Moohan
Copy link
Member

@Moohan Moohan commented Mar 28, 2024

The problem was highlighted in this teams chat, when using get_dataset if the dataset doesn't include a variable like date it becomes impossible to tell which rows are from which resource as get_dataset just joins them into a single tibble.

This PR introduces an include_context parameter which is FALSE by default to both get_resource and get_dataset if TRUE they will now append a few variables: "res_id", "res_name", "res_created_date" and "res_modified_date" to the front of the data.

For get_dataset this is achieved with no additional API calls as all the needed info is included when we hit package_show for the list of IDs. For get_resource this involves an additional API call to resource_show (we haven't used this before).

I chose to have them both be FALSE by default so it won't break any existing code.

@csillasch
Copy link
Contributor

This is great! I`m probably leaning toward having it optional (but perhaps default to TRUE for get_dataset) and I think it might be worth having the option for get_resource as well (but would perhaps default to FALSE for get_resource)?

Moohan added 6 commits April 3, 2024 15:40
Also, improve the clarity of the error message (likely only developers will see this error)
This is currently used throughout the package.
Also make tests skip if it looks like the OpenData platform isn't available.
@Moohan Moohan requested review from csillasch and ross-hull April 10, 2024 15:40
@Moohan Moohan marked this pull request as ready for review April 10, 2024 15:46
@Moohan
Copy link
Member Author

Moohan commented Apr 11, 2024

From discussion

  • Add some more tests with other datasets and resources (d3eaaf84-0f3b-4fb8-9460-e33503095fbe).
  • Should missing modified date be created date.
  • Parse the dates as a date
  • Deal with dump extract

@Moohan
Copy link
Member Author

Moohan commented Apr 19, 2024

@csillasch and @ross-hull I think this is done now.

The modified date still returns as NA if nothing is returned from the API (we discussed possibly setting it to the created date but don't remember the final decision?

We talked about adding some additional tests, I've added some but please let me know if you want anything else specifically testing - or importantly if you find a resource / dataset this isn't working for in which case I'll definitely add that and fix whatever the issue is!

@ross-hull
Copy link
Contributor

This looks good to me. And using it as a user was simple. One thing that I wondered is if it is worth having the same naming convention between the new context columns, as is already used in the open data? The context columns are snake case while the others are upper camel.

Copy link
Contributor

@csillasch csillasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

I spotted unexpected behaviour when testing on the General Practitioner Contact Details dataset: modified date is returned correctly using get_dataset but returns NULL with get_resource("647b256e-4a03-4963-8402-bf559c9e2fff")

Suggested a change to sort this which works for the resource but might be worth testing against some of the others - especially where it worked fine previously.

R/get_resource.R Outdated Show resolved Hide resolved
@Moohan
Copy link
Member Author

Moohan commented Apr 24, 2024

This looks good to me. And using it as a user was simple. One thing that I wondered is if it is worth having the same naming convention between the new context columns, as is already used in the open data? The context columns are snake case while the others are upper camel.

Fair enough, do these names look good?

"ResId" (or "ResID"?)
"ResName"
"ResCreatedDate"
"ResModifiedDate"

@Moohan Moohan requested a review from csillasch April 24, 2024 14:41
Copy link
Contributor

@csillasch csillasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get_dataset seems to handle it fine but get_resource is returning the dates as lists - all your changes look OK to me so not sure how/why this is happening

This was the advice I got back from the `{vctrs}` team.
csillasch
csillasch previously approved these changes May 9, 2024
@csillasch csillasch merged commit 7298938 into Public-Health-Scotland:master Jun 13, 2024
12 checks passed
@Moohan Moohan deleted the add_context branch June 19, 2024 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants