Use REST APIs to resolve DOIs + cleanup dataverse provider #1390

yuvipanda · 2024-12-17T00:06:05Z

(Lots of credit to @pdurbin for helping debug and co-produce this work)

Evolution of smaller PRs #1388 and #1389

While debugging #1388, I realized that we were basically
relying on HTTP redirects as an API for DOI resolution. This is fragile because we don't actually know
how many redirects we may get, and there may be things that break the redirect chain in the middle
(such as dataverse WAF rules, ref https://jupyter.zulipchat.com/#narrow/channel/103349-ask-anything/topic/Binder.20Dataverse.20error)

This PR tries to:

Make our DOI resolution more deterministic (only 1 API call)
Make our dataverse repprovider more robust, by relying on dataverse APIs for more things
Reduce the amount of mocking in provider tests, so they can actually catch errors in our code more easily.

Making DOI resolution more deterministic

Right now, we are making GET requests to https://doi.org/<doi>, and following all redirects to figure out where a DOI resolves to. This has multiple problems:

It isn't deterministic, as we can have many redirects of dubious quality and reliability - not just from doi.org but from the providers themselves
It transfers a lot of unnecessary data (the body of the GET) that we immediately discard
It can trigger firewalls and rate limits on data providers sometimes, and that's hard to debug (it will simply get detected as not a doi based provider, and fall back to git provider).

doi.org has an actual JSON API, and we directly hit that instead. We will always now get an answer with one HTTP request, and not have to follow an arbitrary number of responses!

Rely on the dataverse API directly

By no longer relying on HTTP redirects, we did lose one bit of functionality - dataverse has /citation URLs that resolve into either an individual file's doi (via /file.xhtml) or an entire dataset's doi (via /dataset.xhtml). The existing behavior was:

If we are dealing with a dataset, fetch the entire dataset
If we are dealing with a file, guess at the dataset the file belongs to, and fetch the entire dataset

This PR stops guessing, and uses the API to implement the functionality. We can now directly handle /citation dataverse URLs, and no longer 'guess' - we directly use the API to figure out what's going on.

Reduce the amount of mocking in contentprovider tests

IMO, since the point of contentproviders is to integrate with external content providers, they should be integration tests so we can catch issues with them more easily. Integration tests would have caught https://jupyter.zulipchat.com/#narrow/channel/103349-ask-anything/topic/Binder.20Dataverse.20error more cleanly than how it happened for example.

It also immediately showed me that some of the fixtures we were using to test were not accurate, and I've updated the test fixtures to match what we actually expect them to return.

This PR:

Makes the dataverse contentprovider more of an integration test, so we actually catch bugs
Move it to tests/contentprovider, where I expect more of the contentproviders to join (in the future)
Refactors the detect method in dataverse provider slightly to make it easier to test.
Removes just enough mocking in other doi based providers to get tests to pass.

In a future PR, I'd like to remove more of at least the DOI provider mocks, and simplify the code a little as a result.

As I was debugging jupyterhub#1388, I realized that PR actually broke the dataverse provider, but the existing test was mocking so much that we didn't actually catch it! IMO, since the point of contentproviders is to integrate with external content providers, they should be integration tests so we can catch issues with them more easily. Integration tests would have caught https://jupyter.zulipchat.com/#narrow/channel/103349-ask-anything/topic/Binder.20Dataverse.20error more cleanly than how it happened for example. This PR removes all mocks from the dataverse test, and we immediately benefit - it shows us that the dataverse provider *only* actually handles DOIs, and not direct URLs! So even though we technically had tests earlier that showed our dataverse provider supporting direct dataverse URLs, it simply was not true. So we actually catch the failure. I will try to see if we can use a demo or test instance for the fetch test though, so we don't screw up download stats even more for the existing test doi we use.

yuvipanda · 2024-12-17T00:18:16Z

dang, so /citation can also forward to a file.xhtml url, which the dataverse API calls we are making do not support

for more information, see https://pre-commit.ci

pdurbin

Looks good! As I mentioned, tests are failing, but that's expected.

I did make a couple suggestions about comments and unused code.

repo2docker/contentproviders/dataverse.py

for more information, see https://pre-commit.ci

Co-authored-by: Philip Durbin <[email protected]>

We no longer follow redirects, so this is the canonical URL

yuvipanda · 2024-12-17T22:16:49Z

Ready for review!

minrk

Nice! Only one question about the content id hash, but LGTM!

minrk · 2024-12-18T11:04:13Z

repo2docker/contentproviders/dataverse.py

@@ -129,4 +213,4 @@ def fetch(self, spec, output_dir, yield_output=False):
    @property
    def content_id(self):
        """The Dataverse persistent identifier."""
-        return self.record_id
+        return hashlib.sha256(self.url.encode()).hexdigest()


Is there a risk that a 64 character hash might make image names too long?

https://github.com/opencontainers/distribution-spec/blob/main/spec.md#pulling-manifests says many implementations limit the hostname and name of the image in total to 256 chars. I think this means it may be good enough and not a problem?

Alternatively, I can go back to parsing persistent_id at detect time, instead of at fetch time, and set it that way. I think part of the confusion here is around detect semantics and when content_id is called. Ideally detect should be stateless and be simply used to, well, detect things! But we seem to treat it as also the thing that sets .content_id so it's a little bit of a mess. I'm happy to treat that as a different refactor though.

Choice to be made

Leave this as is

Set this to be persistent_id instead, and move persistent_id parsing back into detect

Happy to do either!

But we seem to treat it as also the thing that sets .content_id so it's a little bit of a mess.

Yeah, that certainly doesn't sound right. It looks to me like we also only access content_id after calling fetch. Is it possible that the issue you are seeing is only in the tests, not how r2d actually behaves? What happens if you raise in content_id if fetch hasn't been called?

If it's just the tests and persistent_id is defined after fetch, then keeping persistent_id seems nice here, and maybe we can fix the tests to be more realistic. And make it explicit that content_id cannot be assumed to be available until fetch has been called?

A tangent I went on about hash length, that I'm not sure is relevant anymore, but already wrote down. Feel free to ignore:

Initially, I thought the content id was the full thing, but of course it's the 'ref' that goes after the doi itself. Running a test gives this 106-character image name:

r2ddoi-3a10-2e7910-2fdvn-2f6zxagt-2f3yrryjec19b07b80bf8eeb95f669a51f64efb7f647f91cf1b1f6ccbef736396ba936ef

Since we're in the namespace of the doi, collision probability is super low. We truncate to the short hash in git. So maybe truncate this hash, or use a natively shorter hash function like:

hashlib.blake2s(self.url.encode(), digest_size=10).hexdigest()

(blake2 is in hashlib.algorithms_guaranteed since 3.6, I think)

ooooh, fixing the tests seems the right thing to do! I'll take a peek.

hey, it looks like I already fixed the tests, so it's all fine now! Back to using persistent_id as the identifier, but this time it's 'proper' - so if we get a file persistent_id, we resolve it to the dataset persistent id, and use that. so if multiple different folks try to use different files from the same dataset, it will lead to cache reuse now!

Great! All looks right to me, then.

We were: 1. In some cases, directly using requests 2. In some cases, directly using the stdlib's urlopen 3. In some cases, had a method named urlopen that simply passed things through to requests This is unnecessarily confusing, and seems to primarily be done for the benefit of mocking the network calls. However, as described in the recently merged jupyterhub#1390, I don't think mocking is appropriate here as it means we don't actually catch problems. This PR mostly focuses on getting unifying to only using requests directly with as little indirection as possible. If any tests were directly using mocks here, they will be replaced with something that is testing things more directly as appropriate

yuvipanda added 5 commits December 16, 2024 12:53

Use the doi.org API to resolve URLs

91242e5

Fix text fixtures

bf40856

Merge branch 'integ' into use-api

3eab292

[WIP] Cleanup dataverse contentprovider

172f8b0

yuvipanda marked this pull request as draft December 17, 2024 00:06

yuvipanda mentioned this pull request Dec 17, 2024

Resolve DOI more cleanly #1388

Closed

Support fetcing single files in dataverse

1260a5a

yuvipanda force-pushed the use-api branch from 321daa6 to 1260a5a Compare December 17, 2024 01:02

Always fetch entire dataset for dataverse

b7050ba

yuvipanda force-pushed the use-api branch from 14fa4b3 to b7050ba Compare December 17, 2024 04:00

Fix content_id for dataverse URLs

fde74ef

yuvipanda force-pushed the use-api branch from d70c0a1 to fde74ef Compare December 17, 2024 17:24

yuvipanda added 2 commits December 17, 2024 09:30

Use List from typing

96057f9

Use hash for content_id

fda5339

yuvipanda force-pushed the use-api branch from efad3d6 to fda5339 Compare December 17, 2024 17:31

[pre-commit.ci] auto fixes from pre-commit.com hooks

f6037ca

for more information, see https://pre-commit.ci

pdurbin approved these changes Dec 17, 2024

View reviewed changes

repo2docker/contentproviders/dataverse.py Outdated Show resolved Hide resolved

repo2docker/contentproviders/dataverse.py Outdated Show resolved Hide resolved

repo2docker/contentproviders/dataverse.py Outdated Show resolved Hide resolved

Fix tests

b854b77

yuvipanda mentioned this pull request Dec 17, 2024

Stop mocking dataverse contentprovider test #1389

Closed

pre-commit-ci bot and others added 6 commits December 17, 2024 19:51

[pre-commit.ci] auto fixes from pre-commit.com hooks

f9e3d70

for more information, see https://pre-commit.ci

Add note about supporting /citation

53fba84

Describe what kind of DOI is being returned

f4d58dc

Co-authored-by: Philip Durbin <[email protected]>

Fix figshare unit test

d71efb8

Fix hydroshare tests

f7dfff1

Fix zenodo tests

3be6ca9

yuvipanda mentioned this pull request Dec 17, 2024

Simplify Dataverse Content Provider to only operate on datasets yuvipanda/repo2docker#7

Closed

yuvipanda changed the title ~~[WIP] Cleanup dataverse provider~~ [WIP] Use REST APIs to resolve DOIs + cleanup dataverse provider Dec 17, 2024

Fix doi provider test

60c0d70

We no longer follow redirects, so this is the canonical URL

yuvipanda changed the title ~~[WIP] Use REST APIs to resolve DOIs + cleanup dataverse provider~~ Use REST APIs to resolve DOIs + cleanup dataverse provider Dec 17, 2024

yuvipanda requested a review from manics December 17, 2024 22:16

yuvipanda marked this pull request as ready for review December 17, 2024 22:16

yuvipanda mentioned this pull request Dec 17, 2024

Support Handles more consistently, rather than just DOIs (which are a subset of handles) #1391

Open

minrk reviewed Dec 18, 2024

View reviewed changes

pdurbin mentioned this pull request Dec 19, 2024

Binder not working from Harvard Dataverse IQSS/dataverse.harvard.edu#328

Open

Switch back to using DOI as persistent_id

e48f5b7

minrk merged commit b7c1515 into jupyterhub:main Dec 20, 2024
19 checks passed

yuvipanda mentioned this pull request Dec 21, 2024

Cleanup hydroshare provider and stop using urlopen #1393

Draft

jupyterhub-bot mentioned this pull request Dec 21, 2024

Update quay.io/jupyterhub/repo2docker version to 2024.07.0-61.gb7c1515 jupyterhub/mybinder.org-deploy#3148

Merged

pdurbin mentioned this pull request Jan 23, 2025

deployment of repo2docker#1390 etc. jupyterhub/mybinder.org-deploy#3191

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use REST APIs to resolve DOIs + cleanup dataverse provider #1390

Use REST APIs to resolve DOIs + cleanup dataverse provider #1390

yuvipanda commented Dec 17, 2024 •

edited

Loading

yuvipanda commented Dec 17, 2024

pdurbin left a comment

yuvipanda commented Dec 17, 2024

minrk left a comment

minrk Dec 18, 2024

yuvipanda Dec 18, 2024

minrk Dec 19, 2024

yuvipanda Dec 19, 2024

yuvipanda Dec 19, 2024

minrk Dec 20, 2024

Use REST APIs to resolve DOIs + cleanup dataverse provider #1390

Use REST APIs to resolve DOIs + cleanup dataverse provider #1390

Conversation

yuvipanda commented Dec 17, 2024 • edited Loading

Making DOI resolution more deterministic

Rely on the dataverse API directly

Reduce the amount of mocking in contentprovider tests

yuvipanda commented Dec 17, 2024

pdurbin left a comment

Choose a reason for hiding this comment

yuvipanda commented Dec 17, 2024

minrk left a comment

Choose a reason for hiding this comment

minrk Dec 18, 2024

Choose a reason for hiding this comment

yuvipanda Dec 18, 2024

Choose a reason for hiding this comment

Choice to be made

minrk Dec 19, 2024

Choose a reason for hiding this comment

yuvipanda Dec 19, 2024

Choose a reason for hiding this comment

yuvipanda Dec 19, 2024

Choose a reason for hiding this comment

minrk Dec 20, 2024

Choose a reason for hiding this comment

yuvipanda commented Dec 17, 2024 •

edited

Loading