Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use REST APIs to resolve DOIs + cleanup dataverse provider #1390

Merged
merged 20 commits into from
Dec 20, 2024

Conversation

yuvipanda
Copy link
Collaborator

@yuvipanda yuvipanda commented Dec 17, 2024

(Lots of credit to @pdurbin for helping debug and co-produce this work)

Evolution of smaller PRs #1388 and #1389

While debugging #1388, I realized that we were basically
relying on HTTP redirects as an API for DOI resolution. This is fragile because we don't actually know
how many redirects we may get, and there may be things that break the redirect chain in the middle
(such as dataverse WAF rules, ref https://jupyter.zulipchat.com/#narrow/channel/103349-ask-anything/topic/Binder.20Dataverse.20error)

This PR tries to:

  1. Make our DOI resolution more deterministic (only 1 API call)
  2. Make our dataverse repprovider more robust, by relying on dataverse APIs for more things
  3. Reduce the amount of mocking in provider tests, so they can actually catch errors in our code more easily.

Making DOI resolution more deterministic

Right now, we are making GET requests to https://doi.org/<doi>, and following all redirects to figure out where a DOI resolves to. This has multiple problems:

  1. It isn't deterministic, as we can have many redirects of dubious quality and reliability - not just from doi.org but from the providers themselves
  2. It transfers a lot of unnecessary data (the body of the GET) that we immediately discard
  3. It can trigger firewalls and rate limits on data providers sometimes, and that's hard to debug (it will simply get detected as not a doi based provider, and fall back to git provider).

doi.org has an actual JSON API, and we directly hit that instead. We will always now get an answer with one HTTP request, and not have to follow an arbitrary number of responses!

Rely on the dataverse API directly

By no longer relying on HTTP redirects, we did lose one bit of functionality - dataverse has /citation URLs that resolve into either an individual file's doi (via /file.xhtml) or an entire dataset's doi (via /dataset.xhtml). The existing behavior was:

  1. If we are dealing with a dataset, fetch the entire dataset
  2. If we are dealing with a file, guess at the dataset the file belongs to, and fetch the entire dataset

This PR stops guessing, and uses the API to implement the functionality. We can now directly handle /citation dataverse URLs, and no longer 'guess' - we directly use the API to figure out what's going on.

Reduce the amount of mocking in contentprovider tests

IMO, since the point of contentproviders is to integrate with external content providers, they should be integration tests so we can catch issues with them more easily. Integration tests would have caught https://jupyter.zulipchat.com/#narrow/channel/103349-ask-anything/topic/Binder.20Dataverse.20error more cleanly than how it happened for example.

It also immediately showed me that some of the fixtures we were using to test were not accurate, and I've updated the test fixtures to match what we actually expect them to return.

This PR:

  1. Makes the dataverse contentprovider more of an integration test, so we actually catch bugs
  2. Move it to tests/contentprovider, where I expect more of the contentproviders to join (in the future)
  3. Refactors the detect method in dataverse provider slightly to make it easier to test.
  4. Removes just enough mocking in other doi based providers to get tests to pass.

In a future PR, I'd like to remove more of at least the DOI provider mocks, and simplify the code a little as a result.

As I was debugging jupyterhub#1388,
I realized that PR actually broke the dataverse provider, but the
existing test was mocking so much that we didn't actually catch it!

IMO, since the point of contentproviders is to integrate with
external content providers, they should be integration tests so we
can catch issues with them more easily. Integration tests would have
caught https://jupyter.zulipchat.com/#narrow/channel/103349-ask-anything/topic/Binder.20Dataverse.20error
more cleanly than how it happened for example.

This PR removes all mocks from the dataverse test, and we immediately
benefit - it shows us that the dataverse provider *only* actually
handles DOIs, and not direct URLs! So even though we technically
had tests earlier that showed our dataverse provider supporting direct
dataverse URLs, it simply was not true. So we actually catch the
failure.

I will try to see if we can use a demo or test instance for the
fetch test though, so we don't screw up download stats even more
for the existing test doi we use.
@yuvipanda yuvipanda marked this pull request as draft December 17, 2024 00:06
@yuvipanda
Copy link
Collaborator Author

dang, so /citation can also forward to a file.xhtml url, which the dataverse API calls we are making do not support

Copy link
Contributor

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! As I mentioned, tests are failing, but that's expected.

I did make a couple suggestions about comments and unused code.

repo2docker/contentproviders/dataverse.py Outdated Show resolved Hide resolved
repo2docker/contentproviders/dataverse.py Outdated Show resolved Hide resolved
repo2docker/contentproviders/dataverse.py Outdated Show resolved Hide resolved
@yuvipanda yuvipanda changed the title [WIP] Cleanup dataverse provider [WIP] Use REST APIs to resolve DOIs + cleanup dataverse provider Dec 17, 2024
We no longer follow redirects, so this is the canonical
URL
@yuvipanda yuvipanda changed the title [WIP] Use REST APIs to resolve DOIs + cleanup dataverse provider Use REST APIs to resolve DOIs + cleanup dataverse provider Dec 17, 2024
@yuvipanda yuvipanda requested a review from manics December 17, 2024 22:16
@yuvipanda yuvipanda marked this pull request as ready for review December 17, 2024 22:16
@yuvipanda
Copy link
Collaborator Author

Ready for review!

Copy link
Member

@minrk minrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Only one question about the content id hash, but LGTM!

@@ -129,4 +213,4 @@ def fetch(self, spec, output_dir, yield_output=False):
@property
def content_id(self):
"""The Dataverse persistent identifier."""
return self.record_id
return hashlib.sha256(self.url.encode()).hexdigest()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a risk that a 64 character hash might make image names too long?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/opencontainers/distribution-spec/blob/main/spec.md#pulling-manifests says many implementations limit the hostname and name of the image in total to 256 chars. I think this means it may be good enough and not a problem?

Alternatively, I can go back to parsing persistent_id at detect time, instead of at fetch time, and set it that way. I think part of the confusion here is around detect semantics and when content_id is called. Ideally detect should be stateless and be simply used to, well, detect things! But we seem to treat it as also the thing that sets .content_id so it's a little bit of a mess. I'm happy to treat that as a different refactor though.

Choice to be made

  1. Leave this as is
  2. Set this to be persistent_id instead, and move persistent_id parsing back into detect

Happy to do either!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we seem to treat it as also the thing that sets .content_id so it's a little bit of a mess.

Yeah, that certainly doesn't sound right. It looks to me like we also only access content_id after calling fetch. Is it possible that the issue you are seeing is only in the tests, not how r2d actually behaves? What happens if you raise in content_id if fetch hasn't been called?

If it's just the tests and persistent_id is defined after fetch, then keeping persistent_id seems nice here, and maybe we can fix the tests to be more realistic. And make it explicit that content_id cannot be assumed to be available until fetch has been called?


A tangent I went on about hash length, that I'm not sure is relevant anymore, but already wrote down. Feel free to ignore:

Initially, I thought the content id was the full thing, but of course it's the 'ref' that goes after the doi itself. Running a test gives this 106-character image name:

r2ddoi-3a10-2e7910-2fdvn-2f6zxagt-2f3yrryjec19b07b80bf8eeb95f669a51f64efb7f647f91cf1b1f6ccbef736396ba936ef

Since we're in the namespace of the doi, collision probability is super low. We truncate to the short hash in git. So maybe truncate this hash, or use a natively shorter hash function like:

hashlib.blake2s(self.url.encode(), digest_size=10).hexdigest()

(blake2 is in hashlib.algorithms_guaranteed since 3.6, I think)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooooh, fixing the tests seems the right thing to do! I'll take a peek.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey, it looks like I already fixed the tests, so it's all fine now! Back to using persistent_id as the identifier, but this time it's 'proper' - so if we get a file persistent_id, we resolve it to the dataset persistent id, and use that. so if multiple different folks try to use different files from the same dataset, it will lead to cache reuse now!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! All looks right to me, then.

@minrk minrk merged commit b7c1515 into jupyterhub:main Dec 20, 2024
19 checks passed
yuvipanda added a commit to yuvipanda/repo2docker that referenced this pull request Dec 20, 2024
We were:

1. In some cases, directly using requests
2. In some cases, directly using the stdlib's urlopen
3. In some cases, had a method named urlopen that simply passed
   things through to requests

This is unnecessarily confusing, and seems to primarily be done
for the benefit of mocking the network calls. However, as described
in the recently merged jupyterhub#1390,
I don't think mocking is appropriate here as it means we don't actually
catch problems.

This PR mostly focuses on getting unifying to only using requests
directly with as little indirection as possible. If any tests were
directly using mocks here, they will be replaced with something that
is testing things more directly as appropriate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants