-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use REST APIs to resolve DOIs + cleanup dataverse provider #1390
Conversation
As I was debugging jupyterhub#1388, I realized that PR actually broke the dataverse provider, but the existing test was mocking so much that we didn't actually catch it! IMO, since the point of contentproviders is to integrate with external content providers, they should be integration tests so we can catch issues with them more easily. Integration tests would have caught https://jupyter.zulipchat.com/#narrow/channel/103349-ask-anything/topic/Binder.20Dataverse.20error more cleanly than how it happened for example. This PR removes all mocks from the dataverse test, and we immediately benefit - it shows us that the dataverse provider *only* actually handles DOIs, and not direct URLs! So even though we technically had tests earlier that showed our dataverse provider supporting direct dataverse URLs, it simply was not true. So we actually catch the failure. I will try to see if we can use a demo or test instance for the fetch test though, so we don't screw up download stats even more for the existing test doi we use.
dang, so |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! As I mentioned, tests are failing, but that's expected.
I did make a couple suggestions about comments and unused code.
for more information, see https://pre-commit.ci
Co-authored-by: Philip Durbin <[email protected]>
We no longer follow redirects, so this is the canonical URL
Ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Only one question about the content id hash, but LGTM!
@@ -129,4 +213,4 @@ def fetch(self, spec, output_dir, yield_output=False): | |||
@property | |||
def content_id(self): | |||
"""The Dataverse persistent identifier.""" | |||
return self.record_id | |||
return hashlib.sha256(self.url.encode()).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a risk that a 64 character hash might make image names too long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/opencontainers/distribution-spec/blob/main/spec.md#pulling-manifests says many implementations limit the hostname and name of the image in total to 256 chars. I think this means it may be good enough and not a problem?
Alternatively, I can go back to parsing persistent_id at detect
time, instead of at fetch
time, and set it that way. I think part of the confusion here is around detect
semantics and when content_id
is called. Ideally detect
should be stateless and be simply used to, well, detect things! But we seem to treat it as also the thing that sets .content_id
so it's a little bit of a mess. I'm happy to treat that as a different refactor though.
Choice to be made
- Leave this as is
- Set this to be persistent_id instead, and move persistent_id parsing back into
detect
Happy to do either!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we seem to treat it as also the thing that sets .content_id so it's a little bit of a mess.
Yeah, that certainly doesn't sound right. It looks to me like we also only access content_id after calling fetch. Is it possible that the issue you are seeing is only in the tests, not how r2d actually behaves? What happens if you raise
in content_id if fetch hasn't been called?
If it's just the tests and persistent_id is defined after fetch, then keeping persistent_id seems nice here, and maybe we can fix the tests to be more realistic. And make it explicit that content_id
cannot be assumed to be available until fetch
has been called?
A tangent I went on about hash length, that I'm not sure is relevant anymore, but already wrote down. Feel free to ignore:
Initially, I thought the content id was the full thing, but of course it's the 'ref' that goes after the doi itself. Running a test gives this 106-character image name:
r2ddoi-3a10-2e7910-2fdvn-2f6zxagt-2f3yrryjec19b07b80bf8eeb95f669a51f64efb7f647f91cf1b1f6ccbef736396ba936ef
Since we're in the namespace of the doi, collision probability is super low. We truncate to the short hash in git. So maybe truncate this hash, or use a natively shorter hash function like:
hashlib.blake2s(self.url.encode(), digest_size=10).hexdigest()
(blake2 is in hashlib.algorithms_guaranteed
since 3.6, I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooooh, fixing the tests seems the right thing to do! I'll take a peek.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey, it looks like I already fixed the tests, so it's all fine now! Back to using persistent_id as the identifier, but this time it's 'proper' - so if we get a file persistent_id, we resolve it to the dataset persistent id, and use that. so if multiple different folks try to use different files from the same dataset, it will lead to cache reuse now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! All looks right to me, then.
We were: 1. In some cases, directly using requests 2. In some cases, directly using the stdlib's urlopen 3. In some cases, had a method named urlopen that simply passed things through to requests This is unnecessarily confusing, and seems to primarily be done for the benefit of mocking the network calls. However, as described in the recently merged jupyterhub#1390, I don't think mocking is appropriate here as it means we don't actually catch problems. This PR mostly focuses on getting unifying to only using requests directly with as little indirection as possible. If any tests were directly using mocks here, they will be replaced with something that is testing things more directly as appropriate
(Lots of credit to @pdurbin for helping debug and co-produce this work)
Evolution of smaller PRs #1388 and #1389
While debugging #1388, I realized that we were basically
relying on HTTP redirects as an API for DOI resolution. This is fragile because we don't actually know
how many redirects we may get, and there may be things that break the redirect chain in the middle
(such as dataverse WAF rules, ref https://jupyter.zulipchat.com/#narrow/channel/103349-ask-anything/topic/Binder.20Dataverse.20error)
This PR tries to:
Making DOI resolution more deterministic
Right now, we are making
GET
requests tohttps://doi.org/<doi>
, and following all redirects to figure out where a DOI resolves to. This has multiple problems:GET
) that we immediately discarddoi.org has an actual JSON API, and we directly hit that instead. We will always now get an answer with one HTTP request, and not have to follow an arbitrary number of responses!
Rely on the dataverse API directly
By no longer relying on HTTP redirects, we did lose one bit of functionality - dataverse has
/citation
URLs that resolve into either an individual file's doi (via/file.xhtml
) or an entire dataset's doi (via/dataset.xhtml
). The existing behavior was:This PR stops guessing, and uses the API to implement the functionality. We can now directly handle
/citation
dataverse URLs, and no longer 'guess' - we directly use the API to figure out what's going on.Reduce the amount of mocking in contentprovider tests
IMO, since the point of contentproviders is to integrate with external content providers, they should be integration tests so we can catch issues with them more easily. Integration tests would have caught https://jupyter.zulipchat.com/#narrow/channel/103349-ask-anything/topic/Binder.20Dataverse.20error more cleanly than how it happened for example.
It also immediately showed me that some of the fixtures we were using to test were not accurate, and I've updated the test fixtures to match what we actually expect them to return.
This PR:
tests/contentprovider
, where I expect more of the contentproviders to join (in the future)detect
method in dataverse provider slightly to make it easier to test.In a future PR, I'd like to remove more of at least the DOI provider mocks, and simplify the code a little as a result.