Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve DOI more cleanly #1388

Closed
wants to merge 4 commits into from

Conversation

yuvipanda
Copy link
Collaborator

@yuvipanda yuvipanda commented Dec 14, 2024

Using doi.org, we only care to find out where the doi is pointing to. We don't need to go fetch the contents of that page fully.

By doing this, we:

  1. Make doi resolution much faster, as we now make only 1 request per doi resolution than two!
  2. Handle circumstances where dataverse (or other providers) restrict requests from datacenter IPs for subsets of URLs that serve HTML, rather than API requests (https://jupyter.zulipchat.com/#narrow/channel/103349-ask-anything/topic/Binder.20Dataverse.20error)

Using doi.org, we only care to find out *where* the doi is
pointing to. We don't need to go fetch the contents of that page
fully.
resp = self._request(f"https://doi.org/{doi}")
# We don't need to fetch the *contents* of the page the doi resolves to -
# only need to know what it redirects to.
resp = self._request(f"https://doi.org/{doi}", allow_redirects=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any rules on whether a DOI can redirect multiple times?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manics great question that led me down a rabbithole! I'm now convinced this approach is not robust enough, and will open a different series of PRs - starting with #1389, and #1390 is a WIP

pdurbin added a commit to IQSS/repo2docker that referenced this pull request Dec 16, 2024
…1388

When the Dataverse content provider was added in jupyterhub#739 it had the
flexibility to operate directly on Dataverse files like this:

repo2docker https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/6ZXAGT/3YRRYJ

However, being able to operate only on datasets (files are stored
in datasets in Dataverse) is enough. That is, this will still work:

repo2docker doi:10.7910/DVN/TJCLKP

And that's all we need.

This simplification builds upon the work in jupyterhub#1388 where the content
of the dataset landing page is not retrieved from the DOI of the
dataset. Instead, the redirect location is fetched, which is all
the Dataverse content provider needs to determine which of the
100+ installations of Dataverse hosts the DOI.

This change should be a no-op for any installation of Datavese with
Binder integration enabled.

Harvard Dataverse (one of the 100+ installations) specifically is
not working with Binder due to a firewall that is blocking
https://dataverse.harvard.edu/citation
The simplification in this commit means that the Dataverse
content provider no longer needs to follow `/citation` to determine
what is on the other side (dataset.xhtml, file.xhtml, etc.). It
assumes that the DOI is always for a dataset (not a file), which
is the expectation we have always set for the Binder tool.

We are tracking Binder not working with Harvard Dataverse here:
IQSS/dataverse.harvard.edu#328
yuvipanda added a commit to yuvipanda/repo2docker that referenced this pull request Dec 16, 2024
As I was debugging jupyterhub#1388,
I realized that PR actually broke the dataverse provider, but the
existing test was mocking so much that we didn't actually catch it!

IMO, since the point of contentproviders is to integrate with
external content providers, they should be integration tests so we
can catch issues with them more easily. Integration tests would have
caught https://jupyter.zulipchat.com/#narrow/channel/103349-ask-anything/topic/Binder.20Dataverse.20error
more cleanly than how it happened for example.

This PR removes all mocks from the dataverse test, and we immediately
benefit - it shows us that the dataverse provider *only* actually
handles DOIs, and not direct URLs! So even though we technically
had tests earlier that showed our dataverse provider supporting direct
dataverse URLs, it simply was not true. So we actually catch the
failure.

I will try to see if we can use a demo or test instance for the
fetch test though, so we don't screw up download stats even more
for the existing test doi we use.
@yuvipanda yuvipanda closed this Dec 17, 2024
@yuvipanda
Copy link
Collaborator Author

Taking a different approach in #1390

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants