Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset pages cannot be opened after update to v6.2 with perma PID #10564

Closed
HenningTimm opened this issue May 15, 2024 · 5 comments
Closed

Dataset pages cannot be opened after update to v6.2 with perma PID #10564

HenningTimm opened this issue May 15, 2024 · 5 comments
Labels
Type: Bug a defect

Comments

@HenningTimm
Copy link
Contributor

Error Description

After updating to v6.2, I get a 404 error for all dataset pages. The page only reads "Page Not Found - The page you are looking for was not found." in a red box, reports 404 in the DevTools, and the Payara logs tell me "failed to retrieve version" as shown below:

[#|2024-05-13T16:26:17.001+0200|WARNUNG|Payara 6.2023.8|edu.harvard.iq.dataverse.DatasetPage|_ThreadID=94;_ThreadName=http-thread-pool::jk-connector(4);_TimeMillis=1715610377001;_LevelValue=900;|
  Failed to retrieve version|#]

This error only occurs for datasets that existed prior to the update and have a valid PID. Opening pages with made-up PIDs reports an internal server error when parsing the PID. The datasets show up in the SOLR index with all of their metadata and files can be downloaded, but downloading metadata for the files fails. Datasets created after the migration were not affected.

Underlying Cause

TL/DR: Our Dataverse was running a perma PID provider that used an authority that had a trailing dash, which we later used as separator after the v6.2 update. This (probably) caused splitting issues making the old datasets unavailable since Dataverse could not find them in the respective database column.

This error occurred when migrating from v6.1 to v6.2 with an existing perma PID provider.
Beginning with v6.2 a separator (i.e. the character that sits between the authority and the should of the generated PID) can also be configured for perma PID providers. Prior to v6.2, we had encoded this separator either in the Shoulder or in the Authority, to be able to visually tell the parts of the PID apart. However, we were a bit inconsistent between our staging setup and our production setup:

Staging Server:
  Authority: thz
  Shoulder: -ds
→ (thz)(-ds)(randomString)

Production Server:
  Authority: thz-
  Shoulder: ds-
→ (thz-)(ds-)(randomString)

During migration, we tried to unify this again. So, we switched to the following pattern when setting up the new PID provider (as described in the release notes. Cf. #10516 for some more information about which settings we added) :

Both Server:
  Authority: thz
  Separator: -
  Shoulder: ds-
-> (thz)(-)(ds-)(randomString)

This worked fine for the staging server, but caused the issue described above for our production system.

Our best guess is that this caused issues with splitting the PIDs so that the old datasets could not be accessed, since the trailing - of the authority was used as separator during parsing of the PID. This made PIDs generated before the migration inaccessible since they still expected the - to be there. This can nicely be seen in the database (see below). Working backwards from the ERROR and INFO message in the payara logs, I reached this part in the code which splits at the first occurrence of the separator. This means that the authority for both thz- and thz would be reported as thz. These entries cannot be found in the database though, since the older datasets have the authority thz-.

When we took a look at the database with the following query

SELECT d.authority, d.createdate, d.identifier, d.identifierregistered, d.indextime, d.protocol, d.storageidentifier
FROM dvobject d
where d.dtype = 'Dataset'
order by createdate desc
limit 10;

We got the following result (truncated to two lines, with some data redacted) on our production server. The first entry with authority thz was created after the update, the second one is a dataset that was there before the update.

 authority |       createdate        | identifier         | identifierregistered |        indextime        | protocol |  storageidentifier  
-----------+-------------------------+--------------------+----------------------+-------------------------+----------+---------------------
 thz       | 2024-05-14 15:02:27.685 | ds-<randomString>  | t                    | 2024-05-14 15:02:28.048 | perma    | s3://thz/ds-<randomString>
 thz-      | 2024-mm-dd 10:00:00.000 | ds-<randomString>  | t                    | 2024-05-13 17:22:53.777 | perma    | s3://thz-/ds-<randomString>

Workarounds and Fixes

So, the easiest workaround would be to not use a separator that is a suffix of the authority in the first place.

We had to revert our production server (restored a VM snapshot) and re-ran the update but used / as a separator instead of -. This solved the issue, however, we now need to use a new PID schema: (thz)(/)(ds-)(randomString). This keeps all old datasets available via the legacy provider.

Potentially (warning: this is untested and highly speculative) we could have also manually removed the trailing dashes from the authority-column in the database to keep our established PID pattern, however we were not brave enough to try this.

While we only just begin to dig into the code base, we will check if it is possible to match the whole prefix instead of splitting on the separator when parsing a PID. This will only be relevant for perma PIDs, since DOIs, Handles, etc. follow are more rigid structure. However, for perma PIDs this could probably lead to unintentional behavior, e.g. when the separator is contained anywhere within the authority, not just as a suffix.

Further Reading

@HenningTimm HenningTimm added the Type: Bug a defect label May 15, 2024
@qqmyers
Copy link
Member

qqmyers commented May 15, 2024

Interesting case. Permalinks and the idea of configurable separators definitely have some limitations as you can create ambiguous cases.

My guess is the code is working as ~designed - trying to parse a string like (thz)(-)(ds-)(randomString) with a perma provider that has - as a separator doesn't match dataset with authority thz- and identifier ds-(randomString) - that would match thz--ds-(randomString). (FWIW, the code you point to above doesn't run for permalinks - this parsePersistentId is used instead.) That's a consequence of the db not storing the separator and assuming it matches that of the pid provider. An alternate design could store the separator and that design would be needed to allow two perma providers that only differ by their separator. Even with that design, I'm not sure that allowing a provider to recognize a dataset who's authority includes the separator (and had a null separator) would be a good idea - that would allow two providers to recognize the same dataset. That said, any proposals/PRs to improve how PermaLinks work are welcome. Some limits on what characters can be in authorities and separators could avoid confusing cases and might be fine with everyone.

I can think of two possible ways to fix things:

  • just delete the '-' from the authority of the relevant datasets/datafiles in the database - as you suggest,
  • configure a second perma provider to recognize the legacy datasets and just make your new one with the - separator the default for new datasets.

The former is probably cleaner, but I think the latter is possible and would properly represent what you have now - old datasets created with one form of permalink and new ones with another. (We're not talking about base-urls here, but if you did have different ones for the two cases, you could support that with two providers.)

@cmbz
Copy link

cmbz commented May 20, 2024

Awaiting response from @HenningTimm. When we hear back, we will be able to decide next steps.

@sbarbosadataverse
Copy link

@HenningTimm Hi Tim, just revisiting this issue. We need your input to move this forward. Thanks!

@sbarbosadataverse sbarbosadataverse added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Jun 26, 2024
@sbarbosadataverse sbarbosadataverse removed the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Jun 26, 2024
@HenningTimm
Copy link
Contributor Author

Sorry for the late reply. We (and by "we" I mean mostly @erodde) did some digging and found out what I did wrong. The PID parsing code behaves as intended and is pretty robust. I think it is safe to say that there is no bug to fix here and everything was caused by user error. Also I do not think that this needs a dedicated entry in the documentation. For the unlikely case that some else faces a similar issue, I will write a bit about what happened and how to avoid it.

What caused the error

Mostly, this was a false assumption on my part on how PIDs are parsed. I thought that as long as the PIDs look the same I could change how they are constructed. As a concrete example I thought moving from

(thz-)(ds-)(randomString)
   |    |        |
   |    |        Identifier
   |    Shoulder
   Authority

to

(thz)(-)(ds-)(randomString)
  |   |  |        |
  |   |  |        Identifier
  |   |  Shoulder
  |   Separator
  Authority

since it seemed like the right way to do it.

As I know now, this did not work, since authorities are stored in the database and matched against using startswith. So the old entries could no longer be found. Just removing the trailing - from the existing datasets would have worked. Staying with the old pattern and keeping the separator empty would probably also have worked. We added another provider with a different separator and that allowed us to circumvent the issue.

Additionally, we also tried some more mischievous combinations of authorities and separators, e.g. (foobar)(-) and (foo)(-) and Dataverse was able to tell them apart neatly.

Who could be affected

This error should only affect institutions that introduced PermaLink PIDs in v6.0 or v6.1 and then decided to restructure these PIDs with v6.2 by changing the authority. I would judge the amount of possibly affected people to be very low.

What should be done now

From my side, this issue can be closed without further action. I do not think this merits changes or additional documentation. If you would like, I could write a sentence or two for the v6.2 release notes on what to avoid, since that is the only step where this error can be introduced.

@pdurbin
Copy link
Member

pdurbin commented Jul 9, 2024

@HenningTimm I really appreciate the thorough writeup!

I agree, I don't think any code or doc change is needed. If anyone has a similar problem, I'll point them to your comment above.

I'll go ahead and close this. Thanks again. ❤️

@pdurbin pdurbin closed this as completed Jul 9, 2024
@cmbz cmbz removed the status in IQSS Dataverse Project Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug a defect
Projects
Status: No status
Development

No branches or pull requests

5 participants