-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset pages cannot be opened after update to v6.2 with perma PID #10564
Comments
Interesting case. Permalinks and the idea of configurable separators definitely have some limitations as you can create ambiguous cases. My guess is the code is working as ~designed - trying to parse a string like (thz)(-)(ds-)(randomString) with a perma provider that has - as a separator doesn't match dataset with authority thz- and identifier ds-(randomString) - that would match thz--ds-(randomString). (FWIW, the code you point to above doesn't run for permalinks - this parsePersistentId is used instead.) That's a consequence of the db not storing the separator and assuming it matches that of the pid provider. An alternate design could store the separator and that design would be needed to allow two perma providers that only differ by their separator. Even with that design, I'm not sure that allowing a provider to recognize a dataset who's authority includes the separator (and had a null separator) would be a good idea - that would allow two providers to recognize the same dataset. That said, any proposals/PRs to improve how PermaLinks work are welcome. Some limits on what characters can be in authorities and separators could avoid confusing cases and might be fine with everyone. I can think of two possible ways to fix things:
The former is probably cleaner, but I think the latter is possible and would properly represent what you have now - old datasets created with one form of permalink and new ones with another. (We're not talking about base-urls here, but if you did have different ones for the two cases, you could support that with two providers.) |
Awaiting response from @HenningTimm. When we hear back, we will be able to decide next steps. |
@HenningTimm Hi Tim, just revisiting this issue. We need your input to move this forward. Thanks! |
Sorry for the late reply. We (and by "we" I mean mostly @erodde) did some digging and found out what I did wrong. The PID parsing code behaves as intended and is pretty robust. I think it is safe to say that there is no bug to fix here and everything was caused by user error. Also I do not think that this needs a dedicated entry in the documentation. For the unlikely case that some else faces a similar issue, I will write a bit about what happened and how to avoid it. What caused the errorMostly, this was a false assumption on my part on how PIDs are parsed. I thought that as long as the PIDs look the same I could change how they are constructed. As a concrete example I thought moving from
to
since it seemed like the right way to do it. As I know now, this did not work, since authorities are stored in the database and matched against using Additionally, we also tried some more mischievous combinations of authorities and separators, e.g. Who could be affectedThis error should only affect institutions that introduced PermaLink PIDs in v6.0 or v6.1 and then decided to restructure these PIDs with v6.2 by changing the authority. I would judge the amount of possibly affected people to be very low. What should be done nowFrom my side, this issue can be closed without further action. I do not think this merits changes or additional documentation. If you would like, I could write a sentence or two for the v6.2 release notes on what to avoid, since that is the only step where this error can be introduced. |
@HenningTimm I really appreciate the thorough writeup! I agree, I don't think any code or doc change is needed. If anyone has a similar problem, I'll point them to your comment above. I'll go ahead and close this. Thanks again. ❤️ |
Error Description
After updating to v6.2, I get a 404 error for all dataset pages. The page only reads "Page Not Found - The page you are looking for was not found." in a red box, reports 404 in the DevTools, and the Payara logs tell me "failed to retrieve version" as shown below:
This error only occurs for datasets that existed prior to the update and have a valid PID. Opening pages with made-up PIDs reports an internal server error when parsing the PID. The datasets show up in the SOLR index with all of their metadata and files can be downloaded, but downloading metadata for the files fails. Datasets created after the migration were not affected.
Underlying Cause
TL/DR: Our Dataverse was running a perma PID provider that used an authority that had a trailing dash, which we later used as separator after the v6.2 update. This (probably) caused splitting issues making the old datasets unavailable since Dataverse could not find them in the respective database column.
This error occurred when migrating from v6.1 to v6.2 with an existing perma PID provider.
Beginning with v6.2 a separator (i.e. the character that sits between the authority and the should of the generated PID) can also be configured for perma PID providers. Prior to v6.2, we had encoded this separator either in the Shoulder or in the Authority, to be able to visually tell the parts of the PID apart. However, we were a bit inconsistent between our staging setup and our production setup:
During migration, we tried to unify this again. So, we switched to the following pattern when setting up the new PID provider (as described in the release notes. Cf. #10516 for some more information about which settings we added) :
This worked fine for the staging server, but caused the issue described above for our production system.
Our best guess is that this caused issues with splitting the PIDs so that the old datasets could not be accessed, since the trailing
-
of the authority was used as separator during parsing of the PID. This made PIDs generated before the migration inaccessible since they still expected the-
to be there. This can nicely be seen in the database (see below). Working backwards from the ERROR and INFO message in the payara logs, I reached this part in the code which splits at the first occurrence of the separator. This means that the authority for boththz-
andthz
would be reported asthz
. These entries cannot be found in the database though, since the older datasets have the authoritythz-
.When we took a look at the database with the following query
We got the following result (truncated to two lines, with some data redacted) on our production server. The first entry with authority
thz
was created after the update, the second one is a dataset that was there before the update.Workarounds and Fixes
So, the easiest workaround would be to not use a separator that is a suffix of the authority in the first place.
We had to revert our production server (restored a VM snapshot) and re-ran the update but used
/
as a separator instead of-
. This solved the issue, however, we now need to use a new PID schema:(thz)(/)(ds-)(randomString)
. This keeps all old datasets available via the legacy provider.Potentially (warning: this is untested and highly speculative) we could have also manually removed the trailing dashes from the authority-column in the database to keep our established PID pattern, however we were not brave enough to try this.
While we only just begin to dig into the code base, we will check if it is possible to match the whole prefix instead of splitting on the separator when parsing a PID. This will only be relevant for perma PIDs, since DOIs, Handles, etc. follow are more rigid structure. However, for perma PIDs this could probably lead to unintentional behavior, e.g. when the separator is contained anywhere within the authority, not just as a suffix.
Further Reading
The text was updated successfully, but these errors were encountered: