Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2023 Annual Preservation Assessment #2472

Closed
6 tasks done
carakey opened this issue May 1, 2023 · 8 comments
Closed
6 tasks done

2023 Annual Preservation Assessment #2472

carakey opened this issue May 1, 2023 · 8 comments
Assignees

Comments

@carakey
Copy link

carakey commented May 1, 2023

During the CoreTrustSeal certification, we committed to and described specific actions we would take annually to assess preservation needs for ScholarsArchive@OSU, especially with regard to the file formats we store.

  • Review/address fixity reports
  • Inventory file formats currently in SA (using Solr)
  • Review/update the preferred file formats guide
  • Review/propose updates to the SA preservation policy
  • Since this is the inaugural assessment: Document procedures for continuing annual assessments
  • Produce a brief report

The target time period for this work is May 1-5, 2023, which is designated as Preservation Week by ALA. This is concurrent with the LIT Spring 2023 ScholarsArchive workcycle. Relevant information will be shared here as the assessment progresses.

@carakey carakey self-assigned this May 1, 2023
@carakey
Copy link
Author

carakey commented May 3, 2023

Preferred File Formats proposed updates are here: https://docs.google.com/document/d/1gPEFQA1xKTYqdxXEIxmgtS5HWRuV15OtrlUfx-0_wSc/edit#heading=h.xwvqgjt37477

Edits have made in the LibGuide. The LibGuide includes a changelog.

https://guides.library.oregonstate.edu/Scholars-Archive/PreferredFileFormats

Update

Based on the file format inventory work, a few more changes were made to the LibGuide, and another changelog was added.

@carakey
Copy link
Author

carakey commented May 17, 2023

File Format Inventory:

  • Total filesets: 78,281
    • Standard PDFs: 72,846
    • PDF/As: 761
    • All other filesets, compared against the Preferred File Formats guide:
      • Highest Confidence formats: 1,548
      • Medium Confidence formats: 2,300
      • Lowest Confidence formats: 583
      • Not rated: 203 filesets with 33 unique file formats
      • Errors: 40
  • Unique file_format_tesim values: 184
  • Unique mime_type_ssi values: 60
  • Unique (normalized) file extensions: 108
  • Unique combinations of file_format, mime_type, and file extension values: 255

Continuing/spinoff work involves:

  • Evaluating the unrated formats; considering whether to 'promote' any to the public preferred format guide and adding them there if so; or at a minimum, providing some local guidance to data curators to inform future submission review.
    • Finished the evaluation, and shared information with research data services for comment before updating the guide
  • Examining the errors; looking at possible patterns of problems with file characterization and ticketing any issues identified.

@carakey
Copy link
Author

carakey commented May 26, 2023

Continuing/spinoff work involves:

  • Examining the errors; looking at possible patterns of problems with file characterization and ticketing any issues identified.

There is some messiness in characterization, especially in assigning the file_format_tesim string values, and how the parenthetical components get ordered. For instance, there are these three different values assigned to EPUB filesets:

  • epub+zip (EPUB ebook data, Electronic Publication, ZIP Format)
  • epub+zip (EPUB ebook data, ZIP Format, Electronic Publication)
  • epub+zip (ZIP Format, EPUB ebook data, Electronic Publication)

...BUT, as far as I can tell, this inventory effort is the only time it creates a problem, and a minor one at that, so I think it's not worth spending time on.

@CGillen
Copy link
Contributor

CGillen commented May 30, 2023

Fixity check found 153188 log messages, each corresponding to a file version. This accounts for 78243 filesets when those log messages are mapped to unique fileset ids:

irb(main):038:0> logs.distinct.count('file_set_id')
   (65.2ms)  SELECT COUNT(DISTINCT `checksum_audit_logs`.`file_set_id`) FROM `checksum_audit_logs`
=> 78243

This is still 63 filesets short of what @carakey found in a bit of a post-mortem check:

q - has_model_ssim:FileSet // numFound:78306

There was a time gap of about 19 hours where 1 new fileset was deposited, so 62 filesets short.
I'm still investigating those 62, I'll update this comment when I find anything out

EDIT:

Well, the mystery deepens. There's actually 60 filesets that don't exist in the audit logs and 14 audit logs that don't exist in the filesets from the same timespan.
The 14 in the logs don't exist anymore, so those are probably just deletions between now and then
The 60 in the solr documents also don't exist anymore.
I'm not entirely sure where that leaves us, but I guess the mismatched numbers are fine? It must come down to solr documents that point to deleted filesets and audit logs for filesets that are also deleted.

At least we figured out why there are WAY more audit logs than filesets

@carakey
Copy link
Author

carakey commented Jun 1, 2023

I'm not entirely sure where that leaves us, but I guess the mismatched numbers are fine? It must come down to solr documents that point to deleted filesets and audit logs for filesets that are also deleted.

I really appreciate @CGillen looking into this! I can let go of the mismatched numbers, but will be curious what kind of count we get next time after cleaning up some of the busted filesets. I think the outstanding issue with fixity checking is those emails not going out. Do we need a ticket for that?

@carakey
Copy link
Author

carakey commented Jun 2, 2023

re: fixity reporting emails, I reopened #2421.

@carakey
Copy link
Author

carakey commented Jun 13, 2023

Status update:

  • Fixity report review, complete per scope (though related work remains with emails)
  • Preservation policy, recommended changes have gone to SA@OSU User Group for approval, still need to be incorporated into wiki page (around 6/16)
  • Procedures documentation, still needs attention
  • Assessment report, a preview has gone to the Digital Preservation Interest Group, will finalize and post to library website after the policy update is complete

@carakey
Copy link
Author

carakey commented Aug 24, 2023

Assessment report and policy changes passed user group review and went live in June. Procedures doc exists but is rough, and will want more work in the next assessment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants