Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up problem FileSets #2491

Open
5 of 7 tasks
carakey opened this issue May 25, 2023 · 4 comments
Open
5 of 7 tasks

Clean up problem FileSets #2491

carakey opened this issue May 25, 2023 · 4 comments
Assignees
Labels

Comments

@carakey
Copy link

carakey commented May 25, 2023

Descriptive summary

During the fileset inventory piece of the 2023 Preservation Assessment, 97 problem filesets were identified and grouped for further action. Briefly, the work is:

  • Delete 31 ingest error filesets that have duplicate filenames
  • Delete 13 failed-test filesets
  • Delete 13 not-found filesets
  • Attempt to reindex / re-characterize 17 not fully characterized filesets
  • Investigate remediation options for 2 ingest error filesets without duplicate filenames
  • Investigate remediation options for 10 corrupted / empty files
  • Defer 11 filesets attached to works in review

Further details and specific PIDs below, mainly for posterity.

Ingest errors with duplicate functional filenames elsewhere (25, plus 6 special cases)

Characteristics:

  • Fileset Solr record does not include file_size_lts or original_checksum_tesim
  • Fileset page loads at https://ir.library.oregonstate.edu/concern/file_sets/{pid}
  • Fileset page includes metadata - title, depositor, date uploaded
  • Fixity check field displays "Fixity checks have not yet been run on this object"
  • Fileset Solr record does not include mime_type_ssi or file_format_tesim
    • Characterization field displays "not yet characterized"
  • Fileset visibility is private
  • Download does not work
  • No parent work is linked from the fileset page
  • No Solr results are found for a parent using file_set_ids_ssim:{pid}
  • Deposited by admin between 2017-07-05 and 2017-07-11
  • A matching filename exists in another functional fileset

Actions:

  • Delete filesets

PIDs:

  • t722h937w
  • bz60cw81c
  • hd76s059j
  • 3t945r30c
  • 9019s305m
  • p2676v998
  • 41687h85j
  • dr26xx804
  • 5x21tf89m
  • bk128b566
  • kp78gg577
  • vd66w047h
  • 1831ck437
  • 7m01bm18h
  • z890rt77t
  • 05741s35n
  • mp48sd42g
  • 1544bp99v
  • bk128b54n
  • fj236262m
  • 3r074v48m
  • db78tc46z
  • dz010q587
  • 8p58pd537
  • 2b88qc29j
Special cases
  • PIDs 0p096755p, 4f16c3641, dn39x2060, b2773w401: Same as regular ingest error group, except,
    • Fileset Solr mime_type_ssi and file_format_tesim have values
    • Characterization field displays file format and mime type
    • File is downloadable and usable with either the download button or direct link, https://ir.library.oregonstate.edu/downloads/{pid}
  • PID 8s45q9241: Same as regular ingest error group, except, Fixity check field displays results of check on 2023-05-16 and fileset visibility is public.
  • PID n296x560p: Same as regular ingest error group, except, deposited by simholt on 2021-04-15 and Fileset visibility is public

Ingest errors WITHOUT duplicate functional filesets (2)

  • PID pz50gw661: Same as regular ingest error group, except no matching filename was found in another fileset
  • PID 41687p06x: Same as regular ingest error group, except no matching filename was found AND a link to a parent work is present and functional (gt54ks546). Also has a slightly later deposit date and public visibility.

Actions:

  • Look for backups
  • Contact creators
  • Otherwise delete

Failed tests (13)

Characteristics:

  • Fileset Solr record does not include file_size_lts or original_checksum_tesim
  • Fileset page loads at https://ir.library.oregonstate.edu/concern/file_sets/{pid}
  • Fileset page includes metadata - title, depositor, date uploaded
  • Fixity check field displays results of check on 5/16/23
  • Characterization field displays file format and mime type
  • File is downloadable and usable with either the download button or direct link, https://ir.library.oregonstate.edu/downloads/{pid}
  • A link to a parent work is present but not functional ("unauthorized" though logged in as admin)
  • The displayed title of the parent work indicates it is a test object
  • No Solr results are found for a parent using file_set_ids_ssim:{pid}

Actions:

  • Delete filesets
PID Parent PID
2514nt408 ks65hm83j
tq57p0138 cr56n810d
5t34ss11x cr56n810d
7m01bt90g cr56n810d
gf06g982z cr56n810d
pr76fb20j cr56n810d
3n204641s cr56n810d
bk128k233 cr56n810d
zs25xh46t cr56n810d
gh93h662p cr56n810d
k3569c18d cr56n810d
7h149x77p cr56n810d
4t64gw31r cr56n810d

Fileset not found (13)

Characteristics:

Actions:

  • Delete filesets

PIDs:

  • 3n204545p
  • 5t34sr847
  • 8s45qh558
  • h415pj035
  • j098zj142
  • k0698g363
  • ks65hm497
  • m613n488s
  • nk322m84h
  • qb98mp03h
  • qz20t163g
  • rr172467w
  • vt150r99h

Not fully characterized filesets (17)

Characteristics:

  • Fileset Solr record does not include file_size_lts or original_checksum_tesim
  • Fileset page loads at https://ir.library.oregonstate.edu/concern/file_sets/{pid}
  • Fileset page includes metadata - title, depositor, date uploaded
  • Fixity check field displays results of check on 5/16/23
  • Characterization field displays file format and mime type
  • File is downloadable and usable with either the download button or direct link, https://ir.library.oregonstate.edu/downloads/{pid}
  • A link to a parent work is present and functional and the work is deposited

Actions:

  • Attempt to resave/reindex filesets to trigger re-characterization
PID Parent PID
4x51hs18f x633f780p
dv140248c q524jx251
ws859p66s 9c67wv964
zp38wm33h f1881t76x
pg15bp255 47429j15s
n8710002m 47429j15s
mp48sm98k 47429j15s
vx021p319 g732dh894
sn00b578r r494vs997
qr46r7630 2n49t902k
w0892b809 79407x82j
6682xc069 vh53x413c
ft848z204 cr56n8315
3t946015w dj52wd37x
6682xb950 5m60r066x
8s45qh957 mc87pz461
rx913z328 6q182t650

Corrupted or empty files (10)

Characteristics:

Actions:

  • Look for backups
  • Contact depositors
  • Otherwise leave as is
PID Parent PID
t148fp853 m039kb83x
nc580t64j m039kb83x
5999n970s m039kb83x
76537308m pg15bg78z
vh53wx19m 8k71nj46h
n870zs774 mk61rj671
sb397b034 mk61rj671
3j333427p mk61rj671
bg257g832 mk61rj671
ng451k461 mk61rj671

Incomplete works (11)

Characteristics:

  • Fileset Solr record does not include file_size_lts or original_checksum_tesim
  • Parent work is still in review

Actions:

  • Defer until complete/deposited

PIDs:

  • ww72bk84z
  • 8s45qj261
  • mc87pz79k
  • bc386s629
  • 70795h27k
  • c247f146z
  • xd07h210z
  • x920g518n
  • 8k71nr403
  • s1784v26p
  • 6682xc23z

Related work

#2472
https://docs.google.com/spreadsheets/d/1GUY-jjCaObxC7d68Tx2Y3wqy9SD7TOQeDlj0jeHWNEo/edit#gid=846061299

@carakey carakey self-assigned this May 25, 2023
@carakey
Copy link
Author

carakey commented Jun 1, 2023

I was able to delete 22 out of the 31 from the "ingest error" group through the UI, but 9 items persisted after using the delete button + confirm process. I'll add these to a batch delete list with the 13 from the "not-found" group, along with any others that refuse to be deleted in the UI.

@carakey
Copy link
Author

carakey commented Jun 2, 2023

Ingest errors WITHOUT duplicate functional filesets (2)

PID pz50gw661: Same as regular ingest error group, except no matching filename was found in another fileset
PID 41687p06x: Same as regular ingest error group, except no matching filename was found AND a link to a parent work is present and functional (gt54ks546). Also has a slightly later deposit date and public visibility.

  • The work for 41687p06x was part of another SA work - one paper from a combined conference proceedings. I was able to extract the pages from the combined PDF and added it as a new version to the work. Successfully remediated.
  • I found the original file for pz50gw661 in Parthenon, and will create a new work (requires some original description based on the document; old DSpace info is lost).

@carakey
Copy link
Author

carakey commented Sep 15, 2023

#2530 - will resolve the "not-found" filesets and the remaining "ingest error" filesets.

#2531 - will resolve the "not fully characterized" filesets (and some others identified more recently).

@carakey
Copy link
Author

carakey commented May 25, 2024

Problem filesets from 2024 preservation assessment include:

  • 9 known zombies from 2530
  • 1 object pz50gw661 that has a long saga but at this point should be deleted
  • 1 object wm117x005 that is an apparent remnant of a deleted work and should be deleted
  • 1 known non-characterized object w0892b809 that needs additional investigation
  • 1 object n583z2641 that functions and has characterization data but a non-loading parent, needs additional investigation
  • 10 known empty / corrupted files from this ticket / last year
  • 7 files with inconsistencies in their characterization information, that I manually fixed during the assessment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

No branches or pull requests

1 participant