IQSS/10108 Stata mimetype refinement for direct upload #11054

qqmyers · 2024-11-26T22:26:27Z

What this PR does / why we need it: As noted in the issue and recently in https://groups.google.com/g/dataverse-community/c/TBx0TTins2k, our ~extensible mimetype detector mechanism assumes starting from a local temp file which isn't useful with direct uploads. The most visible problem we have with this is that all Stata *.dta files get an application/x-stata type to start and our ingest sends all of those to the Stata 13 ingestor, which then fails for Stata 14/15 files. With normal uploads, the local detector reads the first few bytes of the file and assigns a more specific type, e.g. application/x-stata-14 (or 15, etc) which gets routed to the correct version of the Stata ingestion code.

Since the determination of Stata version only relies on reading the first ~42 bytes of the file, and only needs those in a buffer/doesn't require a File to start from, this PR adds code to retrieve the required bytes and run the Stata version check on direct upload(S3) and presumably remote files/Globus files on S3 (cases where the storageidentifier is provided during upload and where getInputStream works.)

The PR also has some notes about the potential to clean things up further and implement other mimetype detectors that only require a subset of bytes in a more extensible framework. I don't have any plan to implement this but incremental steps to add other detectors to the direct upload path are probably possible if there are other cases where we're seeing problems.

Which issue(s) this PR closes:

Closes Not ingested file into Dataverse #10108

Special notes for your reviewer: I also added a note that I think there's a no-op section now where we check files by extension when the extension is null. If someone knows more about the history there, perhaps we can restore whatever was supposed to happen there, or we could delete it.

The code also makes a slight logging change. After the code was updated to allow getting Mimetypes using the full filename, we've been getting info-level suggestions to perhaps add that specific filename to MimeTypeDetectionByFileName.properties which is pretty noisy. I changed the code to keep info-level if the extension is checked and isn't in MimeTypeDetectionByFileExtension.properties but dropped it to fine for the filename check.

Suggestions on how to test this: Find a Stata file with version 14 or 15 and upload it via direct upload/using a direct upload store, verify that it gets a mimetype including the version and is ingested (assuming ingest size limits, etc. allow). (Some Stata files have a *.dta extension - searching for that might help in finding an example.)

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: included

Additional documentation: slightly updated the docs to indicate the Stata check is now done in direct upload.

coveralls · 2024-11-26T22:37:47Z

coverage: 22.751% (+0.2%) from 22.571%
when pulling df068fa on QualitativeDataRepository:IQSS/10108-StataMimeTypeRefinementForDIrectUpload
into a4d0127 on IQSS:develop.

…TypeRefinementForDIrectUpload

src/test/java/edu/harvard/iq/dataverse/api/S3AccessIT.java

qqmyers · 2025-01-27T17:58:10Z

src/test/java/edu/harvard/iq/dataverse/api/S3AccessIT.java

+//                .body("data.files[0].dataFile.storageIdentifier", startsWith(driverId + "://"));
+//
+//        String fileId = JsonPath.from(addFileResponse.body().asString()).getString("data.files[0].dataFile.id");
+        long size = 1000000000l;


I think this only works (here and the test above) because the localstack must be configured with this as the part size? If the part size were smaller, e.g. the min of 5 MB, using this would return URLs for multiple parts whereas the small test files would only be 1 part still.

Buh. I'm not sure.

pdurbin

@qqmyers I made a few tiny commits to docs and comments that I don't think you'll mind. I left a comment about a todo I don't understand. Please take a look.

@landreev I'm going to go ahead and approve this but you've worked on this code a lot so please feel free to fish it out of "ready for qa" if you'd like to take a look.

pdurbin · 2025-01-28T19:35:13Z

src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java

@@ -495,6 +532,7 @@ public static String determineFileType(File f, String fileName) throws IOExcepti
                logger.fine("mime type recognized by extension: "+fileType);
            }
        } else {
+            //ToDo - if the extension is null, how can this call do anything


Suggested change

//ToDo - if the extension is null, how can this call do anything

I'm not sure why this todo is here so I'm suggesting we delete it. In this PR...

File Recognition - Add support for files without extensions #8744

... we added support for detecting files like Makefile or Dockerfile. Having no extension is legit. Maybe I'm just misunderstanding the todo. 🤷

ofahimIQSS · 2025-02-03T21:21:20Z

continuous-integration/jenkins/pr-merge is failing on this ticket.

qqmyers · 2025-02-03T21:42:16Z

Guessing it's a timing issue - not waiting for ingest to finish. I've added a sleepForLock to fix it.

ofahimIQSS · 2025-02-03T21:53:14Z

Was able to reproduce issue on internal and saw that stata 14 file wasn't getting ingested.

Tested the fix, looks good:

Going to merge PR after continuous-integration/jenkins/pr-merge completes

qqmyers added 10 commits November 26, 2024 16:51

try adding stata version checking for direct upload

f7817b6

restore IOException handling

8e2b183

call open on storageIO first

308f281

change try scope

3285f9c

safely set datafile owner

ecbe9d7

use opened storageIO

5593576

typo - use DTA check, change others to just use ByteBuffer

8287063

quiet logging in the by filename check

5d1fb04

color commentary on future changes

2d0ebb9

docs/release notes

ad9112d

Merge remote-tracking branch 'IQSS/develop' into IQSS/10108-StataMime…

976439f

…TypeRefinementForDIrectUpload

cmbz added the FY25 Sprint 14 FY25 Sprint 14 (2025-01-02 - 2025-01-15) label Jan 15, 2025

qqmyers added the Size: 10 A percentage of a sprint. 7 hours. label Jan 15, 2025

cmbz added the FY25 Sprint 15 FY25 Sprint 15 (2025-01-15 - 2025-01-29) label Jan 15, 2025

pdurbin self-assigned this Jan 27, 2025

add direct upload test for Stata file IQSS#10108

45709b5

pdurbin reviewed Jan 27, 2025

View reviewed changes

src/test/java/edu/harvard/iq/dataverse/api/S3AccessIT.java Show resolved Hide resolved

pdurbin assigned qqmyers Jan 27, 2025

qqmyers commented Jan 27, 2025

View reviewed changes

pdurbin added 7 commits January 27, 2025 13:40

test with mimeType=application/octet-stream IQSS#10108

614dc7e

demonstrate Stata file detection working, other cleanup IQSS#10108

af08077

death to spaces in file names IQSS#10108

4078adc

add issue and PR to notes IQSS#10108

ccbcb39

specify S3 IQSS#10108

fa35599

link to the guides from note IQSS#10108

f87e356

typo IQSS#10108

e0414a0

pdurbin approved these changes Jan 28, 2025

View reviewed changes

pdurbin requested a review from landreev January 28, 2025 19:39

pdurbin removed their assignment Jan 28, 2025

pdurbin unassigned qqmyers Jan 28, 2025

cmbz added the FY25 Sprint 16 FY25 Sprint 16 (2025-01-29 - 2025-02-12) label Jan 30, 2025

ofahimIQSS self-assigned this Feb 3, 2025

add sleep for ingest

df068fa

ofahimIQSS merged commit 6be4f20 into IQSS:develop Feb 3, 2025
12 checks passed

ofahimIQSS removed their assignment Feb 3, 2025

pdurbin added this to the 6.6 milestone Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IQSS/10108 Stata mimetype refinement for direct upload #11054

IQSS/10108 Stata mimetype refinement for direct upload #11054

qqmyers commented Nov 26, 2024 •

edited

Loading

coveralls commented Nov 26, 2024 •

edited

Loading

qqmyers Jan 27, 2025

pdurbin Jan 28, 2025

pdurbin left a comment

pdurbin Jan 28, 2025

ofahimIQSS commented Feb 3, 2025

qqmyers commented Feb 3, 2025

ofahimIQSS commented Feb 3, 2025 •

edited

Loading

IQSS/10108 Stata mimetype refinement for direct upload #11054

IQSS/10108 Stata mimetype refinement for direct upload #11054

Conversation

qqmyers commented Nov 26, 2024 • edited Loading

coveralls commented Nov 26, 2024 • edited Loading

qqmyers Jan 27, 2025

Choose a reason for hiding this comment

pdurbin Jan 28, 2025

Choose a reason for hiding this comment

pdurbin left a comment

Choose a reason for hiding this comment

pdurbin Jan 28, 2025

Choose a reason for hiding this comment

ofahimIQSS commented Feb 3, 2025

qqmyers commented Feb 3, 2025

ofahimIQSS commented Feb 3, 2025 • edited Loading

qqmyers commented Nov 26, 2024 •

edited

Loading

coveralls commented Nov 26, 2024 •

edited

Loading

ofahimIQSS commented Feb 3, 2025 •

edited

Loading