Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQSS/10108 Stata mimetype refinement for direct upload #11054

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Nov 26, 2024

What this PR does / why we need it: As noted in the issue and recently in https://groups.google.com/g/dataverse-community/c/TBx0TTins2k, our ~extensible mimetype detector mechanism assumes starting from a local temp file which isn't useful with direct uploads. The most visible problem we have with this is that all Stata *.dta files get an application/x-stata type to start and our ingest sends all of those to the Stata 13 ingestor, which then fails for Stata 14/15 files. With normal uploads, the local detector reads the first few bytes of the file and assigns a more specific type, e.g. application/x-stata-14 (or 15, etc) which gets routed to the correct version of the Stata ingestion code.

Since the determination of Stata version only relies on reading the first ~42 bytes of the file, and only needs those in a buffer/doesn't require a File to start from, this PR adds code to retrieve the required bytes and run the Stata version check on direct upload(S3) and presumably remote files/Globus files on S3 (cases where the storageidentifier is provided during upload and where getInputStream works.)

The PR also has some notes about the potential to clean things up further and implement other mimetype detectors that only require a subset of bytes in a more extensible framework. I don't have any plan to implement this but incremental steps to add other detectors to the direct upload path are probably possible if there are other cases where we're seeing problems.

Which issue(s) this PR closes:

Special notes for your reviewer: I also added a note that I think there's a no-op section now where we check files by extension when the extension is null. If someone knows more about the history there, perhaps we can restore whatever was supposed to happen there, or we could delete it.

The code also makes a slight logging change. After the code was updated to allow getting Mimetypes using the full filename, we've been getting info-level suggestions to perhaps add that specific filename to MimeTypeDetectionByFileName.properties which is pretty noisy. I changed the code to keep info-level if the extension is checked and isn't in MimeTypeDetectionByFileExtension.properties but dropped it to fine for the filename check.

Suggestions on how to test this: Find a Stata file with version 14 or 15 and upload it via direct upload/using a direct upload store, verify that it gets a mimetype including the version and is ingested (assuming ingest size limits, etc. allow). (Some Stata files have a *.dta extension - searching for that might help in finding an example.)

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: included

Additional documentation: slightly updated the docs to indicate the Stata check is now done in direct upload.

@coveralls
Copy link

coveralls commented Nov 26, 2024

Coverage Status

coverage: 22.751% (+0.2%) from 22.571%
when pulling df068fa on QualitativeDataRepository:IQSS/10108-StataMimeTypeRefinementForDIrectUpload
into a4d0127 on IQSS:develop.

@cmbz cmbz added the FY25 Sprint 14 FY25 Sprint 14 (2025-01-02 - 2025-01-15) label Jan 15, 2025
@qqmyers qqmyers added the Size: 10 A percentage of a sprint. 7 hours. label Jan 15, 2025
@cmbz cmbz added the FY25 Sprint 15 FY25 Sprint 15 (2025-01-15 - 2025-01-29) label Jan 15, 2025
@pdurbin pdurbin self-assigned this Jan 27, 2025
// .body("data.files[0].dataFile.storageIdentifier", startsWith(driverId + "://"));
//
// String fileId = JsonPath.from(addFileResponse.body().asString()).getString("data.files[0].dataFile.id");
long size = 1000000000l;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this only works (here and the test above) because the localstack must be configured with this as the part size? If the part size were smaller, e.g. the min of 5 MB, using this would return URLs for multiple parts whereas the small test files would only be 1 part still.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buh. I'm not sure.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqmyers I made a few tiny commits to docs and comments that I don't think you'll mind. I left a comment about a todo I don't understand. Please take a look.

@landreev I'm going to go ahead and approve this but you've worked on this code a lot so please feel free to fish it out of "ready for qa" if you'd like to take a look.

@@ -495,6 +532,7 @@ public static String determineFileType(File f, String fileName) throws IOExcepti
logger.fine("mime type recognized by extension: "+fileType);
}
} else {
//ToDo - if the extension is null, how can this call do anything
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//ToDo - if the extension is null, how can this call do anything

I'm not sure why this todo is here so I'm suggesting we delete it. In this PR...

... we added support for detecting files like Makefile or Dockerfile. Having no extension is legit. Maybe I'm just misunderstanding the todo. 🤷

@pdurbin pdurbin requested a review from landreev January 28, 2025 19:39
@pdurbin pdurbin removed their assignment Jan 28, 2025
@cmbz cmbz added the FY25 Sprint 16 FY25 Sprint 16 (2025-01-29 - 2025-02-12) label Jan 30, 2025
@ofahimIQSS ofahimIQSS self-assigned this Feb 3, 2025
@ofahimIQSS
Copy link
Contributor

continuous-integration/jenkins/pr-merge is failing on this ticket.

@qqmyers
Copy link
Member Author

qqmyers commented Feb 3, 2025

Guessing it's a timing issue - not waiting for ingest to finish. I've added a sleepForLock to fix it.

@ofahimIQSS
Copy link
Contributor

ofahimIQSS commented Feb 3, 2025

Was able to reproduce issue on internal and saw that stata 14 file wasn't getting ingested.

Tested the fix, looks good:
image

Going to merge PR after continuous-integration/jenkins/pr-merge completes

@ofahimIQSS ofahimIQSS merged commit 6be4f20 into IQSS:develop Feb 3, 2025
12 checks passed
@ofahimIQSS ofahimIQSS removed their assignment Feb 3, 2025
@pdurbin pdurbin added this to the 6.6 milestone Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FY25 Sprint 14 FY25 Sprint 14 (2025-01-02 - 2025-01-15) FY25 Sprint 15 FY25 Sprint 15 (2025-01-15 - 2025-01-29) FY25 Sprint 16 FY25 Sprint 16 (2025-01-29 - 2025-02-12) Size: 10 A percentage of a sprint. 7 hours.
Projects
Status: Done 🧹
Development

Successfully merging this pull request may close these issues.

Not ingested file into Dataverse
5 participants