-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IQSS/10108 Stata mimetype refinement for direct upload #11054
IQSS/10108 Stata mimetype refinement for direct upload #11054
Conversation
…TypeRefinementForDIrectUpload
// .body("data.files[0].dataFile.storageIdentifier", startsWith(driverId + "://")); | ||
// | ||
// String fileId = JsonPath.from(addFileResponse.body().asString()).getString("data.files[0].dataFile.id"); | ||
long size = 1000000000l; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this only works (here and the test above) because the localstack must be configured with this as the part size? If the part size were smaller, e.g. the min of 5 MB, using this would return URLs for multiple parts whereas the small test files would only be 1 part still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buh. I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qqmyers I made a few tiny commits to docs and comments that I don't think you'll mind. I left a comment about a todo I don't understand. Please take a look.
@landreev I'm going to go ahead and approve this but you've worked on this code a lot so please feel free to fish it out of "ready for qa" if you'd like to take a look.
@@ -495,6 +532,7 @@ public static String determineFileType(File f, String fileName) throws IOExcepti | |||
logger.fine("mime type recognized by extension: "+fileType); | |||
} | |||
} else { | |||
//ToDo - if the extension is null, how can this call do anything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//ToDo - if the extension is null, how can this call do anything |
I'm not sure why this todo is here so I'm suggesting we delete it. In this PR...
... we added support for detecting files like Makefile or Dockerfile. Having no extension is legit. Maybe I'm just misunderstanding the todo. 🤷
continuous-integration/jenkins/pr-merge is failing on this ticket. |
Guessing it's a timing issue - not waiting for ingest to finish. I've added a sleepForLock to fix it. |
What this PR does / why we need it: As noted in the issue and recently in https://groups.google.com/g/dataverse-community/c/TBx0TTins2k, our ~extensible mimetype detector mechanism assumes starting from a local temp file which isn't useful with direct uploads. The most visible problem we have with this is that all Stata *.dta files get an application/x-stata type to start and our ingest sends all of those to the Stata 13 ingestor, which then fails for Stata 14/15 files. With normal uploads, the local detector reads the first few bytes of the file and assigns a more specific type, e.g. application/x-stata-14 (or 15, etc) which gets routed to the correct version of the Stata ingestion code.
Since the determination of Stata version only relies on reading the first ~42 bytes of the file, and only needs those in a buffer/doesn't require a File to start from, this PR adds code to retrieve the required bytes and run the Stata version check on direct upload(S3) and presumably remote files/Globus files on S3 (cases where the storageidentifier is provided during upload and where getInputStream works.)
The PR also has some notes about the potential to clean things up further and implement other mimetype detectors that only require a subset of bytes in a more extensible framework. I don't have any plan to implement this but incremental steps to add other detectors to the direct upload path are probably possible if there are other cases where we're seeing problems.
Which issue(s) this PR closes:
Special notes for your reviewer: I also added a note that I think there's a no-op section now where we check files by extension when the extension is null. If someone knows more about the history there, perhaps we can restore whatever was supposed to happen there, or we could delete it.
The code also makes a slight logging change. After the code was updated to allow getting Mimetypes using the full filename, we've been getting info-level suggestions to perhaps add that specific filename to MimeTypeDetectionByFileName.properties which is pretty noisy. I changed the code to keep info-level if the extension is checked and isn't in MimeTypeDetectionByFileExtension.properties but dropped it to fine for the filename check.
Suggestions on how to test this: Find a Stata file with version 14 or 15 and upload it via direct upload/using a direct upload store, verify that it gets a mimetype including the version and is ingested (assuming ingest size limits, etc. allow). (Some Stata files have a *.dta extension - searching for that might help in finding an example.)
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?: included
Additional documentation: slightly updated the docs to indicate the Stata check is now done in direct upload.