Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

9002 allow direct upload setting #9003

Merged
merged 44 commits into from
Sep 29, 2023

Conversation

ErykKul
Copy link
Collaborator

@ErykKul ErykKul commented Sep 29, 2022

What this PR does / why we need it:
In some situations, direct upload might not work from the UI, e.g., when s3 storage is not accessible from the internet. This pull request adds an option to allow direct uploads via API only. This way, a third party application can use direct upload from within the internal network, while there is no direct download available to the users via UI.

Which issue(s) this PR closes:

Closes #9002

Special notes for your reviewer:
I have replaced Boolean.getBoolean("...") with Boolean.parseBoolean(System.getProperty("...")) in some places, as I think these were bugs. Let me know if I need to revert it.

Is there a release notes update needed for this change?:
yes

@coveralls
Copy link

coveralls commented Sep 29, 2022

Coverage Status

coverage: 20.047%. remained the same when pulling 3ecd118 on ErykKul:9002_allow_direct_upload_setting into 5fc7b30 on IQSS:develop.

@@ -545,7 +545,8 @@ List of S3 Storage Options
dataverse.files.<id>.label <?> **Required** label to be shown in the UI for this storage (none)
dataverse.files.<id>.bucket-name <?> The bucket name. See above. (none)
dataverse.files.<id>.download-redirect ``true``/``false`` Enable direct download or proxy through Dataverse. ``false``
dataverse.files.<id>.upload-redirect ``true``/``false`` Enable direct upload of files added to a dataset to the S3 store. ``false``
dataverse.files.<id>.upload-redirect ``true``/``false`` Enable direct upload of files added to a dataset in the S3 store. ``false``
dataverse.files.<id>.api-direct-upload ``true``/``false`` Enable direct upload of files added to a dataset via API only. ``false``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left notes in the issue discussing whether this might be better as an allow-out-of-band option that could be moved outside of the S3 section. In that case it might be restricted to superuser if desired. In any case, it would work as in this PR w.r.t. having upload-redirect on supercedes it (you need to be able to call the /addFiles endpoint for directupload so it doesn't make sense to have upload-redirect=true and api-direct-upload (or allow-out-of-band-upload)=false.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed the option to allow-out-of-band-upload. I have also removed it from the s3 section and added this documentation in the general section:

When using integration tools, dataverse installation can be configured to allow out of band upload by setting the dataverse.files.\<id\>.allow-out-of-band-upload JVM option to true.
Files can be then uploaded by an integration tool with datasets/{id}/add api call, or uploaded directly to the storage and registerd in a dataset afterwards using the datasets/{id}/addFiles api call.
Notice that using S3-storage with dataverse.files.\<id\>.upload-redirect JVM option enabled supersedes the allow-out-of-band-upload and will enable direct upload even with allow-out-of-band-upload not set (or set to false).
In other words, dataverse.files.\<id\>.allow-out-of-band-upload option opens the datasets/{id}/add and datasets/{id}/addFiles api endpoints without redirecting uploads in the UI.
Enabling the upload-redirect option allows then direct upload automatically, without the need of enabling the allow-out-of-band-upload (setting it to false does not have any effect in that case).

@mreekie mreekie added the bk2211 label Nov 1, 2022
@ErykKul
Copy link
Collaborator Author

ErykKul commented Nov 16, 2022

@qqmyers @pdurbin I was wondering if other integration tools still work after addition of the redirect feature for s3? For example, I would expect that Data Capture Module (DCM) with rsync would not work anymore with the regular (posix) file storage (and the latest release of Dataverse)?

@qqmyers
Copy link
Member

qqmyers commented Nov 16, 2022

FWIW: This may be something that gets removed as I don't think there is much/any current use. That said, I think we've tried to keep it working for the specific store type it was originally developed for (posix or AWS S3). It's definitely possible that bugs have been introduced but I don't think just the option of direct upload should break it (enabling direct upload doesn't stop normal uploads). If you are trying to use it and see a specific problem, feel free to add an issue/PR - at a minimum if we know for sure it is broken, that may add priority to removing it. (Alternately, if you think DCM needs to stay because it supports some use case that is hard to address with other methods, let us know why and/or if there are ways it could/should be updated to be more useful).

@ErykKul
Copy link
Collaborator Author

ErykKul commented Nov 16, 2022

We are not using DCM, it was just an example. I was just wondering if I am the only one that run into this problem, I did not manage to upload files to posix system with the API (this is the reason for this pull request). Maybe nobody uses posix anymore, or maybe there is a workaround to upload files with the API without using S3? This is blocking for the integration tool I am working on.

@qqmyers
Copy link
Member

qqmyers commented Nov 16, 2022

Ah - I don't think DCM knows about the direct upload API at all - it was built long before that came in. It uses it's own mechanism to associate the uploaded files with the dataset.

Copy link
Contributor

@poikilotherm poikilotherm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of this PR as this might solve a problem for using direct upload at FZJ, using a Ceph-based S3 storage which is not exposed to the internet.

Code looks good to me, but it would be nice to polish the docs a bit. Thank you for your work on this!

doc/sphinx-guides/source/installation/config.rst Outdated Show resolved Hide resolved
doc/release-notes/4.20-release-notes.md Outdated Show resolved Hide resolved
@poikilotherm poikilotherm marked this pull request as draft November 17, 2022 13:56
@poikilotherm
Copy link
Contributor

poikilotherm commented Nov 17, 2022

@ErykKul forgive me being encroaching and converting this to a draft. Doing this to reflect the nature of this PR not being ready for valuation and being sent into the funnel. Trying to play with a better view of what is happening with community contributions and how to get them merged faster. See also https://github.com/orgs/IQSS/projects/36/views/1

@ErykKul
Copy link
Collaborator Author

ErykKul commented Nov 17, 2022

OK, thanks! I have asked @DieuwertjeBloemen to help me with the documentation. We will request the review when we will be ready.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Nov 21, 2022

I have tested adding files with SWORD API, and it works as a workaround. It is only native API that throws "Dataset store configuration does not allow provided storageIdentifier".

@ErykKul ErykKul marked this pull request as ready for review November 21, 2022 09:46
@ErykKul
Copy link
Collaborator Author

ErykKul commented Sep 6, 2023

@qqmyers Thanks for the hint, I will investigate it. Also, if you believe it is better to make the isDirectUploadEnabled non-static and initialize the objects when none can be passed, I will do the refactoring. Other than that, I think that the "waiting" label can be removed. Thank you for your review!

@qqmyers qqmyers removed the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Sep 6, 2023
@ErykKul
Copy link
Collaborator Author

ErykKul commented Sep 7, 2023

@qqmyers, the upload redirect did not work for us because of the CORS policy, and it has nothing to do with tags. I have tried to set it up with the "put-bucket-cors", like the Dataverse documentation explains, but it is not implemented on our S3 (501 error: The requested resource is not implemented). I will test it locally with a CORS proxy and check with our IT department if they can set the CORS policy somehow. Thanks again for pointing me in the right direction!

@qqmyers qqmyers added this to the 6.1 milestone Sep 7, 2023
Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - asked for a couple potential changes, but overall this addresses #9002.

src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java Outdated Show resolved Hide resolved
@@ -507,6 +507,10 @@ A Dataverse installation can alternately store files in a Swift or S3-compatible

A Dataverse installation may also be configured to reference some files (e.g. large and/or sensitive data) stored in a web-accessible trusted remote store.

A Dataverse installation can be configured to allow out of band upload by setting the ``dataverse.files.\<id\>.upload-out-of-band`` JVM option to ``true``.
By default, Dataverse support uploading files via the :ref:`add-file-api`. With S3 stores, a direct upload process can be enabled to allow sending the file directly to the S3 store (without any intermediate copies on the Dataverse server).
With the upload-out-of-band option enabled, it is also possible for file upload to be managed manually or via third-party tools, with the :ref:`add-file-metadata-api` call used to add metadata and inform Dataverse that a new file has been added to the relevant store.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc build is failing because the add-file-metadata-api ref doesn't exist.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it such that it is not a 'ref' anymore. I was not sure if I should add some references here. Right now, it just states literally add-file-metadata-api, without it being any reference, etc.

@scolapasta
Copy link
Contributor

@ErykKul now that we've released 6.0, I'm moving waiting 6.1 issues back to the board.

But could you first handle the 6.0 merge and address any EE10 issues. Thanks.

Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I updated the direct upload api reference and fixed a typo. Ready for QA.

@qqmyers qqmyers removed their assignment Sep 18, 2023
@kcondon kcondon self-assigned this Sep 22, 2023
@kcondon
Copy link
Contributor

kcondon commented Sep 22, 2023

@ErykKul Would you please add a release note snippet mentioning the jvm option and an example of how to set it, and that the id refers to the name of the s3 store configured in Dataverse to which you want to apply this policy to? Thanks. I'm just missing some context when testing and so think it might help end users and when writing the main release notes.
Kevin

@kcondon
Copy link
Contributor

kcondon commented Sep 27, 2023

Thanks, @ErykKul

Adding release notes snippet, directly taken from installation guide.
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's convert the release note to Markdown.

doc/release-notes/9002_allow_direct_upload_setting.md Outdated Show resolved Hide resolved
@ErykKul
Copy link
Collaborator Author

ErykKul commented Sep 28, 2023

@kcondon Thanks!

@kcondon kcondon merged commit ed8e966 into IQSS:develop Sep 29, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: File Upload & Handling Size: 3 A percentage of a sprint. 2.1 hours.
Projects
Status: No status
Status: Closed
Status: WIP
Development

Successfully merging this pull request may close these issues.

isDirectUploadEnabled checks only if upload-redirect property is enabled for a storage driver
8 participants