Dataset with large number of files #8928

linsherpa · 2022-08-23T13:48:25Z

Hi,
I have a dataset that contains around ~75k files (each less than 1MB).

Problem:
I can clearly see and access (open) the files of dataset by clicking "Files" from facet area.
But when I click the dataset i receive "500 internal server error " (after some minutes) and with no updates on server.log

However if I try from API call , eg Json rep of the dataset, I receive the following update from server.log.(text file attached)

Are there any other solutions /ideas to tackle them ? Apart from zipping(double zipping) them
[Ticket Number Reference: #324896]
link: https://help.hmdc.harvard.edu/Ticket/Display.html?id=324896

Best Regards
Lincoln
error_serverlog_apicall.txt

pdurbin · 2022-08-23T14:40:15Z

@linsherpa thanks for chatting and opening this issue and the ticket.

I'll note that out of the box Dataverse only allows you unzip 1000 files at a time from a zip file: https://guides.dataverse.org/en/5.11.1/installation/config.html#multipleuploadfileslimit ... That's the most official statement I could find about how many files are supported in a single dataset... not a very strong one.

As you mentioned, the practical workaround is probably to double zip the files.

For developers I'll mention that scripts/search/data/binary/1000files.zip has 1000 small files we can test with.

Finally, here are some open issues related to large numbers of files in a dataset:

qqmyers · 2022-08-23T14:48:16Z

FWIW: I think we know performance gets worse as the number of files increases, but I don't think there are any hard limits known. My first guess in general for errors would be timeouts or memory/temp space issues, i.e. it takes Dataverse too long to generate and send the json for 75K files and the connection gets closed. Other than the usual checks of looking in the logs and server load, and looking in the browser dev console or going verbose with curl to see the specific status code and responder (i.e. for timeouts you can see if it is a load balancer, apache, etc. that timed out), I'm not sure what else to suggest. (It is surprising that a 500 error can occur without any log info though.)

linsherpa · 2022-08-24T06:38:14Z

Thank you @pdurbin and @qqmyers for your valuable suggestions :)

I increased the TimeOut Parameter of Apache in config file and now i see the dataset with all the files (although it takes some time)

pdurbin · 2023-02-14T20:42:46Z

We're talking about this issue in tech hours. Here are some pain points for users:

Editing the title (or other piece) when there are many files (30,000). The save is prohibitively expensive. Affects depositors. Maybe removing the cascade will help. There are two cascades. We could write tests with the 1000files.zip file vs 1 file. How long to edit the title?
Slow indexing of a dataset with 30,000 files. Affects sysadmins.
Only 20 files but many versions. Slow to make the next version. Multiplying affect. Affects depositors. Reindexing is slow?
...

Other discussion:

Does the new zip previewer/downloader help?
Creating a large JSON for the tree view?
Let's benchmark and measure perf.
Check open file handles.

linsherpa · 2023-02-15T09:21:39Z

zip previewer could help to some extent.
For those who want to process the large files , uploading a large JSON file of the Dataset could of help.(which we are currently also doing)

Other Point

Accessing of a file in a dataset is not a problem, this could be done via REST interface, provided its File-ID is known.
However opening/clicking on a dataset containing large number of files , where it takes long time to sort all its files and display them to user is a bottleneck .

linsherpa closed this as completed Aug 24, 2022

mreekie added this to IQSS Dataverse Project Feb 13, 2023

mreekie moved this to ▶SPRINT- NEEDS SIZING in IQSS Dataverse Project Feb 13, 2023

mreekie added the D: Dataset: large number of files https://github.com/IQSS/dataverse-pm/issues/27 label Feb 13, 2023

pdurbin mentioned this issue Mar 9, 2023

Performance: Slow response for the versions API call with large number of files or versions #9763

Closed

mreekie moved this from SPRINT- NEEDS SIZING to Clear of the Backlog in IQSS Dataverse Project Mar 13, 2023

mreekie added the Size: NoSprintCost label Apr 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset with large number of files #8928

Dataset with large number of files #8928

linsherpa commented Aug 23, 2022

pdurbin commented Aug 23, 2022

qqmyers commented Aug 23, 2022

linsherpa commented Aug 24, 2022 •

edited

Loading

pdurbin commented Feb 14, 2023 •

edited

Loading

linsherpa commented Feb 15, 2023

Dataset with large number of files #8928

Dataset with large number of files #8928

Comments

linsherpa commented Aug 23, 2022

pdurbin commented Aug 23, 2022

qqmyers commented Aug 23, 2022

linsherpa commented Aug 24, 2022 • edited Loading

pdurbin commented Feb 14, 2023 • edited Loading

linsherpa commented Feb 15, 2023

linsherpa commented Aug 24, 2022 •

edited

Loading

pdurbin commented Feb 14, 2023 •

edited

Loading