Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset with large number of files #8928

Closed
linsherpa opened this issue Aug 23, 2022 · 5 comments
Closed

Dataset with large number of files #8928

linsherpa opened this issue Aug 23, 2022 · 5 comments
Labels
D: Dataset: large number of files https://github.com/IQSS/dataverse-pm/issues/27

Comments

@linsherpa
Copy link

Hi,
I have a dataset that contains around ~75k files (each less than 1MB).

Problem:
I can clearly see and access (open) the files of dataset by clicking "Files" from facet area.
But when I click the dataset i receive "500 internal server error " (after some minutes) and with no updates on server.log

However if I try from API call , eg Json rep of the dataset, I receive the following update from server.log.(text file attached)

Are there any other solutions /ideas to tackle them ? Apart from zipping(double zipping) them
[Ticket Number Reference: #324896]
link: https://help.hmdc.harvard.edu/Ticket/Display.html?id=324896

Best Regards
Lincoln
error_serverlog_apicall.txt

@pdurbin
Copy link
Member

pdurbin commented Aug 23, 2022

@linsherpa thanks for chatting and opening this issue and the ticket.

I'll note that out of the box Dataverse only allows you unzip 1000 files at a time from a zip file: https://guides.dataverse.org/en/5.11.1/installation/config.html#multipleuploadfileslimit ... That's the most official statement I could find about how many files are supported in a single dataset... not a very strong one.

As you mentioned, the practical workaround is probably to double zip the files.

For developers I'll mention that scripts/search/data/binary/1000files.zip has 1000 small files we can test with.

Finally, here are some open issues related to large numbers of files in a dataset:

@qqmyers
Copy link
Member

qqmyers commented Aug 23, 2022

FWIW: I think we know performance gets worse as the number of files increases, but I don't think there are any hard limits known. My first guess in general for errors would be timeouts or memory/temp space issues, i.e. it takes Dataverse too long to generate and send the json for 75K files and the connection gets closed. Other than the usual checks of looking in the logs and server load, and looking in the browser dev console or going verbose with curl to see the specific status code and responder (i.e. for timeouts you can see if it is a load balancer, apache, etc. that timed out), I'm not sure what else to suggest. (It is surprising that a 500 error can occur without any log info though.)

@linsherpa
Copy link
Author

linsherpa commented Aug 24, 2022

Thank you @pdurbin and @qqmyers for your valuable suggestions :)

I increased the TimeOut Parameter of Apache in config file and now i see the dataset with all the files (although it takes some time)

@mreekie mreekie moved this to ▶SPRINT- NEEDS SIZING in IQSS Dataverse Project Feb 13, 2023
@mreekie mreekie added the D: Dataset: large number of files https://github.com/IQSS/dataverse-pm/issues/27 label Feb 13, 2023
@pdurbin
Copy link
Member

pdurbin commented Feb 14, 2023

We're talking about this issue in tech hours. Here are some pain points for users:

  • Editing the title (or other piece) when there are many files (30,000). The save is prohibitively expensive. Affects depositors. Maybe removing the cascade will help. There are two cascades. We could write tests with the 1000files.zip file vs 1 file. How long to edit the title?
  • Slow indexing of a dataset with 30,000 files. Affects sysadmins.
  • Only 20 files but many versions. Slow to make the next version. Multiplying affect. Affects depositors. Reindexing is slow?
  • ...

Other discussion:

  • Does the new zip previewer/downloader help?
  • Creating a large JSON for the tree view?
  • Let's benchmark and measure perf.
  • Check open file handles.

@linsherpa
Copy link
Author

  • zip previewer could help to some extent.
  • For those who want to process the large files , uploading a large JSON file of the Dataset could of help.(which we are currently also doing)

Other Point

  • Accessing of a file in a dataset is not a problem, this could be done via REST interface, provided its File-ID is known.
    However opening/clicking on a dataset containing large number of files , where it takes long time to sort all its files and display them to user is a bottleneck .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D: Dataset: large number of files https://github.com/IQSS/dataverse-pm/issues/27
Projects
Status: No status
Development

No branches or pull requests

4 participants