-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow download when media is flagged #299
Comments
Hi @lrossi79 With download speed, do you mean the time you spend waiting on the analysis panel before the download actually starts, or do you mean the download speed in terms of kb/s ? |
no, I mean the actual download speed in terms of kb/s. We looked a little bit into the issue and it seems that the actual download starts while the db is still joining the tables (so before the file is actually fully generated). Does this make any sense? |
No that doesn't make sense, those MySQL queries you see may be unrelated or part of a 'orphaned' query, which happens when you start some analysis, but close the browser tab (the query keeps running in the background). Just to be sure, you could restart MySQL to get rid of these queries and try again. But... the CSV file will only be made available to the user once the script has finished, therefore TCAT should never be of influence on download speed. The CSV is just a static file sitting in the analysis/cache directory which will be served directly by Apache. Are you sure this isn't a situation where you have something (an ISP / hosting provider) throttling your bandwidth after a certain number of megabytes has been downloaded? |
Hi, |
Hi @lrossi79 You're correct about the export function immediately returning, I forgot, so it makes sense the query is running. As you are monitoring the MySQL server, can you send us the SQL statement with the JOIN which is taking a very long time? Best, Emile |
Hi. I'm looking at the same TCAT instance as @lrossi79. Exporting from a bin with 1.2M selected tweets, this I believe is the SQL query without any of the additional mentions, links or media columns selected:
That causes no issues. It takes some seconds to get collected, and the 667MB file downloads at good speed. A problematic export, with the media columns included, has the same query:
But it is grinding a very large number of these, at rate of approximately 10 per second.
These are generated by I cancelled the query after some thousands. The For comparison, the following, conceptually similar SQL query outside of TCAT did not complete in 2.5 hours on our setup.
For the one bin I was investigating, we didn't have an index for the column So I investigated schema and indices, and added an index with
This I believe solved, or at least considerably improved the issue with TCAT exports that @lrossi79 reports. Please observe that each of the bins need to be re-indexed (let's talk about it). I am currently on a slow network but download speed from TCAT export with media improved at least three orders of magnitude after adding the index for Please note all my reports of times are just from naïve runs, not using Monte Carlo methods for estimates, and without controlling other processes competing for the computing resources. |
…ns. Fixes digitalmethodsinitiative#299, for new bins.
Hi @xmacex Thank you for the excellent analysis! I'll study this shortly. As for adding indexes, the procedure should go via One question, did you benchmark the original mod_export script (without the JOIN syntax) after you've added the index? |
Hi @dentoir. Without the index, the queries of this particular TCAT instance, for this particular bin with ~1.2M items and ~500000 media items the Again the same caveat, I'm doing this without a whole lot of scientific measuring or fancy PHP profiling – I am basically just setting |
Hello,
I've looked around and I can't really find anything about this. The issue is rather strange and I'm wondering if it's configuration issue. I have a reasonably large archive (almost 2M tweets) that I need to export. I need the urls to be resolved but when i flag "media" to have the resolved url in the exported file the download speed is incredibly slow (and that makes impossible to download a 1Gb file). The download speed is normal if "media" is not selected.
I really don't understand this but maybe is some kind of known issue.
Thank you!
The text was updated successfully, but these errors were encountered: