-
Notifications
You must be signed in to change notification settings - Fork 79
Diagnose, benchmark and provide guidance for loading large dataframe from BigQuery #329
Comments
I've also noticed that loading a large DataFrame from BigQuery is slow. My machine is outside of GCE so I understand that there will be significant delays in downloading data. Still, my feeling (but still need to confirm) is that the following process is quicker compared to
Some initial profiling data: Using
Using
Using
Time difference => more than 10 minutes! I'm using the legacy I used the following code for profiling:
|
Thanks for the investigation @parthea, these numbers are helpful. Can you add more information on your experiments? Particularly:
I think using the new interface is also important. Thanks for the investigation. |
I reverted my local changes because I didn't cater for _DEFAULT_PAGE_SIZE in I repeated the tests with the following code and the latest pydatalab at master using my local setup (not in GCE). I also removed the limit clause in my SQL query to test a larger size (1667267 rows).
Output
Output:
Now using the latest master with the changes from PR #220
Output:
There is a significant performance improvement if we increase the page size to 100000. Note that the process of exporting data from BigQuery to GCS followed by download still seems to be faster on my local setup outside of GCE. |
Thanks for the analysis, here are my thoughts:
|
One more thing, it's probably better to just analyze the runtime for the |
Here's a bit of my analysis to complete the picture. The numbers were all run on a GCE n1-standard-1 (1 vCPU, 3.75 GB memory) VM. GCS Method (base line):
Step 2: Downloading, using
Step 3: Import to dataframe, using
Total:
Datalab Method:
Page size: 100,000
|
It's worth noting the average page fetch increases as the number of pages increases, which is interesting. Maybe we need to dig deeper into this. Other than that, the numbers seem reasonable, and on par with the discussion above. The low hanging fruit here is to increase the page size, that should give us an easy 2x speedup. This pending PR and discussion is relevant here. For longer term, we should look into avoiding the table list data API altogether. |
Nice analysis! It was very easy to follow along. It's interesting that the performance improvement of using the BigQuery to GCS method is only beneficial with ~150,000+ rows (depending on the number of columns). Thanks for providing clarification about how the API requests are made. It's very helpful!
I apologize for the omission on my part. I found that the run time for
Yes, this is surprising. Do you think it would be helpful to profile memory usage to look for a memory leak? I would guess that regardless of the page_size the memory usage should be consistent across the requests of the same page_size.
Would this be something similar to the BigQuery to GCS method? If not, can you share some information about the available options? |
Thanks Anthoniou and Yasser for the detailed analysis and diagnosis. |
One thing pointed out by @craigcitro is worth investigating, which is we use |
@parthea sorry I missed your last question.
I was thinking something like a |
Here's an updated graph using Opened #339 for these fixes. |
Customer query (via Tahir F.)
Do we have any kind of benchmarks / recommendations for the GCE set-up for the amount of data that would be brought into a pandas dataframe?
His question is as follows:
From my perspective, could you advise me the appropriate spec of GCE?
We grade up the GCE spec and it seems to use only 2% of CPU but it takes 5mins to handle 500,000rows data in pandas.
Do you have any idea to improve the performance of datalab?
Does it relate to network or disk issue?
The text was updated successfully, but these errors were encountered: