-
Notifications
You must be signed in to change notification settings - Fork 249
Large DataFrame to BigQuery is taking a lot of time or crashing the service #731
Comments
I'm looking into this now. It looks to me like a push of the 100,000 row dataframe should take about 30 minutes at the rate it is happening, and you can scale that linearly for the others. Did you actually wait that long? |
One of the performance captures was for a dataset with 193 127 rows and 120 columns:
Which is giving me an ingestion rate (from DataFrame) of 57 802 data points per second or 401 rows per second (as many of my columns are empty). Because of the bad performance and crashes, I moved to do transformation in DataFrame save it as a CSV and load it as a CSV in BigQuery. Now the performance a little bit more acceptable, at least I'm able to process all my files. This is an example of performance data points per seconds processing (loading, parsing, saving, BigQuery ingestion of CSV file): Global performance data points per second benchmarkI got also a performance issue on reading the data from BigQuery and accessing in DataFrame for advanced editing, I didn't yet opened a issue ticket for this. At the end (for now, waiting BigQuery from/to DataFrame is not faster for large volume table), I'm using DataFrame stored as MessagePack in the Google storage bucket. I really hope the performance will be better on day, my feeling is that is due to JSON serialization issue. When I'm doing the groupy and other data filters the performance is more acceptable. |
Inserting large amounts of data with Table.insert_data is not very optimal, I agree. The underlying APIs being used are meant for real-time streaming data with some volume constraints, not for bulk uploads. For the latter you really want to get the data into GCS first (outside Datalab) and then in Datalab you can use Table.load or query directly from GCS using a FederatedTable. |
In the gauge plot CleanCSV2BigQuery is using table.load(CSV file from GCS). Is this the optimal way for fast bulk load in BigQuery? I'm using code like this:
The only reason, I'm using Datalab, in this very special use case, is for batch task orchestration and performance monitoring. The BigQuery ingestion rate is 586 271 data points per second or ~5 000 rows per second as most of my files are with more than 100 columns. |
I believe Table.load is the best. At least from the Datalab side it
requires just a single REST call to initiate, unlike insert_data where
Datalab has to push the data itself in multiple POSTs.
|
Is it possible from Datalab to create federated table in BigQuery? |
You can reference an existing file in GCS by creating a On Fri, Jan 8, 2016 at 2:47 PM, Velizar VESSELINOV <[email protected]
|
Bear in mind when you use a federated table you lose the benefit of a On Fri, Jan 8, 2016 at 3:04 PM, Velizar VESSELINOV <[email protected]
|
Moved to #18 |
See this issue for more detailed discussion and a way around via exporting to GCS as csv/json and then reading into a dataframe. It covers why the current way is slow (not a recommended BigQuery API for large amount of data) and some benchmarks. |
Gist provided few days ago: https://gist.github.com/VelizarVESSELINOV/7b2e65cae44f9c1f7444
The text was updated successfully, but these errors were encountered: