Large DataFrame to BigQuery is taking a lot of time or crashing the service #731

VelizarVESSELINOV · 2015-12-18T19:01:39Z

Gist provided few days ago: https://gist.github.com/VelizarVESSELINOV/7b2e65cae44f9c1f7444

gramster · 2016-01-04T20:19:36Z

I'm looking into this now. It looks to me like a push of the 100,000 row dataframe should take about 30 minutes at the rate it is happening, and you can scale that linearly for the others. Did you actually wait that long?

VelizarVESSELINOV · 2016-01-04T21:22:36Z

One of the performance captures was for a dataset with 193 127 rows and 120 columns:

Elapsed time loading CSV file on DataLab (s): 1.18227410316
Elapsed time parsing CSV and convert to DataFrame (s): 3.82195401192
Elapsed time move the data from DataFrame to BigQuery (s): 400.9408288

Which is giving me an ingestion rate (from DataFrame) of 57 802 data points per second or 401 rows per second (as many of my columns are empty).

Because of the bad performance and crashes, I moved to do transformation in DataFrame save it as a CSV and load it as a CSV in BigQuery. Now the performance a little bit more acceptable, at least I'm able to process all my files. This is an example of performance data points per seconds processing (loading, parsing, saving, BigQuery ingestion of CSV file):

Global performance data points per second benchmark

I got also a performance issue on reading the data from BigQuery and accessing in DataFrame for advanced editing, I didn't yet opened a issue ticket for this. At the end (for now, waiting BigQuery from/to DataFrame is not faster for large volume table), I'm using DataFrame stored as MessagePack in the Google storage bucket.

I really hope the performance will be better on day, my feeling is that is due to JSON serialization issue. When I'm doing the groupy and other data filters the performance is more acceptable.

gramster · 2016-01-04T21:46:29Z

Inserting large amounts of data with Table.insert_data is not very optimal, I agree. The underlying APIs being used are meant for real-time streaming data with some volume constraints, not for bulk uploads. For the latter you really want to get the data into GCS first (outside Datalab) and then in Datalab you can use Table.load or query directly from GCS using a FederatedTable.

VelizarVESSELINOV · 2016-01-04T22:15:43Z

In the gauge plot CleanCSV2BigQuery is using table.load(CSV file from GCS).

Is this the optimal way for fast bulk load in BigQuery?

I'm using code like this:

table.load(google_bucket_csv.uri, mode='overwrite', source_format='csv',
           csv_options=bq.CSVOptions(skip_leading_rows=1))

The only reason, I'm using Datalab, in this very special use case, is for batch task orchestration and performance monitoring.

The BigQuery ingestion rate is 586 271 data points per second or ~5 000 rows per second as most of my files are with more than 100 columns.

gramster · 2016-01-05T17:29:42Z

I believe Table.load is the best. At least from the Datalab side it requires just a single REST call to initiate, unlike insert_data where Datalab has to push the data itself in multiple POSTs.

VelizarVESSELINOV · 2016-01-08T22:47:52Z

Is it possible from Datalab to create federated table in BigQuery?
I can see how to do it, using the UI (with more than 100 columns this is not feasible) or using command-line tool which is not practical as I have a big loop with a lot activity before and after loading.
https://cloud.google.com/bigquery/federated-data-sources

gramster · 2016-01-08T22:59:16Z

You can reference an existing file in GCS by creating a
gcp.datalab.FederatedTable instance.

On Fri, Jan 8, 2016 at 2:47 PM, Velizar VESSELINOV <[email protected]

wrote:

Is it possible from Datalab to create federated table in BigQuery?
I can see how to do it, using the UI (with more than 100 columns this is
not feasible) or using command-line tool which is not practical as I have a
big loop with a lot activity before and after loading.
https://cloud.google.com/bigquery/federated-data-sources

—
Reply to this email directly or view it on GitHub
#731 (comment)
.

VelizarVESSELINOV · 2016-01-08T23:04:52Z

Thanks, I just saw the example at http://github.com/GoogleCloudPlatform/datalab/blob/1a251cbf731d4a96685b45bf0a1ed6485f92db07/content/datalab/tutorials/BigQuery/Using%20External%20Tables%20from%20BigQuery.ipynb

gramster · 2016-01-08T23:08:02Z

Bear in mind when you use a federated table you lose the benefit of a
column oriented store; BQ has to ready all the columns. So querying that
way can be quite a bit more expensive; you usually want to use it for small
tables or one-off queries against most/all columns.

On Fri, Jan 8, 2016 at 3:04 PM, Velizar VESSELINOV <[email protected]

wrote:

Thanks, I just saw the example at
https://github.com/GoogleCloudPlatform/datalab/blob/1a251cbf731d4a96685b45bf0a1ed6485f92db07/content/datalab/tutorials/BigQuery/Using%20External%20Tables%20from%20BigQuery.ipynb

—
Reply to this email directly or view it on GitHub
#731 (comment)
.

gramster · 2016-05-19T23:40:38Z

Moved to #18

Di-Ku · 2017-04-04T20:38:58Z

See this issue for more detailed discussion and a way around via exporting to GCS as csv/json and then reading into a dataframe. It covers why the current way is slow (not a recommended BigQuery API for large amount of data) and some benchmarks.

VelizarVESSELINOV changed the title ~~Large DataFrame to BigQuery is taking a lot of time or crash the service~~ Large DataFrame to BigQuery is taking a lot of time or crashing the service Dec 18, 2015

gramster mentioned this issue May 19, 2016

Large DataFrame to BigQuery is taking a lot of time or crashing the service googledatalab/pydatalab#18

Open

gramster closed this as completed May 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large DataFrame to BigQuery is taking a lot of time or crashing the service #731

Large DataFrame to BigQuery is taking a lot of time or crashing the service #731

VelizarVESSELINOV commented Dec 18, 2015

gramster commented Jan 4, 2016

VelizarVESSELINOV commented Jan 4, 2016

gramster commented Jan 4, 2016

VelizarVESSELINOV commented Jan 4, 2016

gramster commented Jan 5, 2016 via email

VelizarVESSELINOV commented Jan 8, 2016

gramster commented Jan 8, 2016

VelizarVESSELINOV commented Jan 8, 2016

gramster commented Jan 8, 2016

gramster commented May 19, 2016 •

edited

Loading

Di-Ku commented Apr 4, 2017 •

edited

Loading

Large DataFrame to BigQuery is taking a lot of time or crashing the service #731

Large DataFrame to BigQuery is taking a lot of time or crashing the service #731

Comments

VelizarVESSELINOV commented Dec 18, 2015

gramster commented Jan 4, 2016

VelizarVESSELINOV commented Jan 4, 2016

Global performance data points per second benchmark

gramster commented Jan 4, 2016

VelizarVESSELINOV commented Jan 4, 2016

gramster commented Jan 5, 2016 via email

VelizarVESSELINOV commented Jan 8, 2016

gramster commented Jan 8, 2016

VelizarVESSELINOV commented Jan 8, 2016

gramster commented Jan 8, 2016

gramster commented May 19, 2016 • edited Loading

Di-Ku commented Apr 4, 2017 • edited Loading

gramster commented May 19, 2016 •

edited

Loading

Di-Ku commented Apr 4, 2017 •

edited

Loading