Skip to content
This repository was archived by the owner on Sep 3, 2022. It is now read-only.

Large DataFrame to BigQuery is taking a lot of time or crashing the service #731

Closed
VelizarVESSELINOV opened this issue Dec 18, 2015 · 11 comments

Comments

@VelizarVESSELINOV
Copy link

Gist provided few days ago: https://gist.github.com/VelizarVESSELINOV/7b2e65cae44f9c1f7444

@VelizarVESSELINOV VelizarVESSELINOV changed the title Large DataFrame to BigQuery is taking a lot of time or crash the service Large DataFrame to BigQuery is taking a lot of time or crashing the service Dec 18, 2015
@gramster
Copy link
Contributor

gramster commented Jan 4, 2016

I'm looking into this now. It looks to me like a push of the 100,000 row dataframe should take about 30 minutes at the rate it is happening, and you can scale that linearly for the others. Did you actually wait that long?

@VelizarVESSELINOV
Copy link
Author

One of the performance captures was for a dataset with 193 127 rows and 120 columns:

  1. Elapsed time loading CSV file on DataLab (s): 1.18227410316
  2. Elapsed time parsing CSV and convert to DataFrame (s): 3.82195401192
  3. Elapsed time move the data from DataFrame to BigQuery (s): 400.9408288

Which is giving me an ingestion rate (from DataFrame) of 57 802 data points per second or 401 rows per second (as many of my columns are empty).

Because of the bad performance and crashes, I moved to do transformation in DataFrame save it as a CSV and load it as a CSV in BigQuery. Now the performance a little bit more acceptable, at least I'm able to process all my files. This is an example of performance data points per seconds processing (loading, parsing, saving, BigQuery ingestion of CSV file):

Global performance data points per second benchmark

csv_df_csv_bq perf

I got also a performance issue on reading the data from BigQuery and accessing in DataFrame for advanced editing, I didn't yet opened a issue ticket for this. At the end (for now, waiting BigQuery from/to DataFrame is not faster for large volume table), I'm using DataFrame stored as MessagePack in the Google storage bucket.

I really hope the performance will be better on day, my feeling is that is due to JSON serialization issue. When I'm doing the groupy and other data filters the performance is more acceptable.

@gramster
Copy link
Contributor

gramster commented Jan 4, 2016

Inserting large amounts of data with Table.insert_data is not very optimal, I agree. The underlying APIs being used are meant for real-time streaming data with some volume constraints, not for bulk uploads. For the latter you really want to get the data into GCS first (outside Datalab) and then in Datalab you can use Table.load or query directly from GCS using a FederatedTable.

@VelizarVESSELINOV
Copy link
Author

In the gauge plot CleanCSV2BigQuery is using table.load(CSV file from GCS).

Is this the optimal way for fast bulk load in BigQuery?

I'm using code like this:

table.load(google_bucket_csv.uri, mode='overwrite', source_format='csv',
           csv_options=bq.CSVOptions(skip_leading_rows=1))

The only reason, I'm using Datalab, in this very special use case, is for batch task orchestration and performance monitoring.

The BigQuery ingestion rate is 586 271 data points per second or ~5 000 rows per second as most of my files are with more than 100 columns.

@gramster
Copy link
Contributor

gramster commented Jan 5, 2016 via email

@VelizarVESSELINOV
Copy link
Author

Is it possible from Datalab to create federated table in BigQuery?
I can see how to do it, using the UI (with more than 100 columns this is not feasible) or using command-line tool which is not practical as I have a big loop with a lot activity before and after loading.
https://cloud.google.com/bigquery/federated-data-sources

@gramster
Copy link
Contributor

gramster commented Jan 8, 2016

You can reference an existing file in GCS by creating a
gcp.datalab.FederatedTable instance.

On Fri, Jan 8, 2016 at 2:47 PM, Velizar VESSELINOV <[email protected]

wrote:

Is it possible from Datalab to create federated table in BigQuery?
I can see how to do it, using the UI (with more than 100 columns this is
not feasible) or using command-line tool which is not practical as I have a
big loop with a lot activity before and after loading.
https://cloud.google.com/bigquery/federated-data-sources


Reply to this email directly or view it on GitHub
#731 (comment)
.

@gramster
Copy link
Contributor

gramster commented Jan 8, 2016

Bear in mind when you use a federated table you lose the benefit of a
column oriented store; BQ has to ready all the columns. So querying that
way can be quite a bit more expensive; you usually want to use it for small
tables or one-off queries against most/all columns.

On Fri, Jan 8, 2016 at 3:04 PM, Velizar VESSELINOV <[email protected]

wrote:

Thanks, I just saw the example at
https://github.com/GoogleCloudPlatform/datalab/blob/1a251cbf731d4a96685b45bf0a1ed6485f92db07/content/datalab/tutorials/BigQuery/Using%20External%20Tables%20from%20BigQuery.ipynb


Reply to this email directly or view it on GitHub
#731 (comment)
.

@gramster
Copy link
Contributor

gramster commented May 19, 2016

Moved to #18

@Di-Ku
Copy link
Contributor

Di-Ku commented Apr 4, 2017

See this issue for more detailed discussion and a way around via exporting to GCS as csv/json and then reading into a dataframe. It covers why the current way is slow (not a recommended BigQuery API for large amount of data) and some benchmarks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants