Use JSONL to upload to BigQuery #86

judahrand · 2022-02-21T18:53:03Z

Problem

Describe the problem your PR is trying to solve

Proposed changes

Use JSONL rather than Avro to reduce computation and simplify code.
Dump the JSONL to GCS and let BigQuery work out how to efficiently load the data.
Only breaking change that I'm aware of is that you will now need to define a gcs_bucket in the config.

#85

Types of changes

What types of changes does your code introduce to target-bigquery?
Put an x in the boxes that apply

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation Update (if none of the other choices apply)

Checklist

Description above provides context of the change
I have added tests that prove my fix is effective or that my feature works
Unit tests for changes (not needed for documentation changes)
CI checks pass with my changes
Bumping version in setup.py is an individual PR and not mixed with feature or bugfix PRs
Commits follow "How to write a good git commit message"
Relevant documentation is updated including usage instructions

judahrand · 2022-02-21T19:35:49Z

@jmriego This passes the integration tests on our fork so I assume it will here too.

judahrand · 2022-02-21T22:57:37Z

The integration tests also seem to run faster using this implementation. Obviously, there's some noise to them but I've seen around 24 minutes for the Avro implementation and 19 minutes for this JSONL + GCS implementation.

jmriego · 2022-02-22T11:20:48Z

that's really good! do you mind giving me some time to test this in my environment as we have quite a few weird schemas and data we can use for testing?

README.md

jmriego · 2022-02-22T11:24:11Z

README.md

@@ -50,7 +50,9 @@ Full list of options in `config.json`:

 | Property                                | Type      | Required?    | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
 | -------------------------------------   | --------- | ------------ | ---------------------------------------------------------------                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
-| project_id                              | String    | Yes          | BigQuery project                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+| project_id                              | String    | Yes          | BigQuery project
+| gcs_bucket                              | String    | Yes          | Google Cloud Storage Bucket to use to stage files                                                             |


can we make this optional instead? I'd prefer if we don't force people to use GCS as I know not everyone is using it and it's a breaking change.
Should be OK to upload then read from GCS if this parameter is provided or use load_table_from_file if not

IMO there's no real reason not to ingest via GCS. It's certainly not a priority for us and I don't think I'd have the time to make the change + add additional testing in the near future. Though you're more than welcome to make the necessary changes. It will uglify the code a bit too.

the thing is that we have been using it without uploading to GCS for years now and this would make it more difficult for us to have PII compliance unless we deleted the files later. I'm going to merge this to a new jsonl branch and I'll do that change and some testing on it

This implementation does delete the files:

pipelinewise-target-bigquery/target_bigquery/db_sync.py

Lines 443 to 446 in c1272d3

try:

job.result()

finally:

blob.delete()

sounds good! just for the sake of conversation. Let's say we make GCS mandatory. What would be the benefits of doing that?
I can see that it would be more similar to the Snowflake target maintained by PipelineWise that forces you to use a stage or S3. But I don't think you'd win any speed, debug capabilities (if we delete the file in a finally) or anything else.
What do you think?

from https://www.oreilly.com/library/view/google-bigquery-the/9781492044451/ch04.html

^ This suggests that when your network is fast loading into GCS is better. At least the way we're running this (and I'd guess is common) both our DB and (obviously) BQ are in GCP (in the same region in fact). So the network connection is very fast (10Gbps+) This makes GCS the obvious choice (and probably uncompressed better than compressed).

Do you disagree?

Also given that we're already not using compression that also suggests that outside GCP loading via GCS is still quicker.

judahrand · 2022-02-24T13:37:56Z

@jmriego We have now deployed these changes internally and have seen a dramatic increase in throughput both in BQ ingestion times and in individual record processing.

judahrand changed the title ~~Upstream/jsonl~~ Use JSONL to upload to BigQuery Feb 21, 2022

judahrand had a problem deploying to Integrate Pull Request February 21, 2022 18:54 Failure

judahrand force-pushed the upstream/jsonl branch from e24c192 to 48e082e Compare February 21, 2022 18:56

judahrand had a problem deploying to Integrate Pull Request February 21, 2022 18:57 Failure

judahrand force-pushed the upstream/jsonl branch from b39e869 to d614f15 Compare February 21, 2022 19:24

judahrand had a problem deploying to Integrate Pull Request February 21, 2022 19:25 Failure

judahrand force-pushed the upstream/jsonl branch from d614f15 to 05a273e Compare February 21, 2022 19:27

judahrand had a problem deploying to Integrate Pull Request February 21, 2022 19:28 Failure

judahrand had a problem deploying to Integrate Pull Request February 21, 2022 19:30 Failure

judahrand had a problem deploying to Integrate Pull Request February 21, 2022 22:56 Failure

judahrand force-pushed the upstream/jsonl branch from 6e3e257 to c484a7b Compare February 22, 2022 09:55

judahrand had a problem deploying to Integrate Pull Request February 22, 2022 09:56 Failure

judahrand mentioned this pull request Feb 22, 2022

Bug: Can't handle "date" format from json schema #77

Closed

jmriego reviewed Feb 22, 2022

View reviewed changes

judahrand added 8 commits February 22, 2022 11:40

Use JSON rather than Avro

9c2d29f

Add env vars to CI

1ffea7f

Deal with linting errors

e3cae09

Update unittests

e60f75a

Add google-cloud-storage dependency

53effff

Fix bugs

19e3531

Add documentation

7e8af19

Fix test_logical_streams_from_pg_with_hard_delete_mapping

c1272d3

judahrand force-pushed the upstream/jsonl branch from c484a7b to c1272d3 Compare February 22, 2022 11:40

judahrand had a problem deploying to Integrate Pull Request February 22, 2022 11:41 Failure

judahrand had a problem deploying to Integrate Pull Request February 22, 2022 11:41 Error

judahrand had a problem deploying to Integrate Pull Request February 22, 2022 11:42 Failure

jmriego changed the base branch from master to jsonl February 22, 2022 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use JSONL to upload to BigQuery #86

Use JSONL to upload to BigQuery #86

judahrand commented Feb 21, 2022 •

edited

Loading

judahrand commented Feb 21, 2022

judahrand commented Feb 21, 2022 •

edited

Loading

jmriego commented Feb 22, 2022

jmriego Feb 22, 2022

judahrand Feb 22, 2022

jmriego Feb 22, 2022

judahrand Feb 22, 2022

jmriego Feb 23, 2022

judahrand Feb 23, 2022 •

edited

Loading

judahrand Feb 23, 2022

judahrand commented Feb 24, 2022 •

edited

Loading

Use JSONL to upload to BigQuery #86

Are you sure you want to change the base?

Use JSONL to upload to BigQuery #86

Conversation

judahrand commented Feb 21, 2022 • edited Loading

Problem

Proposed changes

Types of changes

Checklist

judahrand commented Feb 21, 2022

judahrand commented Feb 21, 2022 • edited Loading

jmriego commented Feb 22, 2022

jmriego Feb 22, 2022

Choose a reason for hiding this comment

judahrand Feb 22, 2022

Choose a reason for hiding this comment

jmriego Feb 22, 2022

Choose a reason for hiding this comment

judahrand Feb 22, 2022

Choose a reason for hiding this comment

jmriego Feb 23, 2022

Choose a reason for hiding this comment

judahrand Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

judahrand Feb 23, 2022

Choose a reason for hiding this comment

judahrand commented Feb 24, 2022 • edited Loading

judahrand commented Feb 21, 2022 •

edited

Loading

judahrand commented Feb 21, 2022 •

edited

Loading

judahrand Feb 23, 2022 •

edited

Loading

judahrand commented Feb 24, 2022 •

edited

Loading