Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use JSONL to upload to BigQuery #86

Open
wants to merge 8 commits into
base: jsonl
Choose a base branch
from

Conversation

judahrand
Copy link
Contributor

@judahrand judahrand commented Feb 21, 2022

Problem

Describe the problem your PR is trying to solve

Proposed changes

  • Use JSONL rather than Avro to reduce computation and simplify code.
  • Dump the JSONL to GCS and let BigQuery work out how to efficiently load the data.
  • Only breaking change that I'm aware of is that you will now need to define a gcs_bucket in the config.

#85

Types of changes

What types of changes does your code introduce to target-bigquery?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Checklist

  • Description above provides context of the change
  • I have added tests that prove my fix is effective or that my feature works
  • Unit tests for changes (not needed for documentation changes)
  • CI checks pass with my changes
  • Bumping version in setup.py is an individual PR and not mixed with feature or bugfix PRs
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions

@judahrand judahrand changed the title Upstream/jsonl Use JSONL to upload to BigQuery Feb 21, 2022
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 18:54 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 18:54 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 18:54 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 18:57 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 18:57 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 18:57 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 19:25 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 19:25 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 19:25 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 19:28 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 19:28 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 19:28 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 19:30 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 19:30 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 19:30 Failure
@judahrand
Copy link
Contributor Author

@jmriego This passes the integration tests on our fork so I assume it will here too.

@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 22:56 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 22:56 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 21, 2022 22:56 Failure
@judahrand
Copy link
Contributor Author

judahrand commented Feb 21, 2022

The integration tests also seem to run faster using this implementation. Obviously, there's some noise to them but I've seen around 24 minutes for the Avro implementation and 19 minutes for this JSONL + GCS implementation.

@jmriego
Copy link
Owner

jmriego commented Feb 22, 2022

that's really good! do you mind giving me some time to test this in my environment as we have quite a few weird schemas and data we can use for testing?

README.md Show resolved Hide resolved
@@ -50,7 +50,9 @@ Full list of options in `config.json`:

| Property | Type | Required? | Description |
| ------------------------------------- | --------- | ------------ | --------------------------------------------------------------- |
| project_id | String | Yes | BigQuery project |
| project_id | String | Yes | BigQuery project
| gcs_bucket | String | Yes | Google Cloud Storage Bucket to use to stage files |
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this optional instead? I'd prefer if we don't force people to use GCS as I know not everyone is using it and it's a breaking change.
Should be OK to upload then read from GCS if this parameter is provided or use load_table_from_file if not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO there's no real reason not to ingest via GCS. It's certainly not a priority for us and I don't think I'd have the time to make the change + add additional testing in the near future. Though you're more than welcome to make the necessary changes. It will uglify the code a bit too.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thing is that we have been using it without uploading to GCS for years now and this would make it more difficult for us to have PII compliance unless we deleted the files later. I'm going to merge this to a new jsonl branch and I'll do that change and some testing on it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation does delete the files:

try:
job.result()
finally:
blob.delete()

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! just for the sake of conversation. Let's say we make GCS mandatory. What would be the benefits of doing that?
I can see that it would be more similar to the Snowflake target maintained by PipelineWise that forces you to use a stage or S3. But I don't think you'd win any speed, debug capabilities (if we delete the file in a finally) or anything else.
What do you think?

Copy link
Contributor Author

@judahrand judahrand Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

from https://www.oreilly.com/library/view/google-bigquery-the/9781492044451/ch04.html

^ This suggests that when your network is fast loading into GCS is better. At least the way we're running this (and I'd guess is common) both our DB and (obviously) BQ are in GCP (in the same region in fact). So the network connection is very fast (10Gbps+) This makes GCS the obvious choice (and probably uncompressed better than compressed).

Do you disagree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also given that we're already not using compression that also suggests that outside GCP loading via GCS is still quicker.

@judahrand judahrand had a problem deploying to Integrate Pull Request February 22, 2022 11:41 Failure
@judahrand judahrand had a problem deploying to Integrate Pull Request February 22, 2022 11:42 Failure
@jmriego jmriego changed the base branch from master to jsonl February 22, 2022 13:59
@judahrand
Copy link
Contributor Author

judahrand commented Feb 24, 2022

@jmriego We have now deployed these changes internally and have seen a dramatic increase in throughput both in BQ ingestion times and in individual record processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants