-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use JSONL to upload to BigQuery #86
base: jsonl
Are you sure you want to change the base?
Conversation
e24c192
to
48e082e
Compare
b39e869
to
d614f15
Compare
d614f15
to
05a273e
Compare
@jmriego This passes the integration tests on our fork so I assume it will here too. |
The integration tests also seem to run faster using this implementation. Obviously, there's some noise to them but I've seen around 24 minutes for the Avro implementation and 19 minutes for this JSONL + GCS implementation. |
6e3e257
to
c484a7b
Compare
that's really good! do you mind giving me some time to test this in my environment as we have quite a few weird schemas and data we can use for testing? |
@@ -50,7 +50,9 @@ Full list of options in `config.json`: | |||
|
|||
| Property | Type | Required? | Description | | |||
| ------------------------------------- | --------- | ------------ | --------------------------------------------------------------- | | |||
| project_id | String | Yes | BigQuery project | | |||
| project_id | String | Yes | BigQuery project | |||
| gcs_bucket | String | Yes | Google Cloud Storage Bucket to use to stage files | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make this optional instead? I'd prefer if we don't force people to use GCS as I know not everyone is using it and it's a breaking change.
Should be OK to upload then read from GCS if this parameter is provided or use load_table_from_file
if not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO there's no real reason not to ingest via GCS. It's certainly not a priority for us and I don't think I'd have the time to make the change + add additional testing in the near future. Though you're more than welcome to make the necessary changes. It will uglify the code a bit too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thing is that we have been using it without uploading to GCS for years now and this would make it more difficult for us to have PII compliance unless we deleted the files later. I'm going to merge this to a new jsonl branch and I'll do that change and some testing on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation does delete the files:
pipelinewise-target-bigquery/target_bigquery/db_sync.py
Lines 443 to 446 in c1272d3
try: | |
job.result() | |
finally: | |
blob.delete() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! just for the sake of conversation. Let's say we make GCS mandatory. What would be the benefits of doing that?
I can see that it would be more similar to the Snowflake target maintained by PipelineWise that forces you to use a stage or S3. But I don't think you'd win any speed, debug capabilities (if we delete the file in a finally) or anything else.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from https://www.oreilly.com/library/view/google-bigquery-the/9781492044451/ch04.html
^ This suggests that when your network is fast loading into GCS is better. At least the way we're running this (and I'd guess is common) both our DB and (obviously) BQ are in GCP (in the same region in fact). So the network connection is very fast (10Gbps+) This makes GCS the obvious choice (and probably uncompressed better than compressed).
Do you disagree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also given that we're already not using compression that also suggests that outside GCP loading via GCS is still quicker.
c484a7b
to
c1272d3
Compare
@jmriego We have now deployed these changes internally and have seen a dramatic increase in throughput both in BQ ingestion times and in individual record processing. |
Problem
Describe the problem your PR is trying to solve
Proposed changes
gcs_bucket
in the config.#85
Types of changes
What types of changes does your code introduce to target-bigquery?
Put an
x
in the boxes that applyChecklist
setup.py
is an individual PR and not mixed with feature or bugfix PRs