Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More bigquery Dataset extensions for dedupping #10

Merged
merged 11 commits into from
Dec 14, 2017
Merged

More bigquery Dataset extensions for dedupping #10

merged 11 commits into from
Dec 14, 2017

Conversation

gfr10598
Copy link
Contributor

@gfr10598 gfr10598 commented Dec 12, 2017

Adds DestinationQuery() and Dedup()

This was previously reviewed as m-lab/etl#360, but based on feedback decided to move it to the m-lab/go repo.


This change is Reviewable

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-48.2%) to 32.867% when pulling a8c4196 on dedup into 61205ef on master.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-48.2%) to 32.867% when pulling de03734 on dedup into 61205ef on master.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-19.4%) to 32.867% when pulling b3ff35f on dedup into 7ac696d on master.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-19.4%) to 32.867% when pulling 49552b3 on dedup into 7ac696d on master.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-19.4%) to 32.867% when pulling 4c6b873 on dedup into 7ac696d on master.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-5.4%) to 46.853% when pulling a7bc474 on dedup into 7ac696d on master.

@gfr10598 gfr10598 requested review from stephen-soltesz and removed request for pboothe December 13, 2017 18:44
@stephen-soltesz
Copy link
Contributor

I'm wondering if Dedup is generic enough to warrant including in the m-lab/go packages?


Reviewed 1 of 1 files at r4.
Review status: 1 of 4 files reviewed at latest revision, 7 unresolved discussions.


bqext/dataset.go, line 166 at r5 (raw file):

///////////////////////////////////////////////////////////////////

// TODO - really should take the one that was parsed last, instead

If de-duplication depends on ordering of specific column names, such as "parse_time" does this belong in a generic package? Or, could this template be a parameter to Dedup?


bqext/dataset.go, line 192 at r5 (raw file):

	q := dsExt.DestinationQuery(queryString, dest.Table(table))
	if overwrite {
		q.QueryConfig.WriteDisposition = bigquery.WriteTruncate

Could write disposition be a parameter to DestinationQuery? The thinking being to possibly encapsulate more of the details of QueryConfig.


bqext/dataset_integration_test.go, line 87 at r5 (raw file):

// Note that a similar struct is defined in dataset.go, but this
// one is used for testing the QueryAndParse method.
type PartitionInfo struct {

Since this is only a local test dependency, can we call this type something else to prevent confusion? Should the type name be uncapitalized?

Or could we use bqext.PartitionInfo safely in the tests instead?


bqext/dataset_integration_test.go, line 140 at r5 (raw file):

}

func ClientOpts() []option.ClientOption {

clientOpts?


bqext/dataset_integration_test.go, line 143 at r5 (raw file):

	opts := []option.ClientOption{}
	if os.Getenv("TRAVIS") != "" {
		authOpt := option.WithCredentialsFile("../travis-testing.key")

This relative path could become fragile. At run time could we set an env variable to the location and check for that variable and use the location here?

e.g. TRAVIS_CREDENTIALS_FILE?


bqext/dataset_integration_test.go, line 158 at r5 (raw file):

	// First check that source table has expected number of rows.
	// TestDedupSrc should have 6 rows, of which 4 should be unique.

This test is very dependent on the state of an external project. This could discourage third-parties from considering adopting this for themselves.

Since this is a live test anyway, what if this test constructed the 6-row table that would be du-duped by the rest of the test? This would make the setup self-contained in code. And, it would be possible to re-target the test with a single line change. I could even imagine reading an environment variable for the project name.


bqext/dataset_test.go, line 136 at r5 (raw file):

func TestDestinationQuery(t *testing.T) {
	// Create a dummy client.
	client := cloudtest.NewChannelClient(make(chan *http.Response, 10))

Does this need a channel client? Can this use an 'ok client'? I don't see any pre-defined responses.


Comments from Reviewable

@gfr10598
Copy link
Contributor Author

Review status: 1 of 4 files reviewed at latest revision, 7 unresolved discussions.


bqext/dataset.go, line 166 at r5 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

If de-duplication depends on ordering of specific column names, such as "parse_time" does this belong in a generic package? Or, could this template be a parameter to Dedup?

It does not depend on column ordering. It only will depend on things like parse_time or task_filename in order to perform some sanity checks. I'd prefer to leave it until later to generalize it.

It definitely could be added as a parameter, so I'll add that now.
Do we want to make it even more general by using some kind of opts... variadic parameter?


bqext/dataset.go, line 192 at r5 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Could write disposition be a parameter to DestinationQuery? The thinking being to possibly encapsulate more of the details of QueryConfig.

Hmm - let me think about that some more.


bqext/dataset_integration_test.go, line 87 at r5 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Since this is only a local test dependency, can we call this type something else to prevent confusion? Should the type name be uncapitalized?

Or could we use bqext.PartitionInfo safely in the tests instead?

I'm not terribly concerned since this package is bqext_test. But I'll make it private.


bqext/dataset_integration_test.go, line 140 at r5 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

clientOpts?

Done.


bqext/dataset_integration_test.go, line 143 at r5 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

This relative path could become fragile. At run time could we set an env variable to the location and check for that variable and use the location here?

e.g. TRAVIS_CREDENTIALS_FILE?

I'll add an issue. Probably pretty easy to do.
#11


bqext/dataset_integration_test.go, line 158 at r5 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

This test is very dependent on the state of an external project. This could discourage third-parties from considering adopting this for themselves.

Since this is a live test anyway, what if this test constructed the 6-row table that would be du-duped by the rest of the test? This would make the setup self-contained in code. And, it would be possible to re-target the test with a single line change. I could even imagine reading an environment variable for the project name.

Agree. Already created #8
Added comment.


bqext/dataset_test.go, line 136 at r5 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Does this need a channel client? Can this use an 'ok client'? I don't see any pre-defined responses.

Done.


Comments from Reviewable

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-5.4%) to 46.853% when pulling ad63663 on dedup into 7ac696d on master.

@stephen-soltesz
Copy link
Contributor

Review status: 1 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


bqext/dataset.go, line 166 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

It does not depend on column ordering. It only will depend on things like parse_time or task_filename in order to perform some sanity checks. I'd prefer to leave it until later to generalize it.

It definitely could be added as a parameter, so I'll add that now.
Do we want to make it even more general by using some kind of opts... variadic parameter?

By ordering I meant as-in "parsed last".

When I was reading this earlier, I think I was trying to imagine whether Dedup would be plausibly useful to a third-party. But, that may be too broad a criteria. If I imagine the goal for m-lab/go being to add generally useful routines that we may use across our own repos, then that's easier.

Something still feels funny. If I ran Dedup twice on the same source, would I get the same result? Are there any other implicit dependencies between the input table and the implementation of Dedup?

If dedupTemplate was a parameter, perhaps the functionality of Dedup becomes immediately more generic. The operation becomes something like Filter or Record or something -- taking data from a query and writing it to a new table, "dedup" being one application of it.


Comments from Reviewable

@gfr10598
Copy link
Contributor Author

Review status: 1 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


bqext/dataset.go, line 166 at r5 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

By ordering I meant as-in "parsed last".

When I was reading this earlier, I think I was trying to imagine whether Dedup would be plausibly useful to a third-party. But, that may be too broad a criteria. If I imagine the goal for m-lab/go being to add generally useful routines that we may use across our own repos, then that's easier.

Something still feels funny. If I ran Dedup twice on the same source, would I get the same result? Are there any other implicit dependencies between the input table and the implementation of Dedup?

If dedupTemplate was a parameter, perhaps the functionality of Dedup becomes immediately more generic. The operation becomes something like Filter or Record or something -- taking data from a query and writing it to a new table, "dedup" being one application of it.

I'm definitely using the "useful to other repos in m-lab" criteria. Irene might use this, except she isn't using Go. 8-(

Yes if you run Dedup twice you should get the same result. I do not believe there are any other implicit dependencies.

I kinda like the idea of making this a Filter method, but I don't like the idea of injecting an arbitrary query. Then it starts to look like DestinationQuery, I think.

I could, however, extract the bottom bit of the code into a generic ExecQuery or something like that, for queries with not output expected. Let me try that out and you can see what you think.

With this extraction, I'd also be content moving the Dedup method itself back to the etl cmd/dedup package.


Comments from Reviewable

@gfr10598
Copy link
Contributor Author

Review status: 1 of 4 files reviewed at latest revision, 2 unresolved discussions.


bqext/dataset.go, line 192 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Hmm - let me think about that some more.

I think it should be and will add that.


Comments from Reviewable

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-5.04%) to 47.183% when pulling bb80dce on dedup into 7ac696d on master.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-4.3%) to 47.887% when pulling dce1458 on dedup into 7ac696d on master.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-4.3%) to 47.887% when pulling e9b7a3c on dedup into 7ac696d on master.

@stephen-soltesz
Copy link
Contributor

Review status: 1 of 4 files reviewed at latest revision, 5 unresolved discussions.


bqext/dataset.go, line 166 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

I'm definitely using the "useful to other repos in m-lab" criteria. Irene might use this, except she isn't using Go. 8-(

Yes if you run Dedup twice you should get the same result. I do not believe there are any other implicit dependencies.

I kinda like the idea of making this a Filter method, but I don't like the idea of injecting an arbitrary query. Then it starts to look like DestinationQuery, I think.

I could, however, extract the bottom bit of the code into a generic ExecQuery or something like that, for queries with not output expected. Let me try that out and you can see what you think.

With this extraction, I'd also be content moving the Dedup method itself back to the etl cmd/dedup package.

I like the separation between "exec query" and "dedup" now.


bqext/dataset.go, line 165 at r9 (raw file):

// ExecDestQuery constructs a destination query, executes it, and returns status or error.
func (dsExt *Dataset) ExecDestQuery(query string, disposition bigquery.TableWriteDisposition, destTable *bigquery.Table) (*bigquery.JobStatus, error) {

nit: I would advocate for symmetry across names like DestinationQuery and ExecDestQuery. i.e. either both use the full name or both use the short names.


bqext/dataset.go, line 165 at r9 (raw file):

// ExecDestQuery constructs a destination query, executes it, and returns status or error.
func (dsExt *Dataset) ExecDestQuery(query string, disposition bigquery.TableWriteDisposition, destTable *bigquery.Table) (*bigquery.JobStatus, error) {

destTable is a much better parameter than the three project/dataset/table parameters before. 👍


bqext/dataset.go, line 165 at r9 (raw file):

// ExecDestQuery constructs a destination query, executes it, and returns status or error.
func (dsExt *Dataset) ExecDestQuery(query string, disposition bigquery.TableWriteDisposition, destTable *bigquery.Table) (*bigquery.JobStatus, error) {

ExecDestQuery sounds like it executes a value returned by DestinationQuery. Is that possible? Running and Waiting for a query seems like a generally useful pattern.


Comments from Reviewable

@gfr10598
Copy link
Contributor Author

Review status: 1 of 4 files reviewed at latest revision, 4 unresolved discussions.


bqext/dataset.go, line 165 at r9 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

nit: I would advocate for symmetry across names like DestinationQuery and ExecDestQuery. i.e. either both use the full name or both use the short names.

Done.


bqext/dataset.go, line 165 at r9 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

ExecDestQuery sounds like it executes a value returned by DestinationQuery. Is that possible? Running and Waiting for a query seems like a generally useful pattern.

Done.


Comments from Reviewable

@gfr10598
Copy link
Contributor Author

PTAL.


Comments from Reviewable

@stephen-soltesz
Copy link
Contributor

:lgtm:


Review status: 1 of 4 files reviewed at latest revision, 1 unresolved discussion.


bqext/dataset.go, line 144 at r10 (raw file):

}

// DestinationQuery constructs a query with common Config settings for

s/DestinationQuery/DestQuery/


Comments from Reviewable

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-6.3%) to 45.946% when pulling a8f3604 on dedup into 7ac696d on master.

@gfr10598 gfr10598 merged commit fce6e7b into master Dec 14, 2017
@gfr10598 gfr10598 deleted the dedup branch December 14, 2017 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants