-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More bigquery Dataset extensions for dedupping #10
Conversation
I'm wondering if Dedup is generic enough to warrant including in the m-lab/go packages? Reviewed 1 of 1 files at r4. bqext/dataset.go, line 166 at r5 (raw file):
If de-duplication depends on ordering of specific column names, such as "parse_time" does this belong in a generic package? Or, could this template be a parameter to Dedup? bqext/dataset.go, line 192 at r5 (raw file):
Could write disposition be a parameter to bqext/dataset_integration_test.go, line 87 at r5 (raw file):
Since this is only a local test dependency, can we call this type something else to prevent confusion? Should the type name be uncapitalized? Or could we use bqext/dataset_integration_test.go, line 140 at r5 (raw file):
bqext/dataset_integration_test.go, line 143 at r5 (raw file):
This relative path could become fragile. At run time could we set an env variable to the location and check for that variable and use the location here? e.g. bqext/dataset_integration_test.go, line 158 at r5 (raw file):
This test is very dependent on the state of an external project. This could discourage third-parties from considering adopting this for themselves. Since this is a live test anyway, what if this test constructed the 6-row table that would be du-duped by the rest of the test? This would make the setup self-contained in code. And, it would be possible to re-target the test with a single line change. I could even imagine reading an environment variable for the project name. bqext/dataset_test.go, line 136 at r5 (raw file):
Does this need a channel client? Can this use an 'ok client'? I don't see any pre-defined responses. Comments from Reviewable |
Review status: 1 of 4 files reviewed at latest revision, 7 unresolved discussions. bqext/dataset.go, line 166 at r5 (raw file): Previously, stephen-soltesz (Stephen Soltesz) wrote…
It does not depend on column ordering. It only will depend on things like parse_time or task_filename in order to perform some sanity checks. I'd prefer to leave it until later to generalize it. It definitely could be added as a parameter, so I'll add that now. bqext/dataset.go, line 192 at r5 (raw file): Previously, stephen-soltesz (Stephen Soltesz) wrote…
Hmm - let me think about that some more. bqext/dataset_integration_test.go, line 87 at r5 (raw file): Previously, stephen-soltesz (Stephen Soltesz) wrote…
I'm not terribly concerned since this package is bqext_test. But I'll make it private. bqext/dataset_integration_test.go, line 140 at r5 (raw file): Previously, stephen-soltesz (Stephen Soltesz) wrote…
Done. bqext/dataset_integration_test.go, line 143 at r5 (raw file): Previously, stephen-soltesz (Stephen Soltesz) wrote…
I'll add an issue. Probably pretty easy to do. bqext/dataset_integration_test.go, line 158 at r5 (raw file): Previously, stephen-soltesz (Stephen Soltesz) wrote…
Agree. Already created #8 bqext/dataset_test.go, line 136 at r5 (raw file): Previously, stephen-soltesz (Stephen Soltesz) wrote…
Done. Comments from Reviewable |
Review status: 1 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed. bqext/dataset.go, line 166 at r5 (raw file): Previously, gfr10598 (Gregory Russell) wrote…
By ordering I meant as-in "parsed last". When I was reading this earlier, I think I was trying to imagine whether Dedup would be plausibly useful to a third-party. But, that may be too broad a criteria. If I imagine the goal for m-lab/go being to add generally useful routines that we may use across our own repos, then that's easier. Something still feels funny. If I ran Dedup twice on the same source, would I get the same result? Are there any other implicit dependencies between the input table and the implementation of Dedup? If dedupTemplate was a parameter, perhaps the functionality of Dedup becomes immediately more generic. The operation becomes something like Comments from Reviewable |
Review status: 1 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed. bqext/dataset.go, line 166 at r5 (raw file): Previously, stephen-soltesz (Stephen Soltesz) wrote…
I'm definitely using the "useful to other repos in m-lab" criteria. Irene might use this, except she isn't using Go. 8-( Yes if you run Dedup twice you should get the same result. I do not believe there are any other implicit dependencies. I kinda like the idea of making this a Filter method, but I don't like the idea of injecting an arbitrary query. Then it starts to look like DestinationQuery, I think. I could, however, extract the bottom bit of the code into a generic ExecQuery or something like that, for queries with not output expected. Let me try that out and you can see what you think. With this extraction, I'd also be content moving the Dedup method itself back to the etl cmd/dedup package. Comments from Reviewable |
Review status: 1 of 4 files reviewed at latest revision, 2 unresolved discussions. bqext/dataset.go, line 192 at r5 (raw file): Previously, gfr10598 (Gregory Russell) wrote…
I think it should be and will add that. Comments from Reviewable |
Review status: 1 of 4 files reviewed at latest revision, 5 unresolved discussions. bqext/dataset.go, line 166 at r5 (raw file): Previously, gfr10598 (Gregory Russell) wrote…
I like the separation between "exec query" and "dedup" now. bqext/dataset.go, line 165 at r9 (raw file):
nit: I would advocate for symmetry across names like bqext/dataset.go, line 165 at r9 (raw file):
bqext/dataset.go, line 165 at r9 (raw file):
ExecDestQuery sounds like it executes a value returned by DestinationQuery. Is that possible? Running and Waiting for a query seems like a generally useful pattern. Comments from Reviewable |
Review status: 1 of 4 files reviewed at latest revision, 4 unresolved discussions. bqext/dataset.go, line 165 at r9 (raw file): Previously, stephen-soltesz (Stephen Soltesz) wrote…
Done. bqext/dataset.go, line 165 at r9 (raw file): Previously, stephen-soltesz (Stephen Soltesz) wrote…
Done. Comments from Reviewable |
PTAL. Comments from Reviewable |
Review status: 1 of 4 files reviewed at latest revision, 1 unresolved discussion. bqext/dataset.go, line 144 at r10 (raw file):
s/DestinationQuery/DestQuery/ Comments from Reviewable |
Adds DestinationQuery() and Dedup()
This was previously reviewed as m-lab/etl#360, but based on feedback decided to move it to the m-lab/go repo.
This change is