adding configuration options to uniques functionality #224

SimonLangerQC · 2024-06-03T16:25:32Z

No description provided.

ivergara

Quite some lines! Greatly appreciated.

As written in comments, perhaps you can leave the postprocessing parts of this PR out to keep its size more manageable and to have a better discussion on what to do there. I'm not strongly against dealing with the postprocessing here.
The names of the utility functions feel a bit wonky to me. But right now I dont'know how to name them better.
In the utility functions, you use filter, reduce with lambda functions calling list at the end. Is there any chance to change some of them to list comprehensions instead?

ivergara · 2024-06-04T20:54:09Z

src/datajudge/constraints/uniques.py

+    It can be configured in more detail by supplying a custom ``filter_func`` function.
+    Some exemplary implementations are available in this module as ``datajudge.utils.util_filternull_default_deprecated``,
+    ``datajudge.utils.util_filternull_never``, ``datajudge.utils.util_filternull_element_or_tuple_all``, ``datajudge.utils.util_filternull_element_or_tuple_any``.
+    For new deployments, using one of the above filters or a custom one is recommended.


The word deployment in this line seems odd to me. I'm not sure that this sentence is actually needed at all.

ivergara · 2024-06-04T20:56:21Z

src/datajudge/constraints/uniques.py

+    For new deployments, using one of the above filters or a custom one is recommended.
+    Passing None as the argument is equivalent to ``datajudge.utils.util_filternull_default_deprecated``, but triggers a warning.
+    The deprecated default may change in future versions.
+    To silence the warning, set ``filter_func`` explicitly.


One could silence it by configuring the warming module to do so.

I think the fact that's a warning indicating eventual future deprecation should be enough.

src/datajudge/constraints/uniques.py

ivergara · 2024-06-04T21:10:18Z

src/datajudge/utils.py

@@ -42,3 +44,106 @@ def format_difference(
        f"{s1[:diff_idx]}{_fmt_diff_part(s1, diff_idx)}",
        f"{s2[:diff_idx]}{_fmt_diff_part(s2, diff_idx)}",
    )
+
+
+def util_output_postprocessing_sorter(


I was goign to suggest a move like this. Now, since these new functions are already in a utils namespace, doesn't make much sense to prepend such set of characters for each function.

ivergara · 2024-06-04T21:11:48Z

src/datajudge/utils.py

+    return [elem[1:] for elem in lst], [-elem[0] for elem in lst]
+
+
+def util_filternull_default_deprecated(values: List[T]) -> List[T]:


This is the default behavior until now, which you're using when the None is used. But eventually still makes sense to use it in the future right? If so, I don't think that the function itself is deprecated and having that word in the function name is misleading.

My main issue with the current default behavior is that if the user adds a second column to the unique constraint, the NULL-filtering no longer works, even if both the original and second column are NULL.

I've renamed it to filternull_element

ivergara · 2024-06-04T21:13:59Z

tests/integration/test_integration.py

+        (
+            negation,
+            ["col_int"],
+            [


Maybe we instruct our linter/formatter not to do one element per line in this list? Is really unnecessarily bloating the line count. Same for the other parameterizations.

Alternatively, perhaps you can do something like list(range(30)) instead.

SimonLangerQC · 2024-06-05T07:52:09Z

Thanks for the thorough review @ivergara :) - I've integrated your suggestions or replied to your comments.

Factoring out the postprocessing imo is not a great idea, since this would break the _with_outputcheck tests, which are also used to verify the updated null-filtering, etc.

…y scripts for postgres integration testing from a fresh db every time

SimonLangerQC · 2024-06-05T08:55:42Z

I've now moved the additional output configuration options to the base Constraint class, allowing for making use of sorting and slicing for other types of constraints as well, such as the functional dependency constraint

run_integration_tests_postgres.sh

Co-authored-by: Ignacio Vergara Kausel <[email protected]>

src/datajudge/constraints/base.py

tests/integration/test_integration.py

ivergara · 2024-06-05T13:57:18Z

src/datajudge/constraints/base.py

-            [Collection, Optional[Collection]], Collection
-        ] = None,
-        output_remainder_slicer=slice(5),
+        output_processors: List[OutputProcessor] = None,


As discussed you could define this as OutputProcessor | List[OutputProcessor] and in the body you can do something like

if not isinstance(output_processor, List]: output_processor = [output_processor]

kklein

Thanks a lot for your PR @SimonLangerQC - much appreciated!

I gave it a look; don't be surprised by the number of comments - most of them are merely docs-related polishing. :)

src/datajudge/requirements.py

src/datajudge/utils.py

Co-authored-by: Kevin Klein <[email protected]>

SimonLangerQC · 2024-06-07T07:55:16Z

Thanks for your comments @kklein - I've added your suggestions :)

src/datajudge/constraints/base.py

kklein · 2024-06-09T08:14:47Z

src/datajudge/utils.py

+    ) -> Collection: ...
+
+
+def output_processor_sort(


Might it be useful to have unit tests for output_processor_sort, output_processor_limit and sort_tuple_none_aware in datajudge/tests/unit/test_unit.py?

kklein

LGTM! :)

codecov · 2024-06-11T10:28:00Z

Codecov Report

Attention: Patch coverage is 93.02326% with 6 lines in your changes missing coverage. Please review.

Project coverage is 92.44%. Comparing base (246bb47) to head (62f6877).
Report is 86 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datajudge/constraints/uniques.py	88.00%	3 Missing ⚠️
src/datajudge/utils.py	94.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #224      +/-   ##
==========================================
- Coverage   93.03%   92.44%   -0.59%     
==========================================
  Files          18       18              
  Lines        1894     1973      +79     
==========================================
+ Hits         1762     1824      +62     
- Misses        132      149      +17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> update doc string on null columns everywhere and fix typo Update docs Co-authored-by: Kevin Klein <[email protected]> Update docs Co-authored-by: Kevin Klein <[email protected]> Update docs Co-authored-by: Kevin Klein <[email protected]> docs updates update docs filternull docs clarification replace assert by raise ValueError shorten name to apply_output_formatting add unit tests for new utils functions set default to limit 100 elements ensure all relevant tests run for impala and ensure they pass disable extralong test for bigquery due to slow speed capitalization test handle parallel if table already created

adding configuration options to uniques functionality (#224) Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> Docs update Co-authored-by: Kevin Klein <[email protected]> update doc string on null columns everywhere and fix typo Update docs Co-authored-by: Kevin Klein <[email protected]> Update docs Co-authored-by: Kevin Klein <[email protected]> Update docs Co-authored-by: Kevin Klein <[email protected]> docs updates update docs filternull docs clarification replace assert by raise ValueError shorten name to apply_output_formatting add unit tests for new utils functions set default to limit 100 elements ensure all relevant tests run for impala and ensure they pass disable extralong test for bigquery due to slow speed capitalization test handle parallel if table already created

adding configuration options to uniques functionality

9ea8866

SimonLangerQC requested review from ivergara and kklein June 3, 2024 16:25

SimonLangerQC added 2 commits June 4, 2024 18:35

improve docstrings

92f2933

move util_ functions to datajudge.utils

63e9634

ivergara reviewed Jun 4, 2024

View reviewed changes

updates following comments

3782955

add configuration options to functional dependency checks, and utilit…

6cad13e

…y scripts for postgres integration testing from a fresh db every time

SimonLangerQC requested a review from ivergara June 5, 2024 08:56

ivergara reviewed Jun 5, 2024

View reviewed changes

run_integration_tests_postgres.sh Outdated Show resolved Hide resolved

fix typo in run_integration_tests_postgres.sh

7dec26e

Co-authored-by: Ignacio Vergara Kausel <[email protected]>

ivergara reviewed Jun 5, 2024

View reviewed changes

src/datajudge/constraints/base.py Outdated Show resolved Hide resolved

ivergara reviewed Jun 5, 2024

View reviewed changes

src/datajudge/constraints/base.py Outdated Show resolved Hide resolved

ivergara reviewed Jun 5, 2024

View reviewed changes

tests/integration/test_integration.py Show resolved Hide resolved

SimonLangerQC added 2 commits June 5, 2024 14:06

rename to output_processor

6322310

output_processor only

308ff99

SimonLangerQC force-pushed the uniques_improvements branch from 024b0f5 to 308ff99 Compare June 5, 2024 13:32

SimonLangerQC requested a review from ivergara June 5, 2024 13:51

ivergara reviewed Jun 5, 2024

View reviewed changes

SimonLangerQC added 2 commits June 5, 2024 16:13

allow for single output processor

c1fec1a

add output_processor_limit

91ea51c

SimonLangerQC requested a review from ivergara June 6, 2024 14:29

kklein reviewed Jun 6, 2024

View reviewed changes

SimonLangerQC and others added 5 commits June 7, 2024 09:06

Docs update

0f94589

Co-authored-by: Kevin Klein <[email protected]>

Docs update

693b29b

Co-authored-by: Kevin Klein <[email protected]>

Docs update

5c0c03b

Co-authored-by: Kevin Klein <[email protected]>

Docs update

52f993d

Co-authored-by: Kevin Klein <[email protected]>

Docs update

887b0e6

Co-authored-by: Kevin Klein <[email protected]>

SimonLangerQC and others added 12 commits June 7, 2024 09:11

Docs update

151d53b

Co-authored-by: Kevin Klein <[email protected]>

Docs update

2f99478

Co-authored-by: Kevin Klein <[email protected]>

Docs update

ea326ad

Co-authored-by: Kevin Klein <[email protected]>

Docs update

c712205

Co-authored-by: Kevin Klein <[email protected]>

update doc string on null columns everywhere and fix typo

9eb3433

Update docs

e6c396a

Co-authored-by: Kevin Klein <[email protected]>

Update docs

3ca2003

Co-authored-by: Kevin Klein <[email protected]>

Update docs

4ddda10

Co-authored-by: Kevin Klein <[email protected]>

docs updates

0502720

update docs

cf42e38

filternull docs clarification

409b611

replace assert by raise ValueError

536096a

SimonLangerQC requested a review from kklein June 7, 2024 07:54

kklein reviewed Jun 9, 2024

View reviewed changes

SimonLangerQC added 2 commits June 10, 2024 08:39

shorten name to apply_output_formatting

0067b84

add unit tests for new utils functions

143f0f9

SimonLangerQC requested a review from kklein June 10, 2024 08:13

set default to limit 100 elements

b8842a7

kklein approved these changes Jun 10, 2024

View reviewed changes

SimonLangerQC added the ready label Jun 11, 2024

SimonLangerQC added 3 commits June 11, 2024 16:06

ensure all relevant tests run for impala and ensure they pass

0041e99

disable extralong test for bigquery due to slow speed

cb63bef

capitalization test handle parallel if table already created

62f6877

SimonLangerQC merged commit 72ffd10 into main Jun 12, 2024
37 of 39 checks passed

SimonLangerQC deleted the uniques_improvements branch June 12, 2024 08:43

SimonLangerQC mentioned this pull request Jun 28, 2024

Allow disable cache #234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding configuration options to uniques functionality #224

adding configuration options to uniques functionality #224

SimonLangerQC commented Jun 3, 2024

ivergara left a comment

ivergara Jun 4, 2024

SimonLangerQC Jun 5, 2024

ivergara Jun 4, 2024

ivergara Jun 4, 2024

ivergara Jun 4, 2024

SimonLangerQC Jun 5, 2024

ivergara Jun 4, 2024

SimonLangerQC commented Jun 5, 2024

SimonLangerQC commented Jun 5, 2024

ivergara Jun 5, 2024

SimonLangerQC Jun 5, 2024

kklein left a comment

SimonLangerQC commented Jun 7, 2024

kklein Jun 9, 2024

SimonLangerQC Jun 10, 2024

kklein left a comment

codecov bot commented Jun 11, 2024 •

edited

Loading

		return [elem[1:] for elem in lst], [-elem[0] for elem in lst]


		def util_filternull_default_deprecated(values: List[T]) -> List[T]:

adding configuration options to uniques functionality #224

adding configuration options to uniques functionality #224

Conversation

SimonLangerQC commented Jun 3, 2024

ivergara left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SimonLangerQC commented Jun 5, 2024

SimonLangerQC commented Jun 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kklein left a comment

Choose a reason for hiding this comment

SimonLangerQC commented Jun 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kklein left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 11, 2024 • edited Loading

Codecov Report

codecov bot commented Jun 11, 2024 •

edited

Loading