Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding configuration options to uniques functionality #224

Merged
merged 43 commits into from
Jun 12, 2024

Conversation

SimonLangerQC
Copy link
Contributor

No description provided.

@SimonLangerQC SimonLangerQC requested review from ivergara and kklein June 3, 2024 16:25
Copy link
Collaborator

@ivergara ivergara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite some lines! Greatly appreciated.

  1. As written in comments, perhaps you can leave the postprocessing parts of this PR out to keep its size more manageable and to have a better discussion on what to do there. I'm not strongly against dealing with the postprocessing here.

  2. The names of the utility functions feel a bit wonky to me. But right now I dont'know how to name them better.

  3. In the utility functions, you use filter, reduce with lambda functions calling list at the end. Is there any chance to change some of them to list comprehensions instead?

It can be configured in more detail by supplying a custom ``filter_func`` function.
Some exemplary implementations are available in this module as ``datajudge.utils.util_filternull_default_deprecated``,
``datajudge.utils.util_filternull_never``, ``datajudge.utils.util_filternull_element_or_tuple_all``, ``datajudge.utils.util_filternull_element_or_tuple_any``.
For new deployments, using one of the above filters or a custom one is recommended.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word deployment in this line seems odd to me. I'm not sure that this sentence is actually needed at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

For new deployments, using one of the above filters or a custom one is recommended.
Passing None as the argument is equivalent to ``datajudge.utils.util_filternull_default_deprecated``, but triggers a warning.
The deprecated default may change in future versions.
To silence the warning, set ``filter_func`` explicitly.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could silence it by configuring the warming module to do so.

I think the fact that's a warning indicating eventual future deprecation should be enough.

@@ -42,3 +44,106 @@ def format_difference(
f"{s1[:diff_idx]}{_fmt_diff_part(s1, diff_idx)}",
f"{s2[:diff_idx]}{_fmt_diff_part(s2, diff_idx)}",
)


def util_output_postprocessing_sorter(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was goign to suggest a move like this. Now, since these new functions are already in a utils namespace, doesn't make much sense to prepend such set of characters for each function.

return [elem[1:] for elem in lst], [-elem[0] for elem in lst]


def util_filternull_default_deprecated(values: List[T]) -> List[T]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the default behavior until now, which you're using when the None is used. But eventually still makes sense to use it in the future right? If so, I don't think that the function itself is deprecated and having that word in the function name is misleading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main issue with the current default behavior is that if the user adds a second column to the unique constraint, the NULL-filtering no longer works, even if both the original and second column are NULL.

I've renamed it to filternull_element

(
negation,
["col_int"],
[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we instruct our linter/formatter not to do one element per line in this list? Is really unnecessarily bloating the line count. Same for the other parameterizations.

Alternatively, perhaps you can do something like list(range(30)) instead.

@SimonLangerQC
Copy link
Contributor Author

Thanks for the thorough review @ivergara :) - I've integrated your suggestions or replied to your comments.

Factoring out the postprocessing imo is not a great idea, since this would break the _with_outputcheck tests, which are also used to verify the updated null-filtering, etc.

…y scripts for postgres integration testing from a fresh db every time
@SimonLangerQC
Copy link
Contributor Author

I've now moved the additional output configuration options to the base Constraint class, allowing for making use of sorting and slicing for other types of constraints as well, such as the functional dependency constraint

@SimonLangerQC SimonLangerQC requested a review from ivergara June 5, 2024 08:56
@SimonLangerQC SimonLangerQC force-pushed the uniques_improvements branch from 024b0f5 to 308ff99 Compare June 5, 2024 13:32
@SimonLangerQC SimonLangerQC requested a review from ivergara June 5, 2024 13:51
[Collection, Optional[Collection]], Collection
] = None,
output_remainder_slicer=slice(5),
output_processors: List[OutputProcessor] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed you could define this as OutputProcessor | List[OutputProcessor] and in the body you can do something like

if not isinstance(output_processor, List]:
    output_processor = [output_processor]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@SimonLangerQC SimonLangerQC requested a review from ivergara June 6, 2024 14:29
Copy link
Collaborator

@kklein kklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your PR @SimonLangerQC - much appreciated!

I gave it a look; don't be surprised by the number of comments - most of them are merely docs-related polishing. :)

SimonLangerQC and others added 5 commits June 7, 2024 09:06
Co-authored-by: Kevin Klein <[email protected]>
Co-authored-by: Kevin Klein <[email protected]>
Co-authored-by: Kevin Klein <[email protected]>
Co-authored-by: Kevin Klein <[email protected]>
Co-authored-by: Kevin Klein <[email protected]>
@SimonLangerQC SimonLangerQC requested a review from kklein June 7, 2024 07:54
@SimonLangerQC
Copy link
Contributor Author

Thanks for your comments @kklein - I've added your suggestions :)

) -> Collection: ...


def output_processor_sort(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it be useful to have unit tests for output_processor_sort, output_processor_limit and sort_tuple_none_aware in datajudge/tests/unit/test_unit.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@SimonLangerQC SimonLangerQC requested a review from kklein June 10, 2024 08:13
Copy link
Collaborator

@kklein kklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! :)

Copy link

codecov bot commented Jun 11, 2024

Codecov Report

Attention: Patch coverage is 93.02326% with 6 lines in your changes missing coverage. Please review.

Project coverage is 92.44%. Comparing base (246bb47) to head (62f6877).
Report is 86 commits behind head on main.

Files with missing lines Patch % Lines
src/datajudge/constraints/uniques.py 88.00% 3 Missing ⚠️
src/datajudge/utils.py 94.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #224      +/-   ##
==========================================
- Coverage   93.03%   92.44%   -0.59%     
==========================================
  Files          18       18              
  Lines        1894     1973      +79     
==========================================
+ Hits         1762     1824      +62     
- Misses        132      149      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@SimonLangerQC SimonLangerQC merged commit 72ffd10 into main Jun 12, 2024
37 of 39 checks passed
@SimonLangerQC SimonLangerQC deleted the uniques_improvements branch June 12, 2024 08:43
kklein pushed a commit that referenced this pull request Jun 25, 2024
Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

update doc string on null columns everywhere and fix typo

Update docs

Co-authored-by: Kevin Klein <[email protected]>

Update docs

Co-authored-by: Kevin Klein <[email protected]>

Update docs

Co-authored-by: Kevin Klein <[email protected]>

docs updates

update docs

filternull docs clarification

replace assert by raise ValueError

shorten name to apply_output_formatting

add unit tests for new utils functions

set default to limit 100 elements

ensure all relevant tests run for impala and ensure they pass

disable extralong test for bigquery due to slow speed

capitalization test handle parallel if table already created
kklein pushed a commit that referenced this pull request Jun 25, 2024
adding configuration options to uniques functionality (#224)

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

Docs update

Co-authored-by: Kevin Klein <[email protected]>

update doc string on null columns everywhere and fix typo

Update docs

Co-authored-by: Kevin Klein <[email protected]>

Update docs

Co-authored-by: Kevin Klein <[email protected]>

Update docs

Co-authored-by: Kevin Klein <[email protected]>

docs updates

update docs

filternull docs clarification

replace assert by raise ValueError

shorten name to apply_output_formatting

add unit tests for new utils functions

set default to limit 100 elements

ensure all relevant tests run for impala and ensure they pass

disable extralong test for bigquery due to slow speed

capitalization test handle parallel if table already created
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants