Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(rules): config structure #47

Merged
merged 20 commits into from
Aug 8, 2024
Merged

refactor(rules): config structure #47

merged 20 commits into from
Aug 8, 2024

Conversation

cmdoret
Copy link
Member

@cmdoret cmdoret commented Aug 6, 2024

Proposed Changes

Changes the rule specification format to match the proposal in #45.

  • adapt matching logic and associated tests, simplifying where possible.
  • move invert cli flag into config (single source of truth for pseudonymization rules)
  • move top-level function match_rules into the rules module
  • mark fields as optional in serde so that users don't need to add all empty fields.

The mask and triple were cloned a few times by the matching rules, the refactored version passes everything by reference, resulting in a 13% speedup (15.9s -> 14.1s on a 2.3M triples file). That's ~164k triples/second.

Types of Changes

What types of changes does your code introduce? Put an x in the boxes that
apply

  • A bug fix (non-breaking change which fixes an issue).
  • A new feature (non-breaking change which adds functionality).
  • A breaking change (fix or feature that would cause existing
    functionality to not work as expected).
  • A non-productive update (documentation, tooling, etc. if none of the
    other choices apply).

Checklist

Put an x in the boxes that apply. You can also fill these out after creating
the PR. If you're unsure about any of them, don't hesitate to ask. We're here to
help! This is simply a reminder of what we are going to look for before merging
your code.

  • I have read the
    CONTRIBUTING
    guidelines.
  • I have added tests that prove my fix is effective or that my feature
    works.
  • I have added the necessary documentation (if appropriate).

@cmdoret cmdoret self-assigned this Aug 6, 2024
@cmdoret cmdoret linked an issue Aug 6, 2024 that may be closed by this pull request
4 tasks
@cmdoret cmdoret requested a review from supermaxiste August 6, 2024 13:17
Copy link
Member

@supermaxiste supermaxiste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well done @cmdoret 🥳

praise: the code for rules looks both more structured and cleaner
praise: the new config is definitely easier to understand and use

I only have a major comment and an open question related to it:

issue: Since we're playing with the rules, I tried a test run on my end after compiling your code and noticed that all literal values were not pseudonymized in the output. After a bit of digging I stumbled upon a matching condition that was looking for only NamedNode as a type and used _ for any other case. I suggested avoiding this when possible because it goes a bit against Rust's principle of explicit typing, let me know your thoughts.

./target/debug/tripsu pseudo --index tests/data/type_map.nt --config tests/data/config.yaml tests/data/test.nt
 INFO Args: PseudoArgs { index: "tests/data/type_map.nt", input: "tests/data/test.nt", config: "tests/data/config.yaml", output: "-", secret: None }
<http://example.org/4cbe42b24189cb18b362a8cd0e952c482ad3b757ae96dc2590ba8b94d0911895> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
<http://example.org/4cbe42b24189cb18b362a8cd0e952c482ad3b757ae96dc2590ba8b94d0911895> <http://xmlns.com/foaf/0.1/holdsAccount> <http://example.org/a7ee682dbe9f90d4b52c156744757a4aae47df7da05f2ac91261fb4c66cb5340> .
<http://example.org/a7ee682dbe9f90d4b52c156744757a4aae47df7da05f2ac91261fb4c66cb5340> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/OnlineAccount> .
<http://example.org/a7ee682dbe9f90d4b52c156744757a4aae47df7da05f2ac91261fb4c66cb5340> <http://schema.org/name> "my_account32" . <- should be pseudo
<http://example.org/a7ee682dbe9f90d4b52c156744757a4aae47df7da05f2ac91261fb4c66cb5340> <http://schema.org/accessCode> "secret-123" . <- should be pseudo
<http://example.org/4cbe42b24189cb18b362a8cd0e952c482ad3b757ae96dc2590ba8b94d0911895> <http://schema.org/name> "Alice" . <- should be pseudo
[...]

suggestion(tests): the bug above is very funny because all unit tests pass, meaning that all of our main functions work. The issue arises when you run the whole pipeline and a "side" function doesn't do everything it's supposed to. How can we address this?
I suggest adding a test function in pass_second to test the encryption of several triples using rstest. Let me know if you have a better idea 👍

src/rules.rs Outdated Show resolved Hide resolved
src/rules.rs Outdated Show resolved Hide resolved
src/rules.rs Outdated Show resolved Hide resolved
@cmdoret cmdoret requested a review from supermaxiste August 6, 2024 17:33
Copy link
Member

@supermaxiste supermaxiste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for pointing out the wrong line for the bug and excellent job with fixing and the tests 😎

praise: seeing the index! macro in action with multiple entries is so clean!
praise: the test for triples is exactly how I imagined it
praise: thank you for renaming pass_first and pass_second, now it feels like we're 100% in release state

All looks good to me!

@cmdoret cmdoret merged commit f879635 into main Aug 8, 2024
7 checks passed
@cmdoret cmdoret deleted the refactor/config-fmt branch August 8, 2024 11:36
@cmdoret cmdoret removed a link to an issue Aug 8, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants