Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: add optional persistence to the lookup table #39

Closed
wants to merge 1 commit into from

Conversation

ugtar
Copy link

@ugtar ugtar commented Oct 11, 2018

Signed-off-by: Uri Okrent [email protected]

This is just an RFC for an idea I was toying with. This is in no way mergeable but it is a working proof of concept (python 3 only). Basically I would like to be able to persist the identifier lookup table so that identifiers for filth can remain consistent across runs.

generally the usage is create a scrubber, like you do:
s = scrubadub.Scrubber()

tell the scrubber you would like to persist identifiers, and where:
s.persist_identifiers('/path/to/persistent/table')

work as usual:
In [2]: text = "Mike is so cool, Joe is also alright"
in [5]: s.clean(text, replace_with='identifier')
Out[5]: '{{NAME-0}} is so cool, {{NAME-1}} is also alright'

save the lookup table:
s.save_identifiers()

Later, to use your persisted table, call persist_identifiers with the same path as before:
s.persist_identifiers('/path/to/persistent/table')

now you should get the same identifiers for existing stuff:
In [6]: text = "Joe is ok, but Mark is my man"
In [7]: s.clean(text, replace_with='identifier')
Out[7]: '{{NAME-1}} is ok, but {{NAME-2}} is my man'

I realize persisting the table opens up a bit of a can of worms in terms of security, so I did a simple one-way hash of the keys + salt to prevent a hacker from simply cracking open the persistence database and reversing the scrubbing, or from using a rainbow table to do the same.

Thoughts?

@thomasbird
Copy link
Member

Hello and thanks for the contribution!

First off I want to apologise that no one replied to this MR sooner.

I like the idea of this feature, it makes a lot of sense to me. However, in the last three years scrubadub has moved on quite a bit and I think this MR cannot be merged as it is.

The way you would do this in scrubadub today would be to create a new PostProcessor that perhaps derives from the FilthReplacer and implement the persistence in there. I would also suggest to not use pickle in the persistence layer as: pickle can be the source of security issues (since it executes arbitrary code) and pickle files are not necessarily compatible across python versions; instead I would suggest a CSV file.

Since this MR is so stale I'm going to close this and create a linked issue instead.

Thanks again,
Thomas

@thomasbird thomasbird closed this Oct 6, 2021
@thomasbird thomasbird linked an issue Oct 6, 2021 that may be closed by this pull request
@thomasbird
Copy link
Member

thomasbird commented Oct 6, 2021

Thinking about this a little more, you can now add the hash of the Filth into the replaced text, so for example:

import scrubadub
text = "contact Joe Duffy at [email protected]"
scrubber = scrubadub.Scrubber(
    detector_list=[
        'email', 'spacy'
    ],
    post_processor_list=[
        scrubadub.post_processors.FilthReplacer(include_hash=True),
        scrubadub.post_processors.PrefixSuffixReplacer(),
    ],
)
scrubber.clean(text)
'contact {{NAME-99ACF74CCE90D307}} at {{EMAIL-04B67CC2890A2B7D}}'

Which is kind of similar to what you wanted to achieve (I think), the same email is always replaced by the same place holder since it contains part of the hash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Persist Filth lookup table
2 participants