RFC: add optional persistence to the lookup table #39

ugtar · 2018-10-11T01:40:20Z

Signed-off-by: Uri Okrent [email protected]

This is just an RFC for an idea I was toying with. This is in no way mergeable but it is a working proof of concept (python 3 only). Basically I would like to be able to persist the identifier lookup table so that identifiers for filth can remain consistent across runs.

generally the usage is create a scrubber, like you do:
s = scrubadub.Scrubber()

tell the scrubber you would like to persist identifiers, and where:
s.persist_identifiers('/path/to/persistent/table')

work as usual:
In [2]: text = "Mike is so cool, Joe is also alright"
in [5]: s.clean(text, replace_with='identifier')
Out[5]: '{{NAME-0}} is so cool, {{NAME-1}} is also alright'

save the lookup table:
s.save_identifiers()

Later, to use your persisted table, call persist_identifiers with the same path as before:
s.persist_identifiers('/path/to/persistent/table')

now you should get the same identifiers for existing stuff:
In [6]: text = "Joe is ok, but Mark is my man"
In [7]: s.clean(text, replace_with='identifier')
Out[7]: '{{NAME-1}} is ok, but {{NAME-2}} is my man'

I realize persisting the table opens up a bit of a can of worms in terms of security, so I did a simple one-way hash of the keys + salt to prevent a hacker from simply cracking open the persistence database and reversing the scrubbing, or from using a rainbow table to do the same.

Thoughts?

Signed-off-by: Uri Okrent <[email protected]>

thomasbird · 2021-10-06T15:55:02Z

Hello and thanks for the contribution!

First off I want to apologise that no one replied to this MR sooner.

I like the idea of this feature, it makes a lot of sense to me. However, in the last three years scrubadub has moved on quite a bit and I think this MR cannot be merged as it is.

The way you would do this in scrubadub today would be to create a new PostProcessor that perhaps derives from the FilthReplacer and implement the persistence in there. I would also suggest to not use pickle in the persistence layer as: pickle can be the source of security issues (since it executes arbitrary code) and pickle files are not necessarily compatible across python versions; instead I would suggest a CSV file.

Since this MR is so stale I'm going to close this and create a linked issue instead.

Thanks again,
Thomas

thomasbird · 2021-10-06T16:03:59Z

Thinking about this a little more, you can now add the hash of the Filth into the replaced text, so for example:

import scrubadub
text = "contact Joe Duffy at [email protected]"
scrubber = scrubadub.Scrubber(
    detector_list=[
        'email', 'spacy'
    ],
    post_processor_list=[
        scrubadub.post_processors.FilthReplacer(include_hash=True),
        scrubadub.post_processors.PrefixSuffixReplacer(),
    ],
)
scrubber.clean(text)
'contact {{NAME-99ACF74CCE90D307}} at {{EMAIL-04B67CC2890A2B7D}}'

Which is kind of similar to what you wanted to achieve (I think), the same email is always replaced by the same place holder since it contains part of the hash.

RFC: add optional persistence to the lookup table

40f74d3

Signed-off-by: Uri Okrent <[email protected]>

ugtar force-pushed the master branch from ff7d6a2 to 40f74d3 Compare October 16, 2018 15:03

thomasbird closed this Oct 6, 2021

thomasbird mentioned this pull request Oct 6, 2021

Persist Filth lookup table #130

Open

thomasbird linked an issue Oct 6, 2021 that may be closed by this pull request

Persist Filth lookup table #130

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: add optional persistence to the lookup table #39

RFC: add optional persistence to the lookup table #39

ugtar commented Oct 11, 2018

thomasbird commented Oct 6, 2021

thomasbird commented Oct 6, 2021 •

edited

Loading

RFC: add optional persistence to the lookup table #39

RFC: add optional persistence to the lookup table #39

Conversation

ugtar commented Oct 11, 2018

thomasbird commented Oct 6, 2021

thomasbird commented Oct 6, 2021 • edited Loading

thomasbird commented Oct 6, 2021 •

edited

Loading