RFC: add optional persistence to the lookup table #39
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Signed-off-by: Uri Okrent [email protected]
This is just an RFC for an idea I was toying with. This is in no way mergeable but it is a working proof of concept (python 3 only). Basically I would like to be able to persist the identifier lookup table so that identifiers for filth can remain consistent across runs.
generally the usage is create a scrubber, like you do:
s = scrubadub.Scrubber()
tell the scrubber you would like to persist identifiers, and where:
s.persist_identifiers('/path/to/persistent/table')
work as usual:
In [2]: text = "Mike is so cool, Joe is also alright"
in [5]: s.clean(text, replace_with='identifier')
Out[5]: '{{NAME-0}} is so cool, {{NAME-1}} is also alright'
save the lookup table:
s.save_identifiers()
Later, to use your persisted table, call
persist_identifiers
with the same path as before:s.persist_identifiers('/path/to/persistent/table')
now you should get the same identifiers for existing stuff:
In [6]: text = "Joe is ok, but Mark is my man"
In [7]: s.clean(text, replace_with='identifier')
Out[7]: '{{NAME-1}} is ok, but {{NAME-2}} is my man'
I realize persisting the table opens up a bit of a can of worms in terms of security, so I did a simple one-way hash of the keys + salt to prevent a hacker from simply cracking open the persistence database and reversing the scrubbing, or from using a rainbow table to do the same.
Thoughts?