-
Notifications
You must be signed in to change notification settings - Fork 15.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
### Description The feature for pseudonymizing data with ability to retrieve original text (deanonymization) has been implemented. In order to protect private data, such as when querying external APIs (OpenAI), it is worth pseudonymizing sensitive data to maintain full privacy. But then, after the model response, it would be good to have the data in the original form. I implemented the `PresidioReversibleAnonymizer`, which consists of two parts: 1. anonymization - it works the same way as `PresidioAnonymizer`, plus the object itself stores a mapping of made-up values to original ones, for example: ``` { "PERSON": { "<anonymized>": "<original>", "John Doe": "Slim Shady" }, "PHONE_NUMBER": { "111-111-1111": "555-555-5555" } ... } ``` 2. deanonymization - using the mapping described above, it matches fake data with original data and then substitutes it. Between anonymization and deanonymization user can perform different operations, for example, passing the output to LLM. ### Future works - **instance anonymization** - at this point, each occurrence of PII is treated as a separate entity and separately anonymized. Therefore, two occurrences of the name John Doe in the text will be changed to two different names. It is therefore worth introducing support for full instance detection, so that repeated occurrences are treated as a single object. - **better matching and substitution of fake values for real ones** - currently the strategy is based on matching full strings and then substituting them. Due to the indeterminism of language models, it may happen that the value in the answer is slightly changed (e.g. *John Doe* -> *John* or *Main St, New York* -> *New York*) and such a substitution is then no longer possible. Therefore, it is worth adjusting the matching for your needs. - **Q&A with anonymization** - when I'm done writing all the functionality, I thought it would be a cool resource in documentation to write a notebook about retrieval from documents using anonymization. An iterative process, adding new recognizers to fit the data, lessons learned and what to look out for ### Twitter handle @deepsense_ai / @MaksOpp --------- Co-authored-by: MaksOpp <[email protected]> Co-authored-by: Bagatur <[email protected]>
- Loading branch information
1 parent
67696fe
commit 4cc4534
Showing
9 changed files
with
988 additions
and
80 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,7 +28,7 @@ | |
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
|
@@ -47,16 +47,16 @@ | |
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 14, | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"'My name is Mrs. Rachel Chen DDS, call me at 849-829-7628x073 or email me at christopherfrey@example.org'" | ||
"'My name is Laura Ruiz, call me at +1-412-982-8374x13414 or email me at javierwatkins@example.net'" | ||
] | ||
}, | ||
"execution_count": 14, | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
|
@@ -82,7 +82,7 @@ | |
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
|
@@ -94,35 +94,53 @@ | |
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 16, | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"text = f\"\"\"Slim Shady recently lost his wallet. \n", | ||
"Inside is some cash and his credit card with the number 4916 0387 9536 0861. \n", | ||
"If you would find it, please call at 313-666-7440 or write an email here: [email protected].\"\"\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"AIMessage(content='You can find our super secret data at https://www.ross.com/', additional_kwargs={}, example=False)" | ||
] | ||
}, | ||
"execution_count": 16, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Dear Sir/Madam,\n", | ||
"\n", | ||
"We regret to inform you that Richard Fields has recently misplaced his wallet, which contains a sum of cash and his credit card bearing the number 30479847307774. \n", | ||
"\n", | ||
"Should you happen to come across it, we kindly request that you contact us immediately at 6439182672 or via email at [email protected].\n", | ||
"\n", | ||
"Thank you for your attention to this matter.\n", | ||
"\n", | ||
"Yours faithfully,\n", | ||
"\n", | ||
"[Your Name]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from langchain.prompts.prompt import PromptTemplate\n", | ||
"from langchain.chat_models import ChatOpenAI\n", | ||
"from langchain.schema.runnable import RunnablePassthrough\n", | ||
"\n", | ||
"template = \"\"\"According to this text, where can you find our super secret data?\n", | ||
"anonymizer = PresidioAnonymizer()\n", | ||
"\n", | ||
"{anonymized_text}\n", | ||
"template = \"\"\"Rewrite this text into an official, short email:\n", | ||
"\n", | ||
"Answer:\"\"\"\n", | ||
"{anonymized_text}\"\"\"\n", | ||
"prompt = PromptTemplate.from_template(template)\n", | ||
"llm = ChatOpenAI()\n", | ||
"llm = ChatOpenAI(temperature=0)\n", | ||
"\n", | ||
"chain = {\"anonymized_text\": anonymizer.anonymize} | prompt | llm\n", | ||
"chain.invoke(\"You can find our super secret data at https://supersecretdata.com\")" | ||
"response = chain.invoke(text)\n", | ||
"print(response.content)" | ||
] | ||
}, | ||
{ | ||
|
@@ -135,16 +153,16 @@ | |
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 18, | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"'My name is Gabrielle Edwards, call me at 313-666-7440 or email me at [email protected]'" | ||
"'My name is Adrian Fleming, call me at 313-666-7440 or email me at [email protected]'" | ||
] | ||
}, | ||
"execution_count": 18, | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
|
@@ -166,16 +184,16 @@ | |
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"'My name is Victoria Mckinney, call me at 713-549-8623 or email me at [email protected]'" | ||
"'My name is Justin Miller, call me at 761-824-1889 or email me at [email protected]'" | ||
] | ||
}, | ||
"execution_count": 3, | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
|
@@ -201,16 +219,16 @@ | |
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"'My name is Billy Russo, call me at 970-996-9453x038 or email me at jamie80@example.org'" | ||
"'My name is Dr. Jennifer Baker, call me at (508)839-9329x232 or email me at ehamilton@example.com'" | ||
] | ||
}, | ||
"execution_count": 4, | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
|
@@ -232,16 +250,16 @@ | |
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"'My polish phone number is EVIA70648911396944'" | ||
"'My polish phone number is NRGN41434238921378'" | ||
] | ||
}, | ||
"execution_count": 5, | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
|
@@ -261,7 +279,7 @@ | |
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"execution_count": 10, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
|
@@ -291,7 +309,7 @@ | |
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"execution_count": 11, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
|
@@ -308,7 +326,7 @@ | |
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"execution_count": 12, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
|
@@ -337,16 +355,16 @@ | |
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"execution_count": 13, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"'+48 533 220 543'" | ||
"'511 622 683'" | ||
] | ||
}, | ||
"execution_count": 9, | ||
"execution_count": 13, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
|
@@ -374,7 +392,7 @@ | |
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"execution_count": 14, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
|
@@ -389,7 +407,7 @@ | |
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 11, | ||
"execution_count": 15, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
|
@@ -398,16 +416,16 @@ | |
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 12, | ||
"execution_count": 16, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"'My polish phone number is +48 692 715 636'" | ||
"'My polish phone number is +48 734 630 977'" | ||
] | ||
}, | ||
"execution_count": 12, | ||
"execution_count": 16, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
|
@@ -443,7 +461,7 @@ | |
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.9.1" | ||
"version": "3.11.4" | ||
} | ||
}, | ||
"nbformat": 4, | ||
|
Oops, something went wrong.