Skip to content

Commit

Permalink
Data deanonymization (#10093)
Browse files Browse the repository at this point in the history
### Description

The feature for pseudonymizing data with ability to retrieve original
text (deanonymization) has been implemented. In order to protect private
data, such as when querying external APIs (OpenAI), it is worth
pseudonymizing sensitive data to maintain full privacy. But then, after
the model response, it would be good to have the data in the original
form.

I implemented the `PresidioReversibleAnonymizer`, which consists of two
parts:

1. anonymization - it works the same way as `PresidioAnonymizer`, plus
the object itself stores a mapping of made-up values to original ones,
for example:
```
    {
        "PERSON": {
            "<anonymized>": "<original>",
            "John Doe": "Slim Shady"
        },
        "PHONE_NUMBER": {
            "111-111-1111": "555-555-5555"
        }
        ...
    }
```

2. deanonymization - using the mapping described above, it matches fake
data with original data and then substitutes it.

Between anonymization and deanonymization user can perform different
operations, for example, passing the output to LLM.

### Future works

- **instance anonymization** - at this point, each occurrence of PII is
treated as a separate entity and separately anonymized. Therefore, two
occurrences of the name John Doe in the text will be changed to two
different names. It is therefore worth introducing support for full
instance detection, so that repeated occurrences are treated as a single
object.
- **better matching and substitution of fake values for real ones** -
currently the strategy is based on matching full strings and then
substituting them. Due to the indeterminism of language models, it may
happen that the value in the answer is slightly changed (e.g. *John Doe*
-> *John* or *Main St, New York* -> *New York*) and such a substitution
is then no longer possible. Therefore, it is worth adjusting the
matching for your needs.
- **Q&A with anonymization** - when I'm done writing all the
functionality, I thought it would be a cool resource in documentation to
write a notebook about retrieval from documents using anonymization. An
iterative process, adding new recognizers to fit the data, lessons
learned and what to look out for

### Twitter handle
@deepsense_ai / @MaksOpp

---------

Co-authored-by: MaksOpp <[email protected]>
Co-authored-by: Bagatur <[email protected]>
  • Loading branch information
3 people authored Sep 7, 2023
1 parent 67696fe commit 4cc4534
Show file tree
Hide file tree
Showing 9 changed files with 988 additions and 80 deletions.
106 changes: 62 additions & 44 deletions docs/extras/guides/privacy/presidio_data_anonymization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -47,16 +47,16 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'My name is Mrs. Rachel Chen DDS, call me at 849-829-7628x073 or email me at christopherfrey@example.org'"
"'My name is Laura Ruiz, call me at +1-412-982-8374x13414 or email me at javierwatkins@example.net'"
]
},
"execution_count": 14,
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -82,7 +82,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -94,35 +94,53 @@
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"text = f\"\"\"Slim Shady recently lost his wallet. \n",
"Inside is some cash and his credit card with the number 4916 0387 9536 0861. \n",
"If you would find it, please call at 313-666-7440 or write an email here: [email protected].\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"AIMessage(content='You can find our super secret data at https://www.ross.com/', additional_kwargs={}, example=False)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
"name": "stdout",
"output_type": "stream",
"text": [
"Dear Sir/Madam,\n",
"\n",
"We regret to inform you that Richard Fields has recently misplaced his wallet, which contains a sum of cash and his credit card bearing the number 30479847307774. \n",
"\n",
"Should you happen to come across it, we kindly request that you contact us immediately at 6439182672 or via email at [email protected].\n",
"\n",
"Thank you for your attention to this matter.\n",
"\n",
"Yours faithfully,\n",
"\n",
"[Your Name]\n"
]
}
],
"source": [
"from langchain.prompts.prompt import PromptTemplate\n",
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.schema.runnable import RunnablePassthrough\n",
"\n",
"template = \"\"\"According to this text, where can you find our super secret data?\n",
"anonymizer = PresidioAnonymizer()\n",
"\n",
"{anonymized_text}\n",
"template = \"\"\"Rewrite this text into an official, short email:\n",
"\n",
"Answer:\"\"\"\n",
"{anonymized_text}\"\"\"\n",
"prompt = PromptTemplate.from_template(template)\n",
"llm = ChatOpenAI()\n",
"llm = ChatOpenAI(temperature=0)\n",
"\n",
"chain = {\"anonymized_text\": anonymizer.anonymize} | prompt | llm\n",
"chain.invoke(\"You can find our super secret data at https://supersecretdata.com\")"
"response = chain.invoke(text)\n",
"print(response.content)"
]
},
{
Expand All @@ -135,16 +153,16 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'My name is Gabrielle Edwards, call me at 313-666-7440 or email me at [email protected]'"
"'My name is Adrian Fleming, call me at 313-666-7440 or email me at [email protected]'"
]
},
"execution_count": 18,
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -166,16 +184,16 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'My name is Victoria Mckinney, call me at 713-549-8623 or email me at [email protected]'"
"'My name is Justin Miller, call me at 761-824-1889 or email me at [email protected]'"
]
},
"execution_count": 3,
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -201,16 +219,16 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'My name is Billy Russo, call me at 970-996-9453x038 or email me at jamie80@example.org'"
"'My name is Dr. Jennifer Baker, call me at (508)839-9329x232 or email me at ehamilton@example.com'"
]
},
"execution_count": 4,
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -232,16 +250,16 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'My polish phone number is EVIA70648911396944'"
"'My polish phone number is NRGN41434238921378'"
]
},
"execution_count": 5,
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -261,7 +279,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -291,7 +309,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -308,7 +326,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 12,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -337,16 +355,16 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'+48 533 220 543'"
"'511 622 683'"
]
},
"execution_count": 9,
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -374,7 +392,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -389,7 +407,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -398,16 +416,16 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'My polish phone number is +48 692 715 636'"
"'My polish phone number is +48 734 630 977'"
]
},
"execution_count": 12,
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -443,7 +461,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.11.4"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit 4cc4534

Please sign in to comment.