forked from langchain-ai/langchain
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
42 additions
and
101 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,12 +6,12 @@ | |
"source": [ | ||
"# QA with private data protection\n", | ||
"\n", | ||
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/use_cases/question_answering/how_to/qa_privacy_protection.ipynb)\n", | ||
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/use_cases/question_answering/qa_privacy_protection.ipynb)\n", | ||
"\n", | ||
"\n", | ||
"In this notebook, we will look at building a basic system for question answering, based on private data. Before feeding the LLM with this data, we need to protect it so that it doesn't go to an external API (e.g. OpenAI, Anthropic). Then, after receiving the model output, we would like the data to be restored to its original form. Below you can observe an example flow of this QA system:\n", | ||
"\n", | ||
"<img src=\"/img/qa_privacy_protection.png\" width=\"800\">\n", | ||
"<img src=\"/img/qa_privacy_protection.png\" width=\"800\"/>\n", | ||
"\n", | ||
"\n", | ||
"In the following notebook, we will not go into the details of how the anonymizer works. If you are interested, please visit [this part of the documentation](https://python.langchain.com/docs/guides/privacy/presidio_data_anonymization/).\n", | ||
|
@@ -34,83 +34,58 @@ | |
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"1" | ||
] | ||
}, | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"outputs": [], | ||
"source": [ | ||
"from langchain.document_loaders import TextLoader\n", | ||
"document_content = \"\"\"Date: October 19, 2021\n", | ||
" Witness: John Doe\n", | ||
" Subject: Testimony Regarding the Loss of Wallet\n", | ||
"\n", | ||
" Testimony Content:\n", | ||
"\n", | ||
" Hello Officer,\n", | ||
"\n", | ||
" My name is John Doe and on October 19, 2021, my wallet was stolen in the vicinity of Kilmarnock during a bike trip. This wallet contains some very important things to me.\n", | ||
"\n", | ||
"# Load test file with PII entities\n", | ||
"loader = TextLoader(\"text_with_private_data.txt\")\n", | ||
" Firstly, the wallet contains my credit card with number 4111 1111 1111 1111, which is registered under my name and linked to my bank account, PL61109010140000071219812874.\n", | ||
"\n", | ||
"documents = loader.load_and_split()\n", | ||
"len(documents)" | ||
" Additionally, the wallet had a driver's license - DL No: 999000680 issued to my name. It also houses my Social Security Number, 602-76-4532. \n", | ||
"\n", | ||
" What's more, I had my polish identity card there, with the number ABC123456.\n", | ||
"\n", | ||
" I would like this data to be secured and protected in all possible ways. I believe It was stolen at 9:30 AM.\n", | ||
"\n", | ||
" In case any information arises regarding my wallet, please reach out to me on my phone number, 999-888-7777, or through my personal email, [email protected].\n", | ||
"\n", | ||
" Please consider this information to be highly confidential and respect my privacy. \n", | ||
"\n", | ||
" The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, [email protected].\n", | ||
" My representative there is Victoria Cherry (her business phone: 987-654-3210).\n", | ||
"\n", | ||
" Thank you for your assistance,\n", | ||
"\n", | ||
" John Doe\"\"\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"We only have one document, so before we move on to creating a QA system, let's focus on its content to begin with.\n", | ||
"from langchain.schema import Document\n", | ||
"\n", | ||
"You may observe that the text contains many different PII values, some types occur repeatedly (names, phone numbers, emails), and some specific PIIs are repeated (John Doe)." | ||
"documents = [Document(page_content=document_content)]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Date: October 19, 2021\n", | ||
"Witness: John Doe\n", | ||
"Subject: Testimony Regarding the Loss of Wallet\n", | ||
"\n", | ||
"Testimony Content:\n", | ||
"\n", | ||
"Hello Officer,\n", | ||
"\n", | ||
"My name is John Doe and on October 19, 2021, my wallet was stolen in the vicinity of Kilmarnock during a bike trip. This wallet contains some very important things to me.\n", | ||
"\n", | ||
"Firstly, the wallet contains my credit card with number 4111 1111 1111 1111, which is registered under my name and linked to my bank account, PL61109010140000071219812874.\n", | ||
"\n", | ||
"Additionally, the wallet had a driver's license - DL No: 999000680 issued to my name. It also houses my Social Security Number, 602-76-4532. \n", | ||
"\n", | ||
"What's more, I had my polish identity card there, with the number ABC123456.\n", | ||
"\n", | ||
"I would like this data to be secured and protected in all possible ways. I believe It was stolen at 9:30 AM.\n", | ||
"\n", | ||
"In case any information arises regarding my wallet, please reach out to me on my phone number, 999-888-7777, or through my personal email, [email protected].\n", | ||
"\n", | ||
"Please consider this information to be highly confidential and respect my privacy. \n", | ||
"\n", | ||
"The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, [email protected].\n", | ||
"My representative there is Victoria Cherry (her business phone: 987-654-3210).\n", | ||
"\n", | ||
"Thank you for your assistance,\n", | ||
"\n", | ||
"John Doe\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"document_content = documents[0].page_content\n", | ||
"We only have one document, so before we move on to creating a QA system, let's focus on its content to begin with.\n", | ||
"\n", | ||
"print(document_content)" | ||
"You may observe that the text contains many different PII values, some types occur repeatedly (names, phone numbers, emails), and some specific PIIs are repeated (John Doe)." | ||
] | ||
}, | ||
{ | ||
|
@@ -656,10 +631,7 @@ | |
"from langchain.embeddings.openai import OpenAIEmbeddings\n", | ||
"from langchain.vectorstores import FAISS\n", | ||
"\n", | ||
"# 2. Load the data\n", | ||
"loader = TextLoader(\"text_with_private_data.txt\")\n", | ||
"documents = loader.load()\n", | ||
"\n", | ||
"# 2. Load the data: In our case data's already loaded\n", | ||
"# 3. Anonymize the data before indexing\n", | ||
"for doc in documents:\n", | ||
" doc.page_content = anonymizer.anonymize(doc.page_content)\n", | ||
|
@@ -856,9 +828,6 @@ | |
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"loader = TextLoader(\"text_with_private_data.txt\")\n", | ||
"documents = loader.load()\n", | ||
"\n", | ||
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)\n", | ||
"chunks = text_splitter.split_documents(documents)\n", | ||
"\n", | ||
|
@@ -982,9 +951,9 @@ | |
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"display_name": "poetry-venv", | ||
"language": "python", | ||
"name": "python3" | ||
"name": "poetry-venv" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
|
@@ -996,7 +965,7 @@ | |
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.4" | ||
"version": "3.9.1" | ||
} | ||
}, | ||
"nbformat": 4, | ||
|
28 changes: 0 additions & 28 deletions
28
docs/docs/use_cases/question_answering/text_with_private_data.txt
This file was deleted.
Oops, something went wrong.