Skip to content

Commit

Permalink
cr
Browse files Browse the repository at this point in the history
  • Loading branch information
baskaryan committed Oct 11, 2023
1 parent d7e4d98 commit be48f56
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 101 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@
"source": [
"# QA with private data protection\n",
"\n",
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/use_cases/question_answering/how_to/qa_privacy_protection.ipynb)\n",
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/use_cases/question_answering/qa_privacy_protection.ipynb)\n",
"\n",
"\n",
"In this notebook, we will look at building a basic system for question answering, based on private data. Before feeding the LLM with this data, we need to protect it so that it doesn't go to an external API (e.g. OpenAI, Anthropic). Then, after receiving the model output, we would like the data to be restored to its original form. Below you can observe an example flow of this QA system:\n",
"\n",
"<img src=\"/img/qa_privacy_protection.png\" width=\"800\">\n",
"<img src=\"/img/qa_privacy_protection.png\" width=\"800\"/>\n",
"\n",
"\n",
"In the following notebook, we will not go into the details of how the anonymizer works. If you are interested, please visit [this part of the documentation](https://python.langchain.com/docs/guides/privacy/presidio_data_anonymization/).\n",
Expand All @@ -34,83 +34,58 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader\n",
"document_content = \"\"\"Date: October 19, 2021\n",
" Witness: John Doe\n",
" Subject: Testimony Regarding the Loss of Wallet\n",
"\n",
" Testimony Content:\n",
"\n",
" Hello Officer,\n",
"\n",
" My name is John Doe and on October 19, 2021, my wallet was stolen in the vicinity of Kilmarnock during a bike trip. This wallet contains some very important things to me.\n",
"\n",
"# Load test file with PII entities\n",
"loader = TextLoader(\"text_with_private_data.txt\")\n",
" Firstly, the wallet contains my credit card with number 4111 1111 1111 1111, which is registered under my name and linked to my bank account, PL61109010140000071219812874.\n",
"\n",
"documents = loader.load_and_split()\n",
"len(documents)"
" Additionally, the wallet had a driver's license - DL No: 999000680 issued to my name. It also houses my Social Security Number, 602-76-4532. \n",
"\n",
" What's more, I had my polish identity card there, with the number ABC123456.\n",
"\n",
" I would like this data to be secured and protected in all possible ways. I believe It was stolen at 9:30 AM.\n",
"\n",
" In case any information arises regarding my wallet, please reach out to me on my phone number, 999-888-7777, or through my personal email, [email protected].\n",
"\n",
" Please consider this information to be highly confidential and respect my privacy. \n",
"\n",
" The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, [email protected].\n",
" My representative there is Victoria Cherry (her business phone: 987-654-3210).\n",
"\n",
" Thank you for your assistance,\n",
"\n",
" John Doe\"\"\""
]
},
{
"cell_type": "markdown",
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"We only have one document, so before we move on to creating a QA system, let's focus on its content to begin with.\n",
"from langchain.schema import Document\n",
"\n",
"You may observe that the text contains many different PII values, some types occur repeatedly (names, phone numbers, emails), and some specific PIIs are repeated (John Doe)."
"documents = [Document(page_content=document_content)]"
]
},
{
"cell_type": "code",
"execution_count": 3,
"cell_type": "markdown",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Date: October 19, 2021\n",
"Witness: John Doe\n",
"Subject: Testimony Regarding the Loss of Wallet\n",
"\n",
"Testimony Content:\n",
"\n",
"Hello Officer,\n",
"\n",
"My name is John Doe and on October 19, 2021, my wallet was stolen in the vicinity of Kilmarnock during a bike trip. This wallet contains some very important things to me.\n",
"\n",
"Firstly, the wallet contains my credit card with number 4111 1111 1111 1111, which is registered under my name and linked to my bank account, PL61109010140000071219812874.\n",
"\n",
"Additionally, the wallet had a driver's license - DL No: 999000680 issued to my name. It also houses my Social Security Number, 602-76-4532. \n",
"\n",
"What's more, I had my polish identity card there, with the number ABC123456.\n",
"\n",
"I would like this data to be secured and protected in all possible ways. I believe It was stolen at 9:30 AM.\n",
"\n",
"In case any information arises regarding my wallet, please reach out to me on my phone number, 999-888-7777, or through my personal email, [email protected].\n",
"\n",
"Please consider this information to be highly confidential and respect my privacy. \n",
"\n",
"The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, [email protected].\n",
"My representative there is Victoria Cherry (her business phone: 987-654-3210).\n",
"\n",
"Thank you for your assistance,\n",
"\n",
"John Doe\n"
]
}
],
"source": [
"document_content = documents[0].page_content\n",
"We only have one document, so before we move on to creating a QA system, let's focus on its content to begin with.\n",
"\n",
"print(document_content)"
"You may observe that the text contains many different PII values, some types occur repeatedly (names, phone numbers, emails), and some specific PIIs are repeated (John Doe)."
]
},
{
Expand Down Expand Up @@ -656,10 +631,7 @@
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.vectorstores import FAISS\n",
"\n",
"# 2. Load the data\n",
"loader = TextLoader(\"text_with_private_data.txt\")\n",
"documents = loader.load()\n",
"\n",
"# 2. Load the data: In our case data's already loaded\n",
"# 3. Anonymize the data before indexing\n",
"for doc in documents:\n",
" doc.page_content = anonymizer.anonymize(doc.page_content)\n",
Expand Down Expand Up @@ -856,9 +828,6 @@
"metadata": {},
"outputs": [],
"source": [
"loader = TextLoader(\"text_with_private_data.txt\")\n",
"documents = loader.load()\n",
"\n",
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)\n",
"chunks = text_splitter.split_documents(documents)\n",
"\n",
Expand Down Expand Up @@ -982,9 +951,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "poetry-venv",
"language": "python",
"name": "python3"
"name": "poetry-venv"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -996,7 +965,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
"version": "3.9.1"
}
},
"nbformat": 4,
Expand Down
28 changes: 0 additions & 28 deletions docs/docs/use_cases/question_answering/text_with_private_data.txt

This file was deleted.

0 comments on commit be48f56

Please sign in to comment.