From be48f5619f7a60fb36727271953f785a93a6b5ac Mon Sep 17 00:00:00 2001 From: Bagatur Date: Wed, 11 Oct 2023 13:21:41 -0700 Subject: [PATCH] cr --- .../qa_privacy_protection.ipynb | 115 +++++++----------- .../text_with_private_data.txt | 28 ----- 2 files changed, 42 insertions(+), 101 deletions(-) rename docs/docs/{use_cases/question_answering => guides/privacy/presidio_data_anonymization}/qa_privacy_protection.ipynb (92%) delete mode 100644 docs/docs/use_cases/question_answering/text_with_private_data.txt diff --git a/docs/docs/use_cases/question_answering/qa_privacy_protection.ipynb b/docs/docs/guides/privacy/presidio_data_anonymization/qa_privacy_protection.ipynb similarity index 92% rename from docs/docs/use_cases/question_answering/qa_privacy_protection.ipynb rename to docs/docs/guides/privacy/presidio_data_anonymization/qa_privacy_protection.ipynb index f106ce66ddb36..df08756fb493b 100644 --- a/docs/docs/use_cases/question_answering/qa_privacy_protection.ipynb +++ b/docs/docs/guides/privacy/presidio_data_anonymization/qa_privacy_protection.ipynb @@ -6,12 +6,12 @@ "source": [ "# QA with private data protection\n", "\n", - "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/use_cases/question_answering/how_to/qa_privacy_protection.ipynb)\n", + "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/use_cases/question_answering/qa_privacy_protection.ipynb)\n", "\n", "\n", "In this notebook, we will look at building a basic system for question answering, based on private data. Before feeding the LLM with this data, we need to protect it so that it doesn't go to an external API (e.g. OpenAI, Anthropic). Then, after receiving the model output, we would like the data to be restored to its original form. Below you can observe an example flow of this QA system:\n", "\n", - "\n", + "\n", "\n", "\n", "In the following notebook, we will not go into the details of how the anonymizer works. If you are interested, please visit [this part of the documentation](https://python.langchain.com/docs/guides/privacy/presidio_data_anonymization/).\n", @@ -34,83 +34,58 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 1, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "1" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "from langchain.document_loaders import TextLoader\n", + "document_content = \"\"\"Date: October 19, 2021\n", + " Witness: John Doe\n", + " Subject: Testimony Regarding the Loss of Wallet\n", + "\n", + " Testimony Content:\n", + "\n", + " Hello Officer,\n", + "\n", + " My name is John Doe and on October 19, 2021, my wallet was stolen in the vicinity of Kilmarnock during a bike trip. This wallet contains some very important things to me.\n", "\n", - "# Load test file with PII entities\n", - "loader = TextLoader(\"text_with_private_data.txt\")\n", + " Firstly, the wallet contains my credit card with number 4111 1111 1111 1111, which is registered under my name and linked to my bank account, PL61109010140000071219812874.\n", "\n", - "documents = loader.load_and_split()\n", - "len(documents)" + " Additionally, the wallet had a driver's license - DL No: 999000680 issued to my name. It also houses my Social Security Number, 602-76-4532. \n", + "\n", + " What's more, I had my polish identity card there, with the number ABC123456.\n", + "\n", + " I would like this data to be secured and protected in all possible ways. I believe It was stolen at 9:30 AM.\n", + "\n", + " In case any information arises regarding my wallet, please reach out to me on my phone number, 999-888-7777, or through my personal email, johndoe@example.com.\n", + "\n", + " Please consider this information to be highly confidential and respect my privacy. \n", + "\n", + " The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, support@bankname.com.\n", + " My representative there is Victoria Cherry (her business phone: 987-654-3210).\n", + "\n", + " Thank you for your assistance,\n", + "\n", + " John Doe\"\"\"" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 4, "metadata": {}, + "outputs": [], "source": [ - "We only have one document, so before we move on to creating a QA system, let's focus on its content to begin with.\n", + "from langchain.schema import Document\n", "\n", - "You may observe that the text contains many different PII values, some types occur repeatedly (names, phone numbers, emails), and some specific PIIs are repeated (John Doe)." + "documents = [Document(page_content=document_content)]" ] }, { - "cell_type": "code", - "execution_count": 3, + "cell_type": "markdown", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Date: October 19, 2021\n", - "Witness: John Doe\n", - "Subject: Testimony Regarding the Loss of Wallet\n", - "\n", - "Testimony Content:\n", - "\n", - "Hello Officer,\n", - "\n", - "My name is John Doe and on October 19, 2021, my wallet was stolen in the vicinity of Kilmarnock during a bike trip. This wallet contains some very important things to me.\n", - "\n", - "Firstly, the wallet contains my credit card with number 4111 1111 1111 1111, which is registered under my name and linked to my bank account, PL61109010140000071219812874.\n", - "\n", - "Additionally, the wallet had a driver's license - DL No: 999000680 issued to my name. It also houses my Social Security Number, 602-76-4532. \n", - "\n", - "What's more, I had my polish identity card there, with the number ABC123456.\n", - "\n", - "I would like this data to be secured and protected in all possible ways. I believe It was stolen at 9:30 AM.\n", - "\n", - "In case any information arises regarding my wallet, please reach out to me on my phone number, 999-888-7777, or through my personal email, johndoe@example.com.\n", - "\n", - "Please consider this information to be highly confidential and respect my privacy. \n", - "\n", - "The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, support@bankname.com.\n", - "My representative there is Victoria Cherry (her business phone: 987-654-3210).\n", - "\n", - "Thank you for your assistance,\n", - "\n", - "John Doe\n" - ] - } - ], "source": [ - "document_content = documents[0].page_content\n", + "We only have one document, so before we move on to creating a QA system, let's focus on its content to begin with.\n", "\n", - "print(document_content)" + "You may observe that the text contains many different PII values, some types occur repeatedly (names, phone numbers, emails), and some specific PIIs are repeated (John Doe)." ] }, { @@ -656,10 +631,7 @@ "from langchain.embeddings.openai import OpenAIEmbeddings\n", "from langchain.vectorstores import FAISS\n", "\n", - "# 2. Load the data\n", - "loader = TextLoader(\"text_with_private_data.txt\")\n", - "documents = loader.load()\n", - "\n", + "# 2. Load the data: In our case data's already loaded\n", "# 3. Anonymize the data before indexing\n", "for doc in documents:\n", " doc.page_content = anonymizer.anonymize(doc.page_content)\n", @@ -856,9 +828,6 @@ "metadata": {}, "outputs": [], "source": [ - "loader = TextLoader(\"text_with_private_data.txt\")\n", - "documents = loader.load()\n", - "\n", "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)\n", "chunks = text_splitter.split_documents(documents)\n", "\n", @@ -982,9 +951,9 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "poetry-venv", "language": "python", - "name": "python3" + "name": "poetry-venv" }, "language_info": { "codemirror_mode": { @@ -996,7 +965,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.4" + "version": "3.9.1" } }, "nbformat": 4, diff --git a/docs/docs/use_cases/question_answering/text_with_private_data.txt b/docs/docs/use_cases/question_answering/text_with_private_data.txt deleted file mode 100644 index 75c41931ae323..0000000000000 --- a/docs/docs/use_cases/question_answering/text_with_private_data.txt +++ /dev/null @@ -1,28 +0,0 @@ -Date: October 19, 2021 -Witness: John Doe -Subject: Testimony Regarding the Loss of Wallet - -Testimony Content: - -Hello Officer, - -My name is John Doe and on October 19, 2021, my wallet was stolen in the vicinity of Kilmarnock during a bike trip. This wallet contains some very important things to me. - -Firstly, the wallet contains my credit card with number 4111 1111 1111 1111, which is registered under my name and linked to my bank account, PL61109010140000071219812874. - -Additionally, the wallet had a driver's license - DL No: 999000680 issued to my name. It also houses my Social Security Number, 602-76-4532. - -What's more, I had my polish identity card there, with the number ABC123456. - -I would like this data to be secured and protected in all possible ways. I believe It was stolen at 9:30 AM. - -In case any information arises regarding my wallet, please reach out to me on my phone number, 999-888-7777, or through my personal email, johndoe@example.com. - -Please consider this information to be highly confidential and respect my privacy. - -The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, support@bankname.com. -My representative there is Victoria Cherry (her business phone: 987-654-3210). - -Thank you for your assistance, - -John Doe \ No newline at end of file