From 947934443d3756b315aab5996ea31ea6c8233dbd Mon Sep 17 00:00:00 2001 From: maks-operlejn-ds Date: Tue, 10 Oct 2023 23:46:37 +0000 Subject: [PATCH] Updated notebook with descriptions and better LCEL --- .../how_to/qa_privacy_protection.ipynb | 430 ++++++++++++------ .../how_to/text_with_private_data.txt | 2 +- 2 files changed, 300 insertions(+), 132 deletions(-) diff --git a/docs/docs_skeleton/docs/use_cases/question_answering/how_to/qa_privacy_protection.ipynb b/docs/docs_skeleton/docs/use_cases/question_answering/how_to/qa_privacy_protection.ipynb index de9b5b5eeffd2..f11c026b4d276 100644 --- a/docs/docs_skeleton/docs/use_cases/question_answering/how_to/qa_privacy_protection.ipynb +++ b/docs/docs_skeleton/docs/use_cases/question_answering/how_to/qa_privacy_protection.ipynb @@ -9,8 +9,11 @@ "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/presidio_data_anonymization/reversible.ipynb)\n", "\n", "\n", - "[TODO: opis]\n", + "In this notebook, we will look at building a basic system for question answering, based on private data. Before feeding the LLM with this data, we need to protect it so that it doesn't go to an external API (e.g. OpenAI, Anthropic). Then, after receiving the model output, we would like the data to be restored to its original form. Below you can observe an example flow of this QA system:\n", "\n", + "\n", + "\n", + "In the following notebook, we will not go into the details of how the anonymizer works. If you are interested, please visit [this part of the documentation](https://python.langchain.com/docs/guides/privacy/presidio_data_anonymization/).\n", "\n", "## Quickstart\n", "\n", @@ -30,7 +33,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 2, "metadata": {}, "outputs": [ { @@ -39,7 +42,7 @@ "1" ] }, - "execution_count": 3, + "execution_count": 2, "metadata": {}, "output_type": "execute_result" } @@ -47,6 +50,7 @@ "source": [ "from langchain.document_loaders import TextLoader\n", "\n", + "# Load test file with PII entities\n", "loader = TextLoader(\"text_with_private_data.txt\")\n", "\n", "documents = loader.load_and_split()\n", @@ -54,17 +58,17 @@ ] }, { - "cell_type": "code", - "execution_count": 4, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "document_content = documents[0].page_content" + "We only have one document, so before we move on to creating a QA system, let's focus on its content to begin with.\n", + "\n", + "You may observe that the text contains many different PII values, some types occur repeatedly (names, phone numbers, emails), and some specific PIIs are repeated (Maks Operlejn)." ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 3, "metadata": {}, "outputs": [ { @@ -81,7 +85,7 @@ "\n", "My name is Maks Operlejn and on October 19, 2021, my wallet was stolen in the vicinity of Kilmarnock during a bike trip. This wallet contains some very important things to me.\n", "\n", - "Firstly, the wallet contains my credit card with number 5412 5412 5412 5412, which is registered under my name and linked to my bank account, PL61109010140000071219812874.\n", + "Firstly, the wallet contains my credit card with number 4111 1111 1111 1111, which is registered under my name and linked to my bank account, PL61109010140000071219812874.\n", "\n", "Additionally, the wallet had a driver's license - DL No: 999000680 issued to my name. It also houses my Social Security Number, 602-76-4532. \n", "\n", @@ -103,12 +107,14 @@ } ], "source": [ + "document_content = documents[0].page_content\n", + "\n", "print(document_content)" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -123,9 +129,16 @@ " print(colored_string)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's proceed and try to anonymize the text with the default settings. For now, we don't replace the data with synthetic, we just mark it with markers (e.g. ``), so we set `add_default_faker_operators=False`:" + ] + }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -142,7 +155,7 @@ "\n", "My name is \u001b[31m\u001b[0m and on \u001b[31m\u001b[0m, my wallet was stolen in the vicinity of \u001b[31m\u001b[0m during a bike trip. This wallet contains some very important things to me.\n", "\n", - "Firstly, the wallet contains my credit card with number 5412 5412 5412 5412, which is registered under my name and linked to my bank account, \u001b[31m\u001b[0m.\n", + "Firstly, the wallet contains my credit card with number \u001b[31m\u001b[0m, which is registered under my name and linked to my bank account, \u001b[31m\u001b[0m.\n", "\n", "Additionally, the wallet had a driver's license - DL No: \u001b[31m\u001b[0m issued to my name. It also houses my Social Security Number, \u001b[31m\u001b[0m. \n", "\n", @@ -155,7 +168,7 @@ "Please consider this information to be highly confidential and respect my privacy. \n", "\n", "The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, \u001b[31m\u001b[0m.\n", - "My representative there is \u001b[31m\u001b[0m (her business phone: \u001b[31m\u001b[0m).\n", + "My representative there is \u001b[31m\u001b[0m (her business phone: \u001b[31m\u001b[0m).\n", "\n", "Thank you for your assistance,\n", "\n", @@ -173,23 +186,31 @@ "print_colored_pii(anonymizer.anonymize(document_content))" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's also look at the mapping between original and anonymized values:" + ] + }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "{'DATE_TIME': {'': 'October 19, 2021', '': '9:30 AM'},\n", + "{'CREDIT_CARD': {'': '4111 1111 1111 1111'},\n", + " 'DATE_TIME': {'': 'October 19, 2021', '': '9:30 AM'},\n", " 'EMAIL_ADDRESS': {'': 'maksoperlejn@example.com',\n", " '': 'support@bankname.com'},\n", " 'IBAN_CODE': {'': 'PL61109010140000071219812874'},\n", " 'LOCATION': {'': 'Kilmarnock'},\n", " 'PERSON': {'': 'Maks Operlejn', '': 'Victoria Cherry'},\n", - " 'PHONE_NUMBER': {'': '999-888-7777',\n", - " '': '987-654-3210'},\n", + " 'PHONE_NUMBER': {'': '999-888-7777'},\n", + " 'UK_NHS': {'': '987-654-3210'},\n", " 'US_DRIVER_LICENSE': {'': '999000680'},\n", " 'US_SSN': {'': '602-76-4532'}}\n" ] @@ -201,9 +222,24 @@ "pprint.pprint(anonymizer.deanonymizer_mapping)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In general, the anonymizer works pretty well, but I can observe two things to improve here:\n", + "\n", + "1. Datetime redundancy - we have two different entities recognized as `DATE_TIME`, but they contain different type of information. The first one is a date (*October 19, 2021*), the second one is a time (*9:30 AM*). We can improve this by adding a new recognizer to the anonymizer, which will treat time separately from the date.\n", + "2. Polish ID - polish ID has unique pattern, which is not by default part of anonymizer recognizers. The value *ABC123456* is not anonymized.\n", + "\n", + "The solution is simple: we need to add a new recognizers to the anonymizer. You can read more about it in [presidio documentation](https://microsoft.github.io/presidio/analyzer/adding_recognizers/).\n", + "\n", + "\n", + "Let's add new recognizers:" + ] + }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -229,9 +265,16 @@ "time_recognizer = PatternRecognizer(supported_entity=\"TIME\", patterns=[time_pattern])" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And now, we're adding recognizers to our anonymizer:" + ] + }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -239,18 +282,32 @@ "anonymizer.add_recognizer(time_recognizer)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that our anonymization instance remembers previously detected and anonymized values, including those that were not detected correctly (e.g., *\"9:30 AM\"* taken as `DATE_TIME`). So it's worth removing this value, or resetting the entire mapping now that our recognizers have been updated:" + ] + }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "anonymizer.reset_deanonymizer_mapping()" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's anonymize the text and see the results:" + ] + }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 10, "metadata": {}, "outputs": [ { @@ -267,7 +324,7 @@ "\n", "My name is \u001b[31m\u001b[0m and on \u001b[31m\u001b[0m, my wallet was stolen in the vicinity of \u001b[31m\u001b[0m during a bike trip. This wallet contains some very important things to me.\n", "\n", - "Firstly, the wallet contains my credit card with number 5412 5412 5412 5412, which is registered under my name and linked to my bank account, \u001b[31m\u001b[0m.\n", + "Firstly, the wallet contains my credit card with number \u001b[31m\u001b[0m, which is registered under my name and linked to my bank account, \u001b[31m\u001b[0m.\n", "\n", "Additionally, the wallet had a driver's license - DL No: \u001b[31m\u001b[0m issued to my name. It also houses my Social Security Number, \u001b[31m\u001b[0m. \n", "\n", @@ -280,7 +337,7 @@ "Please consider this information to be highly confidential and respect my privacy. \n", "\n", "The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, \u001b[31m\u001b[0m.\n", - "My representative there is \u001b[31m\u001b[0m (her business phone: \u001b[31m\u001b[0m).\n", + "My representative there is \u001b[31m\u001b[0m (her business phone: \u001b[31m\u001b[0m).\n", "\n", "Thank you for your assistance,\n", "\n", @@ -294,23 +351,24 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "{'DATE_TIME': {'': 'October 19, 2021'},\n", + "{'CREDIT_CARD': {'': '4111 1111 1111 1111'},\n", + " 'DATE_TIME': {'': 'October 19, 2021'},\n", " 'EMAIL_ADDRESS': {'': 'maksoperlejn@example.com',\n", " '': 'support@bankname.com'},\n", " 'IBAN_CODE': {'': 'PL61109010140000071219812874'},\n", " 'LOCATION': {'': 'Kilmarnock'},\n", " 'PERSON': {'': 'Maks Operlejn', '': 'Victoria Cherry'},\n", - " 'PHONE_NUMBER': {'': '999-888-7777',\n", - " '': '987-654-3210'},\n", + " 'PHONE_NUMBER': {'': '999-888-7777'},\n", " 'POLISH_ID': {'': 'ABC123456'},\n", " 'TIME': {'