Add conversational entity linking into REL (#150)

* Add submodule for conversational entity linking * Implement API for conversational entity linking * Add script for testing the server response * Re-implement server using fastapi/pydantic * Implement conversational entity linking in server * Refactor response handler code to submodule * Remove `rel_ed` and make ConvEL depend on REL MD * Add some docs for conversational entity linking * Skip CREL test if running on CI We have no way of testing, because the data are not available * Remove unused code
informagi · Jan 24, 2023 · 0018b57 · 0018b57
1 parent b153d67
commit 0018b57
Show file tree

Hide file tree

Showing 22 changed files with 2,294 additions and 167 deletions.
diff --git a/docs/tutorials/conversations.md b/docs/tutorials/conversations.md
@@ -0,0 +1,71 @@
+# Conversational entity linking
+
+The `crel` submodule the conversational entity linking tool trained on the [ConEL-2 dataset](https://github.com/informagi/conversational-entity-linking-2022#conel-2-conversational-entity-linking-dataset).
+
+Unlike existing EL methods, `crel` is developed to identify both named entities and concepts.
+It also uses coreference resolution techniques to identify personal entities and references to the explicit entity mentions in the conversations.
+
+This tutorial describes how to start with conversational entity linking on a local machine.
+
+For more information, see the original [repository on conversational entity linking](https://github.com/informagi/conversational-entity-linking-2022).
+
+## Start with your local environment
+
+### Step 1: Download models
+
+First, download the models below:
+
+- **MD for concepts and NEs**: 
+	+ [Click here to download models](https://drive.google.com/file/d/1OoC2XZp4uBy0eB_EIuIhEHdcLEry2LtU/view?usp=sharing)
+	+ Extract `bert_conv-td` to your `base_url`
+- **Personal Entity Linking**:
+	+ [Click here to download models](https://drive.google.com/file/d/1-jW8xkxh5GV-OuUBfMeT2Tk7tEzvH181/view?usp=sharing)
+	+ Extract `s2e_ast_onto` to your `base_url`
+
+Additionally, conversational entity linking uses the wiki 2019 dataset. For more information on where to place the data and the `base_url`, check out [this page](../how_to_get_started). If setup correctly, your `base_url` should contain these directories:
+
+
+```bash
+.
+└── bert_conv-td
+└── s2e_ast_onto
+└── wiki_2019
+```
+
+
+### Step 2: Example code
+
+This example shows how to link a short conversation. Note that the speakers must be named "USER" and "SPEAKER".
+
+
+```python
+>>> from REL.crel.conv_el import ConvEL
+>>> 
+>>> cel = ConvEL(base_url="C:/path/to/base_url/")
+>>> 
+>>> conversation = [
+>>>     {"speaker": "USER", 
+>>>     "utterance": "I am allergic to tomatoes but we have a lot of famous Italian restaurants here in London.",}, 
+>>> 
+>>>     {"speaker": "SYSTEM", 
+>>>     "utterance": "Some people are allergic to histamine in tomatoes.",},
+>>> 
+>>>     {"speaker": "USER", 
+>>>     "utterance": "Talking of food, can you recommend me a restaurant in my city for our anniversary?",},
+>>> ]
+>>> 
+>>> annotated = cel.annotate(conversation)
+>>> [item for item in annotated if item['speaker'] == 'USER']
+[{'speaker': 'USER',
+  'utterance': 'I am allergic to tomatoes but we have a lot of famous Italian restaurants here in London.',
+  'annotations': [[17, 8, 'tomatoes', 'Tomato'],
+   [54, 19, 'Italian restaurants', 'Italian_cuisine'],
+   [82, 6, 'London', 'London']]},
+ {'speaker': 'USER',
+  'utterance': 'Talking of food, can you recommend me a restaurant in my city for our anniversary?',
+  'annotations': [[11, 4, 'food', 'Food'],
+   [40, 10, 'restaurant', 'Restaurant'],
+   [54, 7, 'my city', 'London']]}]
+
+```
+
diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md
@@ -14,3 +14,4 @@ The remainder of the tutorials are optional and for users who wish to e.g. train
 5. [Reproducing our results](reproducing_our_results/)
 6. [REL as systemd service](systemd_instructions/)
 7. [Notes on using custom models](custom_models/)
+7. [Conversational entity linking](conversations/)
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -10,6 +10,7 @@ nav:
     - tutorials/reproducing_our_results.md
     - tutorials/systemd_instructions.md
     - tutorials/custom_models.md
+    - tutorials/conversations.md
   - Python API reference:
     - api/entity_disambiguation.md
     - api/generate_train_test.md
@@ -72,11 +73,10 @@ plugins:
         - https://numpy.org/doc/stable/objects.inv
         - https://docs.scipy.org/doc/scipy/objects.inv
         - https://pandas.pydata.org/docs/objects.inv
-        selection:
+        options:
           docstring_style: sphinx
           docstring_options:
             ignore_init_summary: yes
-        rendering:
           show_submodules: no
           show_source: true
           docstring_section_style: list

diff --git a/requirements.txt b/requirements.txt
@@ -1,7 +1,10 @@
+anyascii
 colorama
-konoha
+fastapi
 flair>=0.11
+konoha
+nltk
+pydantic
 segtok
 torch
-nltk
-anyascii
+uvicorn
diff --git a/scripts/efficiency_test.py b/scripts/efficiency_test.py
@@ -1,12 +1,15 @@
 import numpy as np
 import requests
+import os
 
 from REL.training_datasets import TrainingEvaluationDatasets
 
 np.random.seed(seed=42)
 
-base_url = "/Users/vanhulsm/Desktop/projects/data/"
-wiki_version = "wiki_2014"
+base_url = os.environ.get("REL_BASE_URL")
+wiki_version = "wiki_2019"
+host = 'localhost'
+port = '5555'
 datasets = TrainingEvaluationDatasets(base_url, wiki_version).load()["aida_testB"]
 
 # random_docs = np.random.choice(list(datasets.keys()), 50)
@@ -40,7 +43,7 @@
             print(myjson)
 
             print("Output API:")
-            print(requests.post("http://192.168.178.11:1235", json=myjson).json())
+            print(requests.post(f"http://{host}:{port}", json=myjson).json())
             print("----------------------------")
 
 

diff --git a/scripts/test_server.py b/scripts/test_server.py
@@ -0,0 +1,62 @@
+import os
+import requests
+
+# Script for testing the implementation of the conversational entity linking API
+#
+# To run:
+#
+#    python .\src\REL\server.py $REL_BASE_URL wiki_2019
+# or
+#    python .\src\REL\server.py $env:REL_BASE_URL wiki_2019
+#
+# Set $REL_BASE_URL to where your data are stored (`base_url`)
+#
+# These paths must exist:
+# - `$REL_BASE_URL/bert_conv`
+# - `$REL_BASE_URL/s2e_ast_onto `
+#
+# (see https://github.com/informagi/conversational-entity-linking-2022/tree/main/tool#step-1-download-models)
+#
+
+
+host = 'localhost'
+port = '5555'
+
+text1 = {
+    "text": "REL is a modular Entity Linking package that can both be integrated in existing pipelines or be used as an API.",
+    "spans": []
+}
+
+conv1 = {
+    "text" : [
+        {
+            "speaker":
+            "USER",
+            "utterance":
+            "I am allergic to tomatoes but we have a lot of famous Italian restaurants here in London.",
+        },
+        {
+            "speaker": "SYSTEM",
+            "utterance": "Some people are allergic to histamine in tomatoes.",
+        },
+        {
+            "speaker":
+            "USER",
+            "utterance":
+            "Talking of food, can you recommend me a restaurant in my city for our anniversary?",
+        },
+    ]
+}
+
+
+for endpoint, myjson in (
+        ('', text1), 
+        ('conversation/', conv1)
+    ):
+    print('Input API:')
+    print(myjson)
+    print()
+    print('Output API:')
+    print(requests.post(f"http://{host}:{port}/{endpoint}", json=myjson).json())
+    print('----------------------------')
+
diff --git a/setup.cfg b/setup.cfg
@@ -43,13 +43,17 @@ package_dir =
     = src
 include_package_data = True
 install_requires =
+    anyascii
     colorama
-    konoha
+    fastapi
     flair>=0.11
+    konoha
+    nltk
+    pydantic
     segtok
+    spacy
     torch
-    nltk
-    anyascii
+    uvicorn
 
 [options.extras_require]
 develop =

diff --git a/src/REL/crel/__init__.py b/src/REL/crel/__init__.py
diff --git a/src/REL/crel/bert_md.py b/src/REL/crel/bert_md.py
@@ -0,0 +1,94 @@
+import torch
+from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
+
+
+class BERT_MD:
+    def __init__(self, file_pretrained):
+        """
+
+        Args:
+            file_pretrained = "./tmp/ft-conel/"
+
+        Note:
+            The output of self.ner_model(s_input) is like
+              - s_input: e.g, 'Burger King franchise'
+              - return: e.g., [{'entity': 'B-ment', 'score': 0.99364895, 'index': 1, 'word': 'Burger', 'start': 0, 'end': 6}, ...]
+        """
+
+        model = AutoModelForTokenClassification.from_pretrained(file_pretrained)
+        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+        model.to(device)
+        tokenizer = AutoTokenizer.from_pretrained(file_pretrained)
+        self.ner_model = pipeline(
+            "ner",
+            model=model,
+            tokenizer=tokenizer,
+            device=device.index if device.index != None else -1,
+            ignore_labels=[],
+        )
+
+    def md(self, s, flag_warning=False):
+        """Perform mention detection
+
+        Args:
+            s: input string
+            flag_warning: if True, print warning message
+
+        Returns:
+            REL style annotation results: [(start_position, length, mention), ...]
+            E.g., [[0, 15, 'The Netherlands'], ...]
+        """
+
+        ann = self.ner_model(s)  # Get ann results from BERT-NER model
+
+        ret = []
+        pos_start, pos_end = -1, -1  # Initialize variables
+
+        for i in range(len(ann)):
+            w, ner = ann[i]["word"], ann[i]["entity"]
+            assert ner in [
+                "B-ment",
+                "I-ment",
+                "O",
+            ], f"Unexpected ner tag: {ner}. If you use BERT-NER as it is, then you should flag_use_normal_bert_ner_tag=True."
+            if ner == "B-ment" and w[:2] != "##":
+                if (pos_start != -1) and (pos_end != -1):  # If B-ment is already found
+                    ret.append(
+                        [pos_start, pos_end - pos_start, s[pos_start:pos_end]]
+                    )  # save the previously identified mention
+                    pos_start, pos_end = -1, -1  # Initialize
+                pos_start, pos_end = ann[i]["start"], ann[i]["end"]
+
+            elif ner == "B-ment" and w[:2] == "##":
+                if (ann[i]["index"] == ann[i - 1]["index"] + 1) and (
+                    ann[i - 1]["entity"] != "B-ment"
+                ):  # If previous token has an entity (ner) label and it is NOT "B-ment" (i.e., ##xxx should not be the begin of the entity)
+                    if flag_warning:
+                        print(
+                            f"WARNING: ##xxx (in this case {w}) should not be the begin of the entity"
+                        )
+
+            elif (
+                i > 0
+                and (ner == "I-ment")
+                and (ann[i]["index"] == ann[i - 1]["index"] + 1)
+            ):  # If w is I-ment and previous word's index (i.e., ann[i-1]['index']) is also a mention
+                pos_end = ann[i]["end"]  # update pos_end
+
+            # This only happens when flag_ignore_o is False
+            elif (
+                ner == "O"
+                and w[:2] == "##"
+                and (
+                    ann[i - 1]["entity"] == "B-ment" or ann[i - 1]["entity"] == "I-ment"
+                )
+            ):  # If w is "O" and ##xxx, and previous token's index (i.e., ann[i-1]['index']) is B-ment or I-ment
+                pos_end = ann[i]["end"]  # update pos_end
+
+        # Append remaining ment
+        if (pos_start != -1) and (pos_end != -1):
+            ret.append(
+                [pos_start, pos_end - pos_start, s[pos_start:pos_end]]
+            )  # Save last mention
+
+        return ret