Skip to content

Commit

Permalink
Add conversational entity linking into REL (#150)
Browse files Browse the repository at this point in the history
* Add submodule for conversational entity linking

* Implement API for conversational entity linking

* Add script for testing the server response

* Re-implement server using fastapi/pydantic

* Implement conversational entity linking in server

* Refactor response handler code to submodule

* Remove `rel_ed` and make ConvEL depend on REL MD

* Add some docs for conversational entity linking

* Skip CREL test if running on CI

We have no way of testing, because the data are not available

* Remove unused code
  • Loading branch information
stefsmeets authored Jan 24, 2023
1 parent b153d67 commit 0018b57
Show file tree
Hide file tree
Showing 22 changed files with 2,294 additions and 167 deletions.
71 changes: 71 additions & 0 deletions docs/tutorials/conversations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Conversational entity linking

The `crel` submodule the conversational entity linking tool trained on the [ConEL-2 dataset](https://github.com/informagi/conversational-entity-linking-2022#conel-2-conversational-entity-linking-dataset).

Unlike existing EL methods, `crel` is developed to identify both named entities and concepts.
It also uses coreference resolution techniques to identify personal entities and references to the explicit entity mentions in the conversations.

This tutorial describes how to start with conversational entity linking on a local machine.

For more information, see the original [repository on conversational entity linking](https://github.com/informagi/conversational-entity-linking-2022).

## Start with your local environment

### Step 1: Download models

First, download the models below:

- **MD for concepts and NEs**:
+ [Click here to download models](https://drive.google.com/file/d/1OoC2XZp4uBy0eB_EIuIhEHdcLEry2LtU/view?usp=sharing)
+ Extract `bert_conv-td` to your `base_url`
- **Personal Entity Linking**:
+ [Click here to download models](https://drive.google.com/file/d/1-jW8xkxh5GV-OuUBfMeT2Tk7tEzvH181/view?usp=sharing)
+ Extract `s2e_ast_onto` to your `base_url`

Additionally, conversational entity linking uses the wiki 2019 dataset. For more information on where to place the data and the `base_url`, check out [this page](../how_to_get_started). If setup correctly, your `base_url` should contain these directories:


```bash
.
└── bert_conv-td
└── s2e_ast_onto
└── wiki_2019
```


### Step 2: Example code

This example shows how to link a short conversation. Note that the speakers must be named "USER" and "SPEAKER".


```python
>>> from REL.crel.conv_el import ConvEL
>>>
>>> cel = ConvEL(base_url="C:/path/to/base_url/")
>>>
>>> conversation = [
>>> {"speaker": "USER",
>>> "utterance": "I am allergic to tomatoes but we have a lot of famous Italian restaurants here in London.",},
>>>
>>> {"speaker": "SYSTEM",
>>> "utterance": "Some people are allergic to histamine in tomatoes.",},
>>>
>>> {"speaker": "USER",
>>> "utterance": "Talking of food, can you recommend me a restaurant in my city for our anniversary?",},
>>> ]
>>>
>>> annotated = cel.annotate(conversation)
>>> [item for item in annotated if item['speaker'] == 'USER']
[{'speaker': 'USER',
'utterance': 'I am allergic to tomatoes but we have a lot of famous Italian restaurants here in London.',
'annotations': [[17, 8, 'tomatoes', 'Tomato'],
[54, 19, 'Italian restaurants', 'Italian_cuisine'],
[82, 6, 'London', 'London']]},
{'speaker': 'USER',
'utterance': 'Talking of food, can you recommend me a restaurant in my city for our anniversary?',
'annotations': [[11, 4, 'food', 'Food'],
[40, 10, 'restaurant', 'Restaurant'],
[54, 7, 'my city', 'London']]}]

```

1 change: 1 addition & 0 deletions docs/tutorials/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,4 @@ The remainder of the tutorials are optional and for users who wish to e.g. train
5. [Reproducing our results](reproducing_our_results/)
6. [REL as systemd service](systemd_instructions/)
7. [Notes on using custom models](custom_models/)
7. [Conversational entity linking](conversations/)
4 changes: 2 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ nav:
- tutorials/reproducing_our_results.md
- tutorials/systemd_instructions.md
- tutorials/custom_models.md
- tutorials/conversations.md
- Python API reference:
- api/entity_disambiguation.md
- api/generate_train_test.md
Expand Down Expand Up @@ -72,11 +73,10 @@ plugins:
- https://numpy.org/doc/stable/objects.inv
- https://docs.scipy.org/doc/scipy/objects.inv
- https://pandas.pydata.org/docs/objects.inv
selection:
options:
docstring_style: sphinx
docstring_options:
ignore_init_summary: yes
rendering:
show_submodules: no
show_source: true
docstring_section_style: list
Expand Down
9 changes: 6 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
anyascii
colorama
konoha
fastapi
flair>=0.11
konoha
nltk
pydantic
segtok
torch
nltk
anyascii
uvicorn
9 changes: 6 additions & 3 deletions scripts/efficiency_test.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
import numpy as np
import requests
import os

from REL.training_datasets import TrainingEvaluationDatasets

np.random.seed(seed=42)

base_url = "/Users/vanhulsm/Desktop/projects/data/"
wiki_version = "wiki_2014"
base_url = os.environ.get("REL_BASE_URL")
wiki_version = "wiki_2019"
host = 'localhost'
port = '5555'
datasets = TrainingEvaluationDatasets(base_url, wiki_version).load()["aida_testB"]

# random_docs = np.random.choice(list(datasets.keys()), 50)
Expand Down Expand Up @@ -40,7 +43,7 @@
print(myjson)

print("Output API:")
print(requests.post("http://192.168.178.11:1235", json=myjson).json())
print(requests.post(f"http://{host}:{port}", json=myjson).json())
print("----------------------------")


Expand Down
62 changes: 62 additions & 0 deletions scripts/test_server.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import os
import requests

# Script for testing the implementation of the conversational entity linking API
#
# To run:
#
# python .\src\REL\server.py $REL_BASE_URL wiki_2019
# or
# python .\src\REL\server.py $env:REL_BASE_URL wiki_2019
#
# Set $REL_BASE_URL to where your data are stored (`base_url`)
#
# These paths must exist:
# - `$REL_BASE_URL/bert_conv`
# - `$REL_BASE_URL/s2e_ast_onto `
#
# (see https://github.com/informagi/conversational-entity-linking-2022/tree/main/tool#step-1-download-models)
#


host = 'localhost'
port = '5555'

text1 = {
"text": "REL is a modular Entity Linking package that can both be integrated in existing pipelines or be used as an API.",
"spans": []
}

conv1 = {
"text" : [
{
"speaker":
"USER",
"utterance":
"I am allergic to tomatoes but we have a lot of famous Italian restaurants here in London.",
},
{
"speaker": "SYSTEM",
"utterance": "Some people are allergic to histamine in tomatoes.",
},
{
"speaker":
"USER",
"utterance":
"Talking of food, can you recommend me a restaurant in my city for our anniversary?",
},
]
}


for endpoint, myjson in (
('', text1),
('conversation/', conv1)
):
print('Input API:')
print(myjson)
print()
print('Output API:')
print(requests.post(f"http://{host}:{port}/{endpoint}", json=myjson).json())
print('----------------------------')

10 changes: 7 additions & 3 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,17 @@ package_dir =
= src
include_package_data = True
install_requires =
anyascii
colorama
konoha
fastapi
flair>=0.11
konoha
nltk
pydantic
segtok
spacy
torch
nltk
anyascii
uvicorn

[options.extras_require]
develop =
Expand Down
Empty file added src/REL/crel/__init__.py
Empty file.
94 changes: 94 additions & 0 deletions src/REL/crel/bert_md.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline


class BERT_MD:
def __init__(self, file_pretrained):
"""
Args:
file_pretrained = "./tmp/ft-conel/"
Note:
The output of self.ner_model(s_input) is like
- s_input: e.g, 'Burger King franchise'
- return: e.g., [{'entity': 'B-ment', 'score': 0.99364895, 'index': 1, 'word': 'Burger', 'start': 0, 'end': 6}, ...]
"""

model = AutoModelForTokenClassification.from_pretrained(file_pretrained)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)
tokenizer = AutoTokenizer.from_pretrained(file_pretrained)
self.ner_model = pipeline(
"ner",
model=model,
tokenizer=tokenizer,
device=device.index if device.index != None else -1,
ignore_labels=[],
)

def md(self, s, flag_warning=False):
"""Perform mention detection
Args:
s: input string
flag_warning: if True, print warning message
Returns:
REL style annotation results: [(start_position, length, mention), ...]
E.g., [[0, 15, 'The Netherlands'], ...]
"""

ann = self.ner_model(s) # Get ann results from BERT-NER model

ret = []
pos_start, pos_end = -1, -1 # Initialize variables

for i in range(len(ann)):
w, ner = ann[i]["word"], ann[i]["entity"]
assert ner in [
"B-ment",
"I-ment",
"O",
], f"Unexpected ner tag: {ner}. If you use BERT-NER as it is, then you should flag_use_normal_bert_ner_tag=True."
if ner == "B-ment" and w[:2] != "##":
if (pos_start != -1) and (pos_end != -1): # If B-ment is already found
ret.append(
[pos_start, pos_end - pos_start, s[pos_start:pos_end]]
) # save the previously identified mention
pos_start, pos_end = -1, -1 # Initialize
pos_start, pos_end = ann[i]["start"], ann[i]["end"]

elif ner == "B-ment" and w[:2] == "##":
if (ann[i]["index"] == ann[i - 1]["index"] + 1) and (
ann[i - 1]["entity"] != "B-ment"
): # If previous token has an entity (ner) label and it is NOT "B-ment" (i.e., ##xxx should not be the begin of the entity)
if flag_warning:
print(
f"WARNING: ##xxx (in this case {w}) should not be the begin of the entity"
)

elif (
i > 0
and (ner == "I-ment")
and (ann[i]["index"] == ann[i - 1]["index"] + 1)
): # If w is I-ment and previous word's index (i.e., ann[i-1]['index']) is also a mention
pos_end = ann[i]["end"] # update pos_end

# This only happens when flag_ignore_o is False
elif (
ner == "O"
and w[:2] == "##"
and (
ann[i - 1]["entity"] == "B-ment" or ann[i - 1]["entity"] == "I-ment"
)
): # If w is "O" and ##xxx, and previous token's index (i.e., ann[i-1]['index']) is B-ment or I-ment
pos_end = ann[i]["end"] # update pos_end

# Append remaining ment
if (pos_start != -1) and (pos_end != -1):
ret.append(
[pos_start, pos_end - pos_start, s[pos_start:pos_end]]
) # Save last mention

return ret
Loading

0 comments on commit 0018b57

Please sign in to comment.