-
Notifications
You must be signed in to change notification settings - Fork 68
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add conversational entity linking into REL (#150)
* Add submodule for conversational entity linking * Implement API for conversational entity linking * Add script for testing the server response * Re-implement server using fastapi/pydantic * Implement conversational entity linking in server * Refactor response handler code to submodule * Remove `rel_ed` and make ConvEL depend on REL MD * Add some docs for conversational entity linking * Skip CREL test if running on CI We have no way of testing, because the data are not available * Remove unused code
- Loading branch information
1 parent
b153d67
commit 0018b57
Showing
22 changed files
with
2,294 additions
and
167 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
# Conversational entity linking | ||
|
||
The `crel` submodule the conversational entity linking tool trained on the [ConEL-2 dataset](https://github.com/informagi/conversational-entity-linking-2022#conel-2-conversational-entity-linking-dataset). | ||
|
||
Unlike existing EL methods, `crel` is developed to identify both named entities and concepts. | ||
It also uses coreference resolution techniques to identify personal entities and references to the explicit entity mentions in the conversations. | ||
|
||
This tutorial describes how to start with conversational entity linking on a local machine. | ||
|
||
For more information, see the original [repository on conversational entity linking](https://github.com/informagi/conversational-entity-linking-2022). | ||
|
||
## Start with your local environment | ||
|
||
### Step 1: Download models | ||
|
||
First, download the models below: | ||
|
||
- **MD for concepts and NEs**: | ||
+ [Click here to download models](https://drive.google.com/file/d/1OoC2XZp4uBy0eB_EIuIhEHdcLEry2LtU/view?usp=sharing) | ||
+ Extract `bert_conv-td` to your `base_url` | ||
- **Personal Entity Linking**: | ||
+ [Click here to download models](https://drive.google.com/file/d/1-jW8xkxh5GV-OuUBfMeT2Tk7tEzvH181/view?usp=sharing) | ||
+ Extract `s2e_ast_onto` to your `base_url` | ||
|
||
Additionally, conversational entity linking uses the wiki 2019 dataset. For more information on where to place the data and the `base_url`, check out [this page](../how_to_get_started). If setup correctly, your `base_url` should contain these directories: | ||
|
||
|
||
```bash | ||
. | ||
└── bert_conv-td | ||
└── s2e_ast_onto | ||
└── wiki_2019 | ||
``` | ||
|
||
|
||
### Step 2: Example code | ||
|
||
This example shows how to link a short conversation. Note that the speakers must be named "USER" and "SPEAKER". | ||
|
||
|
||
```python | ||
>>> from REL.crel.conv_el import ConvEL | ||
>>> | ||
>>> cel = ConvEL(base_url="C:/path/to/base_url/") | ||
>>> | ||
>>> conversation = [ | ||
>>> {"speaker": "USER", | ||
>>> "utterance": "I am allergic to tomatoes but we have a lot of famous Italian restaurants here in London.",}, | ||
>>> | ||
>>> {"speaker": "SYSTEM", | ||
>>> "utterance": "Some people are allergic to histamine in tomatoes.",}, | ||
>>> | ||
>>> {"speaker": "USER", | ||
>>> "utterance": "Talking of food, can you recommend me a restaurant in my city for our anniversary?",}, | ||
>>> ] | ||
>>> | ||
>>> annotated = cel.annotate(conversation) | ||
>>> [item for item in annotated if item['speaker'] == 'USER'] | ||
[{'speaker': 'USER', | ||
'utterance': 'I am allergic to tomatoes but we have a lot of famous Italian restaurants here in London.', | ||
'annotations': [[17, 8, 'tomatoes', 'Tomato'], | ||
[54, 19, 'Italian restaurants', 'Italian_cuisine'], | ||
[82, 6, 'London', 'London']]}, | ||
{'speaker': 'USER', | ||
'utterance': 'Talking of food, can you recommend me a restaurant in my city for our anniversary?', | ||
'annotations': [[11, 4, 'food', 'Food'], | ||
[40, 10, 'restaurant', 'Restaurant'], | ||
[54, 7, 'my city', 'London']]}] | ||
|
||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,10 @@ | ||
anyascii | ||
colorama | ||
konoha | ||
fastapi | ||
flair>=0.11 | ||
konoha | ||
nltk | ||
pydantic | ||
segtok | ||
torch | ||
nltk | ||
anyascii | ||
uvicorn |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
import os | ||
import requests | ||
|
||
# Script for testing the implementation of the conversational entity linking API | ||
# | ||
# To run: | ||
# | ||
# python .\src\REL\server.py $REL_BASE_URL wiki_2019 | ||
# or | ||
# python .\src\REL\server.py $env:REL_BASE_URL wiki_2019 | ||
# | ||
# Set $REL_BASE_URL to where your data are stored (`base_url`) | ||
# | ||
# These paths must exist: | ||
# - `$REL_BASE_URL/bert_conv` | ||
# - `$REL_BASE_URL/s2e_ast_onto ` | ||
# | ||
# (see https://github.com/informagi/conversational-entity-linking-2022/tree/main/tool#step-1-download-models) | ||
# | ||
|
||
|
||
host = 'localhost' | ||
port = '5555' | ||
|
||
text1 = { | ||
"text": "REL is a modular Entity Linking package that can both be integrated in existing pipelines or be used as an API.", | ||
"spans": [] | ||
} | ||
|
||
conv1 = { | ||
"text" : [ | ||
{ | ||
"speaker": | ||
"USER", | ||
"utterance": | ||
"I am allergic to tomatoes but we have a lot of famous Italian restaurants here in London.", | ||
}, | ||
{ | ||
"speaker": "SYSTEM", | ||
"utterance": "Some people are allergic to histamine in tomatoes.", | ||
}, | ||
{ | ||
"speaker": | ||
"USER", | ||
"utterance": | ||
"Talking of food, can you recommend me a restaurant in my city for our anniversary?", | ||
}, | ||
] | ||
} | ||
|
||
|
||
for endpoint, myjson in ( | ||
('', text1), | ||
('conversation/', conv1) | ||
): | ||
print('Input API:') | ||
print(myjson) | ||
print() | ||
print('Output API:') | ||
print(requests.post(f"http://{host}:{port}/{endpoint}", json=myjson).json()) | ||
print('----------------------------') | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
import torch | ||
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline | ||
|
||
|
||
class BERT_MD: | ||
def __init__(self, file_pretrained): | ||
""" | ||
Args: | ||
file_pretrained = "./tmp/ft-conel/" | ||
Note: | ||
The output of self.ner_model(s_input) is like | ||
- s_input: e.g, 'Burger King franchise' | ||
- return: e.g., [{'entity': 'B-ment', 'score': 0.99364895, 'index': 1, 'word': 'Burger', 'start': 0, 'end': 6}, ...] | ||
""" | ||
|
||
model = AutoModelForTokenClassification.from_pretrained(file_pretrained) | ||
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") | ||
model.to(device) | ||
tokenizer = AutoTokenizer.from_pretrained(file_pretrained) | ||
self.ner_model = pipeline( | ||
"ner", | ||
model=model, | ||
tokenizer=tokenizer, | ||
device=device.index if device.index != None else -1, | ||
ignore_labels=[], | ||
) | ||
|
||
def md(self, s, flag_warning=False): | ||
"""Perform mention detection | ||
Args: | ||
s: input string | ||
flag_warning: if True, print warning message | ||
Returns: | ||
REL style annotation results: [(start_position, length, mention), ...] | ||
E.g., [[0, 15, 'The Netherlands'], ...] | ||
""" | ||
|
||
ann = self.ner_model(s) # Get ann results from BERT-NER model | ||
|
||
ret = [] | ||
pos_start, pos_end = -1, -1 # Initialize variables | ||
|
||
for i in range(len(ann)): | ||
w, ner = ann[i]["word"], ann[i]["entity"] | ||
assert ner in [ | ||
"B-ment", | ||
"I-ment", | ||
"O", | ||
], f"Unexpected ner tag: {ner}. If you use BERT-NER as it is, then you should flag_use_normal_bert_ner_tag=True." | ||
if ner == "B-ment" and w[:2] != "##": | ||
if (pos_start != -1) and (pos_end != -1): # If B-ment is already found | ||
ret.append( | ||
[pos_start, pos_end - pos_start, s[pos_start:pos_end]] | ||
) # save the previously identified mention | ||
pos_start, pos_end = -1, -1 # Initialize | ||
pos_start, pos_end = ann[i]["start"], ann[i]["end"] | ||
|
||
elif ner == "B-ment" and w[:2] == "##": | ||
if (ann[i]["index"] == ann[i - 1]["index"] + 1) and ( | ||
ann[i - 1]["entity"] != "B-ment" | ||
): # If previous token has an entity (ner) label and it is NOT "B-ment" (i.e., ##xxx should not be the begin of the entity) | ||
if flag_warning: | ||
print( | ||
f"WARNING: ##xxx (in this case {w}) should not be the begin of the entity" | ||
) | ||
|
||
elif ( | ||
i > 0 | ||
and (ner == "I-ment") | ||
and (ann[i]["index"] == ann[i - 1]["index"] + 1) | ||
): # If w is I-ment and previous word's index (i.e., ann[i-1]['index']) is also a mention | ||
pos_end = ann[i]["end"] # update pos_end | ||
|
||
# This only happens when flag_ignore_o is False | ||
elif ( | ||
ner == "O" | ||
and w[:2] == "##" | ||
and ( | ||
ann[i - 1]["entity"] == "B-ment" or ann[i - 1]["entity"] == "I-ment" | ||
) | ||
): # If w is "O" and ##xxx, and previous token's index (i.e., ann[i-1]['index']) is B-ment or I-ment | ||
pos_end = ann[i]["end"] # update pos_end | ||
|
||
# Append remaining ment | ||
if (pos_start != -1) and (pos_end != -1): | ||
ret.append( | ||
[pos_start, pos_end - pos_start, s[pos_start:pos_end]] | ||
) # Save last mention | ||
|
||
return ret |
Oops, something went wrong.