Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mention detection with Bert #151

Open
wants to merge 62 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 59 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
af0da5e
efficiency test without server
eriktks Dec 15, 2022
26072db
efficiency test without server
eriktks Dec 15, 2022
5f850dc
efficiency test without server
eriktks Dec 15, 2022
ca1b937
efficiency test without server
eriktks Dec 15, 2022
234e886
efficiency test without server
eriktks Dec 15, 2022
f8f3d7e
fixed bert server usage
eriktks Dec 16, 2022
f3914e9
fixed gerbil test problem
eriktks Dec 16, 2022
84f28d7
added multilingual bert
eriktks Dec 20, 2022
67696a9
refactored code
eriktks Dec 22, 2022
0af9d19
refactored code
eriktks Dec 23, 2022
8c18251
smooth installation updates
eriktks Jan 5, 2023
a6ae211
fixed tests/test_ed_pipeline.py
eriktks Jan 5, 2023
1dd3a54
made required arguments optional
eriktks Jan 5, 2023
e7b1604
code cleanup
eriktks Jan 13, 2023
cd24c27
prune word-internal mentions
eriktks Jan 13, 2023
f2a5514
solve initials bug
eriktks Jan 13, 2023
0b90309
file path standardization
eriktks Jan 13, 2023
dc008c0
flagged flair with splitting
eriktks Jan 16, 2023
151bac4
move evaluate_predictions.py to scripts
eriktks Jan 19, 2023
894837d
move evaluate_predictions.py to scripts
eriktks Jan 19, 2023
08a5938
simplified NER tagger selection
eriktks Jan 30, 2023
083f7f5
skipped tests requiring data
eriktks Jan 30, 2023
061edc1
add defaults for arguments
eriktks Jan 30, 2023
369275f
replace next by continue
eriktks Jan 30, 2023
4e6703e
replace with list comprhension
eriktks Jan 30, 2023
7b15c15
simplify code
eriktks Jan 30, 2023
1ca5fbe
values became keyword arguments
eriktks Jan 30, 2023
9e3dae1
string formatting replaced rounding
eriktks Jan 30, 2023
25a5bf1
Update tests/test_evaluate_predictions.py
eriktks Feb 14, 2023
508483a
Update tests/test_evaluate_predictions.py
eriktks Feb 14, 2023
cd86c35
Update tests/test_evaluate_predictions.py
eriktks Feb 14, 2023
fa188e8
Update tests/test_evaluate_predictions.py
eriktks Feb 14, 2023
cbdc791
fixed data format
eriktks Feb 14, 2023
d31135e
make tests work
eriktks Feb 13, 2024
10a3d87
make tests work
eriktks Feb 13, 2024
93899ed
make tests work
eriktks Feb 13, 2024
72b31e0
make tests work
eriktks Feb 13, 2024
44dc91d
removed redundant function
eriktks Feb 13, 2024
ea80f2f
use_server on same level
eriktks Feb 20, 2024
83ff7e6
fixed unreadable code
eriktks Feb 27, 2024
94cc303
base_url for defining path
eriktks Feb 27, 2024
cdcf86c
use startswith iso re.search
eriktks Feb 27, 2024
8a05197
simplified computations
eriktks Feb 27, 2024
565a96a
print with % iso f
eriktks Mar 5, 2024
9d713cc
removed is_flair function argument
eriktks Mar 5, 2024
afbb17f
removed function argument tagger_ner_name
eriktks Mar 5, 2024
6b5b10d
updated combine_entities output format
eriktks Mar 5, 2024
c21e85f
replaced re.sub by str.removeprefix
eriktks Mar 19, 2024
bdc149c
removed redundant re calls
eriktks Mar 19, 2024
2e73d0c
crash without loaded model
eriktks Mar 19, 2024
3bf20d2
simplified split_docs_value variable
eriktks Mar 19, 2024
ef5c5a9
removed hard-coded paths
eriktks Mar 19, 2024
e7304fb
enabling manual action run
eriktks Mar 19, 2024
7cd274f
changing python version
eriktks Mar 19, 2024
e500990
chnaged pytest arguments
eriktks Mar 19, 2024
4858dce
fixing merge conflicts
eriktks Mar 28, 2024
425d917
solving most merge conflicts
eriktks Mar 29, 2024
191b08a
corrected incomplete path
eriktks Apr 8, 2024
59fb3ae
added documentation for ner
eriktks Apr 8, 2024
19ca5a4
restricted scipy version
eriktks Apr 9, 2024
abf4152
restricted scipy version
eriktks Apr 9, 2024
becdac1
restricted scipy version
eriktks Apr 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,15 @@ on:
- main
pull_request:
branches: [ main ]
workflow_dispatch:


jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.7, 3.8]
python-version: [3.9]

steps:
- uses: actions/checkout@v3
Expand All @@ -40,4 +42,4 @@ jobs:

- name: Test with pytest
run: |
pytest -W ignore
pytest tests
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ celerybeat.pid
.venv
env/
venv/
venv3/
ENV/
env.bak/
venv.bak/
Expand Down Expand Up @@ -133,3 +134,5 @@ dmypy.json

# Project specific
/data
data
000README
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,20 @@
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/radboud-el)](https://pypi.org/project/radboud-el/)
[![PyPI](https://img.shields.io/pypi/v/radboud-el.svg?style=flat)](https://pypi.org/project/radboud-el/)

---

Example tests:

* Flair: `python3 scripts/efficiency_test.py --process_sentences`
* Bert: `python3 scripts/efficiency_test.py --use_bert_base_uncased --split_docs_value 500`
* Server (slower):
* `python3 src/REL/server.py --use_bert_base_uncased --split_docs_value 500 --ed-model ed-wiki-2019 data wiki_2019`
* `python3 scripts/efficiency_test.py --use_server`

Needs installation of REL documents in directory `doc` (`ed-wiki-2019`, `generic` and `wiki_2019`)

---

REL is a modular Entity Linking package that is provided as a Python package as well as a web API. REL has various meanings - one might first notice that it stands for relation, which is a suiting name for the problems that can be tackled with this package. Additionally, in Dutch a 'rel' means a disturbance of the public order, which is exactly what we aim to achieve with the release of this package.

REL utilizes *English* Wikipedia as a knowledge base and can be used for the following tasks:
Expand Down
11 changes: 11 additions & 0 deletions conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
import os
import pytest


def pytest_addoption(parser):
parser.addoption("--base_url", action="store", default=os.path.dirname(__file__) + "/src/data/")


@pytest.fixture
def base_url(request):
return request.config.getoption("--base_url")
2 changes: 1 addition & 1 deletion docs/tutorials/custom_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ model, you can only use a local filepath.

NER and ED models that we provide as part of REL can be loaded easily using
aliases. Available models are listed
[on the REL repository](https://github.com/informagi/REL/tree/master/REL/models/models.json).
[on the REL repository](https://github.com/informagi/REL/tree/master/src/REL/models/models.json).
All models that need to be downloaded from the web are cached for subsequent
use.

Expand Down
13 changes: 7 additions & 6 deletions docs/tutorials/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,10 @@ The remainder of the tutorials are optional and for users who wish to e.g. train

1. [How to get started (project folder and structure).](how_to_get_started/)
2. [End-to-End Entity Linking.](e2e_entity_linking/)
3. [Evaluate on GERBIL.](evaluate_gerbil/)
4. [Deploy REL for a new Wikipedia corpus](deploy_REL_new_wiki/):
5. [Reproducing our results](reproducing_our_results/)
6. [REL server](server/)
7. [Notes on using custom models](custom_models/)
7. [Conversational entity linking](conversations/)
3. [Mention Detection models.](ner/)
4. [Evaluate on GERBIL.](evaluate_gerbil/)
5. [Deploy REL for a new Wikipedia corpus](deploy_REL_new_wiki/):
6. [Reproducing our results](reproducing_our_results/)
7. [REL server](server/)
8. [Notes on using custom models](custom_models/)
9. [Conversational entity linking](conversations/)
24 changes: 24 additions & 0 deletions docs/tutorials/ner.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Mention Detection models

REL offers different named entity models for mention detection:

- `flair`: named model for English, expects upper and lower case text (default)
- `bert_base_cased`: basic name model for English, expects upper and lower case text
- `bert_base_uncased`: basic name model for English, expects lower case text
- `bert_large_cased`: extensive name model for English, expects upper and lower case text
- `bert_large_uncased`: extensive name model for English, expects lower case text
- `bert_multilingual`: multilingual name model, expects upper and lower case text

To change the default Flair model, specify the required model with the `--tagger_ner_name` option, for example when calling the server:

```bash
python src/REL/server.py --tagger_ner_name bert_base_cased
```

or specify the model in the `tagger_name` parameter of a mention detection call:

```python
mentions_dataset, n_mentions = mention_detection.find_mentions(docs, tagger_ner="bert_base_cased")
```

The available named entity models are specified in the file `src/REL/ner/set_tagger_ner.py`. The file names refer to locations on the website huggingface.co, for example https://huggingface.co/flair/ner-english-fast . The file can be extended with new models, for example for other languages.
86 changes: 0 additions & 86 deletions scripts/efficiency_test.py

This file was deleted.

10 changes: 0 additions & 10 deletions scripts/gerbil_middleware/Makefile

This file was deleted.

14 changes: 7 additions & 7 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -43,17 +43,16 @@ package_dir =
= src
include_package_data = True
install_requires =
anyascii
colorama
fastapi
flair>=0.11
konoha
nltk
pydantic
flair>=0.11
segtok
spacy
torch
uvicorn
nltk
anyascii
termcolor
syntok
spacy

[options.extras_require]
develop =
Expand All @@ -80,3 +79,4 @@ where = src

# [options.entry_points]
# console_scripts =

2 changes: 1 addition & 1 deletion src/REL/crel/s2e_pe/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,7 @@ def pad_batch(self, batch, max_length):
example[0],
use_fast=False,
add_special_tokens=True,
pad_to_max_length=True,
padding='longest',
max_length=max_length,
return_attention_mask=True,
return_tensors="pt",
Expand Down
12 changes: 9 additions & 3 deletions src/REL/db/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,9 +186,15 @@ def lookup_wik(self, w, table_name, column):
"select {} from {} where word = :word".format(column, table_name),
{"word": w},
).fetchone()
res = (
e if e is None else json.loads(e[0].decode()) if column == "p_e_m" else e[0]
)
if not e:
res = None
elif column == "p_e_m":
try:
res = json.loads(e[0].decode())
except AttributeError:
res = json.loads("".join(chr(int(x, 2)) for x in e[0].split()))
else:
res = e[0]

return res

Expand Down
Loading
Loading