Mention detection with Bert #151

eriktks · 2023-01-05T15:28:31Z

New version of REL with the opportunity to use Bert instead of Flair for mention detection (named entity recognition)

Installation

python3 -m venv --prompt REL venv3
source venv3/bin/activate
git clone https://github.com/informagi/REL.git
cd REL
git checkout md-with-bert-2
pip install -e '.[develop]'
mkdir src/data
cd src/data
curl -O http://gem.cs.ru.nl/generic.tar.gz
curl -O http://gem.cs.ru.nl/ed-wiki-2019.tar.gz
curl -O http://gem.cs.ru.nl/wiki_2019.tar.gz
tar zxf wiki_2019.tar.gz
tar zxf ed-wiki-2019.tar.gz
tar zxf generic.tar.gz
cd ../..

Testing

Run 13 tests

pytest tests

Some warnings might be reported but all tests should succeed

List test options:

python3 src/scripts/efficiency_test.py -h

Process one test document with Flair, sentence-by-sentence, and report the performance:

python3 src/scripts/efficiency_test.py --tagger_ner_name flair --max_docs 1 --process_sentences
...
Results: PMD RMD FMD PEL REL FEL: 94.9% 63.8% 76.3% | 69.2% 46.6% 55.7%

Process all 50 test documents with uncased Bert base, document-by-document, maximum of 500 tokens, and report the performance:

python3 src/scripts/efficiency_test.py --tagger_ner_name bert_base_uncased --split_docs_value 500
...
Results: PMD RMD FMD PEL REL FEL: 93.5% 62.4% 74.8% | 62.9% 42.0% 50.3%

Use the server to process one test document with Flair, sentence-by-sentence, and report the performance:

python3 src/REL/server.py
python3 src/scripts/efficiency_test.py --max_docs 1 --use_server
...
Results: PMD RMD FMD PEL REL FEL: 95.0% 63.3% 76.0% | 67.5% 45.0% 54.0%

stefsmeets

Hi @eriktks This is a very long PR, so a very long review 😅

I found it very tricky to review, because there are small changes everywhere.

I have two main impressions. First, the way the code is currently structured seems like it is quite tricky to add a new tagger. The code itself is written with flair in mind, and there is no interface to add a new tagger (like Bert). I'm sure the code works well, but I would have liked to see a clean interface between Bert and Flar, so that any new tagger just has to define this interface and it just works. I guess this is a bit of a pipe dream, but I think we should still strive to make small steps toward this.

Second, most of the code is written in a way that it is very difficult to grasp what is going on. This is an issue with the existing code, but also in the code in this PR. Lots of nested dict and list lookups, lots of nested if statements and conditionals. I didn't even attempt to try to understand what most of it does. This also means it is difficult for me to form a picture of what is going on (see point 1).

evaluate_predictions.py - This code only seems to be used inside the efficiency_test.py script. I don't think it should be part of the main REL module. I think it belongs inside the scripts directory.
mention_detection.py - This module was already quite complex, and I think that the changes you made do not help to make it easier to read. In fact, the Bert and Flair paths are completely entangled inside the code. For me at least, it is impossible to read and get a feel what is going on. I'm afraid this will make the code very difficult to debug and maintain, or, add a new tagger... 🫣
Have you thought about how to test this code on github actions? This is something I'm struggling with a bit, because there is no good 'small' dataset to test with at the moment.
Related to that, can you make sure that the github actions at least do not fail? Maybe you can use markers to mark tests that will not run in the CI?
Other than that, I left a bunch of remarks with my thoughts on how this code could be improved. Have a look and see what you think.

Edit:
Just remembered that you can also use this decorator to skip the tests on the CI (I use this elsewhere):

@pytest.mark.skipif(
    os.getenv("GITHUB_ACTIONS")=='true', reason="No way of testing this on Github actions."
)

scripts/efficiency_test.py

tests/test_evaluate_predictions.py

tests/test_flair_md.py

src/REL/mention_detection.py

Co-authored-by: Stef Smeets <[email protected]>

eriktks · 2024-03-19T19:29:46Z

Github action build now succeeds after changing the Python version and the pytest arguments

stefsmeets

We discussed this offline. The changes look good to me! 🚀

eriktks added 13 commits December 20, 2022 14:04

efficiency test without server

af0da5e

efficiency test without server

26072db

efficiency test without server

5f850dc

efficiency test without server

ca1b937

efficiency test without server

234e886

fixed bert server usage

f8f3d7e

fixed gerbil test problem

f3914e9

added multilingual bert

84f28d7

refactored code

67696a9

refactored code

0af9d19

smooth installation updates

8c18251

fixed tests/test_ed_pipeline.py

a6ae211

made required arguments optional

1dd3a54

eriktks requested a review from stefsmeets January 5, 2023 15:28

eriktks added 4 commits January 13, 2023 18:33

code cleanup

e7b1604

prune word-internal mentions

cd24c27

solve initials bug

f2a5514

file path standardization

0b90309

stefsmeets requested changes Jan 16, 2023

View reviewed changes

stefsmeets reviewed Jan 16, 2023

View reviewed changes

src/REL/mention_detection.py Outdated Show resolved Hide resolved

stefsmeets mentioned this pull request Jan 16, 2023

Add conversational entity linking into REL #150

Merged

4 tasks

eriktks and others added 9 commits January 16, 2023 16:57

flagged flair with splitting

dc008c0

move evaluate_predictions.py to scripts

151bac4

move evaluate_predictions.py to scripts

894837d

simplified NER tagger selection

08a5938

skipped tests requiring data

083f7f5

add defaults for arguments

061edc1

replace next by continue

369275f

replace with list comprhension

4e6703e

Co-authored-by: Stef Smeets <[email protected]>

simplify code

7b15c15

Co-authored-by: Stef Smeets <[email protected]>

eriktks added 18 commits February 13, 2024 18:12

removed redundant function

44dc91d

use_server on same level

ea80f2f

fixed unreadable code

83ff7e6

base_url for defining path

94cc303

use startswith iso re.search

cdcf86c

simplified computations

8a05197

print with % iso f

565a96a

removed is_flair function argument

9d713cc

removed function argument tagger_ner_name

afbb17f

updated combine_entities output format

6b5b10d

replaced re.sub by str.removeprefix

c21e85f

removed redundant re calls

bdc149c

crash without loaded model

2e73d0c

simplified split_docs_value variable

3bf20d2

removed hard-coded paths

ef5c5a9

enabling manual action run

e7304fb

changing python version

7cd274f

chnaged pytest arguments

e500990

eriktks added 4 commits March 28, 2024 18:30

fixing merge conflicts

4858dce

solving most merge conflicts

425d917

corrected incomplete path

191b08a

added documentation for ner

59fb3ae

stefsmeets self-requested a review April 8, 2024 14:46

stefsmeets previously approved these changes Apr 8, 2024

View reviewed changes

restricted scipy version

19ca5a4

eriktks dismissed stefsmeets’s stale review via 19ca5a4 April 9, 2024 12:33

eriktks added 2 commits April 9, 2024 14:42

restricted scipy version

abf4152

restricted scipy version

becdac1

stefsmeets approved these changes Apr 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mention detection with Bert #151

Mention detection with Bert #151

eriktks commented Jan 5, 2023 •

edited

Loading

stefsmeets left a comment •

edited

Loading

eriktks commented Mar 19, 2024

stefsmeets left a comment •

edited

Loading

Mention detection with Bert #151

Are you sure you want to change the base?

Mention detection with Bert #151

Conversation

eriktks commented Jan 5, 2023 • edited Loading

Installation

Testing

stefsmeets left a comment • edited Loading

Choose a reason for hiding this comment

eriktks commented Mar 19, 2024

stefsmeets left a comment • edited Loading

Choose a reason for hiding this comment

eriktks commented Jan 5, 2023 •

edited

Loading

stefsmeets left a comment •

edited

Loading

stefsmeets left a comment •

edited

Loading