Coreference search with LSH #153

f-hafner · 2023-01-25T17:05:15Z

Make REL more scalable by using locality-sensitive hashing to reduce search space for coreference search

Features added

1. add coref switch `search_corefs` to REL

3 options
- "all": for each mention, search all other mentions for a coreference (the current default)
- "off": do not search for coreferences
- "lsh": First, find similar mentions with locality-sensitive hashing. Then, search for coreferences in the reduced set of comparison groups.
The main adjustment was done in REL/src/REL/training_datasets.py, and specifically in the function with_coref. The function is called unless search_corefs="off". For the option lsh, the function uses the LSHRandomProjections class from the new module lsh.py in REL.
with_coref now also returns an indicator for whether a mention is identified as a coreference (along with other attributes for mentions, such as candidates etc), and this is passed on to the final output.
Behavior
- default is unchanged: use "all"
- parameters (see below) for LSH are currently not exposed at the top level for ED, and two of them are hard-coded in with_coref

2. `REL.lsh.LSHRandomProjections` implements LSH with Random projections

Key parameters: shingle size, number of bands, band length
- shingle size defines the number of adjacent letters that define one feature
- the number of bands and the band length determine the length of the binary signature, which will be used to assess the similarity between mentions
Procedure
- the signature is cut in n_bands
- for each band, extract the mentions that have the exact same band
- the candidate mentions for a mention $i$ are all mentions whose signatures overlap with $i$'s signature in any of the bands
Choose optimal parameters for LSH (done outside REL, see here)
- band length = log(number of mentions), following theory
- shingle size, number of bands
  - load coreferring mentions from AIDA test a
    - mention cur_m is a coreference for mention m if
      1. m contains cur_m as a word, besides other words. Ie, cur_m = "hendrix", m = "jimi hendrix"
      2. the gold entity of m and cur_m are the same (this is different from __find_coref where we do not have ground truth)
  - informally find parameters that maximise recall

3. Efficiency tests in `scripts/`

Update efficiency_test.py
- add coref switch and other command-line options
- when running on local computer (server=False):
  - profile the ED call and save as a dataframe
  - scaling of ED: run ED on multiply-stacked mentions from test data and time execution
- all of this is for running on my local computer. which of the options should we keep? what needs to be changed for server=True?
Add script run_efficiency_tests.sh that runs efficiency_test.py with the different coref options

Installation (I have not reproduced this recently; let me know if it fails)

cd rel20
git clone [email protected]:f-hafner/REL.git

# get data; taken from https://github.com/informagi/REL/pull/151
mkdir data
cd data
curl -O http://gem.cs.ru.nl/generic.tar.gz
curl -O http://gem.cs.ru.nl/ed-wiki-2019.tar.gz
curl -O http://gem.cs.ru.nl/wiki_2019.tar.gz
tar zxf wiki_2019.tar.gz
tar zxf ed-wiki-2019.tar.gz
tar zxf generic.tar.gz

# install from branch
cd ../REL
git checkout flavio/coref-lsh
conda create -n rel python=3.7
conda activate rel
pip install -e .[develop]

Testing

pytest tests/test_lsh.py # mostly unit tests
# update `base_url` in `scripts/efficiency_test.py`
python scripts/efficiency_test.py --search_corefs "lsh" # only on local laptop, with CPU

Notes/questions/to dos

conventions for docstrings? I have tried to follow the style used in the existing code
organization of functions in lsh: should they go elsewhere?
more tests for end-to-end lsh?
should we add another option "auto" which automatically selects "lsh" for large documents?
(are there obvious inefficiencies that have a simple fix?)

I have been sitting on it for long enough now, but let me know if there is something important missing for a review.

- with_coref option: change name to no_corefs, default False - add options to efficiency test - bash script for multiple runs with efficiency test and different options

- no coref search, 'all', 'lsh' - update ED class and efficiency_test.py accordingly

f-hafner · 2023-01-26T08:18:04Z

And here is the report on performance: https://github.com/rel20/rel_coref_experiments/blob/main/tex/coreferences.pdf

f-hafner added 30 commits November 25, 2022 13:29

add option for whether with_coref() should be used

53540b3

time ED for different dataset sizes

cdd8c7b

change coref switch name, more efficiency tests

aa9c741

- with_coref option: change name to no_corefs, default False - add options to efficiency test - bash script for multiple runs with efficiency test and different options

add coreference indicator to prediction output

463245d

efficiency test: also pickle data after mention detection

93fa8f5

integrate lsh and first run

c218ce1

3 options for coreferences

9f5657d

- no coref search, 'all', 'lsh' - update ED class and efficiency_test.py accordingly

adjust update_efficiency_tests.sh

7d8a17c

make printout backwards compatible

f080855

add basic logging to lsh class

1a432d9

fix bug for single mention, add logging to efficiency test

ae8e9e1

restore run_efficiency_test.sh

b8b4ea0

scale fake data more

0cc7598

switch to hashing with random projections

2ca1bbc

add some more debugging to lsh

5fc4357

speed up get_candidates()

e6894d5

use sklearn binarizer for encoding

6cbb668

test higher precision for lsh

6ac9ff0

vectorize banding

f19c904

small speed ups for get_candidates_new()

9f41745

small changes to efficiency tests

7570688

start tidying lsh

aa79b24

drop most of old code

50f6bfd

lsh class: tidy, add docstrings

017b03e

give right name to main class: random projections

780cee2

start tests, fix bug in cols_to_int_multidim

afa63d9

improve docstrings

136659b

n_bands and band_length as main inputs to class

35a0a32

document the lsh class

ff6778c

update docstring for with_coref

2b6315a

f-hafner added 12 commits January 24, 2023 15:16

small fixes to lsh and training_datasets

86f89c2

tidy efficiency_test

7586fe6

set lsh params according to validation data

d218d54

update docstrings; optimize lsh parameters

94146a7

small changes in lsh.py

575cb1d

add __repr__ to lsh

eb65bee

improve docstrings, reorder imports

6907eca

further tidy efficiency tests

f39fa94

tidy docstring, add test for short mentions

20785e4

some more comments, and reference online sources

5e19915

make dirs for output of efficiency test if necessary

231ca45

use logging in with_coref

3f06dac

f-hafner mentioned this pull request Feb 13, 2023

Sync REBL to coreference search in REL informagi/REBL#8

Merged

f-hafner marked this pull request as draft February 15, 2023 17:06

add base_url argument to efficiency tests

5aa84db

f-hafner marked this pull request as ready for review February 20, 2023 07:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coreference search with LSH #153

Coreference search with LSH #153

f-hafner commented Jan 25, 2023

f-hafner commented Jan 26, 2023 •

edited

Loading

Coreference search with LSH #153

Are you sure you want to change the base?

Coreference search with LSH #153

Conversation

f-hafner commented Jan 25, 2023

Features added

1. add coref switch search_corefs to REL

2. REL.lsh.LSHRandomProjections implements LSH with Random projections

3. Efficiency tests in scripts/

Installation (I have not reproduced this recently; let me know if it fails)

Testing

Notes/questions/to dos

f-hafner commented Jan 26, 2023 • edited Loading

1. add coref switch `search_corefs` to REL

2. `REL.lsh.LSHRandomProjections` implements LSH with Random projections

3. Efficiency tests in `scripts/`

f-hafner commented Jan 26, 2023 •

edited

Loading