Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coreference search with LSH #153

Open
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

f-hafner
Copy link

Make REL more scalable by using locality-sensitive hashing to reduce search space for coreference search

Features added

1. add coref switch search_corefs to REL

  • 3 options
    • "all": for each mention, search all other mentions for a coreference (the current default)
    • "off": do not search for coreferences
    • "lsh": First, find similar mentions with locality-sensitive hashing. Then, search for coreferences in the reduced set of comparison groups.
  • The main adjustment was done in REL/src/REL/training_datasets.py, and specifically in the function with_coref. The function is called unless search_corefs="off". For the option lsh, the function uses the LSHRandomProjections class from the new module lsh.py in REL.
  • with_coref now also returns an indicator for whether a mention is identified as a coreference (along with other attributes for mentions, such as candidates etc), and this is passed on to the final output.
  • Behavior
    • default is unchanged: use "all"
    • parameters (see below) for LSH are currently not exposed at the top level for ED, and two of them are hard-coded in with_coref

2. REL.lsh.LSHRandomProjections implements LSH with Random projections

  • Key parameters: shingle size, number of bands, band length
    • shingle size defines the number of adjacent letters that define one feature
    • the number of bands and the band length determine the length of the binary signature, which will be used to assess the similarity between mentions
  • Procedure
    • the signature is cut in n_bands
    • for each band, extract the mentions that have the exact same band
    • the candidate mentions for a mention $i$ are all mentions whose signatures overlap with $i$'s signature in any of the bands
  • Choose optimal parameters for LSH (done outside REL, see here)
    • band length = log(number of mentions), following theory
    • shingle size, number of bands
      • load coreferring mentions from AIDA test a
        • mention cur_m is a coreference for mention m if
          1. m contains cur_m as a word, besides other words. Ie, cur_m = "hendrix", m = "jimi hendrix"
          2. the gold entity of m and cur_m are the same (this is different from __find_coref where we do not have ground truth)
      • informally find parameters that maximise recall

3. Efficiency tests in scripts/

  • Update efficiency_test.py
    • add coref switch and other command-line options
    • when running on local computer (server=False):
      • profile the ED call and save as a dataframe
      • scaling of ED: run ED on multiply-stacked mentions from test data and time execution
    • all of this is for running on my local computer. which of the options should we keep? what needs to be changed for server=True?
  • Add script run_efficiency_tests.sh that runs efficiency_test.py with the different coref options

Installation (I have not reproduced this recently; let me know if it fails)

cd rel20
git clone [email protected]:f-hafner/REL.git

# get data; taken from https://github.com/informagi/REL/pull/151
mkdir data
cd data
curl -O http://gem.cs.ru.nl/generic.tar.gz
curl -O http://gem.cs.ru.nl/ed-wiki-2019.tar.gz
curl -O http://gem.cs.ru.nl/wiki_2019.tar.gz
tar zxf wiki_2019.tar.gz
tar zxf ed-wiki-2019.tar.gz
tar zxf generic.tar.gz

# install from branch
cd ../REL
git checkout flavio/coref-lsh
conda create -n rel python=3.7
conda activate rel
pip install -e .[develop]

Testing

pytest tests/test_lsh.py # mostly unit tests
# update `base_url` in `scripts/efficiency_test.py`
python scripts/efficiency_test.py --search_corefs "lsh" # only on local laptop, with CPU

Notes/questions/to dos

  • conventions for docstrings? I have tried to follow the style used in the existing code
  • organization of functions in lsh: should they go elsewhere?
  • more tests for end-to-end lsh?
  • should we add another option "auto" which automatically selects "lsh" for large documents?
  • (are there obvious inefficiencies that have a simple fix?)

I have been sitting on it for long enough now, but let me know if there is something important missing for a review.

- with_coref option: change name to no_corefs, default False
- add options to efficiency test
- bash script for multiple runs with efficiency test and different
options
- no coref search, 'all', 'lsh'
- update ED class and efficiency_test.py accordingly
@f-hafner
Copy link
Author

f-hafner commented Jan 26, 2023

And here is the report on performance: https://github.com/rel20/rel_coref_experiments/blob/main/tex/coreferences.pdf

@f-hafner f-hafner marked this pull request as ready for review February 20, 2023 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant