Code repo for the CoNLL 2021 paper:
MirrorWiC: On Eliciting Word-in-Context Representations from Pretrained Language Models
by Qianchu Liu*, Fangyu Liu*, Nigel Collier, Anna Korhonen, Ivan Vulić
MirrorWiC is a fully unsupervised approach to improving word-in-context (WiC) representations in pretrained language models, achieved via a simple and efficient WiC-targeted fine-tuning procedure. The proposed method leverages only raw texts sampled from Wikipedia, assuming no sense-annotated data, and learns context-aware word representations within a standard contrastive learning setup.
model | WiC (dev) | Usim |
---|---|---|
baseline: bert-base-uncased | 68.49 | 54.52 |
mirrorwic-bert-base-uncased | 71.94 | 61.82 |
mirrorwic-roberta-base | 71.15 | 57.95 |
mirrorwic-deberta-base | 71.78 | 62.79 |
- Preprocess train data:
Run the following to convert a text file (one sentence per line) to WiC-formated train data.
./train_data/en_wiki.txt
provides example input. In the output data, each target word is marked with brackets and random erasing with masking is applied.
>> python get_mirrorwic_traindata.py \
--data [input data] \
--lg [language] \
--random_er [random erasing length]
Eg.
>> python get_mirrorwic_traindata.py \
--data ./train_data/en_wiki.txt \
--lg en \
--random_er 10
- Train:
>> cd train_scripts
>> bash ./mirror_sentence.sh [CUDA] [training data] [base model] [dropout]
Eg.
>> bash ./mirror_wic.sh 1,0 ../train_data/en_wiki.txt.mirror.wic.re10 bert-base-uncased 0.4
Download the evaluation data from here, and put the folder in the root directory.
Then run:
>> cd evaluation_scripts
>> bash ./eval.sh [task] [model] [cuda]
[task]
: wic
, wic-tsv
, usim
, cosimlex
, wsd
, am2ico
, xlwic
Eg.
>> bash ./eval.sh usim cambridgeltl/mirrorwic-bert-base-uncased 0
You can get the MirrorWiC embeddings by running the following:
>> from evaluation_scripts.src.helpers import get_embed
>> from transformers import AutoTokenizer, AutoModel
>> model = AutoModel.from_pretrained('cambridgeltl/mirrorwic-bert-base-uncased')
>> tokenizer = AutoTokenizer.from_pretrained('cambridgeltl/mirrorwic-bert-base-uncased')
>> texts = ['This is a [ sample ] .', 'This is another [ sample ] .'] #target words are indicated by brackets
>> embeddings = get_embed(texts,tokenizer,model,flag='token',layer_start=9,layer_end=13,maxlen=64) # get the average embedding of the top 4 layers (layer 9 to layer 13)
@inproceedings{liu2021mirrorwic,
title={MirrorWiC: On Eliciting Word-in-Context Representations from Pretrained Language Models},
author={Liu, Qianchu and Liu, Fangyu and Collier, Nigel and Korhonen, Anna and Vuli{\'c}, Ivan},
booktitle = "Proceedings of the 25rd Conference on Computational Natural Language Learning (CoNLL)"
year={2021}
}
The code is modified on the basis of mirror-bert.