You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In order to reproduce the conll score reported in BERT paper (92.4 bert-base and 92.8 bert-large) one trick is to apply a truecaser on article titles (all upper case sentences) as preprocessing step for conll train/dev/test. This can be simply done with the following method.
#https://github.com/daltonfury42/truecase
#pip install truecase
import truecase
import re
# original tokens
#['FULL', 'FEES', '1.875', 'REOFFER', '99.32', 'SPREAD', '+20', 'BP']
def truecase_sentence(tokens):
word_lst = [(w, idx) for idx, w in enumerate(tokens) if all(c.isalpha() for c in w)]
lst = [w for w, _ in word_lst if re.match(r'\b[A-Z\.\-]+\b', w)]
if len(lst) and len(lst) == len(word_lst):
parts = truecase.get_true_case(' '.join(lst)).split()
# the trucaser have its own tokenization ...
# skip if the number of word dosen't match
if len(parts) != len(word_lst): return tokens
for (w, idx), nw in zip(word_lst, parts):
tokens[idx] = nw
# truecased tokens
#['Full', 'fees', '1.875', 'Reoffer', '99.32', 'spread', '+20', 'BP']
Also, i found useful to use : very small learning rate (5e-6) \ large batch size (128) \ high epoch num (>40).
With these configurations and preprocessing, I was able to reach 92.8 with bert-large.
The text was updated successfully, but these errors were encountered:
In order to reproduce the conll score reported in BERT paper (92.4 bert-base and 92.8 bert-large) one trick is to apply a truecaser on article titles (all upper case sentences) as preprocessing step for conll train/dev/test. This can be simply done with the following method.
Also, i found useful to use : very small learning rate (5e-6) \ large batch size (128) \ high epoch num (>40).
With these configurations and preprocessing, I was able to reach 92.8 with bert-large.
The text was updated successfully, but these errors were encountered: