-
Notifications
You must be signed in to change notification settings - Fork 3
Name Entity Recognition (NER)
Find out all names.
overview
Use as many NER tools as we can, to increase recall of NER.
- Stanford CoreNLP:
PERSON
tag from NER output - Jieba:
nr
tag from pos output
We can also provide people's names at appendix and biographees' names as user-defined dictionary to Jieba to increase the possibility that Jieba correctly recognizes those names, which increase precision of NER.
we use regular expression to find the name of and relations with biographee's family at the same time.
see Use regex to extract kinship
There are 5 filters to increase precision of NER.
It is often that some placenames are wrongly recognized as names, such as "士林", so we collect all placenames(mainly base on administrative divisions) in Taiwan and China, to filter out this case. Furthermore, filter out the case "placename + 人", such as "福建人" which means people whose hometown is "福建".
There is surname in every name, so we use "Hundred Family Surnames" and 7000 the most frequent Japanese names crawled from "名字由来" as all valid surnames, and filter out all NER results that without any valid surname as its suffix.
For normal length of Chinese or Japanese names would be between 2 ~ 4, so NER results that are out of this interval will regarded as invalid and filtered out.
This means that there's no any name could be substring of another name. For example, a name "王喬峰" could often derives "王喬峰" and "喬峰", two NER results, especially when using multiple NER tools together. So we take longer NER result and exclude shorter result in this case.
Even filters above, there are some words that often wrongly recognized as valid, such as "伯父"、 "伯母"、 ”於民國”, which are somewhat highly frequent words. By exclude these rare cases, we can slightly raise our precision.
Though we get a effective result, whose precision is 0.82 and recall is 0.78 by randomly take 10 biographies as the sample, methods above are not absolute or perfect.
The process didn't deal with the case that two different people with the same name, which means we can't distinguish two persons with the same name.
Possible solution is recognize people from its name and the times he/she lived at the same time. Biography provides much information about times of biographee, so it may be a good possibility.
The main false positive results are words that are not names actually but with a valid surname as its suffix, such as "公正" which means fairness, but "公" is in valid surnames.
It is almost impossible to detect this case from only the word, because there may be actually be a man called "公正" , we may need to know its context to decide whether it is a valid name.
Fortunately, even people don't have Chinese or Japanese names will have its Chinese translation of its name, such as "Roni" -> "羅尼", and "羅" is a valid surname in Chinese so it could be correctly caught. But obviously it will be false negative when there is no valid surname as suffix of the translation of its name.
To deal with this case, we can collect frequent surnames in English and other language, but it will be a heavly work and is not very smart.