Skip to content

Name Entity Recognition (NER)

richardyy1188 edited this page Aug 24, 2018 · 8 revisions

Goal

Find out all names.

Process

overview

diagram of proposed NER method

1. Use NER Tools to get names

Use as many NER tools as we can, to increase recall of NER.

  • Stanford CoreNLP: PERSON tag from NER output
  • Jieba: nr tag from pos output
    We can also provide people's names at appendix and biographees' names as user-defined dictionary to Jieba to increase the possibility that Jieba correctly recognizes those names, which increase precision of NER.

2. Use regex to get names

we use regular expression to find the name of and relations with biographee's family at the same time.
see Use regex to extract kinship

3. filter the names

There are 5 filters to increase precision of NER.

3.1 Not Placename

It is often that some placenames are wrongly recognized as names, such as "士林", so we collect all placenames(mainly base on administrative divisions) in Taiwan and China, to filter out this case. Furthermore, filter out the case "placename + 人", such as "福建人" which means people whose hometown is "福建".

3.2 Surname Suffix

There is surname in every name, so we use "Hundred Family Surnames" and 7000 the most frequent Japanese names crawled from "名字由来" as all valid surnames, and filter out all NER results that without any valid surname as its suffix.

3.3 Length Constraint

For normal length of Chinese or Japanese names would be between 2 ~ 4, so NER results that are out of this interval will regarded as invalid and filtered out.

3.4 No substring principle

This means that there's no any name could be substring of another name. For example, a name "王喬峰" could often derives "王喬峰" and "喬峰", two NER results, especially when using multiple NER tools together. So we take longer NER result and exclude shorter result in this case.

3.5 Exception Exclusion

Even filters above, there are some words that often wrongly recognized as valid, such as "伯父"、 "伯母"、 ”於民國”, which are somewhat highly frequent words. By exclude these rare cases, we can slightly raise our precision.

Difficult Points & Possible Solutions

Though we get a effective result, whose precision is 0.82 and recall is 0.78 by randomly take 10 biographies as the sample, methods above are not absolute or perfect.

The same name

The process didn't deal with the case that two different people with the same name, which means we can't distinguish two persons with the same name.

Possible solution is recognize people from its name and the times he/she lived at the same time. Biography provides much information about times of biographee, so it may be a good possibility.

Valid surname but not name

The main false positive results are words that are not names actually but with a valid surname as its suffix, such as "公正" which means fairness, but "公" is in valid surnames.

It is almost impossible to detect this case from only the word, because there may be actually be a man called "公正" , we may need to know its context to decide whether it is a valid name.

English names

Fortunately, even people don't have Chinese or Japanese names will have its Chinese translation of its name, such as "Roni" -> "羅尼", and "羅" is a valid surname in Chinese so it could be correctly caught. But obviously it will be false negative when there is no valid surname as suffix of the translation of its name.

To deal with this case, we can collect frequent surnames in English and other language, but it will be a heavly work and is not very smart.