Project on making japanese knowledge graph with rule-based method. Image from article by Jennifer L. Schenker.
In this little project, I build a little file for people to produce knowledge graph from japanese articles. The detailed introduction of knowledge graph can be seen in the article. Basically, the process of making knowledge graphs from articles (or from any data sources like structured datasets and unstructured datasets) can be separated into two steps:
- Knowledge Extraction: In this step, people will analyze the dependency relation of each tokens in the text data, extract the entities and relations in the data, and finally preserve them as Subject-Predicate-Object (SPO) triples. This process is where NLP plays a key role in making knowledge graph.
- Graph Construction: Storing the SPO triples in a Graph database and visualizing them by tools such as networkx library or neo4j.
Several tutorials of knowledge graph can be found on websites such as notebook in kaggle. This project refers to it but change several places to change the original knowledge extraction part from english version into japanese version since the grammar of english is hugely different from that of japanese. For example, in english, the simplest sentence structure including a subject, a relation(trasitive verb), and an object can be written as "I eat a cake." or "You drink tea."; however, in japanese, the simplest sentence structure including a subject, a relation(trasitive verb), and an object should be write as "I cake eat(私がケーキを食べる。)" or "You tea drink"(あなたがお茶を飲む。). The difference is apparently that the object is in front of the verb so the algorithm must be modified. Furthermore, it is noted that the so called "形容動詞" in japanese may be used to build the relationship between subject and object. For example, "I like you." in english should be translated as "私は君が好きだ。" in japanese, where the word "好き" is actually the "adjective verb" so in principle 君(You) here is not object in japanese. In fact, the word 君 is obl(oblique nominal) in the sentence. In this situation, instead of deriving SPO triplet, we should build the Subject-Predicate-Oblique Nominal triplet. For more detailed about what is obl, please check the website.
Just like compound nouns, in japanese, there are compound verbs as well such like "書き始める", "動き出す", and "言いかける". To precisely capture these compound verbs as relation between subject and object, I further analyze the detailed dependency relations to correctly detect there are compound verbs in sentences or not. Nouns connected to each other by conjunctions like "や" or "と" are considered as a single entity.
There are several improvements are considered and may be fulfilled in the future:
- Different plotting tools: Although the networkx library is popular and pretty enough for people to plot a knowledge graph from scratch, there are many other optins to produce knowledge graphs such as neo4j and GRAPHLYTIC. People are encouraged to use those tools to implement graph visualization after they do the knowledge extraction.
- Expansion of knowledge extraction: SPO triplets should be able to derived from nouns modified by transitive verbs theortically. For instance, in sentence "ご飯を食べる人。"(A man eating rice.), the corresponding SPO triplet is apparently (人, 食べる, ご飯). Algorithm of knowledge extraction of this kind of noun may be added in the future.