Skip to content

The code provided here allows extraction of names and relationships from notarial acts written in Dutch language..

Notifications You must be signed in to change notification settings

branjbar/MiSS-Information-Extraction

Repository files navigation

MiSS Information Extraction

The code provided here allows for extraction of names and husband-wife relationships from notarial acts written in Dutch language.

How to use this code

Using this code is very easy. Just download the complete package and run the main file 'nerd_main.py' in python:

python nerd_main.py

more details

  • nerd_main.py contains the main class NERD(text) that can be used as
nerd = Nerd(a_piece_of_text)

Once an instance nerd is made, the references can be extracted by

nerd.get_references()

and the relations can be extracted by

nerd.get_relations()

Also, a highlighted html text can be exported by using the following code

nerd.get_highlighted_text()
  • module_preprocess.py contains the code for preprocessing the text and removing/correcting the bad text patterns
  • module_names.py contains the code for tagging words
  • module_refs contains the code for using the tagged words to extract relations
  • module_rels contains the code for detecting the husband-wife relationships
  • /db-folder contains some dictionaries required to extract the names from text ... first_name.txt: list of frequent first names in Dutch ... last_name_multiple.txt: list of common last names that consist of more than one word ... starting_words.py list of the words that start a sentence and can be problematic in detecting the correct pattern of names

Evaluations

according to the first evaluations on 48 notarial acts that contain 309 individual names, 278 names are extracted precisely and 31 names are undetected: Recall: 90%, Precision: 91%

Terms of Use

This code is developed within the MiSS project (http://swarmlab.unimaas.nl/catch/), funded by NWO. This code is free to use. However, it will be highly appreciated if the developer gets notified in case of use (email: [email protected]).

About

The code provided here allows extraction of names and relationships from notarial acts written in Dutch language..

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages