Skip to content

Latest commit

 

History

History
56 lines (43 loc) · 2.79 KB

README.md

File metadata and controls

56 lines (43 loc) · 2.79 KB

Named entitity recognition (NER) and Detecting Hyponym\Hypernym relationship on the dataset of Patents

The main goals of this project are:

  • Train NER model with dataset of Patents in the specific domain
  • Fine-tune with prodidy
  • Implement automatic detection of hyponyms\hypernyms with Hearst patterns
  • Validate detection results with several methods, inluding Wikidata

Structure

Setup

  1. Install dependencies from requirements.txt
  2. Unpack data:
    tar -xvf G06K.txt.gz
  3. Open project.ipynb and run first cell to chek that all imports works propperly

Notebook structure

Here is a brief overview of the project.ipynb parts.

Data processing

Screenshot 2022-06-03 at 11 14 57

In this section patent text read and prcessed to extract potential Named entities using curated list of terms manyterms.lower.txt

Training NER model

Screenshot 2022-06-03 at 11 21 47

Next, we are training the model on the created dataset.
Additionaly, if you have access to the Prodiy, you can apply Active Learning to tune the model.

Hearst patterns for hyponym detection

Screenshot 2022-06-03 at 11 33 11

Thise section is dedicated to extracting potential Entity linking (like hypernyms) using Hearst Patterns.

Automatic validation of the results

Screenshot 2022-06-03 at 11 34 55

Afte extraction, we validate results automatically, using Wiki API, WordNet or SpaCy embeddings. Here is an example of validation table after processing:

hq2SyK1SEvKTISY0DtddgY_mF9j966vIPi8Fhm26nJq-xPNc_NH0xPhap97ZAruJOHaEjqbf7a2-kKwSZnw6JeRFH9dwk2w06Dd9OjTOq3EmgRbpmFAYIIuyTphYtAeqcYa70NWnW_9ZwK4cGmEv0A