Named Entity Recognition (NER) is used to retrieve textual information of entities. Examples of these entities are names
, location
, date
, organisation
, etc. It is very useful in very long texts when one needs to have a understand the context in the paper, for example in political articles, to have an idea of who or what is being discussed or in research papers, to identify the known keywords.
Since NER is very helpful in large text data, which predominantly are in the forms of web pages or pdfs, this project is focused on the extraction of entities from PDFs documents.
- Develop an OCR system for extracting texts in PDFs.
- Apply spaCy pretrained model for named entity recognition in texts.
- Convert the output entities to a dataframe and display the output.
- Deploy the project.
- On the app, users should be able to see the dataframe and download it as a .csv or .txt (optional).
-
Clone this repo.
-
Create a virtual environment.
-
Run
bash setup.sh
- Run
python pdf_ner.py --file_path <path/to/pdf>
. <path/to/pdf> is to be replaced with the a path to a pdf file.
- Run
python recognize.py --file_path <path/to/pdf>
. <path/to/pdf> is to be replaced with the a path to a pdf file.
A display of the dataframe in the terminal and a .txt file containing the extracted entities.