Named Entity Recognition in PDFs

Named Entity Recognition (NER) is used to retrieve textual information of entities. Examples of these entities are names, location, date, organisation, etc. It is very useful in very long texts when one needs to have a understand the context in the paper, for example in political articles, to have an idea of who or what is being discussed or in research papers, to identify the known keywords.

Since NER is very helpful in large text data, which predominantly are in the forms of web pages or pdfs, this project is focused on the extraction of entities from PDFs documents.

Steps taken in project:

Develop an OCR system for extracting texts in PDFs.
Apply spaCy pretrained model for named entity recognition in texts.
Convert the output entities to a dataframe and display the output.

TODO:

Deploy the project.
On the app, users should be able to see the dataframe and download it as a .csv or .txt (optional).

Run with a Linux CLI:

Clone this repo.
Create a virtual environment.
Run bash setup.sh

To extract entities from pdf:

Run python pdf_ner.py --file_path <path/to/pdf>. <path/to/pdf> is to be replaced with the a path to a pdf file.

To convert pdf to text alone:

Run python recognize.py --file_path <path/to/pdf>. <path/to/pdf> is to be replaced with the a path to a pdf file.

Output

A display of the dataframe in the terminal and a .txt file containing the extracted entities.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
test_data		test_data
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
pdf_ner.py		pdf_ner.py
preprocess.py		preprocess.py
recognize.py		recognize.py
requirements.txt		requirements.txt
setup.sh		setup.sh
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Named Entity Recognition in PDFs

Steps taken in project:

TODO:

Run with a Linux CLI:

To extract entities from pdf:

To convert pdf to text alone:

Output

Major libraries:

About

Releases

Packages

Languages

sharonibejih/pdfs-ner

Folders and files

Latest commit

History

Repository files navigation

Named Entity Recognition in PDFs

Steps taken in project:

TODO:

Run with a Linux CLI:

To extract entities from pdf:

To convert pdf to text alone:

Output

Major libraries:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages