Document-Identifier

Document Identifier is a software which harnesses the power of GCP API's along with Machine learning to sort documents into folders.

Legal documents can include passports, driving licences etc. of any country. Machine learning is implemented using tensorflow.

Currently the program supports jpg images and docx files.

Setup

clone this repo
Save GCP key as "key.json" in resources folder
clone tensorflow repo from https://github.com/tensorflow/models
copy models dir to ml folder
download faster_rcnn_inception_v2_coco_2018_01_28 http://download.tensorflow.org/models/object_detection/faster_rcnn_inception_v2_coco_2018_01_28.tar.gz
Extract the faster_rcnn_inception_v2_coco_2018_01_28 folder to the ml\models\research\object_detection folder

Run the following :

pip install -r requirements.txt
python -m spacy download en_core_web_lg
conda install -c anaconda protobuf

Change directories to the \ml\models\research directory and run the following:

protoc --python_out=. .\object_detection\protos\anchor_generator.proto .\object_detection\protos\argmax_matcher.proto .\object_detection\protos\bipartite_matcher.proto .\object_detection\protos\box_coder.proto .\object_detection\protos\box_predictor.proto .\object_detection\protos\eval.proto .\object_detection\protos\faster_rcnn.proto .\object_detection\protos\faster_rcnn_box_coder.proto .\object_detection\protos\grid_anchor_generator.proto .\object_detection\protos\hyperparams.proto .\object_detection\protos\image_resizer.proto .\object_detection\protos\input_reader.proto .\object_detection\protos\losses.proto .\object_detection\protos\matcher.proto .\object_detection\protos\mean_stddev_box_coder.proto .\object_detection\protos\model.proto .\object_detection\protos\optimizer.proto .\object_detection\protos\pipeline.proto .\object_detection\protos\post_processing.proto .\object_detection\protos\preprocessor.proto .\object_detection\protos\region_similarity_calculator.proto .\object_detection\protos\square_box_coder.proto .\object_detection\protos\ssd.proto .\object_detection\protos\ssd_anchor_generator.proto .\object_detection\protos\string_int_label_map.proto .\object_detection\protos\train.proto .\object_detection\protos\keypoint_box_coder.proto .\object_detection\protos\multiscale_anchor_generator.proto .\object_detection\protos\graph_rewriter.proto

Delete all checkpoints in the training folder

Incase of error while running first command, consider running terminal with Admin privilages

Running the code

python main.py

Optional flags:

-m : predict only using ML

-t : train the model

After the model is trained press Ctrl + C and run the following command from object_detection dir (where XXXX is highest checkpoint number)
python export_inference_graph.py --input_type image_tensor --pipeline_config_path training/faster_rcnn_inception_v2_pets.config --trained_checkpoint_prefix training/model.ckpt-XXXX --output_directory inference_graph

-i : use another dir for input (By default input dir is taken as input)

Tasks:

On Going:

remove dependency on user for running export_inference_graph
Get Keys for GCP
Create global config file with paths to path_labels etc in predict.py
Create script to form more complex training data form exiting training data
Fix bug where file already exists (in move_doc function)

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
__pycache__		__pycache__
lib		lib
ml		ml
original_data		original_data
output		output
resources		resources
test_doc		test_doc
test_images		test_images
test_images_ml		test_images_ml
test_images_segregator		test_images_segregator
.gitignore		.gitignore
Document identifier documentation .docx		Document identifier documentation .docx
README.md		README.md
TF over YOLO.pdf		TF over YOLO.pdf
doc_iden.py		doc_iden.py
main.py		main.py
requirements.txt		requirements.txt
segregator.py		segregator.py
test.txt		test.txt
train.txt		train.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document-Identifier

Setup

Running the code

Optional flags:

Tasks:

On Going:

Done:

About

Releases

Packages

Languages

gskishan004/Document-Identifier

Folders and files

Latest commit

History

Repository files navigation

Document-Identifier

Setup

Running the code

Optional flags:

Tasks:

On Going:

Done:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages