Skip to content

Latest commit

 

History

History
executable file
·
63 lines (44 loc) · 2.15 KB

README.md

File metadata and controls

executable file
·
63 lines (44 loc) · 2.15 KB

Text Recognition with ViTSTR

Introduction

This is a mini-project of implementing Text Recognition task using ViTSTR (Vision Transformer for Scene Text Recognition). The method is inspired by this public repository created by roatienza that built using a fork of CLOVA AI Deep Text Recognition Benchmark. This project also based on Vision Transformer for Fast and Efficient Scene Text Recognition paper.

ViTSTR is a simple single-stage model that uses a pre-trained Vision Transformer (ViT) to perform Scene Text Recognition (ViTSTR). It has a comparable accuracy with state-of-the-art STR models although it uses significantly less number of parameters and FLOPS. ViTSTR is also fast due to the parallel computation inherent to ViT architecture.

VitSTR Architecture

The main advantage by using ViTSTR for text recognition is the simplicity and the efficiency. Instead of using general method (four-steps, three-steps) to do text recognition task, ViTSTR only using one stage (Transformer Encoder) to performs. We can see the comparison in the figure below.

STR design patterns

Tutorial

Clone the project

git clone https://github.com/zogojogo/text-recognition-wii.git

Go to the project directory

cd text-recognition-wii

Download Dependencies

pip install -r requirements.txt

Start API service

python3 app.py

API Reference

Service: http://your-ip-address:8080

POST image

  POST /segment_lung

Content-Type: multipart/form-data

Name Type Description
image file Required. image/png or image/jpg MIME Type

Output Example

Output:

{
  "filename": "<filename>",
  "contentype": "<image type>",
  "output text": "<predicted text>",
  "inference time": "<inference time>"
}