This is a mini-project of implementing Text Recognition task using ViTSTR (Vision Transformer for Scene Text Recognition). The method is inspired by this public repository created by roatienza that built using a fork of CLOVA AI Deep Text Recognition Benchmark. This project also based on Vision Transformer for Fast and Efficient Scene Text Recognition paper.
ViTSTR is a simple single-stage model that uses a pre-trained Vision Transformer (ViT) to perform Scene Text Recognition (ViTSTR). It has a comparable accuracy with state-of-the-art STR models although it uses significantly less number of parameters and FLOPS. ViTSTR is also fast due to the parallel computation inherent to ViT architecture.
The main advantage by using ViTSTR for text recognition is the simplicity and the efficiency. Instead of using general method (four-steps, three-steps) to do text recognition task, ViTSTR only using one stage (Transformer Encoder) to performs. We can see the comparison in the figure below.
Clone the project
git clone https://github.com/zogojogo/text-recognition-wii.git
Go to the project directory
cd text-recognition-wii
Download Dependencies
pip install -r requirements.txt
Start API service
python3 app.py
Service: http://your-ip-address:8080
POST /segment_lung
Content-Type: multipart/form-data
Name | Type | Description |
---|---|---|
image |
file |
Required. image/png or image/jpg MIME Type |
Output:
{
"filename": "<filename>",
"contentype": "<image type>",
"output text": "<predicted text>",
"inference time": "<inference time>"
}