The official PyTorch implementation of VLPT-STD (CVPR 2022).
VLPT-STD is a new pre-training paradigm for scene text detection that only requires text annotations. We propose three vision-language pretraining pretext tasks: imagetext contrastive learning (ITC), masked language modeling (MLM) and word-in-image prediction (WIP) to learn contextualized, joint representations, for the sake of enhancing the performance of scene text detectors. Extensive experiments on standard benchmarks demonstrate that the proposed paradigm can significantly improve the performance of various representative text detectors.
pip3 install -r requirements.txt
Download synthtext dataset.
- The structure of data folder as below.
data
└── SynthText
├── 1
├── 2
├── 3
├── ...
└── gt.mat
- Use
write_synthtext_pyarrow.py
to prepare arrow data format for pretraining.
pretrained resnet50 at this url.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch -nproc_per_node=8 main.py --exp_name base
Performances on EAST, DB and PSENet are summaried as follows:
ICDAR2015 | ICDAR2017 | MSRA-TD500 | |||||||
---|---|---|---|---|---|---|---|---|---|
P | R | F | P | R | F | P | R | F | |
EAST + SynthText | 89.6 | 81.5 | 85.3 | 75.1 | 61.9 | 67.9 | 86.9 | 77.6 | 82.0 |
EAST + VLPT-STD | 91.5 | 85.4 | 88.3 | 77.7 | 64.6 | 70.5 | 88.5 | 76.7 | 82.2 |
ICDAR2015 | Total-Text | MSRA-TD500 | |||||||
---|---|---|---|---|---|---|---|---|---|
P | R | F | P | R | F | P | R | F | |
DB + SynthText | 88.2 | 82.7 | 85.4 | 87.1 | 82.5 | 84.7 | 91.5 | 79.2 | 84.9 |
DB + VLPT-STD | 92.0 | 81.6 | 86.5 | 88.7 | 84.0 | 86.3 | 92.3 | 84.9 | 88.5 |
ICDAR2015 | Total-Text | CTW1500 | |||||||
---|---|---|---|---|---|---|---|---|---|
P | R | F | P | R | F | P | R | F | |
PSENet + SynthText | 84.3 | 78.4 | 81.3 | 89.2 | 79.2 | 83.9 | 83.6 | 79.7 | 81.6 |
PSENet + VLPT-STD | 86.0 | 82.8 | 84.3 | 90.8 | 82.0 | 86.1 | 86.3 | 80.7 | 83.3 |
This implementation has been based on ViLT.
If you find this work useful, please cite:
@inproceedings{song2022vision,
title={Vision-Language Pre-Training for Boosting Scene Text Detectors},
author={Song, Sibo and Wan, Jianqiang and Yang, Zhibo and Tang, Jun and Cheng, Wenqing and Bai, Xiang and Yao, Cong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={15681--15691},
year={2022}
}
VLPT-STD is released under the terms of the Apache License, Version 2.0.
VLPT-STD is an algorithm for scene text detection pretraining and the code and models herein created by the authors from Alibaba can only be used for research purpose.
Copyright (C) 1999-2022 Alibaba Group Holding Ltd.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.