-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The HWDB and ICDAR2013 #2
Comments
Thanks for your attention to our work! In fact, we initially collected HDWB2.0-2.2, ICDAR2013, and SCUT for the experiments. However, we observed that there existed some domain gaps regarding image styles among these three datasets (eg, HWDB2.0-2.2 and ICDAR2013 have clean backgrounds, while SCUT have more complex backgrounds suffering from uneven illumination, grids, etc.). So it is inefficient to combine them for training. Additionally, we observed that HDWB2.0-2.2 and ICDAR2013 have fewer samples (52,220 and 3,432) compared with SCUT (116,643). The community mainly utilize HWDB1.0-1.2 (single Chinese character datasets) to synthesize text line datasets for training, which is a little inconvenient. So we only construct the handwriting dataset based on SCUT. Thanks for your advice. Anyway, we will upload the lmdb-format HWDB2.0-2.2 and ICDAR2013 for further research. |
Hello, thank you very much for your reply! |
Hello! These datasets are collected from official websites. You can manually convert them to half-width format for training. |
Can you provide the code for processing dgrl format data in hwdb or the hwdb dataset(png/jpg format)? I have encountered a problem in this step of parsing, and hope to get your suggestions. |
import struct
import os
import cv2 as cv
import numpy as np
from PIL import Image
dgrl = '/home/dataset/benchmark/temp/offline_handwriting/HWDB2.0Test/006-P16.dgrl'
def read_from_dgrl(dgrl, file):
if not os.path.exists(dgrl):
print('DGRL not exis!')
return
dir_name,base_name = os.path.split(dgrl)
label_dir = dir_name+'_label'
image_dir = dir_name+'_images'
if not os.path.exists(label_dir):
os.makedirs(label_dir)
if not os.path.exists(image_dir):
os.makedirs(image_dir)
with open(dgrl, 'rb') as f:
# 读取表头尺寸
header_size = np.fromfile(f, dtype='uint8', count=4)
header_size = sum([j<<(i*8) for i,j in enumerate(header_size)])
# print(header_size)
# 读取表头剩下内容,提取 code_length
header = np.fromfile(f, dtype='uint8', count=header_size-4)
code_length = sum([j<<(i*8) for i,j in enumerate(header[-4:-2])])
# print(code_length)
# 读取图像尺寸信息,提取图像中行数量
image_record = np.fromfile(f, dtype='uint8', count=12)
height = sum([j<<(i*8) for i,j in enumerate(image_record[:4])])
width = sum([j<<(i*8) for i,j in enumerate(image_record[4:8])])
line_num = sum([j<<(i*8) for i,j in enumerate(image_record[8:])])
print('图像尺寸:')
print(height, width, line_num)
# 读取每一行的信息
for k in range(line_num):
print(k+1)
# 读取该行的字符数量
char_num = np.fromfile(f, dtype='uint8', count=4)
char_num = sum([j<<(i*8) for i,j in enumerate(char_num)])
print('字符数量:', char_num)
# 读取该行的标注信息
label = np.fromfile(f, dtype='uint8', count=code_length*char_num)
label = [label[i]<<(8*(i%code_length)) for i in range(code_length*char_num)]
label = [sum(label[i*code_length:(i+1)*code_length]) for i in range(char_num)]
label = [struct.pack('I', i).decode('gbk', 'ignore')[0] for i in label]
print('合并前:', label)
label = ''.join(label)
label = ''.join(label.split(b'\x00'.decode())) # 去掉不可见字符 \x00,这一步不加的话后面保存的内容会出现看不见的问题
print('合并后:', label)
# 读取该行的位置和尺寸
pos_size = np.fromfile(f, dtype='uint8', count=16)
y = sum([j<<(i*8) for i,j in enumerate(pos_size[:4])])
x = sum([j<<(i*8) for i,j in enumerate(pos_size[4:8])])
h = sum([j<<(i*8) for i,j in enumerate(pos_size[8:12])])
w = sum([j<<(i*8) for i,j in enumerate(pos_size[12:])])
# print(x, y, w, h)
# 读取该行的图片
bitmap = np.fromfile(f, dtype='uint8', count=h*w)
bitmap = np.array(bitmap).reshape(h, w)
# 保存信息
label_file = os.path.join(label_dir, base_name.replace('.dgrl', '_'+str(k)+'.txt'))
with open(label_file, 'w') as f1:
f1.write(label)
bitmap_file = os.path.join(image_dir, base_name.replace('.dgrl', '_'+str(k)+'.jpg'))
print(bitmap_file)
cv.imwrite(bitmap_file, bitmap)
pil_img = Image.fromarray(bitmap.astype('uint8')).convert('RGB')
# display(pil_img)
file.write('{} {}\n'.format(bitmap_file, label.replace(' ','')))
|
你好,我想补充在hwdb和icdar2013上的实验。 |
“而这在原始图像中一般为异常类(显式不全,划去的字符)。这样的数据操作时采用直接略掉还是为这类字符设置一个新的类别(异常类)?” “同时我发现在ICDAR2013中存在着 hwdb中不含有的字符,且有些字符特殊符号是不能转为半角的。同时对于hwdb数据的字符表设置多少类,我也感到有些疑惑。 希望这些答复能解答您的疑虑 |
Thank you very much for your work. Could you please supplement the experimental results on HWDB and ICADA2013. These two data sets are very important in Chinese handwriting recognition and have a relatively large amount of work, so that it is easier to compare the performance differences between different methods.
The text was updated successfully, but these errors were encountered: