-
Notifications
You must be signed in to change notification settings - Fork 0
/
dataset-description.html
245 lines (225 loc) · 10.5 KB
/
dataset-description.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="">
<meta name="author" content="">
<title> ICDAR 2019 cTDaR | Dataset</title>
<link href="css/bootstrap.min.css" rel="stylesheet">
<link href="css/font-awesome.min.css" rel="stylesheet">
<link href="css/animate.min.css" rel="stylesheet">
<link href="css/lightbox.css" rel="stylesheet">
<link href="css/main.css" rel="stylesheet">
<link href="css/responsive.css" rel="stylesheet">
<!--[if lt IE 9]>
<script src="js/html5shiv.js"></script>
<script src="js/respond.min.js"></script>
<![endif]-->
<link rel="shortcut icon" href="images/ico/favicon.ico">
<link rel="apple-touch-icon-precomposed" sizes="144x144" href="images/ico/apple-touch-icon-144-precomposed.png">
<link rel="apple-touch-icon-precomposed" sizes="114x114" href="images/ico/apple-touch-icon-114-precomposed.png">
<link rel="apple-touch-icon-precomposed" sizes="72x72" href="images/ico/apple-touch-icon-72-precomposed.png">
<link rel="apple-touch-icon-precomposed" href="images/ico/apple-touch-icon-57-precomposed.png">
</head><!--/head-->
<body>
<header id="header">
<div class="container">
<div class="row">
<div class="col-sm-12 overflow">
</div>
</div>
</div>
<div class="navbar navbar-inverse" role="banner">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-collapse">
<!-- <span class="sr-only">Toggle navigation</span> -->
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="index.html">
<h1><img src="images/logo.png" alt="logo" style="width: 45%; height: 40%; max-width: 500px; padding-left: 45px" ></h1>
</a>
</div>
<div class="collapse navbar-collapse">
<ul class="nav navbar-nav navbar-right">
<li><a href="index.html">Home</a></li>
<li class="dropdown"><a href="#">Tasks<i class="fa fa-angle-down"></i></a>
<ul role="menu" class="sub-menu">
<li><a href="tasks.html">Tasks</a></li>
<li><a href="results.html">Results</a></li>
</ul>
</li>
<li class="dropdown active"><a href="#">Dataset<i class="fa fa-angle-down"></i></a>
<ul role="menu" class="sub-menu">
<li><a href="dataset-description.html">Description</a></li>
<li><a href="dataset-training.html">Training Dataset</a></li>
<li><a href="dataset-testing.html">Test Dataset</a></li>
</ul>
</li>
<li><a href="evaluation.html">Evaluation</a></li>
<li><a href="organizers.html">Organizers</a></li>
<li><a href="faq.html">FAQ</a></li>
</ul>
</div>
<!-- <div class="search">
<form role="form">
<i class="fa fa-search"></i>
<div class="field-toggle">
<input type="text" class="search-form" autocomplete="off" placeholder="Search">
</div>
</form>
</div> -->
</div>
</div>
</header>
<!--/#header-->
<section id="page-breadcrumb">
<div class="vertical-center sun">
<div class="container">
<div class="row">
<div class="action">
<div class="col-sm-12">
<h1 class="title">Dataset Description</h1>
</div>
</div>
</div>
</div>
</div>
</section>
<!--/#page-breadcrumb-->
<section id="features">
<div class="container">
<div class="row justify-content-center">
<div class="single-features">
<div class="col-sm-10 col-sm-offset-1 wow fadeInRight" data-wow-duration="500ms" data-wow-delay="300ms">
<p>
The dataset consists of modern documents and archival ones with various formats, including document images and born-digital formats such as PDF. The annotated contents contain the table entities and cell entities in a document, while we do not deal with nested tables. We gathered 1000 modern ones and 1000 archival ones as table region detection task's test dataset and 80 documents as table recognition task's test dataset (see figure examples below).
</p>
<p>The samples can be downloaded <a href="https://github.com/cndplab-founder/ICDAR2019_cTDaR.git">here.</a></p>
</div>
</div>
<div class="single-features">
<div class="col-sm-5 col-sm-offset-1 wow fadeInRight" data-wow-duration="500ms" data-wow-delay="300ms" align="center">
<img src="images/data/sample-c.jpg" class="img-responsive" alt="">
<div class="w-100"></div>
<p> Figure (a) : modern dataset</p>
</div>
<div class="col-sm-5 wow fadeInRight" data-wow-duration="500ms" data-wow-delay="300ms" align="center">
<img src="images/data/sample-d.jpg" class="img-responsive" alt="">
<div class="w-100"></div>
<p> Figure (b) : historical dataset</p>
</div>
</div>
<!-- <div class="single-features">
<div class="col-sm-6 col-sm-offset-3 wow fadeInRight" data-wow-duration="500ms" data-wow-delay="300ms" align="center"> -->
<!-- <div class="span1" style="float: none;margin: 0 auto;"> -->
<!-- <img src="images/data/sample-a.jpg" class="img-responsive" alt="" style="width: 60%; height: 50%; max-width: 500px"> -->
<!-- <img src="images/data/sample-a.jpg" class="img-responsive" alt="">
<div class="w-100"></div>
<p> Figure (a) : closed printed table</p>
<br>
<img src="images/data/sample-b.jpg" class="img-responsive" alt="">
<div class="w-100"></div>
<p> Figure (b) : less-ruling-line printed table</p>
<br>
<img src="images/data/sample-c.jpg" class="img-responsive" alt="">
<div class="w-100"></div>
<p> Figure (c) : colored printed table</p>
<br>
<img src="images/data/sample-d.jpg" class="img-responsive" alt="">
<div class="w-100"></div>
<p> Figure (d) : handwritten table</p>
<br>
<p>
Example of dataset categories:
<br>
(a, b, c) modern dataset; (d) archival dataset.
</p>
</div>
</div> -->
<div class="single-features">
<div class="col-sm-10 col-sm-offset-1 wow fadeInRight" data-wow-duration="500ms" data-wow-delay="300ms">
<h2>
Ground Truth Format
</h2>
<p>
For the annotation of dataset, we use an similar notation derived from ICDAR 2013 Table Competition format, creating a single XML file to store the structures.
</p>
<p>
In the XML file. Each <table> element corresponds to a table, which contains a single <Coords> element with [points] attribute to indicates the coordinates of the bounding polygon with 4 vertices. Table also contain a list of <cell> elements, for each <cell> element attributes [start-row], [start-col], [end-row] and [end-col] denotes its position in the table, and a unique numerical [id] for this cell.The element <Coords> for the <cell> element denotes the coordinates of the bounding polygon of this cell box, and <content> is the text within this cell (optional for submission).
</p>
<br>
<div id="annotation-example">
<p><?xml version="1.0" encoding="UTF-8"?></p>
<p><document filename="table.jpg"></p>
<div style="text-indent:30px;">
<p><table></p>
<div style="text-indent:60px;">
<p><Coords points="92,442 92,528 350,528 350,442"/></p>
<p><cell start-row="0" start-col="1" end-row="0" end-col="1"></p>
<div style="text-indent:90px;">
<p><Coords points="154,442 154,453 200,453 200,442"/></p>
<p><content>IndustryA</content></p>
</div>
<p></cell></p>
<p> ... </p>
<p><cell start-row="4" start-col="4" end-row="4" end-col="4"></p>
<div style="text-indent:90px;">
<p><Coords points="334,517 334,528 350,528 350,517"/></p>
<p><content>660</content></p>
</div>
<p></cell></p>
</div>
<p></table></p>
<p> ... </p>
<p><table></p>
<div style="text-indent:60px;">
<p><Coords points="414,442 414,528 673,528 673,442"/></p>
<p><cell start-row="0" start-col="1" end-row="0" end-col="1"></p>
<div style="text-indent:90px;">
<p><Coords points="477,442 477,453 522,453 522,442"/></p>
<p><content>IndustryB</content></p>
</div>
<p></cell></p>
<p> ... </p>
</div>
<p></table></p>
<p> ... </p>
</div>
<p></document></p>
<!-- <p>< ></p> -->
</div>
<br>
<p><strong>Important Note:</strong> For the modern dataset, the convex hull of the content describes a cell region. For the historical dataset, it is requested that the output region of a cell is the cell boundary. This is necessary due to the characteristics of handwritten text, which is often overlapping with different cells.</p>
<p><strong>Update Note:</strong> A novel supplement dataset version is published in
<a href="https://github.com/cndplab-founder/ICDAR2019_cTDaR_dataset_supplement.git">ICDAR2019_cTDaR_dataset_supplement</a>, which is a helpful subset in terms of adjacency relations, from <strong>Prof. Cheng-Lin Liu's Group, Institute of Automation, Chinese Academy of Sciences</strong>. Thank Cheng-Lin Liu's Group for their helpful contributions!
</p>
</div>
</div>
</div>
</div>
</section>
<!--/#features-->
<footer id="footer">
<div class="container">
<div class="row">
<div class="col-sm-12">
<div class="copyright-text text-center">
<p>© PKU Founder Group 2019. All Rights Reserved.</p>
<p>Designed by <a target="_blank" href="http://www.themeum.com">Themeum</a></p>
</div>
</div>
</div>
</div>
</footer>
<!--/#footer-->
<script type="text/javascript" src="js/jquery.js"></script>
<script type="text/javascript" src="js/bootstrap.min.js"></script>
<script type="text/javascript" src="js/lightbox.min.js"></script>
<script type="text/javascript" src="js/wow.min.js"></script>
<script type="text/javascript" src="js/main.js"></script>
</body>
</html>