copyright | years | lastupdated | keywords | subcollection |
---|---|---|---|---|
2018, 2020 |
2020-12-04 |
table,tables,classify tables,classifying tables,table understanding,row,rows,column,columns,cell,cells,body cell,body cells,header,headers,key,value,key value pair,context,row header,row headers,column header,column headers,context,contexts |
discovery-data |
{:shortdesc: .shortdesc} {:external: target="_blank" .external} {:tip: .tip} {:pre: .pre} {:codeblock: .codeblock} {:screen: .screen} {:note: .note} {:important: .important} {:javascript: .ph data-hd-programlang='javascript'} {:java: .ph data-hd-programlang='java'} {:python: .ph data-hd-programlang='python'} {:swift: .ph data-hd-programlang='swift'}
{: #understanding_tables}
The Table Understanding
enrichment parses tables in a detailed format.
{: shortdesc}
The Table Understanding
enrichment produces detailed output regarding tables and table-related data within documents. This enrichment can be enabled for a collection in any of the following ways:
- If your collection consists of files whose formats are supported by Smart Document Understanding, and you annotate one or more tables by using the
table
label, theTable Understanding
enrichment is automatically applied to thehtml
field of the collection. - If you are using {{site.data.keyword.discoveryshort}} for Content Intelligence, the
Table Understanding
enrichment is automatically applied to thehtml
field of the collection. - If your collection includes files whose formats allow for fields other than
html
, such as CSV or JSON, visit the Enrichments page of the collection to update theTable Understanding
enrichment to be applied to the required field of the collection. For more information, see Managing enrichments.
For information on querying tables, see Query parameters.
If you are using Smart Document Understanding and have chosen the Pre-trained-models option, you do not need to annotate tables. For more information, see Identifying fields. {: note}
{: #table-output-schema}
The output schema from the Table Understanding
enrichment is as follows.
{
"tables": [
{
"location" : {
"begin" : int,
"end" : int
},
"text": string,
"section_title": {
"text": string,
"location": {
"begin" : int,
"end" : int
}
},
"title": {
"location": {
"begin": int,
"end": int,
},
"text": string
},
"table_headers" : [
{
"cell_id" : string,
"location" : {
"begin" : int,
"end" : int
},
"text" : string,
"row_index_begin" : int,
"row_index_end" : int,
"column_index_begin" : int,
"column_index_end" : int
},
...
],
"column_headers" : [
{
"cell_id" : string,
"location" : {
"begin" : int,
"end" : int
},
"text" : string,
"text_normalized" : string,
"row_index_begin" : int,
"row_index_end" : int,
"column_index_begin" : int,
"column_index_end" : int
},
...
],
"row_headers" : [
{
"cell_id" : string,
"location" : {
"begin" : int,
"end" : int
},
"text" : string,
"text_normalized" : string,
"row_index_begin" : int,
"row_index_end" : int,
"column_index_begin" : int,
"column_index_end" : int
},
...
],
"body_cells" : [
{
"cell_id" : string,
"location" : {
"begin" : int,
"end" : int
},
"text" : string,
"row_index_begin" : int,
"row_index_end" : int,
"column_index_begin" : int,
"column_index_end" : int,
"row_header_ids": [ string ],
"row_header_texts": [ string ],
"row_header_texts_normalized": [ string ],
"column_header_ids": [ string ],
"column_header_texts": [ string ],
"column_header_texts_normalized": [ string ],
"attributes" : [
{
"type" : string,
"text" : string,
"location" : {
"begin" : int,
"end" : int
}
},
...
]
},
...
],
"key_value_pairs": [
{
"key": {
"cell_id": string,
"location": {
"begin": int,
"end": int
},
"text": string
},
"value": [{
"cell_id": string,
"location": {
"begin": int,
"end": int
},
"text": string
},
...
]
},
...
],
"contexts": [
{
"text": string,
"location": {
"begin": int,
"end": int
}
},
...
]
}
]
}
{: #table-schema-arrangement}
The schema is arranged as follows.
tables
: An array that defines the tables identified in the input document.location
: The location of the current table as defined by itsbegin
andend
indexes in the input document.text
: The textual contents of the current table from the input document without associated markup content.section_title
: If identified, the location of a section title contained in the current table. Empty if no section title is identified.text
: The text of the identified section title.location
: The location of the section title in the input document as defined by itsbegin
andend
indexes.
title
: If identified, the title or caption of the current table of the formTable x.: ...
. Empty when no title is identified. When exposed, thetitle
is also excluded from thecontexts
array of the same table.location
: The location of the title in the input document as defined by itsbegin
andend
indexes.text
: The text of the identified table title or caption.
table_headers
: An array of table-level cells applicable as headers to all the other cells of the current table. Each table header is defined as a collection of the following elements:cell_id
: The unique ID of the cell in the current table.location
: The location of the cell in the input document as defined by itsbegin
andend
indexes.text
: The textual contents of the cell from the input document without associated markup content.row_index_begin
: Thebegin
index of the cell'srow
location in the current table.row_index_end
: Theend
index of the cell'srow
location in the current table.column_index_begin
: Thebegin
index of the cell'scolumn
location in the current table.column_index_end
: Theend
index of the cell'scolumn
location in the current table.
column_headers
: An array of column-level cells, each applicable as a header to other cells in the same column as itself, of the current table. Each column header is defined as a collection of the following items:cell_id
: The unique ID of the cell in the current table.location
: The location of the cell in the input document as defined by itsbegin
andend
indexes.text
: The textual contents of the cell from the input document without associated markup content.text_normalized
: If you provide customization input, the normalized version of the cell text according to the customization; otherwise, the same value astext
.row_index_begin
: Thebegin
index of the cell'srow
location in the current table.row_index_end
: Theend
index of the cell'srow
location in the current table.column_index_begin
: Thebegin
index of the cell'scolumn
location in the current table.column_index_end
: Theend
index of the cell'scolumn
location in the current table.
row_headers
: An array of row-level cells, each applicable as a header to other cells in the same row as itself, of the current table. Each row header is defined as a collection of the following items:cell_id
: The unique ID of the cell in the current table.location
: The location of the cell in the input document as defined by itsbegin
andend
indexes.text
: The textual contents of the cell from the input document without associated markup content.text_normalized
: If you provide customization input, the normalized version of the cell text according to the customization; otherwise, the same value astext
.row_index_begin
: Thebegin
index of the cell'srow
location in the current table.row_index_end
: Theend
index of the cell'srow
location in the current table.column_index_begin
: Thebegin
index of the cell'scolumn
location in the current table.column_index_end
: Theend
index of the cell'scolumn
location in the current table.
body_cells
: An array of cells that are not table header or column header or row header cells, of the current table with corresponding row and column header associations. Each body cell is defined as a collection of the following items:cell_id
: The unique ID of the cell in the current table.location
: The location of the cell in the input document as defined by itsbegin
andend
indexes.text
: The textual contents of the cell from the input document without associated markup content.row_index_begin
: Thebegin
index of this cell'srow
location in the current table.row_index_end
: Theend
index of this cell'srow
location in the current table.column_index_begin
: Thebegin
index of this cell'scolumn
location in the current table.column_index_end
: Theend
index of this cell'scolumn
location in the current table.row_header_ids
: An array of values, each being thecell_id
value of a row header that is applicable to this body cell.row_header_texts
: An array of values, each being thetext
value of a row header that is applicable to this body cell.row_header_texts_normalized
: If you provide customization input, the normalized version of the row header texts according to the customization; otherwise, the same value asrow_header_texts
.column_header_ids
: An array of values, each being thecell_id
value of a column header that is applicable to this body cell.column_header_texts
: An array of values, each being thetext
value of a column header that is applicable to this body cell.column_header_texts_normalized
: If you provide customization input, the normalized version of the column header texts according to the customization; otherwise, the same value ascolumn_header_texts
.attributes
: An array that identifies document attributes. Each object in the array consists of three elements:type
: The type of attribute. Possible values areAddress
,Currency
,DateTime
,Duration
,Location
,Number
,Organization
,Percentage
, andPerson
.text
: The text that is associated with the attribute.location
: The location of the attribute as defined by itsbegin
andend
indexes.
key_value_pairs
: An array that specifies any key-value pairs in tables in the input document. For more information, see Understanding key-value pairs.key
: An object that specifies a key for a key-value pair.cell_id
: The unique ID of the key in the table.location
: The location of the key cell in the input document as defined by itsbegin
andend
indexes.text
: The text content of the table cell without HTML markup.
value
: An array that specifies the value or values for a key-value pair.cell_id
: The unique ID of the value in the table.location
: The location of the value cell in the input document as defined by itsbegin
andend
indexes.text
: The text content of the table cell without HTML markup.
contexts
: A list of related material that precedes and follows åthe table, excluding its section title, which is provided in thesection_title
field. Related material includes related sentences; footnotes; and sentences from other parts of the document that refer to the table. The list is represented as an array. Each object in the array consists of the following elements:text
: The text contents of a related material from the input document, without HTML markup.location
: The location of the related material in the input document as defined by itsbegin
andend
indexes.
{: #notes-on-table-schema}
- Row and column index values per cell are zero-based and so begin with
0
. - Multiple values in arrays of
row_header_ids
androw_header_texts
elements indicate a possible hierarchy of row headers. - Multiple values in arrays of
column_header_ids
andcolumn_header_texts
elements indicate a possible hierarchy of column headers.
{: #table-examples}
The following is an example table from an input document.
The table is composed as follows:
where:
- Bold text indicates a column header
- Italic text indicates a row header
- Unstyled text indicates a body cell
The output from service represents the example's first body cell (that is, the first cell in row 3 with a value of 35.0%
) as follows.
{
"tables": [ {
"location": {
"begin": 872,
"end": 5879
},
"text": "...",
"section_title": {
"text": "",
"location": {
"begin": 0,
"end": 0
}
},
"table_headers" : [ ],
"column_headers" : [ {
"cell_id" : "colHeader-1050-1082",
"location" : {
"begin" : 1050,
"end" : 1083
},
"text" : "Three months ended September 30,",
"text_normalized" : "Three months ended September 30,",
"row_index_begin" : 0,
"row_index_end" : 0,
"column_index_begin" : 1,
"column_index_end" : 2
}, {
"cell_id" : "colHeader-1270-1301",
"location" : {
"begin" : 1270,
"end" : 1302
},
"text" : "Nine months ended September 30,",
"text_normalized" : "Nine months ended September 30,",
"row_index_begin" : 0,
"row_index_end" : 0,
"column_index_begin" : 3,
"column_index_end" : 4
}, {
"cell_id" : "colHeader-1544-1548",
"location" : {
"begin" : 1544,
"end" : 1549
},
"text" : "2005",
"text_normalized" : "Year 1",
"row_index_begin" : 1,
"row_index_end" : 1,
"column_index_begin" : 1,
"column_index_end" : 1
}, {
"cell_id" : "colHeader-1712-1716",
"location" : {
"begin" : 1712,
"end" : 1717
},
"text" : "2004",
"text_normalized" : "Year 2",
"row_index_begin" : 1,
"row_index_end" : 1,
"column_index_begin" : 2,
"column_index_end" : 2
}, {
"cell_id" : "colHeader-1889-1893",
"location" : {
"begin" : 1889,
"end" : 1894
},
"text" : "2005",
"text_normalized" : "Year 1",
"row_index_begin" : 1,
"row_index_end" : 1,
"column_index_begin" : 3,
"column_index_end" : 3
}, {
"cell_id" : "colHeader-2057-2061",
"location" : {
"begin" : 2057,
"end" : 2062
},
"text" : "2004",
"text_normalized" : "Year 2",
"row_index_begin" : 1,
"row_index_end" : 1,
"column_index_begin" : 4,
"column_index_end" : 4
} ],
"row_headers" : [ {
"cell_id" : "rowHeader-2244-2262",
"location" : {
"begin" : 2244,
"end" : 2263
},
"text" : "Statutory tax rate",
"text_normalized" : "Statutory tax rate",
"row_index_begin" : 2,
"row_index_end" : 2,
"column_index_begin" : 0,
"column_index_end" : 0
}, {
"cell_id" : "rowHeader-3197-3217",
"location" : {
"begin" : 3197,
"end" : 3218
},
"text" : "IRS audit settlement",
"text_normalized" : "IRS audit settlement",
"row_index_begin" : 3,
"row_index_end" : 3,
"column_index_begin" : 0,
"column_index_end" : 0
}, {
"cell_id" : "rowHeader-4148-4176",
"location" : {
"begin" : 4148,
"end" : 4177
},
"text" : "Dividends received deduction",
"text_normalized" : "Dividends received deduction",
"row_index_begin" : 4,
"row_index_end" : 4,
"column_index_begin" : 0,
"column_index_end" : 0
}, {
"cell_id" : "rowHeader-5106-5130",
"location" : {
"begin" : 5106,
"end" : 5131
},
"text" : "Total effective tax rate",
"text_normalized" : "Total effective tax rate",
"row_index_begin" : 5,
"row_index_end" : 5,
"column_index_begin" : 0,
"column_index_end" : 0
} ],
"key_value_pairs" : [ ],
"body_cells" : [ {
"cell_id" : "bodyCell-2450-2455",
"location" : {
"begin" : 2450,
"end" : 2456
},
"text" : "35.0%",
"row_index_begin" : 2,
"row_index_end" : 2,
"column_index_begin" : 1,
"column_index_end" : 1,
"row_header_ids" : [ "rowHeader-2244-2262" ],
"row_header_texts" : [ "Statutory tax rate" ],
"row_header_texts_normalized" : [ "Statutory tax rate" ],
"column_header_ids" : [ "colHeader-1050-1082", "colHeader-1544-1548" ],
"column_header_texts" : [ "Three months ended September 30,", "2005" ],
"column_header_texts_normalized" : [ "Three months ended September 30,", "Year 1" ],
"attributes": [ ]
}, {
"cell_id" : "bodyCell-2633-2638",
"location" : {
"begin" : 2633,
"end" : 2639
},
"text" : "35.0%",
"row_index_begin" : 2,
"row_index_end" : 2,
"column_index_begin" : 2,
"column_index_end" : 2,
"row_header_ids" : [ "rowHeader-2244-2262" ],
"row_header_texts" : [ "Statutory tax rate" ],
"row_header_texts_normalized" : [ "Statutory tax rate" ],
"column_header_ids" : [ "colHeader-1050-1082", "colHeader-1712-1716" ],
"column_header_texts" : [ "Three months ended September 30,", "2004" ],
"column_header_texts_normalized" : [ "Three months ended September 30,", "Year 2" ],
"attributes": [ ]
}, {
"cell_id" : "bodyCell-2825-2830",
"location" : {
"begin" : 2825,
"end" : 2831
},
"text" : "35.0%",
"row_index_begin" : 2,
"row_index_end" : 2,
"column_index_begin" : 3,
"column_index_end" : 3,
"row_header_ids" : [ "rowHeader-2244-2262" ],
"row_header_texts" : [ "Statutory tax rate" ],
"row_header_texts_normalized" : [ "Statutory tax rate" ],
"column_header_ids" : [ "colHeader-1270-1301", "colHeader-1889-1893" ],
"column_header_texts" : [ "Nine months ended September 30,", "2005" ],
"column_header_texts_normalized" : [ "Nine months ended September 30,", "Year 1" ],
"attributes": [ ]
}, {
"cell_id" : "bodyCell-3008-3013",
"location" : {
"begin" : 3008,
"end" : 3014
},
"text" : "35.0%",
"row_index_begin" : 2,
"row_index_end" : 2,
"column_index_begin" : 4,
"column_index_end" : 4,
"row_header_ids" : [ "rowHeader-2244-2262" ],
"row_header_texts" : [ "Statutory tax rate" ],
"row_header_texts_normalized" : [ "Statutory tax rate" ],
"column_header_ids" : [ "colHeader-1270-1301", "colHeader-2057-2061" ],
"column_header_texts" : [ "Nine months ended September 30,", "2004" ],
"column_header_texts_normalized" : [ "Nine months ended September 30,", "Year 2" ],
"attributes": [ ]
},
...
],
"contexts": [ ]
}
{: #key-value-pairs}
Tables can contain key-value pairs that span multiple table cells. Table Understanding can detect the following types of tabular key-value pairs.
- Simple key-value pairs in adjacent cells, as in the following example table.
Key | Value |
---|---|
Item number | 123456789 |
Date | 1/1/2019 |
Amount | $1000 |
- Key-value pairs in the same cell, as in the following example table.
Item number: 123456789 | Address: 123 Anywhere Dr |
Date: 1/1/2019 | |
Amount: $1000 |