contributors: @GitYCC
-
We introduce a simple new approach to the problem of understanding documents where non-trivial layout influences the local semantics.
-
We modify the Transformer encoder architecture (RoBERTa) in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn language semantics from scratch.
-
We only augment the input of the model with the coordinates of token bounding boxes, avoiding the use of raw images.
-
This leads to a layout-aware language model which can then be fine-tuned on downstream tasks.
-
Steps:
- injects the layout information into a pretrained instance of RoBERTa
- training stage: The augmented models were trained on a masked language modeling objective extended with layout information on unannotated 2M visually rich pages.
- fine-tuning stage: We fine-tune the augmented model on a dataset consisting of documents with non-trivial layout.
- datasets: Kleister NDA, Kleister Charity, SROIE and CORD
-
Proposed Method
-
Semantic embedding and Sequential Embedding
- visually rich documents -> use OCR system (Tesseract) to obtain bounding boxes -> flatten obtained tokens by document order
- the input embeddings of RoBERTa:
$x_i=s_i+p_i$ -
$s_i\in \R^n$ is the semantic embedding of the token at position$i$ , taken from a trainable embedding layer -
$p_i\in\R^n$ is a positional embedding, depending only on$i$
-
-
Modification of input embeddings
-
input embeddings of LAMBERT:
$x_i=s_i+p_i+L(l_i)$ -
$l_i\in\R^k$ stands for layout embeddings -
$L$ : trainable linear layer$\R^k → \R^n$ - We initialize the weight matrix of
$L$ according to a normal distribution$N(0,σ^2)$ , with the standard deviation$σ$ being a hyperparameter. - We have to choose
$σ$ carefully, so that in the initial phase of training, the$L(l_i)$ term does not interfere overly with the already learned representations. We experimentally determined the value$σ = 0.02$ to be near-optimal.
- We initialize the weight matrix of
-
-
layout embeddings
-
We first normalize the bounding boxes by translating them so that the upper left corner is at
$(0, 0)$ , and dividing their dimensions by the page height. -
The layout embedding of a token will be defined as the concatenation of four embeddings of the individual coordinates of its bounding box. For an integer
$d$ and a vector of scaling factors$θ\in \R^d$ (a geometric sequence interpolating between 1 and 500), we define the corresponding embedding of a single coordinate$t$ as $$ emb_θ(t)=(sin(tθ);cos(tθ)) \in \R^{2d} $$ where the$sin$ and$cos$ are performed element-wise, yielding two vectors in$\R^d$ . The resulting concatenation of single bounding box coordinate embeddings is then a vector in$\R^{8d}$ .
-
-
-
Relative bias
- The raw attention scores are then computed as
$α_{ij} = d^{−1/2} q_i^T k_j$ . Afterwards, they are normalized using softmax, and used as weights in linear combinations of value vectors. - The point of relative bias is to modify the computation of the raw attention scores by introducing a bias term:
$α_{ij}^′ = α_{ij}+β_{ij}$ . - Relative 1D bias:
$β_{ij}=W(i-j)$ -
$W$ : trainable query embeddings (nn.Embedding
in pytorch) -
$i$ and$j$ : sequential position
-
- Relative 2D bias:
$β_{ij}=H(⌊ξ_i −ξ_j⌋)+V(⌊η_i −η_j⌋)$ -
$H(l)$ and$V(l)$ are trainable weights (nn.Embedding
in pytorch) defined for every integer$l ∈ [−C, C)$ -
$(ξ_i, η_i) = (Cx_1, C(y_1 + y_2)/2)$ if$b_i = (x1, y1, x2, y2)$ is the normalized bounding box of the$i$ -th token
-
- Combine 1D and 2D:
$β_{ij}=W(i-j)+H(⌊ξ_i −ξ_j⌋)+V(⌊η_i −η_j⌋)$
- The raw attention scores are then computed as
-
Result