LAMBERT: Layout-Aware Language Modeling for Information Extraction (2021), Lukasz Garncarek et al.

contributors: @GitYCC

[paper] [code]

We introduce a simple new approach to the problem of understanding documents where non-trivial layout influences the local semantics.
We modify the Transformer encoder architecture (RoBERTa) in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn language semantics from scratch.
We only augment the input of the model with the coordinates of token bounding boxes, avoiding the use of raw images.
This leads to a layout-aware language model which can then be fine-tuned on downstream tasks.
Steps:
- injects the layout information into a pretrained instance of RoBERTa
- training stage: The augmented models were trained on a masked language modeling objective extended with layout information on unannotated 2M visually rich pages.
- fine-tuning stage: We fine-tune the augmented model on a dataset consisting of documents with non-trivial layout.
  - datasets: Kleister NDA, Kleister Charity, SROIE and CORD
Proposed Method
- Semantic embedding and Sequential Embedding
  - visually rich documents -> use OCR system (Tesseract) to obtain bounding boxes -> flatten obtained tokens by document order
  - the input embeddings of RoBERTa: $x_i=s_i+p_i$
    - $s_i\in \R^n$ is the semantic embedding of the token at position $i$, taken from a trainable embedding layer
    - $p_i\in\R^n$ is a positional embedding, depending only on $i$
- Modification of input embeddings
  - input embeddings of LAMBERT: $x_i=s_i+p_i+L(l_i)$
    - $l_i\in\R^k$ stands for layout embeddings
    - $L$: trainable linear layer $\R^k → \R^n$
      - We initialize the weight matrix of $L$ according to a normal distribution $N(0,σ^2)$, with the standard deviation $σ$ being a hyperparameter.
      - We have to choose $σ$ carefully, so that in the initial phase of training, the $L(l_i)$ term does not interfere overly with the already learned representations. We experimentally determined the value $σ = 0.02$ to be near-optimal.
  - layout embeddings
    - We first normalize the bounding boxes by translating them so that the upper left corner is at $(0, 0)$, and dividing their dimensions by the page height.
    - The layout embedding of a token will be defined as the concatenation of four embeddings of the individual coordinates of its bounding box. For an integer $d$ and a vector of scaling factors $θ\in \R^d$ (a geometric sequence interpolating between 1 and 500), we define the corresponding embedding of a single coordinate $t$ as $$ emb_θ(t)=(sin(tθ);cos(tθ)) \in \R^{2d} $$ where the $sin$ and $cos$ are performed element-wise, yielding two vectors in $\R^d$. The resulting concatenation of single bounding box coordinate embeddings is then a vector in $\R^{8d}$.
- Relative bias
  - The raw attention scores are then computed as $α_{ij} = d^{−1/2} q_i^T k_j$. Afterwards, they are normalized using softmax, and used as weights in linear combinations of value vectors.
  - The point of relative bias is to modify the computation of the raw attention scores by introducing a bias term: $α_{ij}^′ = α_{ij}+β_{ij}$.
  - Relative 1D bias: $β_{ij}=W(i-j)$
    - $W$: trainable query embeddings (nn.Embedding in pytorch)
    - $i$ and $j$ : sequential position
  - Relative 2D bias: $β_{ij}=H(⌊ξ_i −ξ_j⌋)+V(⌊η_i −η_j⌋)$
    - $H(l)$ and $V(l)$ are trainable weights (nn.Embedding in pytorch) defined for every integer $l ∈ [−C, C)$
    - $(ξ_i, η_i) = (Cx_1, C(y_1 + y_2)/2)$ if $b_i = (x1, y1, x2, y2)$ is the normalized bounding box of the $i$-th token
  - Combine 1D and 2D: $β_{ij}=W(i-j)+H(⌊ξ_i −ξ_j⌋)+V(⌊η_i −η_j⌋)$
Result

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LAMBERT.md

LAMBERT.md

LAMBERT: Layout-Aware Language Modeling for Information Extraction (2021), Lukasz Garncarek et al.

contributors: @GitYCC

Files

LAMBERT.md

Latest commit

History

LAMBERT.md

File metadata and controls

LAMBERT: Layout-Aware Language Modeling for Information Extraction (2021), Lukasz Garncarek et al.

contributors: @GitYCC