Skip to content

Latest commit

 

History

History
71 lines (45 loc) · 4.03 KB

LAMBERT.md

File metadata and controls

71 lines (45 loc) · 4.03 KB

LAMBERT: Layout-Aware Language Modeling for Information Extraction (2021), Lukasz Garncarek et al.

contributors: @GitYCC

[paper] [code]


  • We introduce a simple new approach to the problem of understanding documents where non-trivial layout influences the local semantics.

  • We modify the Transformer encoder architecture (RoBERTa) in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn language semantics from scratch.

  • We only augment the input of the model with the coordinates of token bounding boxes, avoiding the use of raw images.

  • This leads to a layout-aware language model which can then be fine-tuned on downstream tasks.

  • Steps:

    • injects the layout information into a pretrained instance of RoBERTa
    • training stage: The augmented models were trained on a masked language modeling objective extended with layout information on unannotated 2M visually rich pages.
    • fine-tuning stage: We fine-tune the augmented model on a dataset consisting of documents with non-trivial layout.
      • datasets: Kleister NDA, Kleister Charity, SROIE and CORD
  • Proposed Method

    • Semantic embedding and Sequential Embedding

      • visually rich documents -> use OCR system (Tesseract) to obtain bounding boxes -> flatten obtained tokens by document order
      • the input embeddings of RoBERTa: $x_i=s_i+p_i$
        • $s_i\in \R^n$ is the semantic embedding of the token at position $i$, taken from a trainable embedding layer
        • $p_i\in\R^n$ is a positional embedding, depending only on $i$
    • Modification of input embeddings

      • input embeddings of LAMBERT: $x_i=s_i+p_i+L(l_i)$

        • $l_i\in\R^k$ stands for layout embeddings
        • $L$: trainable linear layer $\R^k → \R^n$
          • We initialize the weight matrix of $L$ according to a normal distribution $N(0,σ^2)$, with the standard deviation $σ$ being a hyperparameter.
          • We have to choose $σ$ carefully, so that in the initial phase of training, the $L(l_i)$ term does not interfere overly with the already learned representations. We experimentally determined the value $σ = 0.02$ to be near-optimal.
      • layout embeddings

        • We first normalize the bounding boxes by translating them so that the upper left corner is at $(0, 0)$, and dividing their dimensions by the page height.

        • The layout embedding of a token will be defined as the concatenation of four embeddings of the individual coordinates of its bounding box. For an integer $d$ and a vector of scaling factors $θ\in \R^d$ (a geometric sequence interpolating between 1 and 500), we define the corresponding embedding of a single coordinate $t$ as $$ emb_θ(t)=(sin(tθ);cos(tθ)) \in \R^{2d} $$ where the $sin$ and $cos$ are performed element-wise, yielding two vectors in $\R^d$. The resulting concatenation of single bounding box coordinate embeddings is then a vector in $\R^{8d}$.

    • Relative bias

      • The raw attention scores are then computed as $α_{ij} = d^{−1/2} q_i^T k_j$. Afterwards, they are normalized using softmax, and used as weights in linear combinations of value vectors.
      • The point of relative bias is to modify the computation of the raw attention scores by introducing a bias term: $α_{ij}^′ = α_{ij}+β_{ij}$.
      • Relative 1D bias: $β_{ij}=W(i-j)$
        • $W$: trainable query embeddings (nn.Embedding in pytorch)
        • $i$ and $j$ : sequential position
      • Relative 2D bias: $β_{ij}=H(⌊ξ_i −ξ_j⌋)+V(⌊η_i −η_j⌋)$
        • $H(l)$ and $V(l)$ are trainable weights (nn.Embedding in pytorch) defined for every integer $l ∈ [−C, C)$
        • $(ξ_i, η_i) = (Cx_1, C(y_1 + y_2)/2)$ if $b_i = (x1, y1, x2, y2)$ is the normalized bounding box of the $i$-th token
      • Combine 1D and 2D: $β_{ij}=W(i-j)+H(⌊ξ_i −ξ_j⌋)+V(⌊η_i −η_j⌋)$
  • Result