2023-01-08: A Summary of "DocFormer: End to End Transformer for Document Understanding" (Appalaraju et al. 2021 ICCV)

Our previous blog described the importance of document understanding for layout analysis. While layout analysis is important, many downstream tasks, including document classification, entity extraction, and sequence labeling often require visual document understanding (VDU). VDU requires an understanding of both structures and layout of the document. There are VDU approaches based on only textual features or approaches based on both textual and spatial features. The best results are obtained by fusing textual, spatial, and visual features . Appalaraju et al. proposed DocFormer: End-to-End Transformer for Document Understanding at IEEE / CVF International Conference on Computer Vision in 2021, which incorporates a novel multimodal self-attention with shared embeddings in an encoder-only transformer architecture. DocFormer achieved state-of-the-art results on four various downstream VDU tasks. The contributions of this paper are: DocFormer has ...