AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained language models for document classification

www.karakun.com
Leveraging pre-trained
language models for
document classiﬁcation
Holger Keibel (Karakun)
Daniele Puccinelli (SUPSI)
AI-SDV 2021

2
Our presentation at AI-SDV 2020
• Beginning of a joint research project
of Karakun (Basel), DSwiss (Zurich) and SUPSI (Lugano)
• Co-funded by Innosuisse
• Document classiﬁcation and information extraction
hibu-platform.com dswiss.com supsi.ch

3
Our customers’ most frequent AI needs
Text classification
• Assign categories
to texts
• Predefined set of
categories
Information extraction
• Identify within a text
relevant pieces of
information
• Entities, keywords,
values etc.
Topic identification
• Assign a label to a
text, summarizing its
main topic
• Generally use terms
found in the text

5
Use cases: Document processing
Document type:
Invoice
Classiﬁcation
Amount: 171.19
Currency: CHF
Invoice number: 2020/AB-773
Due date: October 25, 2020
IBAN: xxxx xxxx xxxx xxxx x
Recipient: Musterﬁrma AG
Extraction

6
Use cases: Smart actions
Amount: 171.19
Currency: CHF
Invoice number: 2020/AB-773
Due date: October 25, 2020
IBAN: xxxx xxxx xxxx xxxx x
Recipient: Musterﬁrma AG
Extraction
Automated payment reminder
before October 25, 2020
Smart Action 1
Smart Action 2
Smart Action 3

7
Our presentation at AI-SDV 2020
• Document classification and information extraction
needed in many customer projects
• Often domain-specific and language-specific
• Scaling best done by a learning approach
• Good quality only achievable with massive training data -> costly
• Goal: Reduce required training data
• Result: EXTRA classifier

8
Pre-trained language models
• Best known models: BERT and GPT-3
• Pre-trained on specific tasks with massive data
-> learns underlying patterns of target language(s)
-> rich contextual word embeddings
(vector representations of words based on context)
• Major improvement over standard word embeddings
(GloVe and Word2Vec)
• Large pre-trained BERT models may be fine-tuned
• for specific tasks
• in specific domains
• using a relatively small number of training examples
• works best for running text (sentences)

9
Acceptable performance level
Fine-tuning: much less data needed
Training samples
(log)
Fine-tuning pre-trained system
Training a network from scratch
Performance after
training

10
Pre-trained
moderate
moderate
moderate
moderate
moderate
high
high
Cost factors and quality aspects
Rule-based Supervised learning
Required data volume low high
Required data quality rather low high
Initial ramp-up costs rather high rather high
Maintenance costs high moderate
Costs of scaling system to new
domains, applications and languages
(🡪 time to market)
high moderate
Sensitive to context low high
Recall (🡪 false negatives) low (1)
high

11
Research project goals
• Create core classifiers and extractors by fine-tuning BERT
• Increase coverage of existing classifiers (e.g. document type)
• Improve performance (quality of classification/extraction results)
• Extend to new tasks, e.g.
• Extract relevant data from inside invoices, contracts
• Reference market: D-A-CH
• Target languages: German, French, Italian, English

12
Document types
Document classification Information extraction (slot filling)
Invoice Name of issuer
Name of bill recipient
Invoice number
Balance due
Currency
IBAN / account number
Payment code
Expiration date
Contract Depends on specific subclass (e.g. employment
contracts, insurance contracts, telco, rental
agreements, general utilities and consumer
contracts (electricity, water, waste collection etc.)
Bank statement Bank
Account holder
IBAN
Currency
Amount
Other Not applicable
Training data
Classification:
Need 100s of documents per
category (instead of 10’000s)
Extraction:
Ideally more
Most categories:
Difficult to collect documents for
general purposes,
easier in specific customer
project

13
Classiﬁcation pipeline
Reduce OCR output noise
Raw text is typically
extracted from scanned
documents (OCR)
• Pre-trained BERT
• Fine-tuning on
domain-speciﬁc
task and data

14
Classiﬁcation: Training results
Cross-linguistic evaluation
for invoices:
Invoices: better than expected given limited running text.
Can be expected to be better for contracts.
Overall conﬁrmed need for comparatively few training samples.
System Precision Recall F1
BERT-based ~ 0.8 ~ 0.8 ~ 0.8
Old rule-based
(benchmark)
~ 0.8 ~ 0.2 ~ 0.3

15
Information extraction pipeline
• OCR output (hOCR)
encodes text with
layout information
• Cluster into chunks
of text (spatially
co-located words)
Image processing using
OpenCV
• Pre-trained BERT
• Fine-tuning on
domain-speciﬁc
data

16
Challenges
• Noisy OCR output
• Still challenging to find enough training documents for sensitive data
• Risk of overfitting given small number of training samples
• Avoid using multiple documents by the same issuer (creates bias)
• Ability to decide that a given document does not fall into any of the
trained categories (-> out-of-domain detection)
• Classification challenges magnified for extraction task

17
People involved
Karakun
Sandro Pedrazzini
Holger Keibel
Martin Huber
Johannes Porzelt
Elisabeth Maier
SUPSI
Daniele Puccinelli
Luca Chiarabini
Fabio Landoni
Olmo Barberis
Giancarlo Corti
Reda Bousbah
DSwiss
Tobias Christen
Davide Vosti
Fabio Schiavoni
Contact: daniele.puccinelli@supsi.ch, holger.keibel@karakun.com

AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained language models for document classification

More Related Content

What's hot

Similar to AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained language models for document classification

More from Dr. Haxel Consult

Recently uploaded

AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained language models for document classification