www.karakun.com
Leveraging pre-trained
language models for
document classification
Holger Keibel (Karakun)
Daniele Puccinelli (SUPSI)
AI-SDV 2021
2
Our presentation at AI-SDV 2020
• Beginning of a joint research project
of Karakun (Basel), DSwiss (Zurich) and SUPSI (Lugano)
• Co-funded by Innosuisse
• Document classification and information extraction
hibu-platform.com dswiss.com supsi.ch
3
Our customers’ most frequent AI needs
Text classification
• Assign categories
to texts
• Predefined set of
categories
Information extraction
• Identify within a text
relevant pieces of
information
• Entities, keywords,
values etc.
Topic identification
• Assign a label to a
text, summarizing its
main topic
• Generally use terms
found in the text
4
Use cases: Search solutions
5
Use cases: Document processing
Document type:
Invoice
Classification
Amount: 171.19
Currency: CHF
Invoice number: 2020/AB-773
Due date: October 25, 2020
IBAN: xxxx xxxx xxxx xxxx x
Recipient: Musterfirma AG
Extraction
6
Use cases: Smart actions
Amount: 171.19
Currency: CHF
Invoice number: 2020/AB-773
Due date: October 25, 2020
IBAN: xxxx xxxx xxxx xxxx x
Recipient: Musterfirma AG
Extraction
Automated payment reminder
before October 25, 2020
Smart Action 1
Smart Action 2
Smart Action 3
7
Our presentation at AI-SDV 2020
• Document classification and information extraction
needed in many customer projects
• Often domain-specific and language-specific
• Scaling best done by a learning approach
• Good quality only achievable with massive training data -> costly
• Goal: Reduce required training data
• Result: EXTRA classifier
8
Pre-trained language models
• Best known models: BERT and GPT-3
• Pre-trained on specific tasks with massive data
-> learns underlying patterns of target language(s)
-> rich contextual word embeddings
(vector representations of words based on context)
• Major improvement over standard word embeddings
(GloVe and Word2Vec)
• Large pre-trained BERT models may be fine-tuned
• for specific tasks
• in specific domains
• using a relatively small number of training examples
• works best for running text (sentences)
9
Acceptable performance level
Fine-tuning: much less data needed
Training samples
(log)
Fine-tuning pre-trained system
Training a network from scratch
Performance after
training
10
Pre-trained
moderate
moderate
moderate
moderate
moderate
high
high
Cost factors and quality aspects
Rule-based Supervised learning
Required data volume low high
Required data quality rather low high
Initial ramp-up costs rather high rather high
Maintenance costs high moderate
Costs of scaling system to new
domains, applications and languages
(🡪 time to market)
high moderate
Sensitive to context low high
Recall (🡪 false negatives) low (1)
high
11
Research project goals
• Create core classifiers and extractors by fine-tuning BERT
• Increase coverage of existing classifiers (e.g. document type)
• Improve performance (quality of classification/extraction results)
• Extend to new tasks, e.g.
• Extract relevant data from inside invoices, contracts
• Reference market: D-A-CH
• Target languages: German, French, Italian, English
12
Document types
Document classification Information extraction (slot filling)
Invoice Name of issuer
Name of bill recipient
Invoice number
Balance due
Currency
IBAN / account number
Payment code
Expiration date
Contract Depends on specific subclass (e.g. employment
contracts, insurance contracts, telco, rental
agreements, general utilities and consumer
contracts (electricity, water, waste collection etc.)
Bank statement Bank
Account holder
IBAN
Currency
Amount
Other Not applicable
Training data
Classification:
Need 100s of documents per
category (instead of 10’000s)
Extraction:
Ideally more
Most categories:
Difficult to collect documents for
general purposes,
easier in specific customer
project
13
Classification pipeline
Reduce OCR output noise
Raw text is typically
extracted from scanned
documents (OCR)
• Pre-trained BERT
• Fine-tuning on
domain-specific
task and data
14
Classification: Training results
Cross-linguistic evaluation
for invoices:
Invoices: better than expected given limited running text.
Can be expected to be better for contracts.
Overall confirmed need for comparatively few training samples.
System Precision Recall F1
BERT-based ~ 0.8 ~ 0.8 ~ 0.8
Old rule-based
(benchmark)
~ 0.8 ~ 0.2 ~ 0.3
15
Information extraction pipeline
• OCR output (hOCR)
encodes text with
layout information
• Cluster into chunks
of text (spatially
co-located words)
Image processing using
OpenCV
• Pre-trained BERT
• Fine-tuning on
domain-specific
data
16
Challenges
• Noisy OCR output
• Still challenging to find enough training documents for sensitive data
• Risk of overfitting given small number of training samples
• Avoid using multiple documents by the same issuer (creates bias)
• Ability to decide that a given document does not fall into any of the
trained categories (-> out-of-domain detection)
• Classification challenges magnified for extraction task
17
People involved
Karakun
Sandro Pedrazzini
Holger Keibel
Martin Huber
Johannes Porzelt
Elisabeth Maier
SUPSI
Daniele Puccinelli
Luca Chiarabini
Fabio Landoni
Olmo Barberis
Giancarlo Corti
Reda Bousbah
DSwiss
Tobias Christen
Davide Vosti
Fabio Schiavoni
Contact: daniele.puccinelli@supsi.ch, holger.keibel@karakun.com

AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained language models for document classification

  • 1.
    www.karakun.com Leveraging pre-trained language modelsfor document classification Holger Keibel (Karakun) Daniele Puccinelli (SUPSI) AI-SDV 2021
  • 2.
    2 Our presentation atAI-SDV 2020 • Beginning of a joint research project of Karakun (Basel), DSwiss (Zurich) and SUPSI (Lugano) • Co-funded by Innosuisse • Document classification and information extraction hibu-platform.com dswiss.com supsi.ch
  • 3.
    3 Our customers’ mostfrequent AI needs Text classification • Assign categories to texts • Predefined set of categories Information extraction • Identify within a text relevant pieces of information • Entities, keywords, values etc. Topic identification • Assign a label to a text, summarizing its main topic • Generally use terms found in the text
  • 4.
  • 5.
    5 Use cases: Documentprocessing Document type: Invoice Classification Amount: 171.19 Currency: CHF Invoice number: 2020/AB-773 Due date: October 25, 2020 IBAN: xxxx xxxx xxxx xxxx x Recipient: Musterfirma AG Extraction
  • 6.
    6 Use cases: Smartactions Amount: 171.19 Currency: CHF Invoice number: 2020/AB-773 Due date: October 25, 2020 IBAN: xxxx xxxx xxxx xxxx x Recipient: Musterfirma AG Extraction Automated payment reminder before October 25, 2020 Smart Action 1 Smart Action 2 Smart Action 3
  • 7.
    7 Our presentation atAI-SDV 2020 • Document classification and information extraction needed in many customer projects • Often domain-specific and language-specific • Scaling best done by a learning approach • Good quality only achievable with massive training data -> costly • Goal: Reduce required training data • Result: EXTRA classifier
  • 8.
    8 Pre-trained language models •Best known models: BERT and GPT-3 • Pre-trained on specific tasks with massive data -> learns underlying patterns of target language(s) -> rich contextual word embeddings (vector representations of words based on context) • Major improvement over standard word embeddings (GloVe and Word2Vec) • Large pre-trained BERT models may be fine-tuned • for specific tasks • in specific domains • using a relatively small number of training examples • works best for running text (sentences)
  • 9.
    9 Acceptable performance level Fine-tuning:much less data needed Training samples (log) Fine-tuning pre-trained system Training a network from scratch Performance after training
  • 10.
    10 Pre-trained moderate moderate moderate moderate moderate high high Cost factors andquality aspects Rule-based Supervised learning Required data volume low high Required data quality rather low high Initial ramp-up costs rather high rather high Maintenance costs high moderate Costs of scaling system to new domains, applications and languages (🡪 time to market) high moderate Sensitive to context low high Recall (🡪 false negatives) low (1) high
  • 11.
    11 Research project goals •Create core classifiers and extractors by fine-tuning BERT • Increase coverage of existing classifiers (e.g. document type) • Improve performance (quality of classification/extraction results) • Extend to new tasks, e.g. • Extract relevant data from inside invoices, contracts • Reference market: D-A-CH • Target languages: German, French, Italian, English
  • 12.
    12 Document types Document classificationInformation extraction (slot filling) Invoice Name of issuer Name of bill recipient Invoice number Balance due Currency IBAN / account number Payment code Expiration date Contract Depends on specific subclass (e.g. employment contracts, insurance contracts, telco, rental agreements, general utilities and consumer contracts (electricity, water, waste collection etc.) Bank statement Bank Account holder IBAN Currency Amount Other Not applicable Training data Classification: Need 100s of documents per category (instead of 10’000s) Extraction: Ideally more Most categories: Difficult to collect documents for general purposes, easier in specific customer project
  • 13.
    13 Classification pipeline Reduce OCRoutput noise Raw text is typically extracted from scanned documents (OCR) • Pre-trained BERT • Fine-tuning on domain-specific task and data
  • 14.
    14 Classification: Training results Cross-linguisticevaluation for invoices: Invoices: better than expected given limited running text. Can be expected to be better for contracts. Overall confirmed need for comparatively few training samples. System Precision Recall F1 BERT-based ~ 0.8 ~ 0.8 ~ 0.8 Old rule-based (benchmark) ~ 0.8 ~ 0.2 ~ 0.3
  • 15.
    15 Information extraction pipeline •OCR output (hOCR) encodes text with layout information • Cluster into chunks of text (spatially co-located words) Image processing using OpenCV • Pre-trained BERT • Fine-tuning on domain-specific data
  • 16.
    16 Challenges • Noisy OCRoutput • Still challenging to find enough training documents for sensitive data • Risk of overfitting given small number of training samples • Avoid using multiple documents by the same issuer (creates bias) • Ability to decide that a given document does not fall into any of the trained categories (-> out-of-domain detection) • Classification challenges magnified for extraction task
  • 17.
    17 People involved Karakun Sandro Pedrazzini HolgerKeibel Martin Huber Johannes Porzelt Elisabeth Maier SUPSI Daniele Puccinelli Luca Chiarabini Fabio Landoni Olmo Barberis Giancarlo Corti Reda Bousbah DSwiss Tobias Christen Davide Vosti Fabio Schiavoni Contact: [email protected], [email protected]