Stefan Geissler kairntech - SDC Nice Apr 2019

IC-SDV 2019
April 9-10, 2019
Nice, France
Addressing requirements for
real-world deployments of ML & NLP
Stefan Geißler, Kairntech

Agenda
Looking back: the NLP landscape has changed
dramatically
Algorithms  Data!
Support dataset creation: The Kairntech Sherpa
Kairntech? Who are we
Conclusion

Looking back : NLP landscape has changed
2000:
Very few open source components
Lexicons, Taggers, Morphology,
Parsers mostly proprietory, complex to
install and maintain, limited coverage
« Make or Buy »
High level of manual efforts in
creating and maintaining lexical
knowledge bases, rule systems

Today
2019:
Sharing! (Github, …)
Lexicons, Taggers, Morphology,
Parsers often in the public domain
« Combine & Adapt »
Broad success of learning-based
approaches

2019: A tipping point in ML & NLP?
 « 2018 was the ‘image net’ moment for deep learning in NLP’ (S.
Ruder)
 In Image Processing in 2012 a Deep Learning network won a
public contest by a large margin. Now in 2018 we saw exciting NLP
models implementing transfer learning: ELMo, UMLfit, BERT
 « ML Engineering in NLP will truly blossom in 2019 » (E. Ameisen)
 Focus on Tools beyond model building! Link NLP/AI to production
use! What does it mean to build data-driven products and
services?
 « Enough papers: Let’s build AI now! » (A. Ng, 2017)
 « AI is the new electricity! »

Example: Named Entity Recognition
Cf.
https://www.researchgate.net/publication/329933780_A_Survey_on_Deep_Learning_for_Named_Entity_Recognition/download
Many / most of
these approaches
available with
code

NLP: A commodity?
Named entity recognition in four steps:
$ pip install spacy
$ python –m spacy download en
$ cat > testspacy.py
import spacy
nlp = spacy.load(‘en’)
doc = nlp(“Angela Merkel will meet Emmanuel Macron at the summit in Amsterdam”)
for entity in doc.ents:
print(entity.text)
CRTL-D
$ python testspacy.py
Angela Merkel
Emmanuel Macron
Amsterdam

Algorithms are commodity
Even the top scoring system from the list earlier is available on github:
https://github.com/zalandoresearch/flair
For the protocol:
The survey does not list Delft (
https://github.com/kermitt2/delft),
implemented by the Kairntech chief
ML expert and which
•Scores exactly at 93,09% on
Conll2003, too
•Creates models that are very
compact (~5MB vs. >150MB)
•Loads model in ~2sec at initialization

Pain points
 Off-the-shelf NLP models often don’t work for
specific needs
 Implementation is slowed down by the need of
building specific training dataset
 AI/NLP services are often require integration of
business glossaries & knowledge graph
 Absence of maintenance leads to quality deviations

Frequent requirements in real-world projects
 In many commercial scenarios around entity extraction, an entity not only has to be
recognized but also typed
 A DATE in a contract may be the date when the contract becomes effective,
when it was signed, when it will be terminated
 A PERSON in a legal opinion may be the defendant, the lawyer, the judge, the
witness …
 A DISEASE in clinical study may be the core therapeutic area or a peripheral
occasional adverse event
 This is beyond the public named entity recognition modules
 Typically, for these decisions no training corpora exist. They must be established
within a project.

You don’t have to take my word on that.
Let’s listen to what the experts say:
 Algorithms are commodity, data is gold
Peter Norvig:
“We [at Google] don't have
better algorithms than anyone
else; we just have more data!”
“More data beats clever
algorithms.”
Angela Merkel:
“Data is the new oil of the 21st
century!“

So: We need data, not only algorithms
Charts copied from https://hackernoon.com/%EF%B8%8F-big-challenge-in-deep-learning-training-data-31a88b97b282

Requirements
What will be more important for
the success of your project?
Driving the training accuracy from, say,
92,4 to 93,6% on a pre-defined data set?
or
ML components that allow high quality with
small training sets and moderate annotation and
training time?

Example
 The Conll2003 data set used in many academic NER
experiments contains >100000 entities
 Assume 30sec per entity  100 person days pure annotation
time! (With one single annotator)
Unrealistic in most commercial project settings.
Commercial projects have requirements that are different
from academic research!

On dataset preparation: Requirements
Web-based (no install), intuitive GUI, usable by domain experts
Limit manual annotation efforts: Active Learning
Collaboration (work in teams, measure inter-annotator agreement)
Not just NER annotation: Entity typing, document categorization, …
Must facilitate deployment-to-production

Why another tool?
 WebAnno:
 Scientific focus: « Annotate corpora to allow the study of
linguistic phenomena »
 Sentence-based, Loosing all layout information
 Spacy/Prodi.gy:
 Focus on local/lexical named entity recognition. Underlying
model by default considering a narrow window of n (n=4) words
left and right.
 Brat:
 Interface-only. Integration with model building, semi-automatic
suggestions, deployment?

Kairntech Sherpa
Annotation
environment
Raw or preannotated
Corpora:
Text, Audio, …
ML model
Curated AnnotationsAutomatic Annotation
Suggestions
User
Datasets and
ML models
Search, Collaboration, Manual &
assisted annotation, Quality
metrics, Synchronisation into ML
model

Active Learning?
 Reduce effort in manual annotation of data by presenting the user with data in
some informed order:
 Ask the user for feedback on the samples that promises the highest benefit:
Samples that are least certain*
(*) Diagrams used from datacamp.com
 Active Learning applied on NLP tasks has been shown to reduce the amount of
required training data dramatically
 7% of the sample under AL regime yield the same quality as naive selection
(cf. Laws 2012: https://d-nb.info/1030521204/34)
 In a project that would mean 1 day annotation instead of 14 days

Benefits of AL?
 Growing accuracy on a
(simple) ML task as number
of samples grows
 Naive selection (« Random »,
orange line) growing slowly
 Informed selection (« QBC,
« query by committee », red
line) grows much faster
 AL promises to reduce effort
required for manual
annotation

A non-expert workflow for dataset creation
Ask the
application for
suggestions
(De-) validate
and retrain
Once satisfied,
export/deploy

About Kairntech
 Kairntech: The company
 Created in dec 2018, 10 partners
 France (Paris & Grenoble/Meylan), Germany
(Heidelberg)
 Kairntech: The team
 Background in Software engineering, Machine
Learning, Sales, Management
 +15 years of experence in NLP development and
deployment from Xerox, IBM, TEMIS. Development of
components currently in production at CERN, NASA,
EPO…)

Kairntech: Our profile
 Industrialize the creation of document sets (training
corpora) by offering an environment for the data
preparation by domain experts, easy and efficient to use
 The transformation of data sets in document analysis
services, adding value to enterprise knowledge
repositories (e.g. knowledge graphs)
 Industrial deployment of maintenance of these services.

Conclusions
 So much data!
 But very little of it labelled and useful for superised learning
 So many pretrained models!
 But most of the time they do not quite do what you need in
your project
 So many algorithms!
 But a library alone will not allow you to implement the solution
you need
 Kairntech is there to support you!

Thank you for your attention !
Stefan.Geissler@kairntech.com

Stefan Geissler kairntech - SDC Nice Apr 2019

More Related Content

What's hot (7)

Similar to Stefan Geissler kairntech - SDC Nice Apr 2019 (20)

Recently uploaded (20)

Stefan Geissler kairntech - SDC Nice Apr 2019

Editor's Notes