SlideShare a Scribd company logo
Traitement Automatique
du Langage sur du texte
Devoxx Avril 2019
Hyperlex
Contract Management
Analysis
Data Extraction
and
Review
Machine Learning Pipeline
Documents
Title 1
Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Phasellus bibendum nulla eget
ornare. Sed velit dui,
Title 2
tincidunt vel massa in,
Praesent ante tellus, interdum
vitae auctor non,
Sed facilisis ipsum vel ornare
cursus. Nam pulvinar risus sed
arcu molestie, non semper felis
efficitur. Pellentesque porttitor
maximus augue, sed vulputate
sapien fringilla vel. Sed facilisis
nisi vel elit iaculis, quis ornare
dolor euismod. Donec odio felis,
lobortis sed cursus ut, mollis
vitae sem. Vivamus ultrices sed
sem eu fermentum. Sed id
tincidunt ex. Etiam pharetra
enim maximus luctus ornare.
Nulla suscipit metus leo, vel
dictum justo posuere in.
Integer in laoreet urna. Nunc ut
maximus mi, vel iaculis sem.
mattis eu lorem. Donec
ullamcorper sit amet arcu at
efficitur. Mauris quis convallis
erat. Sed faucibus urna ut
lectus mattis elementum.
Aenean tincidunt maximus
bibendum. In vestibulum
aliquam neque, ut
Header
Table
W
aterm
ark
Document classification
Optical Character
Recognition
Text cleaning and
recomposition
Paragraph segmentation
Paragraph classification
Named Entity Recognition
Hierarchical Data
Recomposition
Understanding
Common NLP tasks
My father went to Devoxx last year when he was in France.
ORGANIZATION
Named Entity Recognition (NER)
Part-of-speech tagging
VERB
PERSONPERSON
Coreference Resolution (CR)
LOCATION
Entity Mention Detection (EMD)
Relation Extraction (RE)
● Language Modeling
● Question Answering
● Summarization
● Machine Translation
Traditional Machine Learning
ORGANIZATION
Feature
Representation
Learning Function
Label prediction
to
Devoxx
last
Preprocessing
Stemming
Lemmatization
Word segmentation
Vectorization
Learning Functions
Linear Regression Logistic Regression
Support-vector machine
Perceptron
Conditional Random Field
def word2features(sent, i):
word = sent[i][0]
postag = sent[i][1]
features = {
'bias': 1.0,
'word.lower()': word.lower(),
'word[-3:]': word[-3:],
'word[-2:]': word[-2:],
'word.isupper()': word.isupper(),
'word.istitle()': word.istitle(),
'word.isdigit()': word.isdigit(),
'postag': postag,
'postag[:2]': postag[:2],
}
if i > 0:
word1 = sent[i-1][0]
postag1 = sent[i-1][1]
features.update({
'-1:word.lower()': word1.lower(),
'-1:word.istitle()': word1.istitle(),
'-1:word.isupper()': word1.isupper(),
'-1:postag': postag1,
'-1:postag[:2]': postag1[:2],
})
else:
features['BOS'] = True
...
# CoNLL 2002 data
nltk.corpus.conll2002.fileids()
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]
...
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
max_iterations=100,
all_possible_transitions=True
)
crf.fit(X_train, y_train)
...
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred,
average='weighted',
labels=labels)
labels = ['B-LOC', 'B-ORG', 'B-PER', 'I-PER',
'B-MISC', 'I-ORG', 'I-LOC', 'I-MISC']
Conditional random field
precision recall f1-score support
B-LOC 0.775 0.757 0.766 1084
I-LOC 0.601 0.631 0.616 325
B-MISC 0.698 0.499 0.582 339
I-MISC 0.644 0.567 0.603 557
B-ORG 0.795 0.801 0.798 1400
I-ORG 0.831 0.773 0.801 1104
B-PER 0.812 0.876 0.843 735
I-PER 0.873 0.931 0.901 634
avg / total 0.779 0.764 0.770 6178
Going Deeper
Going Deep
From one layer to many hidden layers
vectors
Learning Function
to
Devoxx
last
Learning Function Learning Function
ORGANIZATION
Label prediction
BackpropagationBackpropagation Backpropagation
loss function
Word Vectors
Confidential cat Personal
Source : Efficient Estimation of Word Representations in Vector Space - Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean - 2013
“The Issuer hereby agrees to hold and treat all Confidential Information”
Word Vectors
Paragraph and document embedding
Produce a vector from a paragraph or document
Source : Distributed Representations of Sentences and Documents - Quoc Le, Tomas Mikolov
Term Frequency–Inverse Document
Frequency
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(encoding='latin-1', ngram_range=(1, 2),
stop_words='english')
features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray()
features.shape
>> (4569, 12633)
4569 documents represented by 12633 features, representing the tf-idf score for different
unigrams and bigrams
Entity Recognition with Deep Learning
ORG
My father went to Devoxx last year when he was in France.
- ---
Recurrent Neural Network
My father went Devoxx
ORG
- --
Source : Understanding LSTM Networks - https://colah.github.io/posts/2015-08-Understanding-LSTMs/ - 2015
Long Short-Term Memory
went to Devoxx
- - ORG
Source : Understanding LSTM Networks - https://colah.github.io/posts/2015-08-Understanding-LSTMs/ - 2015
Deep Contextualized Word Representations
ELMo (Embeddings from Language Models)
LSTM-based language model trained on large corpus of text.
My
father
went
Word Embedding
Forward LSTM Backward LSTM
Word Prediction
Deep Contextualized Word Representations
ELMo capture the word sense based on the context
Source : Deep contextualized word representations - Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
Luke Zettlemoyer
Deep Contextualized Word Representations
Provide results on most NLP tasks
But slower by an order of magnitude (predictions around ~20x slower)
Source : Deep contextualized word representations - Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
Luke Zettlemoyer
Sequence to Sequence
Source : Jay Alammar - https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
Sequence to Sequence
Source : Sequence to Sequence Learning with Neural Networks - Ilya Sutskever, Oriol Vinyals, Quoc V. Le - 2014
Augmented Recurrent Neural Networks with
Attention
Source : CHRIS OLAH, SHAN CARTER - https://distill.pub/2016/augmented-rnns/#attentional-interfaces
Encoder Decoder with Attention
Source : Jay Alammar - https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
Attention: Transformer
Source : Transformer: A Novel Neural Network Architecture for Language Understanding -
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Self-attention mechanism directly models relationships
between all words in a sentence, regardless of their respective
position
Attention: Transformer
Source : Attention Is All You Need - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia
Polosukhin - 2017
Image Captioning
Source : Image Captioning - https://github.com/yunjey/pytorch-tutorial/tree/master/tutorials/03-advanced/image_captioning - 2018
BERT
“The Issuer hereby agrees to hold and treat all Confidential Information”
Masked Language Model
“The Issuer hereby agrees to [...]” || “This Agreement shall terminate [...]”
Next sentence prediction
Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018
BERT
Source : BERT Explained: State of the art language model for NLP -
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 - 2018
BERT - Training cost
Dataset: BookCorpus (800M words) + English Wikipedia (2500M words)
According to the paper: english models took 4 days to pre-train on 16 to
64 TPUs (~500USD for a BERT-base model)
English + multilingual models released by Google
Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018
BERT - NER
ORG
My father went to Devoxx last year when he was in France.
- - -----
It was the best conference he ever attended.
- - - -
Conditional
Random Field
BERT
Transformer
encoder
Embedding
BERT - Model Architecture Comparison
Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018
Benchmark
33
General Language Understanding Evaluation benchmark (GLUE)
benchmark
Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018
Our feedbacks on BERT
● Quite fast to finetune from BERT-base
(minutes to hour)
● Finetuning on the training corpus is
needed (compared to finetuning only
on a general corpus)
● Finetuning only the extractor is already
enough, but jointly learn BERT+classifier
helps a little more
● More experiments should be done with
>128 tokens and BERT-large
Multi-Task Learning
Source : A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks - Victor Sanh, Thomas Wolf, Sebastian Ruder
Multitask Learning
Source : A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks - Victor Sanh, Thomas Wolf, Sebastian Ruder
Chronology
Source : Unsupervised Deep Learning - https://media.neurips.cc/Conferences/NIPS2018/Slides/graves-deeplearning2.pdf
Example
LSTM Text Generation
GoT Book 6 (LSTM trained on the first five ASOIAF/GOT books)
Tyrion could hear Lord Aemon’s coughing. “I miss for it. Why did you proper?”
“I feared Master Sansa, Ser,” Ser Jaime reminded her. “She Baratheon is one of the crossing. The second
sons of your onion concubine.”
Lady Donella length of a longsword, the hair that went ready to climb side from her. And all between
them were belaquo bonebreaker and the night’s watch ride in their room. Only he could not look at
them, even others sure. “How could you leave the world?”
“Some must, for you,” a woman’s voiced up lazily. “Gods, Reek.”
She poured off two eyes and stepped down under the fire. “She will find your brother, and now I heard
her since she was standing the bowl. The night was fair and damp.
Source : https://github.com/zackthoutt/got-book-6
GPT 2
"New AI fake text generator may be too dangerous to release"
SYSTEM PROMPT (HUMAN-WRITTEN)
A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its
whereabouts are unknown.
The incident occurred on the downtown train line, which runs from Covington and Ashland stations.
In an email to Ohio news outlets, the U.S. Department of Energy said it is working with the Federal Railroad
Administration to find the thief.
“The theft of this nuclear material will have significant negative consequences on public and environmental
health, our workforce and the economy of our nation,” said Tom Hicks, the U.S. Energy Secretary, in a statement.
“Our top priority is to secure the theft and ensure it doesn’t happen again.”
The stolen material was taken from the University of Cincinnati’s Research Triangle Park nuclear research site,
according to a news release from Department officials.
The Nuclear Regulatory Commission did not immediately release any information.
According to the release, the U.S. Department of Energy’s Office of Nuclear Material Safety and Security is leading
that team’s investigation.
...
Trained on 40 gigabytes of text retrieved from sources
linked by high-ranking Reddit posts. 1.5 billion parameters.
Source : https://openai.com/blog/better-language-models/
GPT 2 - Common Sense Reasoning
The trophy doesn’t fit into the brown suitcase because it
is too large.
Correct answer: it = trophy
Model answer: it = trophy
The trophy doesn’t fit into the brown suitcase because it
is too small.
Correct answer: it = suitcase
Model answer: it = suitcase
GPT 2 - Question Answering
Who wrote the book the origin of species?
Correct answer: Charles Darwin
Model answer: Charles Darwin
What is the largest state in the U.S. by land mass?
Correct answer: Alaska
Model answer: California
GPT 2 "junior" - Repetitive speeches
Public GPT 2: 117 million parameter version of the model
Source : https://arstechnica.com/information-technology/2019/02/twenty-minutes-into-the-future-with-openais-deep-fake-text-ai/
The last time a Republican Governor signed a bill that would force the states to accept millions of
more immigrants than they already do, that was in 2009. That's a long time to wait to see if that
legislation would pass, but it is already happening now. It is a long time to wait to see if that legislation
passes again. And that's part of the reason why our president is so popular. He is one who says things
that are good for the country. He has said things that are good for America. As a businessman, he has
said things that are good for America. And he has said things that are good for our communities. He
has said things that are good for our families. He has said things that are good for our nation. He has
said things that are good for America. And he has said things that are good for our country. And he
has said things that are good for America. And he has said things that are good for America. And he
has said things that are good for America. And he has said things that are good for America. And he
has said things that are good for America. And he has said things that are good for America.
GPT 2 "junior" - Fake News Test
SYSTEM PROMPT (HUMAN-WRITTEN)
Iran has announced that it has successfully tested a hydrogen bomb.
As for the U.N., the council's permanent representative under President Barack Obama has
met to see to it that it follows the Paris climate agreement, which has allowed states to
move away from fossil fuels.
…
Kim Jong-un, the leader of North Korea and most closely aligned with the United States, has
warned of an imminent U.S. attack. The test of a hydrogen bomb Thursday killed 13 people
and injured several others in a Pyongyang explosion, the country's state TV station
reported.<|endoftext|>Coconut Cream
…
This is an amazing flavor that will satisfy any dessert enthusiast, but in most cases doesn't
taste sweet in moderation. It also creates a great nut-free breakfast cereal and can be
served with peanut butter or coconut milk. The taste can just as great as any dessert
topping.
Ingredients:
1/3 cup coconut milk
1/2 cup low-fat unsweetened coconut milk
...
Questions?
Hyperlex.ai
Alexis Agahi
@aagahi

More Related Content

PPTX
KiwiPyCon 2014 - NLP with Python tutorial
Alyona Medelyan
 
PDF
Thai Text processing by Transfer Learning using Transformer (Bert)
Kobkrit Viriyayudhakorn
 
PPTX
KiwiPyCon 2014 talk - Understanding human language with Python
Alyona Medelyan
 
PDF
Sk t academy lecture note
Susang Kim
 
PDF
NLP Deep Learning with Tensorflow
seungwoo kim
 
PDF
DeepPavlov 2019
Mikhail Burtsev
 
PDF
Segmenting dna sequence into words
Liang Wang
 
PDF
Introduction to NLTK
Sreejith Sasidharan
 
KiwiPyCon 2014 - NLP with Python tutorial
Alyona Medelyan
 
Thai Text processing by Transfer Learning using Transformer (Bert)
Kobkrit Viriyayudhakorn
 
KiwiPyCon 2014 talk - Understanding human language with Python
Alyona Medelyan
 
Sk t academy lecture note
Susang Kim
 
NLP Deep Learning with Tensorflow
seungwoo kim
 
DeepPavlov 2019
Mikhail Burtsev
 
Segmenting dna sequence into words
Liang Wang
 
Introduction to NLTK
Sreejith Sasidharan
 

What's hot (20)

PDF
RoFormer: Enhanced Transformer with Rotary Position Embedding
taeseon ryu
 
PPT
NLTK: Natural Language Processing made easy
outsider2
 
PPSX
Nltk - Boston Text Analytics
shanbady
 
PDF
Developing Korean Chatbot 101
Jaemin Cho
 
PDF
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
cscpconf
 
PDF
Natural language processing (Python)
Sumit Raj
 
PDF
BERT Finetuning Webinar Presentation
bhavesh_physics
 
PPTX
Parts of Speect Tagging
theyaseen51
 
PDF
BERT - Part 1 Learning Notes of Senthil Kumar
Senthil Kumar M
 
PPTX
BERT introduction
Hanwha System / ICT
 
PDF
Creating Chatbots Using TensorFlow | Chatbot Tutorial | Deep Learning Trainin...
Edureka!
 
PPTX
NLTK
Muhammed Shokr
 
PDF
Lecture04
mavillard
 
PPTX
Chatbot ppt
Manish Mishra
 
PPTX
Nltk
Anirudh
 
PDF
Categorizing and pos tagging with nltk python
Janu Jahnavi
 
PDF
Chat bot making process using Python 3 & TensorFlow
Jeongkyu Shin
 
PDF
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
Karthik Murugesan
 
RoFormer: Enhanced Transformer with Rotary Position Embedding
taeseon ryu
 
NLTK: Natural Language Processing made easy
outsider2
 
Nltk - Boston Text Analytics
shanbady
 
Developing Korean Chatbot 101
Jaemin Cho
 
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
cscpconf
 
Natural language processing (Python)
Sumit Raj
 
BERT Finetuning Webinar Presentation
bhavesh_physics
 
Parts of Speect Tagging
theyaseen51
 
BERT - Part 1 Learning Notes of Senthil Kumar
Senthil Kumar M
 
BERT introduction
Hanwha System / ICT
 
Creating Chatbots Using TensorFlow | Chatbot Tutorial | Deep Learning Trainin...
Edureka!
 
Lecture04
mavillard
 
Chatbot ppt
Manish Mishra
 
Nltk
Anirudh
 
Categorizing and pos tagging with nltk python
Janu Jahnavi
 
Chat bot making process using Python 3 & TensorFlow
Jeongkyu Shin
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
Karthik Murugesan
 
Ad

Similar to Devoxx traitement automatique du langage sur du texte en 2019 (20)

PPTX
Natural language processing and transformer models
Ding Li
 
PPTX
Introduction to Neural Information Retrieval and Large Language Models
sadjadeb
 
PDF
A pragmatic introduction to natural language processing models (October 2019)
Julien SIMON
 
PDF
Deep-learning based Language Understanding and Emotion extractions
Jeongkyu Shin
 
PDF
AM4TM_WS22_Practice_01_NLP_Basics.pdf
mewajok782
 
PDF
Should we be afraid of Transformers?
Dominik Seisser
 
PDF
How can text-mining leverage developments in Deep Learning? Presentation at ...
jcscholtes
 
PDF
(Deep) Neural Networks在 NLP 和 Text Mining 总结
君 廖
 
PPTX
Recent Advances in Natural Language Processing
Apache MXNet
 
PDF
CoreML for NLP (Melb Cocoaheads 08/02/2018)
Hon Weng Chong
 
PDF
Pointing the Unknown Words
hytae
 
PPTX
1909 BERT: why-and-how (CODE SEMINAR)
WarNik Chow
 
PDF
Pre-Trained-Language-Models-for-NLU
POOJA BHOJWANI
 
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Seonghyun Kim
 
PDF
NLP@DATEV: Setting up a domain specific language model, Dr. Jonas Rende & Tho...
Erlangen Artificial Intelligence & Machine Learning Meetup
 
PPTX
Deep Learning for Natural Language Processing
Jonathan Mugan
 
PDF
Natural Language Processing (NLP)
Yuriy Guts
 
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Minh Pham
 
PPTX
Natural Language Processing Advancements By Deep Learning - A Survey
AkshayaNagarajan10
 
PPTX
Bert
Abdallah Bashir
 
Natural language processing and transformer models
Ding Li
 
Introduction to Neural Information Retrieval and Large Language Models
sadjadeb
 
A pragmatic introduction to natural language processing models (October 2019)
Julien SIMON
 
Deep-learning based Language Understanding and Emotion extractions
Jeongkyu Shin
 
AM4TM_WS22_Practice_01_NLP_Basics.pdf
mewajok782
 
Should we be afraid of Transformers?
Dominik Seisser
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
jcscholtes
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
君 廖
 
Recent Advances in Natural Language Processing
Apache MXNet
 
CoreML for NLP (Melb Cocoaheads 08/02/2018)
Hon Weng Chong
 
Pointing the Unknown Words
hytae
 
1909 BERT: why-and-how (CODE SEMINAR)
WarNik Chow
 
Pre-Trained-Language-Models-for-NLU
POOJA BHOJWANI
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Seonghyun Kim
 
NLP@DATEV: Setting up a domain specific language model, Dr. Jonas Rende & Tho...
Erlangen Artificial Intelligence & Machine Learning Meetup
 
Deep Learning for Natural Language Processing
Jonathan Mugan
 
Natural Language Processing (NLP)
Yuriy Guts
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Minh Pham
 
Natural Language Processing Advancements By Deep Learning - A Survey
AkshayaNagarajan10
 
Ad

Recently uploaded (20)

PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Doc9.....................................
SofiaCollazos
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 

Devoxx traitement automatique du langage sur du texte en 2019

  • 1. Traitement Automatique du Langage sur du texte Devoxx Avril 2019
  • 3. Machine Learning Pipeline Documents Title 1 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus bibendum nulla eget ornare. Sed velit dui, Title 2 tincidunt vel massa in, Praesent ante tellus, interdum vitae auctor non, Sed facilisis ipsum vel ornare cursus. Nam pulvinar risus sed arcu molestie, non semper felis efficitur. Pellentesque porttitor maximus augue, sed vulputate sapien fringilla vel. Sed facilisis nisi vel elit iaculis, quis ornare dolor euismod. Donec odio felis, lobortis sed cursus ut, mollis vitae sem. Vivamus ultrices sed sem eu fermentum. Sed id tincidunt ex. Etiam pharetra enim maximus luctus ornare. Nulla suscipit metus leo, vel dictum justo posuere in. Integer in laoreet urna. Nunc ut maximus mi, vel iaculis sem. mattis eu lorem. Donec ullamcorper sit amet arcu at efficitur. Mauris quis convallis erat. Sed faucibus urna ut lectus mattis elementum. Aenean tincidunt maximus bibendum. In vestibulum aliquam neque, ut Header Table W aterm ark Document classification Optical Character Recognition Text cleaning and recomposition Paragraph segmentation Paragraph classification Named Entity Recognition Hierarchical Data Recomposition Understanding
  • 4. Common NLP tasks My father went to Devoxx last year when he was in France. ORGANIZATION Named Entity Recognition (NER) Part-of-speech tagging VERB PERSONPERSON Coreference Resolution (CR) LOCATION Entity Mention Detection (EMD) Relation Extraction (RE) ● Language Modeling ● Question Answering ● Summarization ● Machine Translation
  • 5. Traditional Machine Learning ORGANIZATION Feature Representation Learning Function Label prediction to Devoxx last Preprocessing Stemming Lemmatization Word segmentation Vectorization
  • 6. Learning Functions Linear Regression Logistic Regression Support-vector machine Perceptron
  • 7. Conditional Random Field def word2features(sent, i): word = sent[i][0] postag = sent[i][1] features = { 'bias': 1.0, 'word.lower()': word.lower(), 'word[-3:]': word[-3:], 'word[-2:]': word[-2:], 'word.isupper()': word.isupper(), 'word.istitle()': word.istitle(), 'word.isdigit()': word.isdigit(), 'postag': postag, 'postag[:2]': postag[:2], } if i > 0: word1 = sent[i-1][0] postag1 = sent[i-1][1] features.update({ '-1:word.lower()': word1.lower(), '-1:word.istitle()': word1.istitle(), '-1:word.isupper()': word1.isupper(), '-1:postag': postag1, '-1:postag[:2]': postag1[:2], }) else: features['BOS'] = True ... # CoNLL 2002 data nltk.corpus.conll2002.fileids() X_train = [sent2features(s) for s in train_sents] y_train = [sent2labels(s) for s in train_sents] ... crf = sklearn_crfsuite.CRF( algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100, all_possible_transitions=True ) crf.fit(X_train, y_train) ... y_pred = crf.predict(X_test) metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels) labels = ['B-LOC', 'B-ORG', 'B-PER', 'I-PER', 'B-MISC', 'I-ORG', 'I-LOC', 'I-MISC']
  • 8. Conditional random field precision recall f1-score support B-LOC 0.775 0.757 0.766 1084 I-LOC 0.601 0.631 0.616 325 B-MISC 0.698 0.499 0.582 339 I-MISC 0.644 0.567 0.603 557 B-ORG 0.795 0.801 0.798 1400 I-ORG 0.831 0.773 0.801 1104 B-PER 0.812 0.876 0.843 735 I-PER 0.873 0.931 0.901 634 avg / total 0.779 0.764 0.770 6178
  • 10. Going Deep From one layer to many hidden layers vectors Learning Function to Devoxx last Learning Function Learning Function ORGANIZATION Label prediction BackpropagationBackpropagation Backpropagation loss function
  • 11. Word Vectors Confidential cat Personal Source : Efficient Estimation of Word Representations in Vector Space - Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean - 2013 “The Issuer hereby agrees to hold and treat all Confidential Information”
  • 13. Paragraph and document embedding Produce a vector from a paragraph or document Source : Distributed Representations of Sentences and Documents - Quoc Le, Tomas Mikolov
  • 14. Term Frequency–Inverse Document Frequency TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document) IDF(t) = log_e(Total number of documents / Number of documents with term t in it) from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer(encoding='latin-1', ngram_range=(1, 2), stop_words='english') features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray() features.shape >> (4569, 12633) 4569 documents represented by 12633 features, representing the tf-idf score for different unigrams and bigrams
  • 15. Entity Recognition with Deep Learning ORG My father went to Devoxx last year when he was in France. - ---
  • 16. Recurrent Neural Network My father went Devoxx ORG - -- Source : Understanding LSTM Networks - https://colah.github.io/posts/2015-08-Understanding-LSTMs/ - 2015
  • 17. Long Short-Term Memory went to Devoxx - - ORG Source : Understanding LSTM Networks - https://colah.github.io/posts/2015-08-Understanding-LSTMs/ - 2015
  • 18. Deep Contextualized Word Representations ELMo (Embeddings from Language Models) LSTM-based language model trained on large corpus of text. My father went Word Embedding Forward LSTM Backward LSTM Word Prediction
  • 19. Deep Contextualized Word Representations ELMo capture the word sense based on the context Source : Deep contextualized word representations - Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer
  • 20. Deep Contextualized Word Representations Provide results on most NLP tasks But slower by an order of magnitude (predictions around ~20x slower) Source : Deep contextualized word representations - Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer
  • 21. Sequence to Sequence Source : Jay Alammar - https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
  • 22. Sequence to Sequence Source : Sequence to Sequence Learning with Neural Networks - Ilya Sutskever, Oriol Vinyals, Quoc V. Le - 2014
  • 23. Augmented Recurrent Neural Networks with Attention Source : CHRIS OLAH, SHAN CARTER - https://distill.pub/2016/augmented-rnns/#attentional-interfaces
  • 24. Encoder Decoder with Attention Source : Jay Alammar - https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
  • 25. Attention: Transformer Source : Transformer: A Novel Neural Network Architecture for Language Understanding - https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html Self-attention mechanism directly models relationships between all words in a sentence, regardless of their respective position
  • 26. Attention: Transformer Source : Attention Is All You Need - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin - 2017
  • 27. Image Captioning Source : Image Captioning - https://github.com/yunjey/pytorch-tutorial/tree/master/tutorials/03-advanced/image_captioning - 2018
  • 28. BERT “The Issuer hereby agrees to hold and treat all Confidential Information” Masked Language Model “The Issuer hereby agrees to [...]” || “This Agreement shall terminate [...]” Next sentence prediction Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018
  • 29. BERT Source : BERT Explained: State of the art language model for NLP - https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 - 2018
  • 30. BERT - Training cost Dataset: BookCorpus (800M words) + English Wikipedia (2500M words) According to the paper: english models took 4 days to pre-train on 16 to 64 TPUs (~500USD for a BERT-base model) English + multilingual models released by Google Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018
  • 31. BERT - NER ORG My father went to Devoxx last year when he was in France. - - ----- It was the best conference he ever attended. - - - - Conditional Random Field BERT Transformer encoder Embedding
  • 32. BERT - Model Architecture Comparison Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018
  • 33. Benchmark 33 General Language Understanding Evaluation benchmark (GLUE) benchmark Source : J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 2018
  • 34. Our feedbacks on BERT ● Quite fast to finetune from BERT-base (minutes to hour) ● Finetuning on the training corpus is needed (compared to finetuning only on a general corpus) ● Finetuning only the extractor is already enough, but jointly learn BERT+classifier helps a little more ● More experiments should be done with >128 tokens and BERT-large
  • 35. Multi-Task Learning Source : A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks - Victor Sanh, Thomas Wolf, Sebastian Ruder
  • 36. Multitask Learning Source : A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks - Victor Sanh, Thomas Wolf, Sebastian Ruder
  • 37. Chronology Source : Unsupervised Deep Learning - https://media.neurips.cc/Conferences/NIPS2018/Slides/graves-deeplearning2.pdf
  • 39. LSTM Text Generation GoT Book 6 (LSTM trained on the first five ASOIAF/GOT books) Tyrion could hear Lord Aemon’s coughing. “I miss for it. Why did you proper?” “I feared Master Sansa, Ser,” Ser Jaime reminded her. “She Baratheon is one of the crossing. The second sons of your onion concubine.” Lady Donella length of a longsword, the hair that went ready to climb side from her. And all between them were belaquo bonebreaker and the night’s watch ride in their room. Only he could not look at them, even others sure. “How could you leave the world?” “Some must, for you,” a woman’s voiced up lazily. “Gods, Reek.” She poured off two eyes and stepped down under the fire. “She will find your brother, and now I heard her since she was standing the bowl. The night was fair and damp. Source : https://github.com/zackthoutt/got-book-6
  • 40. GPT 2 "New AI fake text generator may be too dangerous to release" SYSTEM PROMPT (HUMAN-WRITTEN) A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown. The incident occurred on the downtown train line, which runs from Covington and Ashland stations. In an email to Ohio news outlets, the U.S. Department of Energy said it is working with the Federal Railroad Administration to find the thief. “The theft of this nuclear material will have significant negative consequences on public and environmental health, our workforce and the economy of our nation,” said Tom Hicks, the U.S. Energy Secretary, in a statement. “Our top priority is to secure the theft and ensure it doesn’t happen again.” The stolen material was taken from the University of Cincinnati’s Research Triangle Park nuclear research site, according to a news release from Department officials. The Nuclear Regulatory Commission did not immediately release any information. According to the release, the U.S. Department of Energy’s Office of Nuclear Material Safety and Security is leading that team’s investigation. ... Trained on 40 gigabytes of text retrieved from sources linked by high-ranking Reddit posts. 1.5 billion parameters. Source : https://openai.com/blog/better-language-models/
  • 41. GPT 2 - Common Sense Reasoning The trophy doesn’t fit into the brown suitcase because it is too large. Correct answer: it = trophy Model answer: it = trophy The trophy doesn’t fit into the brown suitcase because it is too small. Correct answer: it = suitcase Model answer: it = suitcase
  • 42. GPT 2 - Question Answering Who wrote the book the origin of species? Correct answer: Charles Darwin Model answer: Charles Darwin What is the largest state in the U.S. by land mass? Correct answer: Alaska Model answer: California
  • 43. GPT 2 "junior" - Repetitive speeches Public GPT 2: 117 million parameter version of the model Source : https://arstechnica.com/information-technology/2019/02/twenty-minutes-into-the-future-with-openais-deep-fake-text-ai/ The last time a Republican Governor signed a bill that would force the states to accept millions of more immigrants than they already do, that was in 2009. That's a long time to wait to see if that legislation would pass, but it is already happening now. It is a long time to wait to see if that legislation passes again. And that's part of the reason why our president is so popular. He is one who says things that are good for the country. He has said things that are good for America. As a businessman, he has said things that are good for America. And he has said things that are good for our communities. He has said things that are good for our families. He has said things that are good for our nation. He has said things that are good for America. And he has said things that are good for our country. And he has said things that are good for America. And he has said things that are good for America. And he has said things that are good for America. And he has said things that are good for America. And he has said things that are good for America. And he has said things that are good for America.
  • 44. GPT 2 "junior" - Fake News Test SYSTEM PROMPT (HUMAN-WRITTEN) Iran has announced that it has successfully tested a hydrogen bomb. As for the U.N., the council's permanent representative under President Barack Obama has met to see to it that it follows the Paris climate agreement, which has allowed states to move away from fossil fuels. … Kim Jong-un, the leader of North Korea and most closely aligned with the United States, has warned of an imminent U.S. attack. The test of a hydrogen bomb Thursday killed 13 people and injured several others in a Pyongyang explosion, the country's state TV station reported.<|endoftext|>Coconut Cream … This is an amazing flavor that will satisfy any dessert enthusiast, but in most cases doesn't taste sweet in moderation. It also creates a great nut-free breakfast cereal and can be served with peanut butter or coconut milk. The taste can just as great as any dessert topping. Ingredients: 1/3 cup coconut milk 1/2 cup low-fat unsweetened coconut milk ...