SlideShare a Scribd company logo
Natural language
Processing
Hichem Felouat
hichemfel@gmail.com
Hichem Felouat - hichemfel@gmail.com 2
Natural language Processing
• Natural language processing (NLP) is a subfield of artificial
intelligence concerned with the interactions between computers
and human (natural) languages, in particular how to program
computers to process and analyze large amounts of natural
language data.
• Challenges in natural language processing frequently involve
speech recognition, natural language understanding, and
natural language generation.
Hichem Felouat - hichemfel@gmail.com 3
Natural language Processing
Hichem Felouat - hichemfel@gmail.com 4
Tokenizer API
import tensorflow as tf
from tensorflow import keras
import numpy as np
texts = ["I love Algeria", "machine learning", "Artificial intelligence","AI"]
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
# tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(texts)
print("Total number of documents : n",tokenizer.document_count)
print("Number of distinct characters/words: n",len(tokenizer.word_index))
print("word_index : n",tokenizer.word_index)
print("word_counts : n",tokenizer.word_counts)
print("word_docs : n",tokenizer.word_docs)
print("texts_to_sequences : (Algeria) n",tokenizer.texts_to_sequences(["Algeria"]))
print("sequences_to_texts : n",tokenizer.sequences_to_texts([[4, 3, 7, 2, 8, 1, 4]]))
• word_counts: A dictionary of words and their counts.
• word_docs: A dictionary of words and how many
documents each appeared in.
• word_index: A dictionary of words and their uniquely
assigned integers.
• document_count: An integer count of the total number
of documents that were used to fit the Tokenizer.
Hichem Felouat - hichemfel@gmail.com 5
Total number of documents :
4
Number of distinct characters/words :
15
word_index :
{'i': 1, 'e': 2, 'a': 3, 'l': 4, 'n': 5, ' ': 6, 'g': 7, 'r': 8, 'c': 9, 't': 10, 'o': 11, 'v': 12, 'm': 13, 'h': 14, 'f': 15}
word_counts :
OrderedDict([('i', 10), (' ', 4), ('l', 6), ('o', 1), ('v', 1), ('e', 7), ('a', 7), ('g', 3), ('r', 3), ('m', 1), ('c', 3),
('h', 1), ('n', 5), ('t', 2), ('f', 1)])
word_docs :
defaultdict(<class 'int'>, {'a': 4, 'v': 1, 'r': 3, ' ': 3, 'g': 3, 'o': 1, 'l': 3, 'e': 3, 'i': 4, 'c': 2, 'h': 1, 'm': 1,
'n': 2, 'f': 1, 't': 1})
texts_to_sequences : (Algeria)
[[3, 4, 7, 2, 8, 1, 3]]
sequences_to_texts :
['a l g e r i a']
Tokenizer API - Char
Hichem Felouat - hichemfel@gmail.com 6
Tokenizer API - Words
Total number of documents :
4
Number of distinct characters/words :
8
word_index :
{'i': 1, 'love': 2, 'algeria': 3, 'machine': 4, 'learning': 5, 'artificial': 6, 'intelligence': 7, 'ai': 8}
word_counts :
OrderedDict([('i', 1), ('love', 1), ('algeria', 1), ('machine', 1), ('learning', 1), ('artificial', 1),
('intelligence', 1), ('ai', 1)])
word_docs :
defaultdict(<class 'int'>, {'i': 1, 'love': 1, 'algeria': 1, 'machine': 1, 'learning': 1, 'artificial': 1,
'intelligence': 1, 'ai': 1})
texts_to_sequences : (Algeria)
[[3]]
sequences_to_texts :
['algeria machine intelligence love ai i algeria']
Hichem Felouat - hichemfel@gmail.com 7
texts_to_sequences
# Let’s encode the full text so each character/word is represented by its ID
encoded = tokenizer.texts_to_sequences(texts)
print("Encode the full text : n",encoded)
Char :
{'i': 1, 'e': 2, 'a': 3, 'l': 4, 'n': 5, ' ': 6, 'g': 7, 'r': 8, 'c': 9, 't': 10, 'o': 11, 'v': 12, 'm': 13, 'h': 14, 'f': 15}
==> [[1, 6, 4, 11, 12, 2, 6, 3, 4, 7, 2, 8, 1, 3], [13, 3, 9, 14, 1, 5, 2, 6, 4, 2, 3, 8, 5, 1, 5, 7], [3, 8, 10, 1, 15, 1,
9, 1, 3, 4, 6, 1, 5, 10, 2, 4, 4, 1, 7, 2, 5, 9, 2], [3, 1]]
Word :
{'i': 1, 'love': 2, 'algeria': 3, 'machine': 4, 'learning': 5, 'artificial': 6, 'intelligence': 7, 'ai': 8}
==> [[1, 2, 3], [4, 5], [6, 7], [8]]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(encoded, y, test_size=0.20,
random_state=100)
Hichem Felouat - hichemfel@gmail.com 8
texts_to_matrix
encoded_docs = tokenizer.texts_to_matrix(texts, mode="tfidf")
• binary : Whether or not each word is present in the document.
This is the default.
• count : The count of each word in the document.
• freq : The frequency of each word as a ratio of words within
each document.
• tfidf : The Text Frequency-Inverse DocumentFrequency (TF-
IDF) scoring for each word in the document.
Hichem Felouat - hichemfel@gmail.com 9
texts_to_matrix
Hichem Felouat - hichemfel@gmail.com 10
Sequence Padding
from keras.preprocessing.sequence import pad_sequences
# define sequences
sequences = [ [1, 2, 3, 4], [1, 2, 3], [1] ]
# Padding sequence data
result = pad_sequences(sequences, maxlen=4, truncating='post')
print("result : n",result)
result :
[[1 2 3 4]
[0 1 2 3]
[0 0 0 1]]
Hichem Felouat - hichemfel@gmail.com 11
• The simplest possible RNN composed of one neuron receiving inputs,
producing an output, and sending that output back to itself (figure -left).
• We can represent this tiny network against the time axis, as shown in (figure
- right). This is called unrolling the network through time.
Recurrent Neural Network(RNN)
Hichem Felouat - hichemfel@gmail.com 12
Seq-to-seq (top left), seq-to-vector (top right), vector-to-seq (bottom left), and Encoder–Decoder
(bottom right) networks.
Recurrent Neural Network(RNN)
Hichem Felouat - hichemfel@gmail.com 13
Deep RNNs
model = keras.models.Sequential([
keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
keras.layers.SimpleRNN(20, return_sequences=True),
keras.layers.SimpleRNN(1) ])
• Make sure to set return_sequences=True for all recurrent layers except the
last one, if you only care about the last output.
• It might be preferable to replace the output layer with a Dense layer.
model = keras.models.Sequential([
keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
keras.layers.SimpleRNN(20),
keras.layers.Dense(1) ])
Hichem Felouat - hichemfel@gmail.com 14
Deep RNNs
• To turn the model into a sequence-to-sequence model, we must set
return_sequences=True in all recurrent layers (even the last one), and we
must apply the output Dense layer at every time step.
• Keras offers a TimeDistributed layer for this very purpose: it wraps any
layer (e.g., a Dense layer) and applies it at every time step of its input
sequence.
model = keras.models.Sequential([
keras.layers.SimpleRNN(20, return_sequences=True,
input_shape=[None, 1]),
keras.layers.SimpleRNN(20, return_sequences=True),
keras.layers.TimeDistributed(keras.layers.Dense(10)) ])
Hichem Felouat - hichemfel@gmail.com 15
Long Short-Term Memory (LSTM)
LSTM cell
Hichem Felouat - hichemfel@gmail.com 16
Long Short-Term Memory (LSTM)
model = keras.models.Sequential([
keras.layers.LSTM(20, return_sequences=True,
input_shape=[None, 1]),
keras.layers.LSTM(20, return_sequences=True),
keras.layers.TimeDistributed(keras.layers.Dense(10))
])
Hichem Felouat - hichemfel@gmail.com 17
Gated Recurrent Unit (GRU)
• The GRU cell is a simplified version of the LSTM cell,
and it seems to perform just as well.
• GRU often improves performance, but not always, and
there is no clear pattern for which tasks are better off
with or without them: you will have to try it on your task
and see if it helps.
• model.add(keras.layers.GRU(N))
Hichem Felouat - hichemfel@gmail.com 18
Gated Recurrent Unit (GRU)
GRU cell
Hichem Felouat - hichemfel@gmail.com 19
Reusing Pretrained Embeddings
Tensorflow 2.0 introduced Keras as the default high-level API to
build models. Combined with pretrained models from Tensorflow
Hub, it provides a dead-simple way for transfer learning in NLP to
create good models out of the box.
import tensorflow_hub as hub
model = keras.Sequential([
hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1",
dtype=tf.string, input_shape=[], output_shape=[50]),
.
. ])
Hichem Felouat - hichemfel@gmail.com 20
Reusing Pretrained Embeddings
BERT was developed by researchers at Google in 2018 and has
been proven to be state-of-the-art for a variety of natural language
processing tasks such as text classification, text summarization,
text generation, etc.
Hichem Felouat - hichemfel@gmail.com 21
Reusing Pretrained Embeddings
import torch
import transformers as ppb # pytorch transformers
# !pip install transformers
# model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-
uncased')
# Want BERT instead of distilBERT? Uncomment the following line:
model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)
print(tokenizer.encode("Natural language processing", add_special_tokens=True))
print(tokenizer.encode("arabic language", add_special_tokens=True))
print(tokenizer.encode("hello", add_special_tokens=True))
[101, 3019, 2653, 6364, 102]
[101, 5640, 2653, 102]
[101, 7592, 102]
Hichem Felouat - hichemfel@gmail.com 22
Reusing Pretrained Embeddings
Hichem Felouat - hichemfel@gmail.com 23
Hichem Felouat - hichemfel@gmail.com 24
Sentiment Analysis:
https://github.com/hichemfelouat/my-codes-of-machine-learning/blob/master/Sentiment_Analysis.py
Sentiment Analysis with TL:
https://github.com/hichemfelouat/my-codes-of-machine-learning/blob/master/Sentiment_Analysis_TL.py
Predict next char :
https://github.com/hichemfelouat/my-codes-of-machine-learning/blob/master/Predict_next_char.py
BERT:
https://github.com/hichemfelouat/my-codes-of-machine-learning/blob/master/BERT.py
Machine Translation :
https://github.com/hichemfelouat/my-codes-of-machine-learning/blob/master/Machine_Translation.py
Hichem Felouat - hichemfel@gmail.com 25
Thanks for your
attention

More Related Content

PDF
Transfer Learning
Hichem Felouat
 
PDF
Introduction To Generative Adversarial Networks GANs
Hichem Felouat
 
PDF
Machine Learning Algorithms
Hichem Felouat
 
PDF
Build your own Convolutional Neural Network CNN
Hichem Felouat
 
PDF
How to Build your First Neural Network
Hichem Felouat
 
PDF
Predict future time series forecasting
Hichem Felouat
 
PDF
Object detection and Instance Segmentation
Hichem Felouat
 
PDF
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Universitat Politècnica de Catalunya
 
Transfer Learning
Hichem Felouat
 
Introduction To Generative Adversarial Networks GANs
Hichem Felouat
 
Machine Learning Algorithms
Hichem Felouat
 
Build your own Convolutional Neural Network CNN
Hichem Felouat
 
How to Build your First Neural Network
Hichem Felouat
 
Predict future time series forecasting
Hichem Felouat
 
Object detection and Instance Segmentation
Hichem Felouat
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Universitat Politècnica de Catalunya
 

What's hot (20)

PPTX
Tensor flow (1)
景逸 王
 
PDF
Tensor board
Sung Kim
 
PDF
A Tour of Tensorflow's APIs
Dean Wyatte
 
PDF
Nyc open-data-2015-andvanced-sklearn-expanded
Vivian S. Zhang
 
PDF
Mit6 094 iap10_lec01
Tribhuwan Pant
 
PDF
Mit6 094 iap10_lec04
Tribhuwan Pant
 
PDF
Gradient Boosted Regression Trees in scikit-learn
DataRobot
 
PDF
Mit6 094 iap10_lec03
Tribhuwan Pant
 
PDF
Mit6 094 iap10_lec02
Tribhuwan Pant
 
PPTX
Machine Learning - Introduction to Tensorflow
Andrew Ferlitsch
 
PDF
Google TensorFlow Tutorial
台灣資料科學年會
 
PDF
Xgboost
Vivian S. Zhang
 
PDF
Introduction to NumPy for Machine Learning Programmers
Kimikazu Kato
 
PPTX
Mat lab workshop
Vinay Kumar
 
PPTX
Tensorflow - Intro (2017)
Alessio Tonioni
 
PDF
TensorFlow example for AI Ukraine2016
Andrii Babii
 
PPTX
Introduction to Tensorflow
Tzar Umang
 
PDF
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Arnaud Joly
 
PDF
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Igor Sfiligoi
 
PDF
Introduction to TensorFlow, by Machine Learning at Berkeley
Ted Xiao
 
Tensor flow (1)
景逸 王
 
Tensor board
Sung Kim
 
A Tour of Tensorflow's APIs
Dean Wyatte
 
Nyc open-data-2015-andvanced-sklearn-expanded
Vivian S. Zhang
 
Mit6 094 iap10_lec01
Tribhuwan Pant
 
Mit6 094 iap10_lec04
Tribhuwan Pant
 
Gradient Boosted Regression Trees in scikit-learn
DataRobot
 
Mit6 094 iap10_lec03
Tribhuwan Pant
 
Mit6 094 iap10_lec02
Tribhuwan Pant
 
Machine Learning - Introduction to Tensorflow
Andrew Ferlitsch
 
Google TensorFlow Tutorial
台灣資料科學年會
 
Introduction to NumPy for Machine Learning Programmers
Kimikazu Kato
 
Mat lab workshop
Vinay Kumar
 
Tensorflow - Intro (2017)
Alessio Tonioni
 
TensorFlow example for AI Ukraine2016
Andrii Babii
 
Introduction to Tensorflow
Tzar Umang
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Arnaud Joly
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Igor Sfiligoi
 
Introduction to TensorFlow, by Machine Learning at Berkeley
Ted Xiao
 
Ad

Similar to Natural Language Processing (NLP) (20)

PDF
Language translation with Deep Learning (RNN) with TensorFlow
S N
 
PDF
Machine Learning: Make Your Ruby Code Smarter
Astrails
 
PDF
Natural language processing open seminar For Tensorflow usage
hyunyoung Lee
 
PDF
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Raffi Khatchadourian
 
PDF
Magic Clusters and Where to Find Them 2.0 - Eugene Pirogov
Elixir Club
 
PDF
Deep Dive on Deep Learning (June 2018)
Julien SIMON
 
PDF
Zoho Interview Questions By Scholarhat.pdf
Scholarhat
 
PDF
Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Hamidreza Soleimani
 
PDF
NTC_Tensor flow 深度學習快速上手班_Part1 -機器學習
NTC.im(Notch Training Center)
 
PDF
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Raffi Khatchadourian
 
PDF
Functional (web) development with Clojure
Henrik Eneroth
 
PDF
pa-pe-pi-po-pure Python Text Processing
Rodrigo Senra
 
PDF
C3 w2
Ajay Taneja
 
PDF
Python lecture 05
Tanwir Zaman
 
PDF
R Programming - part 1.pdf
RohanBorgalli
 
ODP
Clojure basics
Knoldus Inc.
 
PDF
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World
Brian Troutwine
 
PDF
Pune Clojure Course Outline
Baishampayan Ghose
 
PDF
Valerii Vasylkov Erlang. measurements and benefits.
Аліна Шепшелей
 
PDF
SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"
Inhacking
 
Language translation with Deep Learning (RNN) with TensorFlow
S N
 
Machine Learning: Make Your Ruby Code Smarter
Astrails
 
Natural language processing open seminar For Tensorflow usage
hyunyoung Lee
 
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Raffi Khatchadourian
 
Magic Clusters and Where to Find Them 2.0 - Eugene Pirogov
Elixir Club
 
Deep Dive on Deep Learning (June 2018)
Julien SIMON
 
Zoho Interview Questions By Scholarhat.pdf
Scholarhat
 
Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Hamidreza Soleimani
 
NTC_Tensor flow 深度學習快速上手班_Part1 -機器學習
NTC.im(Notch Training Center)
 
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Raffi Khatchadourian
 
Functional (web) development with Clojure
Henrik Eneroth
 
pa-pe-pi-po-pure Python Text Processing
Rodrigo Senra
 
Python lecture 05
Tanwir Zaman
 
R Programming - part 1.pdf
RohanBorgalli
 
Clojure basics
Knoldus Inc.
 
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World
Brian Troutwine
 
Pune Clojure Course Outline
Baishampayan Ghose
 
Valerii Vasylkov Erlang. measurements and benefits.
Аліна Шепшелей
 
SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"
Inhacking
 
Ad

Recently uploaded (20)

PPTX
Dakar Framework Education For All- 2000(Act)
santoshmohalik1
 
PPTX
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
PPTX
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
PPTX
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
PPTX
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
PPTX
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
PPTX
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
PPTX
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
PPTX
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
DOCX
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
PPTX
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
PPTX
How to Apply for a Job From Odoo 18 Website
Celine George
 
PPTX
PROTIEN ENERGY MALNUTRITION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
PPTX
CDH. pptx
AneetaSharma15
 
PPTX
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
Dakar Framework Education For All- 2000(Act)
santoshmohalik1
 
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
How to Apply for a Job From Odoo 18 Website
Celine George
 
PROTIEN ENERGY MALNUTRITION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
CDH. pptx
AneetaSharma15
 
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 

Natural Language Processing (NLP)

  • 2. Hichem Felouat - [email protected] 2 Natural language Processing • Natural language processing (NLP) is a subfield of artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. • Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
  • 3. Hichem Felouat - [email protected] 3 Natural language Processing
  • 4. Hichem Felouat - [email protected] 4 Tokenizer API import tensorflow as tf from tensorflow import keras import numpy as np texts = ["I love Algeria", "machine learning", "Artificial intelligence","AI"] tokenizer = keras.preprocessing.text.Tokenizer(char_level=True) # tokenizer = keras.preprocessing.text.Tokenizer() tokenizer.fit_on_texts(texts) print("Total number of documents : n",tokenizer.document_count) print("Number of distinct characters/words: n",len(tokenizer.word_index)) print("word_index : n",tokenizer.word_index) print("word_counts : n",tokenizer.word_counts) print("word_docs : n",tokenizer.word_docs) print("texts_to_sequences : (Algeria) n",tokenizer.texts_to_sequences(["Algeria"])) print("sequences_to_texts : n",tokenizer.sequences_to_texts([[4, 3, 7, 2, 8, 1, 4]])) • word_counts: A dictionary of words and their counts. • word_docs: A dictionary of words and how many documents each appeared in. • word_index: A dictionary of words and their uniquely assigned integers. • document_count: An integer count of the total number of documents that were used to fit the Tokenizer.
  • 5. Hichem Felouat - [email protected] 5 Total number of documents : 4 Number of distinct characters/words : 15 word_index : {'i': 1, 'e': 2, 'a': 3, 'l': 4, 'n': 5, ' ': 6, 'g': 7, 'r': 8, 'c': 9, 't': 10, 'o': 11, 'v': 12, 'm': 13, 'h': 14, 'f': 15} word_counts : OrderedDict([('i', 10), (' ', 4), ('l', 6), ('o', 1), ('v', 1), ('e', 7), ('a', 7), ('g', 3), ('r', 3), ('m', 1), ('c', 3), ('h', 1), ('n', 5), ('t', 2), ('f', 1)]) word_docs : defaultdict(<class 'int'>, {'a': 4, 'v': 1, 'r': 3, ' ': 3, 'g': 3, 'o': 1, 'l': 3, 'e': 3, 'i': 4, 'c': 2, 'h': 1, 'm': 1, 'n': 2, 'f': 1, 't': 1}) texts_to_sequences : (Algeria) [[3, 4, 7, 2, 8, 1, 3]] sequences_to_texts : ['a l g e r i a'] Tokenizer API - Char
  • 6. Hichem Felouat - [email protected] 6 Tokenizer API - Words Total number of documents : 4 Number of distinct characters/words : 8 word_index : {'i': 1, 'love': 2, 'algeria': 3, 'machine': 4, 'learning': 5, 'artificial': 6, 'intelligence': 7, 'ai': 8} word_counts : OrderedDict([('i', 1), ('love', 1), ('algeria', 1), ('machine', 1), ('learning', 1), ('artificial', 1), ('intelligence', 1), ('ai', 1)]) word_docs : defaultdict(<class 'int'>, {'i': 1, 'love': 1, 'algeria': 1, 'machine': 1, 'learning': 1, 'artificial': 1, 'intelligence': 1, 'ai': 1}) texts_to_sequences : (Algeria) [[3]] sequences_to_texts : ['algeria machine intelligence love ai i algeria']
  • 7. Hichem Felouat - [email protected] 7 texts_to_sequences # Let’s encode the full text so each character/word is represented by its ID encoded = tokenizer.texts_to_sequences(texts) print("Encode the full text : n",encoded) Char : {'i': 1, 'e': 2, 'a': 3, 'l': 4, 'n': 5, ' ': 6, 'g': 7, 'r': 8, 'c': 9, 't': 10, 'o': 11, 'v': 12, 'm': 13, 'h': 14, 'f': 15} ==> [[1, 6, 4, 11, 12, 2, 6, 3, 4, 7, 2, 8, 1, 3], [13, 3, 9, 14, 1, 5, 2, 6, 4, 2, 3, 8, 5, 1, 5, 7], [3, 8, 10, 1, 15, 1, 9, 1, 3, 4, 6, 1, 5, 10, 2, 4, 4, 1, 7, 2, 5, 9, 2], [3, 1]] Word : {'i': 1, 'love': 2, 'algeria': 3, 'machine': 4, 'learning': 5, 'artificial': 6, 'intelligence': 7, 'ai': 8} ==> [[1, 2, 3], [4, 5], [6, 7], [8]] from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(encoded, y, test_size=0.20, random_state=100)
  • 8. Hichem Felouat - [email protected] 8 texts_to_matrix encoded_docs = tokenizer.texts_to_matrix(texts, mode="tfidf") • binary : Whether or not each word is present in the document. This is the default. • count : The count of each word in the document. • freq : The frequency of each word as a ratio of words within each document. • tfidf : The Text Frequency-Inverse DocumentFrequency (TF- IDF) scoring for each word in the document.
  • 10. Hichem Felouat - [email protected] 10 Sequence Padding from keras.preprocessing.sequence import pad_sequences # define sequences sequences = [ [1, 2, 3, 4], [1, 2, 3], [1] ] # Padding sequence data result = pad_sequences(sequences, maxlen=4, truncating='post') print("result : n",result) result : [[1 2 3 4] [0 1 2 3] [0 0 0 1]]
  • 11. Hichem Felouat - [email protected] 11 • The simplest possible RNN composed of one neuron receiving inputs, producing an output, and sending that output back to itself (figure -left). • We can represent this tiny network against the time axis, as shown in (figure - right). This is called unrolling the network through time. Recurrent Neural Network(RNN)
  • 12. Hichem Felouat - [email protected] 12 Seq-to-seq (top left), seq-to-vector (top right), vector-to-seq (bottom left), and Encoder–Decoder (bottom right) networks. Recurrent Neural Network(RNN)
  • 13. Hichem Felouat - [email protected] 13 Deep RNNs model = keras.models.Sequential([ keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]), keras.layers.SimpleRNN(20, return_sequences=True), keras.layers.SimpleRNN(1) ]) • Make sure to set return_sequences=True for all recurrent layers except the last one, if you only care about the last output. • It might be preferable to replace the output layer with a Dense layer. model = keras.models.Sequential([ keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]), keras.layers.SimpleRNN(20), keras.layers.Dense(1) ])
  • 14. Hichem Felouat - [email protected] 14 Deep RNNs • To turn the model into a sequence-to-sequence model, we must set return_sequences=True in all recurrent layers (even the last one), and we must apply the output Dense layer at every time step. • Keras offers a TimeDistributed layer for this very purpose: it wraps any layer (e.g., a Dense layer) and applies it at every time step of its input sequence. model = keras.models.Sequential([ keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]), keras.layers.SimpleRNN(20, return_sequences=True), keras.layers.TimeDistributed(keras.layers.Dense(10)) ])
  • 15. Hichem Felouat - [email protected] 15 Long Short-Term Memory (LSTM) LSTM cell
  • 16. Hichem Felouat - [email protected] 16 Long Short-Term Memory (LSTM) model = keras.models.Sequential([ keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]), keras.layers.LSTM(20, return_sequences=True), keras.layers.TimeDistributed(keras.layers.Dense(10)) ])
  • 17. Hichem Felouat - [email protected] 17 Gated Recurrent Unit (GRU) • The GRU cell is a simplified version of the LSTM cell, and it seems to perform just as well. • GRU often improves performance, but not always, and there is no clear pattern for which tasks are better off with or without them: you will have to try it on your task and see if it helps. • model.add(keras.layers.GRU(N))
  • 18. Hichem Felouat - [email protected] 18 Gated Recurrent Unit (GRU) GRU cell
  • 19. Hichem Felouat - [email protected] 19 Reusing Pretrained Embeddings Tensorflow 2.0 introduced Keras as the default high-level API to build models. Combined with pretrained models from Tensorflow Hub, it provides a dead-simple way for transfer learning in NLP to create good models out of the box. import tensorflow_hub as hub model = keras.Sequential([ hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1", dtype=tf.string, input_shape=[], output_shape=[50]), . . ])
  • 20. Hichem Felouat - [email protected] 20 Reusing Pretrained Embeddings BERT was developed by researchers at Google in 2018 and has been proven to be state-of-the-art for a variety of natural language processing tasks such as text classification, text summarization, text generation, etc.
  • 21. Hichem Felouat - [email protected] 21 Reusing Pretrained Embeddings import torch import transformers as ppb # pytorch transformers # !pip install transformers # model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base- uncased') # Want BERT instead of distilBERT? Uncomment the following line: model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased') # Load pretrained model/tokenizer tokenizer = tokenizer_class.from_pretrained(pretrained_weights) model = model_class.from_pretrained(pretrained_weights) print(tokenizer.encode("Natural language processing", add_special_tokens=True)) print(tokenizer.encode("arabic language", add_special_tokens=True)) print(tokenizer.encode("hello", add_special_tokens=True)) [101, 3019, 2653, 6364, 102] [101, 5640, 2653, 102] [101, 7592, 102]
  • 22. Hichem Felouat - [email protected] 22 Reusing Pretrained Embeddings
  • 24. Hichem Felouat - [email protected] 24 Sentiment Analysis: https://github.com/hichemfelouat/my-codes-of-machine-learning/blob/master/Sentiment_Analysis.py Sentiment Analysis with TL: https://github.com/hichemfelouat/my-codes-of-machine-learning/blob/master/Sentiment_Analysis_TL.py Predict next char : https://github.com/hichemfelouat/my-codes-of-machine-learning/blob/master/Predict_next_char.py BERT: https://github.com/hichemfelouat/my-codes-of-machine-learning/blob/master/BERT.py Machine Translation : https://github.com/hichemfelouat/my-codes-of-machine-learning/blob/master/Machine_Translation.py
  • 25. Hichem Felouat - [email protected] 25 Thanks for your attention