SlideShare a Scribd company logo
Text classification
Kennissessie
Agenda
● Text classification
● Sparse data
○ Dimensionality reduction / visualization sparse data
○ Classification on sparse data
● Text embedding
○ Short explanation doc2vec
○ Visualization sparse vs embedded
○ Classification sparse vs embedded
● Hands-on!
Text classification - Definition
● Text classification is the task of assigning predefined categories to free-text documents.
Example: News article classification
What is the category of this news article?
Classification
Sunken
ships
Example: News article classification
Examples:
Great war
Examples:
Sunken ships
Example: Every word is a feature
Feature
dimensions
Document 1:
Class A
Document 2:
Class A
Document 3:
Class B
Document 4:
Class B
1: arrived 0 1 4 5
2: received 1 2 3 5
3: gold 4 4 4 1
4: a 1 0 1 2
5: energy 5 5 5 3
Feature vector Feature space
Dimensionality
Features
(one word per feature)
Classes
Text = high dimensional
Feature
dimensions
Document 1:
Class A
Document 2:
Class A
Document 3:
Class B
Document 4:
Class B
1: arrived 0 1 4 5
2: received 1 2 3 5
3: gold 4 4 4 1
4: a 1 0 1 2
5: energy 5 5 5 3
Text = sparse
Feature
dimensions
Document 1:
Class A
Document 2:
Class A
Document 3:
Class A
Document 4:
Class A
Document 5:
Class ?
1: acquired 0 0 1 0 0
2: received 0 2 0 0 0
3: collected 1 0 0 0 0
4: a 0 0 0 2 0
5: energy 0 0 0 0 1
Dataset: Reuters news article dataset
100 top words
across whole
corpus
For each
document count
how often each
word occurs
Word 1
Word 2
Word 3
Word 100
Document 1 Document 2
Word 1 0 2
Word 2 3 1
Word 3 4 4
Word 100 1 1
Feature space of
100 dimensions
containing 21578
data points
Dimensionality reduction - Reuters top 100 words
Dimensionality reduction - Reuters top 100 words
Dimensionality reduction - pipeline
Documents = text + category
Tsne
(dimensionality
reduction)
visualization
Words (100
dimensions) Reduced 2d
vectors
categories
Dimensionality reduction - Mnist
Dimensionality reduction - pipeline
Mnist = picture + class
Tsne
(dimensionality
reduction)
visualization
Pixels (800
dimensions) Reduced 2d
vectors
classes
Data cleaning
● Remove stop words:
○ a
○ the
○ or
● Stemming:
● Remove non alphanumeric characters:
○ $%^@#
○ 😁😂
○ <html> https://
Top 100 words - Data cleaning disabled
Top 100 words - Data cleaning enabled
Data cleaning results
Classificatie score data cleaning off:
0.88
Classificatie score data cleaning on:
0.90
Documents = text + category
training
Classification score - pipeline
verification
words+categories
20%
80% Trained
model
score
Embedding - doc2vec
Word 1
Word 2
Word 3
….
Word 50000
Document 1
Document 2
…
Document 10000
Word 1
Word 2
Word 3
….
Word 50000
Embedding - doc2vec - example
Word 1
Word 345
Word 1000
Document 245
Word 25
Word 1204
Word 1
Word 345
Word 1000
Document 312
Word 45
Word 1182
Input Hidden Output
Word1
Word2
Word3
Doc1
Doc2
Embedding - doc2vec - example
Word1
Word2
Word3
Word4
Word5
Embedding - pipeline
Documents = text + category
doc2vec
classification
Text
(10000+
dimensions)
document
features 100
dimensions
categories
Reuters - score doc2vec vs top 100 words
Word count top 100 words:
0.90
Doc2vec:
0.94
IMDB movie reviews - doc2vec vs wordcount
Class: positive
Bromwell High is nothing short of brilliant. Expertly
scripted and perfectly delivered, this searing parody of
a students and teachers at a South London Public
School leaves you literally rolling with laughter. It's
vulgar, provocative, witty and sharp. The characters
are a superbly caricatured cross section of British
society (or to be more accurate, of any society).
Following the escapades of Keisha, Latrina and
Natella, our three "protagonists" for want of a better
term, the show doesn't shy away from parodying every
imaginable subject. Political correctness flies out the
window in every episode. If you enjoy shows that
aren't afraid to poke fun of every taboo subject
imaginable, then Bromwell High will not disappoint!
Class: negative
Robert DeNiro plays the most unbelievably intelligent
illiterate of all time. This movie is so wasteful of talent,
it is truly disgusting. The script is unbelievable. The
dialog is unbelievable. Jane Fonda's character is a
caricature of herself, and not a funny one. The movie
moves at a snail's pace, is photographed in an
ill-advised manner, and is insufferably preachy. It also
plugs in every cliche in the book. Swoozie Kurtz is
excellent in a supporting role, but so what?<br /><br
/>Equally annoying is this new IMDB rule of requiring
ten lines for every review. When a movie is this
worthless, it doesn't require ten lines of text to let other
readers know that it is a waste of time and tape. Avoid
this movie.
IMDB movie reviews - doc2vec vs wordcount
IMDB movie reviews - doc2vec vs wordcount
Word count top 250 words:
0.72
Doc2vec:
0.83
Conclusion
● It’s all about extracting the right features from your data
● Visualize the data to get a sense of the value of your features
● You can use the same algorithms for text, image, audio and other kinds of
data once it converted to an abstract feature space
Hands-on
● Tweaken pipeline
● Doc2vec similarity
● Tweaken classificatie algoritme

More Related Content

What's hot (20)

PDF
bag-of-words models
Xiaotao Zou
 
PDF
Feature Engineering
HJ van Veen
 
PPTX
NAMED ENTITY RECOGNITION
live_and_let_live
 
PPTX
Generative models
Birger Moell
 
PPT
Textmining Introduction
Datamining Tools
 
PDF
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
PPT
Introduction to Natural Language Processing
rohitnayak
 
PDF
Natural Language Processing with Python
Benjamin Bengfort
 
PPTX
Language models
Maryam Khordad
 
PDF
Natural Language Processing (NLP)
Yuriy Guts
 
PDF
Introduction to Natural Language Processing (NLP)
VenkateshMurugadas
 
KEY
Testing Hadoop jobs with MRUnit
Eric Wendelin
 
PDF
First Order Logic resolution
Amar Jukuntla
 
PDF
Natural Language Processing: L02 words
ananth
 
PDF
Natural language processing (NLP) introduction
Robert Lujo
 
PPTX
K-Nearest Neighbor(KNN)
Abdullah al Mamun
 
PPTX
NLP State of the Art | BERT
shaurya uppal
 
PPTX
Introduction to natural language processing (NLP)
Alia Hamwi
 
PDF
Text classification-php-v4
Glenn De Backer
 
PDF
The Evolution of Data Science
Kenny Daniel
 
bag-of-words models
Xiaotao Zou
 
Feature Engineering
HJ van Veen
 
NAMED ENTITY RECOGNITION
live_and_let_live
 
Generative models
Birger Moell
 
Textmining Introduction
Datamining Tools
 
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
Introduction to Natural Language Processing
rohitnayak
 
Natural Language Processing with Python
Benjamin Bengfort
 
Language models
Maryam Khordad
 
Natural Language Processing (NLP)
Yuriy Guts
 
Introduction to Natural Language Processing (NLP)
VenkateshMurugadas
 
Testing Hadoop jobs with MRUnit
Eric Wendelin
 
First Order Logic resolution
Amar Jukuntla
 
Natural Language Processing: L02 words
ananth
 
Natural language processing (NLP) introduction
Robert Lujo
 
K-Nearest Neighbor(KNN)
Abdullah al Mamun
 
NLP State of the Art | BERT
shaurya uppal
 
Introduction to natural language processing (NLP)
Alia Hamwi
 
Text classification-php-v4
Glenn De Backer
 
The Evolution of Data Science
Kenny Daniel
 

Similar to Text classification presentation (20)

PDF
TRECVID 2016 : Video to Text Description
George Awad
 
PDF
[系列活動] 人工智慧與機器學習在推薦系統上的應用
台灣資料科學年會
 
PDF
Death to project documentation with eXtreme Programming
Alex Fernandez
 
PDF
From DOT to Dotty
Martin Odersky
 
PDF
Text Mining Analytics 101
Manohar Swamynathan
 
PDF
Living Documentation (NCrafts Paris 2015, DDDx London 2015, BDX.io 2015, Code...
Cyrille Martraire
 
PPTX
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
Andrew Liu
 
PDF
Improving classification accuracy for customer contact transcriptions
Maria Vechtomova
 
PPT
Slides
butest
 
PDF
Semantic similarity for faster Knowledge Graph delivery at scale
Connected Data World
 
PDF
Entities for Augmented Intelligence
krisztianbalog
 
PPTX
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
OWAISSALAUDDINKHAN
 
PDF
Transformer_Clustering_PyData_2022.pdf
ChristopherLennan
 
PDF
Evaluation Initiatives for Entity-oriented Search
krisztianbalog
 
PPT
What to do when one size does not fit all?!
Arjen de Vries
 
PPTX
Lecture 10
Jeet Das
 
PDF
Word Embeddings, why the hype ?
Hady Elsahar
 
PPT
Datatypes in C Language
Pooja Patel
 
PDF
Semantic Recognition of Ontology Refactoring
Gerd Groener
 
PPT
chapter 5 Information Retrieval Models.ppt
KelemAlebachew
 
TRECVID 2016 : Video to Text Description
George Awad
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
台灣資料科學年會
 
Death to project documentation with eXtreme Programming
Alex Fernandez
 
From DOT to Dotty
Martin Odersky
 
Text Mining Analytics 101
Manohar Swamynathan
 
Living Documentation (NCrafts Paris 2015, DDDx London 2015, BDX.io 2015, Code...
Cyrille Martraire
 
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
Andrew Liu
 
Improving classification accuracy for customer contact transcriptions
Maria Vechtomova
 
Slides
butest
 
Semantic similarity for faster Knowledge Graph delivery at scale
Connected Data World
 
Entities for Augmented Intelligence
krisztianbalog
 
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
OWAISSALAUDDINKHAN
 
Transformer_Clustering_PyData_2022.pdf
ChristopherLennan
 
Evaluation Initiatives for Entity-oriented Search
krisztianbalog
 
What to do when one size does not fit all?!
Arjen de Vries
 
Lecture 10
Jeet Das
 
Word Embeddings, why the hype ?
Hady Elsahar
 
Datatypes in C Language
Pooja Patel
 
Semantic Recognition of Ontology Refactoring
Gerd Groener
 
chapter 5 Information Retrieval Models.ppt
KelemAlebachew
 
Ad

Recently uploaded (20)

PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PPTX
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Ad

Text classification presentation

  • 2. Agenda ● Text classification ● Sparse data ○ Dimensionality reduction / visualization sparse data ○ Classification on sparse data ● Text embedding ○ Short explanation doc2vec ○ Visualization sparse vs embedded ○ Classification sparse vs embedded ● Hands-on!
  • 3. Text classification - Definition ● Text classification is the task of assigning predefined categories to free-text documents.
  • 4. Example: News article classification What is the category of this news article?
  • 6. Example: News article classification Examples: Great war Examples: Sunken ships
  • 7. Example: Every word is a feature Feature dimensions Document 1: Class A Document 2: Class A Document 3: Class B Document 4: Class B 1: arrived 0 1 4 5 2: received 1 2 3 5 3: gold 4 4 4 1 4: a 1 0 1 2 5: energy 5 5 5 3 Feature vector Feature space
  • 9. Text = high dimensional Feature dimensions Document 1: Class A Document 2: Class A Document 3: Class B Document 4: Class B 1: arrived 0 1 4 5 2: received 1 2 3 5 3: gold 4 4 4 1 4: a 1 0 1 2 5: energy 5 5 5 3
  • 10. Text = sparse Feature dimensions Document 1: Class A Document 2: Class A Document 3: Class A Document 4: Class A Document 5: Class ? 1: acquired 0 0 1 0 0 2: received 0 2 0 0 0 3: collected 1 0 0 0 0 4: a 0 0 0 2 0 5: energy 0 0 0 0 1
  • 11. Dataset: Reuters news article dataset 100 top words across whole corpus For each document count how often each word occurs Word 1 Word 2 Word 3 Word 100 Document 1 Document 2 Word 1 0 2 Word 2 3 1 Word 3 4 4 Word 100 1 1 Feature space of 100 dimensions containing 21578 data points
  • 12. Dimensionality reduction - Reuters top 100 words
  • 13. Dimensionality reduction - Reuters top 100 words
  • 14. Dimensionality reduction - pipeline Documents = text + category Tsne (dimensionality reduction) visualization Words (100 dimensions) Reduced 2d vectors categories
  • 16. Dimensionality reduction - pipeline Mnist = picture + class Tsne (dimensionality reduction) visualization Pixels (800 dimensions) Reduced 2d vectors classes
  • 17. Data cleaning ● Remove stop words: ○ a ○ the ○ or ● Stemming: ● Remove non alphanumeric characters: ○ $%^@# ○ 😁😂 ○ <html> https://
  • 18. Top 100 words - Data cleaning disabled
  • 19. Top 100 words - Data cleaning enabled
  • 20. Data cleaning results Classificatie score data cleaning off: 0.88 Classificatie score data cleaning on: 0.90
  • 21. Documents = text + category training Classification score - pipeline verification words+categories 20% 80% Trained model score
  • 22. Embedding - doc2vec Word 1 Word 2 Word 3 …. Word 50000 Document 1 Document 2 … Document 10000 Word 1 Word 2 Word 3 …. Word 50000
  • 23. Embedding - doc2vec - example Word 1 Word 345 Word 1000 Document 245 Word 25 Word 1204 Word 1 Word 345 Word 1000 Document 312 Word 45 Word 1182
  • 24. Input Hidden Output Word1 Word2 Word3 Doc1 Doc2 Embedding - doc2vec - example Word1 Word2 Word3 Word4 Word5
  • 25. Embedding - pipeline Documents = text + category doc2vec classification Text (10000+ dimensions) document features 100 dimensions categories
  • 26. Reuters - score doc2vec vs top 100 words Word count top 100 words: 0.90 Doc2vec: 0.94
  • 27. IMDB movie reviews - doc2vec vs wordcount Class: positive Bromwell High is nothing short of brilliant. Expertly scripted and perfectly delivered, this searing parody of a students and teachers at a South London Public School leaves you literally rolling with laughter. It's vulgar, provocative, witty and sharp. The characters are a superbly caricatured cross section of British society (or to be more accurate, of any society). Following the escapades of Keisha, Latrina and Natella, our three "protagonists" for want of a better term, the show doesn't shy away from parodying every imaginable subject. Political correctness flies out the window in every episode. If you enjoy shows that aren't afraid to poke fun of every taboo subject imaginable, then Bromwell High will not disappoint! Class: negative Robert DeNiro plays the most unbelievably intelligent illiterate of all time. This movie is so wasteful of talent, it is truly disgusting. The script is unbelievable. The dialog is unbelievable. Jane Fonda's character is a caricature of herself, and not a funny one. The movie moves at a snail's pace, is photographed in an ill-advised manner, and is insufferably preachy. It also plugs in every cliche in the book. Swoozie Kurtz is excellent in a supporting role, but so what?<br /><br />Equally annoying is this new IMDB rule of requiring ten lines for every review. When a movie is this worthless, it doesn't require ten lines of text to let other readers know that it is a waste of time and tape. Avoid this movie.
  • 28. IMDB movie reviews - doc2vec vs wordcount
  • 29. IMDB movie reviews - doc2vec vs wordcount Word count top 250 words: 0.72 Doc2vec: 0.83
  • 30. Conclusion ● It’s all about extracting the right features from your data ● Visualize the data to get a sense of the value of your features ● You can use the same algorithms for text, image, audio and other kinds of data once it converted to an abstract feature space
  • 31. Hands-on ● Tweaken pipeline ● Doc2vec similarity ● Tweaken classificatie algoritme