SlideShare a Scribd company logo
How Does LSA Work? 
Andrew Koo - Insight Data Science
Latent Semantic Analysis 
• Separate the text into sentences based on a trained model 
• Build a sparse matrix of words and the count it appears in 
each sentence 
• Normalize each word with tf-idf 
• Use singular value decomposition to reduce each 
sentence vector to multidimensional “conceptual space” 
• Pick top sentences based on the absolute value of the 
sentence vector in the “conceptual space”
1. Separate the Text into Sentences 
• Apply Tokenizer from the Python sumy Library 
“Hi world! Hello 
world! This is 
Andrew.” 
[“Hi world!”, “Hello 
world!”, “This is 
Andrew.”]
2. Build a sparse matrix of words and 
the count it appears in each sentence 
[“Hi world!”, “Hello 
world!”, “This is 
Andrew.”] 
(Sen , word) Count 
(0 , 2) 
(0 , 5) 
(1 , 5) 
(1 , 1) 
(2 , 4) 
(2 , 3) 
(2 , 0) 
1 
1 
1 
1 
1 
1 
1
3. Normalize each word with tf-idf 
• tf: term frequency - how frequent a term occurs in a document 
• idf: inverse doc frequency - how important a word is (weigh 
down the frequent terms, ex: is, does, how) 
(Sen , word) Count 
(0 , 2) 
(0 , 5) 
(1 , 5) 
(1 , 1) 
(2 , 4) 
(2 , 3) 
(2 , 0) 
1 
1 
1 
1 
1 
1 
1 
(Sen , word) Count 
(0 , 2) 
(0 , 5) 
(1 , 5) 
(1 , 1) 
(2 , 4) 
(2 , 3) 
(2 , 0) 
0.796 
0.605 
0.605 
0.796 
0.577 
0.577 
0.577
4. Use singular value decomposition to 
reduce each sentence vector to 
multidimensional “conceptual” space 
Normalized word-sentence 
matrix 
Transform 
matrix 
Scaling 
matrix 
Concept 
matrix 
Multiply the normalized word-sentence matrix by UT to transform 
each sentence to a vector in the multidimensional conceptual space
5. Pick top sentences based on the 
absolute value of the sentence vector in 
the “conceptual space” 
Concept Vector 
— λ1V1T — 
— λ2V2T — 
— λ3V3T — 
— λ4V4T — 
— λ5V5T — 
— λ6V6T — 
— λ7V7T — 
= 
Sentence Vector 
S’0 S’1 S’2 S’3 S’4 S’5 S’6 
0.400 
0.213 
0.243 
0.762 
0.145 
0.123 
0.254 
The absolute value of this vector is the importance score of this 
sentence

More Related Content

What's hot (14)

PDF
Media Art II openFrameworks 複数のシーンの管理・切替え
Atsushi Tadokoro
 
PPTX
Function and graphs
Rione Drevale
 
PPT
A1, 6 1, solving systems by graphing (rev)
kstraka
 
PDF
Textrank algorithm
Andrew Koo
 
PPTX
Theory of Computation
Shiraz316
 
PDF
265 ge8151 problem solving and python programming - 2 marks with answers
vithyanila
 
PDF
Elixir
Robert Brown
 
PPT
Exponential functions
Jessica Garcia
 
PDF
Cs2303 theory of computation all anna University question papers
appasami
 
PPTX
Parts of Speect Tagging
theyaseen51
 
PDF
~knitr+pandocではじめる~『R MarkdownでReproducible Research』
Nagi Teramo
 
PPSX
Semantic analysis
Ibrahim Muneer
 
PPT
Lec 02 logical eq (Discrete Mathematics)
Naosher Md. Zakariyar
 
PPTX
for関数を使った繰り返し処理によるヒストグラムの一括出力
imuyaoti
 
Media Art II openFrameworks 複数のシーンの管理・切替え
Atsushi Tadokoro
 
Function and graphs
Rione Drevale
 
A1, 6 1, solving systems by graphing (rev)
kstraka
 
Textrank algorithm
Andrew Koo
 
Theory of Computation
Shiraz316
 
265 ge8151 problem solving and python programming - 2 marks with answers
vithyanila
 
Elixir
Robert Brown
 
Exponential functions
Jessica Garcia
 
Cs2303 theory of computation all anna University question papers
appasami
 
Parts of Speect Tagging
theyaseen51
 
~knitr+pandocではじめる~『R MarkdownでReproducible Research』
Nagi Teramo
 
Semantic analysis
Ibrahim Muneer
 
Lec 02 logical eq (Discrete Mathematics)
Naosher Md. Zakariyar
 
for関数を使った繰り返し処理によるヒストグラムの一括出力
imuyaoti
 

Viewers also liked (20)

PPT
Latent Semantic Indexing For Information Retrieval
Sudarsun Santhiappan
 
PPTX
NLP and LSA getting started
Innovation Engineering
 
PPT
Latent Semantic Indexing and Analysis
Mercy Livingstone
 
PDF
Presentation of OpenNLP
Robert Viseur
 
PDF
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
Social Media Camp
 
PPTX
Latent Semanctic Analysis Auro Tripathy
Auro Tripathy
 
PPTX
Using ls as in class 2015
MrsMcGinty
 
PDF
Lecture 2: Computational Semantics
Marina Santini
 
PDF
Topic Modelling: Tutorial on Usage and Applications
Ayush Jain
 
PDF
Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Pe...
Alexis Perrier
 
PDF
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Jonathan Sedar
 
PDF
Mathematical approach for Text Mining 1
Kyunghoon Kim
 
PPTX
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
rchbeir
 
PPTX
Recommending Tags with a Model of Human Categorization
Christoph Trattner
 
PPTX
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
Damiano Spina
 
PDF
Geometric Aspects of LSA
Heinrich Hartmann
 
PPTX
20 cv mil_models_for_words
zukun
 
PPTX
Analysis of Reviews on Sony Z3
Krishna Bollojula
 
PDF
AutoCardSorter - Designing the Information Architecture of a web site using L...
Christos Katsanos
 
PPTX
Latent Semantic Indexing and Search Engines Optimimization (SEO)
muzzy4friends
 
Latent Semantic Indexing For Information Retrieval
Sudarsun Santhiappan
 
NLP and LSA getting started
Innovation Engineering
 
Latent Semantic Indexing and Analysis
Mercy Livingstone
 
Presentation of OpenNLP
Robert Viseur
 
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
Social Media Camp
 
Latent Semanctic Analysis Auro Tripathy
Auro Tripathy
 
Using ls as in class 2015
MrsMcGinty
 
Lecture 2: Computational Semantics
Marina Santini
 
Topic Modelling: Tutorial on Usage and Applications
Ayush Jain
 
Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Pe...
Alexis Perrier
 
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Jonathan Sedar
 
Mathematical approach for Text Mining 1
Kyunghoon Kim
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
rchbeir
 
Recommending Tags with a Model of Human Categorization
Christoph Trattner
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
Damiano Spina
 
Geometric Aspects of LSA
Heinrich Hartmann
 
20 cv mil_models_for_words
zukun
 
Analysis of Reviews on Sony Z3
Krishna Bollojula
 
AutoCardSorter - Designing the Information Architecture of a web site using L...
Christos Katsanos
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
muzzy4friends
 
Ad

Similar to LSA algorithm (20)

PDF
Word2vec algorithm
Andrew Koo
 
PPTX
Text Mining for Lexicography
Leiden University
 
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
PDF
Evaluation of subjective answers using glsa enhanced with contextual synonymy
ijnlc
 
PDF
Latent Semantic Analysis(LSA)
Jihye Kwon
 
PDF
Word2vec and Friends
Bruno Gonçalves
 
PPTX
Vectors in Search - Towards More Semantic Matching
Simon Hughes
 
PPTX
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Lucidworks
 
PPTX
Analyzing Arguments during a Debate using Natural Language Processing in Python
Abhinav Gupta
 
PPTX
Text summarization-with Extractive Text summarization techniques.pptx
Tayyaba Amber
 
PPTX
An introduction to compositional models in distributional semantics
Andre Freitas
 
PDF
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
csandit
 
PDF
Word representation: SVD, LSA, Word2Vec
ananth
 
PDF
Automated Essay Scoring Using Generalized Latent Semantic Analysis
Gina Rizzo
 
PDF
CS571: Distributional semantics
Jinho Choi
 
PDF
International Journal of Computer Science and Security Volume (1) Issue (4)
CSCJournals
 
PPTX
Introduction to Distributional Semantics
Andre Freitas
 
PDF
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
csandit
 
PDF
Extraction Based automatic summarization
Abdelaziz Al-Rihawi
 
PPTX
DL-CO2 -Session 3 Learning Vectorial Representations of Words.pptx
Kv Sagar
 
Word2vec algorithm
Andrew Koo
 
Text Mining for Lexicography
Leiden University
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
Evaluation of subjective answers using glsa enhanced with contextual synonymy
ijnlc
 
Latent Semantic Analysis(LSA)
Jihye Kwon
 
Word2vec and Friends
Bruno Gonçalves
 
Vectors in Search - Towards More Semantic Matching
Simon Hughes
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Lucidworks
 
Analyzing Arguments during a Debate using Natural Language Processing in Python
Abhinav Gupta
 
Text summarization-with Extractive Text summarization techniques.pptx
Tayyaba Amber
 
An introduction to compositional models in distributional semantics
Andre Freitas
 
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
csandit
 
Word representation: SVD, LSA, Word2Vec
ananth
 
Automated Essay Scoring Using Generalized Latent Semantic Analysis
Gina Rizzo
 
CS571: Distributional semantics
Jinho Choi
 
International Journal of Computer Science and Security Volume (1) Issue (4)
CSCJournals
 
Introduction to Distributional Semantics
Andre Freitas
 
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
csandit
 
Extraction Based automatic summarization
Abdelaziz Al-Rihawi
 
DL-CO2 -Session 3 Learning Vectorial Representations of Words.pptx
Kv Sagar
 
Ad

Recently uploaded (20)

PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PPTX
things that used in cleaning of the things
drkaran1421
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PPTX
DATA-COLLECTION METHODS, TYPES AND SOURCES
biggdaad011
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
Learning Tendency Analysis of Scratch Programming Course(Entry Class) for Upp...
ryouta039
 
PPTX
AI Project Cycle and Ethical Frameworks.pptx
RiddhimaVarshney1
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PPTX
isaacnewton-250718125311-e7ewqeqweqwa74d99.pptx
MahmoudHalim13
 
PDF
The X-Press God-WPS Office.pdf hdhdhdhdhd
ramifatoh4
 
PDF
apidays Munich 2025 - Geospatial Artificial Intelligence (GeoAI) with OGC API...
apidays
 
PPTX
materials that are required to used.pptx
drkaran1421
 
PPTX
fashion industry boom.pptx an economics project
TGMPandeyji
 
PDF
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PPTX
Presentation1.pptx4327r58465824358432884
udayfand0306
 
PDF
apidays Munich 2025 - Let’s build, debug and test a magic MCP server in Postm...
apidays
 
PPTX
apidays Munich 2025 - GraphQL 101: I won't REST, until you GraphQL, Surbhi Si...
apidays
 
PPTX
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
PPTX
Slide studies GC- CRC - PC - HNC baru.pptx
LLen8
 
Data base management system Transactions.ppt
gandhamcharan2006
 
things that used in cleaning of the things
drkaran1421
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
DATA-COLLECTION METHODS, TYPES AND SOURCES
biggdaad011
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Learning Tendency Analysis of Scratch Programming Course(Entry Class) for Upp...
ryouta039
 
AI Project Cycle and Ethical Frameworks.pptx
RiddhimaVarshney1
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
isaacnewton-250718125311-e7ewqeqweqwa74d99.pptx
MahmoudHalim13
 
The X-Press God-WPS Office.pdf hdhdhdhdhd
ramifatoh4
 
apidays Munich 2025 - Geospatial Artificial Intelligence (GeoAI) with OGC API...
apidays
 
materials that are required to used.pptx
drkaran1421
 
fashion industry boom.pptx an economics project
TGMPandeyji
 
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
Presentation1.pptx4327r58465824358432884
udayfand0306
 
apidays Munich 2025 - Let’s build, debug and test a magic MCP server in Postm...
apidays
 
apidays Munich 2025 - GraphQL 101: I won't REST, until you GraphQL, Surbhi Si...
apidays
 
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
Slide studies GC- CRC - PC - HNC baru.pptx
LLen8
 

LSA algorithm

  • 1. How Does LSA Work? Andrew Koo - Insight Data Science
  • 2. Latent Semantic Analysis • Separate the text into sentences based on a trained model • Build a sparse matrix of words and the count it appears in each sentence • Normalize each word with tf-idf • Use singular value decomposition to reduce each sentence vector to multidimensional “conceptual space” • Pick top sentences based on the absolute value of the sentence vector in the “conceptual space”
  • 3. 1. Separate the Text into Sentences • Apply Tokenizer from the Python sumy Library “Hi world! Hello world! This is Andrew.” [“Hi world!”, “Hello world!”, “This is Andrew.”]
  • 4. 2. Build a sparse matrix of words and the count it appears in each sentence [“Hi world!”, “Hello world!”, “This is Andrew.”] (Sen , word) Count (0 , 2) (0 , 5) (1 , 5) (1 , 1) (2 , 4) (2 , 3) (2 , 0) 1 1 1 1 1 1 1
  • 5. 3. Normalize each word with tf-idf • tf: term frequency - how frequent a term occurs in a document • idf: inverse doc frequency - how important a word is (weigh down the frequent terms, ex: is, does, how) (Sen , word) Count (0 , 2) (0 , 5) (1 , 5) (1 , 1) (2 , 4) (2 , 3) (2 , 0) 1 1 1 1 1 1 1 (Sen , word) Count (0 , 2) (0 , 5) (1 , 5) (1 , 1) (2 , 4) (2 , 3) (2 , 0) 0.796 0.605 0.605 0.796 0.577 0.577 0.577
  • 6. 4. Use singular value decomposition to reduce each sentence vector to multidimensional “conceptual” space Normalized word-sentence matrix Transform matrix Scaling matrix Concept matrix Multiply the normalized word-sentence matrix by UT to transform each sentence to a vector in the multidimensional conceptual space
  • 7. 5. Pick top sentences based on the absolute value of the sentence vector in the “conceptual space” Concept Vector — λ1V1T — — λ2V2T — — λ3V3T — — λ4V4T — — λ5V5T — — λ6V6T — — λ7V7T — = Sentence Vector S’0 S’1 S’2 S’3 S’4 S’5 S’6 0.400 0.213 0.243 0.762 0.145 0.123 0.254 The absolute value of this vector is the importance score of this sentence