SlideShare a Scribd company logo
Extraction-Based Automatic
Summarization
Abdelaziz Al-Rihawi
Mohammad Kher Kabbaby
Faculty of Information Technology Engineering
Damascus University
Artificial Intelligence Department
Classification of Summarization Tasks
Summary Type
Extraction-based
• extracts objects from the entire collection, without modifying the objects themselves.
• where the goal is to select whole sentences (without modifying them)
Abstraction-based
• Retelling the selected sentences to end summary.
Use of External resources
Knowledge-Poor
• don’t use any external resources to generate summery
Knowledge-Rich
• May utilize external corpus such as Wikipedia or lexical resources such as
WordNet or VerbOcean .
• used to unravel semantic relations between words, phrases or sentences.
Task Specific Constrains
Query-Focused
• query is provided to a summarizer in addition to the source documents.
• summarizer construct a summary that contains information requested by the
query.
Update
• The purpose of the update summary is to identify new pieces of information
in the more recent articles with the assumption that the user has already read
the previous ones
Guided
• a set of aspects that should be covered in a summary is provided
Summarization Workflow
Preprocessing
Sentence
Representation
Similarity Measures
Content Selection
Preprocessing
Sentence
Segmentation
Short
Sentence
Removal
Word
Segmentation
Stop Word
Removal
Short Word
Removal
Stemming
Sentence Segmentation
Splitting a text into sentences using unsupervised sentence boundary
identification algorithm Punkt.
Short Sentence Removal
Sentences containing less than three words are considered to be not
informative enough or incorrectly segmented.
Word Segmentation
Splitting a sentence into words based on spaces and punctuation and
using the conventions used by the Penn Treebank1 to handle special
cases
Example:
“weren’t” is split into two words “were” and “n’t”
Stop Word Removal
Stop words like: { and, the, or, … }
Define function like is_a to remove stop words.
Short Word Removal
Words smaller than three letters are considered to be non-content
words.
Stemming
A rule-based stemming algorithm, the Porter stemmer.
Output of Preprocessing
Example:
“ Every weekend, students in Zhengzhou can take in a free
concert, a traditional Chinese opera or a stage play in the city’s
Youth and Children’s Palace.”
Output of Preprocessing
Words:
{ weekend, student, zhengzhou, take, free, concert, trait,
chines, opera, stage, play, citi, youth, children, palace}
Sentence Representation
Feature Selection
• Term as Feature
• Features are obtained by selecting unique words from the
preprocessed sentences.
• Each sentence is represented as a context vector
• The vectors are accumulated into a representation matrix
Feature Selection
Term Count (TC)
Represents sentences as vectors of which elements are the absolute
frequency of words in the sentence
Feature Selection
Term frequency-inverse sentence frequency (TF-ISF)
the same weighting scheme as the TF-IDF but sentences are used instead of
documents.
T F(w,d) =
TC (w,d)
|d|
IDF(w) = log
|D|
|D(w)|+ 1
W : word w
|d| : the number of words in document d
Feature Selection
Latent semantic analysis:
A distributed representation model
Matrix Vk or the matrix product Sk ·Vk
obtained from SVD are used as the sentence representation matrix.
Feature Selection
LSA Algorithm:
Step 1 - Creating the Count Matrix
Step 2 - Modify the Counts with TFIDF
Step 3 - Using the Singular Value Decomposition
Step 4 - Sentence Selection for summary.
Feature Selection
art concert capit citi children educ
S1 1 0 0 0 1 1
S2 1 1 1 0 0 0
s3 0 1 0 1 0 0
S4 0 0 0 1 0 1
Count Matrix
Feature Selection
art concert capit citi children educ
S1 0.23 0 0 0 0.46 0.23
S2 0.23 0.23 0.46 0 0 0
s3 0 0.35 0 0.35 0 0
S4 0 0 0 0.35 0 0.35
TF-IDF representation Matrix
Feature Selection
D1 D2 D3
S1 0.23 0 0
S2 0.23 0.23 0.46
s3 0 0.35 0
S4 0 0 0
LSA representation Matrix
Similarity Measures
Similarity Measures
• Corpus-based
• measures use term frequencies observed in a corpus to relate contexts to
each other
• Knowledge-based
• predefined semantic relations between terms obtained from lexical resources
Similarity Measures
Jaccard similarity coefficient
set-based similarity metric used for measuring similarities between
sentences represented with TC representation.
Similarity Measures
Cosine similarity
a vector-based similarity metric used for representations with real-
valued weights such as TF-ISF and LSA.
Sentence Selection
Sentence Selection
The goal of the selection procedure is to identify a set of sentences that
contain important information.
Three criteria are optimized when selecting the sentences:
1. Relevance
2. Redundancy
3. Length
Maximize the relevance while minimizing the redundancy
Sentence Selection
Selection of sentences can be handled either:
1. Supervised Methods
2. Unsupervised Methods
Sentence Selection
Supervised Methods:
• use a classifier trained on a set of documents coupled with
corresponding extracts
• possible to label sentences with a binary value:
• 1- a sentence is included in the extract
• 0 - a sentence is not included in the extract.
• each sentence should be represented by a feature vector
• a classifier is trained on a set of feature vectors
Sentence Selection
Supervised Methods:
1. cue phrases and topic terms
2. position of a sentence in a document
3. centrality of a sentence
1. for example a similarity between a sentence and other
4. length of a sentence
1. for example the number of open-class words (i.e. nouns, main verbs,
adjectives, adverbs) in a sentence
Sentence Selection
Unsupervised Methods:
• unsupervised summarization algorithms are either centroid-based
or centrality-based.
Sentence Selection
Centroid-based Algorithm:
• select sentences that contain informative words
• Refereed to as topic signatures
• Calculation of informativeness :
• using popular weighting schemes such as TF-IDF or log-likelihood ratio
Sentence Selection
Centroid-based Algorithm:
• When the similarity between each pair of sentences is available
• for a sentence S is to take the average of the similarities between S and all
the other sentences
• The algorithms described above rely on superficial features ignoring
higher-level semantic information such as semantic relations between
terms, so we use Graph theory.
Use of Graphs in Automatic
Summarization
Graph Representations
Similarity graph Event graph
Graph Representations
• Similarity relations.
• Semantic relations such as semantic roles, cause-consequences,
specifications, time relations.
Centrality Measures
• Graph theory and network analysis provide a great number of
different methods and algorithms for working with graphs
• Length of edges in this graph correspond to actual distances between
nodes
• The size of a node will be modified according to the centrality of this
node
• calculated using a specific centrality measure, so that more central
nodes will be larger
Centrality Measures
• Degree-based Methods
• Path-based Methods
Reference
Extraction-Based Automatic
Summarization
Gleb Sizov
June 2010
Master of Science in Computer Science
Gleb Sizov
Thanks!! 

More Related Content

What's hot (20)

PDF
Text Summarization
Prabhakar Bikkaneti
 
PDF
text summarization using amr
amit nagarkoti
 
PDF
Abstractive Text Summarization
Tho Phan
 
PDF
Document Summarization
Pratik Kumar
 
PPTX
Text summarization using deep learning
Abu Kaisar
 
PDF
Word2Vec
hyunyoung Lee
 
PDF
Improving Neural Abstractive Text Summarization with Prior Knowledge
Gaetano Rossiello, PhD
 
PPTX
NLP
guestff64339
 
PDF
Natural Language Processing
Toine Bogers
 
PPT
Introduction to Natural Language Processing
Pranav Gupta
 
PDF
Natural Language Processing (NLP)
Yuriy Guts
 
PPTX
Word2Vec
mohammad javad hasani
 
PPTX
Natural language processing
Yogendra Tamang
 
PDF
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Edureka!
 
PPTX
Tutorial on word2vec
Leiden University
 
PPTX
Natural Language Processing (NLP) - Introduction
Aritra Mukherjee
 
PPTX
Bert
Abdallah Bashir
 
PPTX
What is word2vec?
Traian Rebedea
 
PDF
Information Extraction
Rubén Izquierdo Beviá
 
Text Summarization
Prabhakar Bikkaneti
 
text summarization using amr
amit nagarkoti
 
Abstractive Text Summarization
Tho Phan
 
Document Summarization
Pratik Kumar
 
Text summarization using deep learning
Abu Kaisar
 
Word2Vec
hyunyoung Lee
 
Improving Neural Abstractive Text Summarization with Prior Knowledge
Gaetano Rossiello, PhD
 
Natural Language Processing
Toine Bogers
 
Introduction to Natural Language Processing
Pranav Gupta
 
Natural Language Processing (NLP)
Yuriy Guts
 
Natural language processing
Yogendra Tamang
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Edureka!
 
Tutorial on word2vec
Leiden University
 
Natural Language Processing (NLP) - Introduction
Aritra Mukherjee
 
What is word2vec?
Traian Rebedea
 
Information Extraction
Rubén Izquierdo Beviá
 

Viewers also liked (13)

PDF
Textrank algorithm
Andrew Koo
 
PDF
Text Summarization
Carlos Castillo (ChaTo)
 
PDF
Thesis 2016
DHIMAN DAS
 
PPTX
ResQu: A Framework for Automatic Evaluation of Knowledge-Driven Automatic Sum...
Nishita Jaykumar
 
PDF
Automatic Document Summarization
Findwise
 
PDF
Automatic Text Summarization
HimanshuPu
 
PDF
Word2Vec: Vector presentation of words - Mohammad Mahdavi
irpycon
 
PPTX
An Introduction to gensim: "Topic Modelling for Humans"
sandinmyjoints
 
PDF
Drawing word2vec
Kai Sasaki
 
PDF
word2vec - From theory to practice
hen_drik
 
PDF
Tutorial on automatic summarization
Constantin Orasan
 
PDF
Word2vec algorithm
Andrew Koo
 
PDF
Katja Filippova
Lidia Pivovarova
 
Textrank algorithm
Andrew Koo
 
Text Summarization
Carlos Castillo (ChaTo)
 
Thesis 2016
DHIMAN DAS
 
ResQu: A Framework for Automatic Evaluation of Knowledge-Driven Automatic Sum...
Nishita Jaykumar
 
Automatic Document Summarization
Findwise
 
Automatic Text Summarization
HimanshuPu
 
Word2Vec: Vector presentation of words - Mohammad Mahdavi
irpycon
 
An Introduction to gensim: "Topic Modelling for Humans"
sandinmyjoints
 
Drawing word2vec
Kai Sasaki
 
word2vec - From theory to practice
hen_drik
 
Tutorial on automatic summarization
Constantin Orasan
 
Word2vec algorithm
Andrew Koo
 
Katja Filippova
Lidia Pivovarova
 
Ad

Similar to Extraction Based automatic summarization (20)

PDF
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Edmond Lepedus
 
PPT
Information extraction for Free Text
butest
 
PDF
Query trees
Shefa Idrees
 
PPTX
Module II.pptxh bnjkm,l.ftghyujkiolp[;'hjuikolp
vallepubalaji66
 
PPTX
Information Extraction
ssbd6985
 
PPTX
Information Extraction
ssbd6985
 
PPTX
Information Extraction
ssbd6985
 
PPT
Query based summarization
damom77
 
PPT
Query based summarization
damom77
 
PPT
Query Based Summarization
Mariana Damova, Ph.D
 
PDF
G04124041046
IOSR-JEN
 
PPT
Project Presentation
butest
 
PPTX
Final presentation
Nitish Upreti
 
PDF
Extraction of Data Using Comparable Entity Mining
iosrjce
 
PDF
E017252831
IOSR Journals
 
PDF
A combination of reduction and expansion approaches to handle with long natur...
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
PDF
A Novel Approach for Keyword extraction in learning objects using text mining
IJSRD
 
PDF
Interface for Finding Close Matches from Translation Memory
Priyatham Bollimpalli
 
PPT
lecture_mooney.ppt
butest
 
PPTX
Eskm20140903
Shuhei Otani
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Edmond Lepedus
 
Information extraction for Free Text
butest
 
Query trees
Shefa Idrees
 
Module II.pptxh bnjkm,l.ftghyujkiolp[;'hjuikolp
vallepubalaji66
 
Information Extraction
ssbd6985
 
Information Extraction
ssbd6985
 
Information Extraction
ssbd6985
 
Query based summarization
damom77
 
Query based summarization
damom77
 
Query Based Summarization
Mariana Damova, Ph.D
 
G04124041046
IOSR-JEN
 
Project Presentation
butest
 
Final presentation
Nitish Upreti
 
Extraction of Data Using Comparable Entity Mining
iosrjce
 
E017252831
IOSR Journals
 
A combination of reduction and expansion approaches to handle with long natur...
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
A Novel Approach for Keyword extraction in learning objects using text mining
IJSRD
 
Interface for Finding Close Matches from Translation Memory
Priyatham Bollimpalli
 
lecture_mooney.ppt
butest
 
Eskm20140903
Shuhei Otani
 
Ad

Recently uploaded (20)

PDF
The-Origin- of -Metazoa-vertebrates .ppt
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PPTX
Pratik inorganic chemistry silicon based ppt
akshaythaker18
 
PPTX
Lamarckism is one of the earliest theories of evolution, proposed before Darw...
Laxman Khatal
 
PDF
Polarized Multiwavelength Emission from Pulsar Wind—Accretion Disk Interactio...
Sérgio Sacani
 
PPTX
GB1 Q1 04 Life in a Cell (1).pptx GRADE 11
JADE ACOSTA
 
PDF
Primordial Black Holes and the First Stars
Sérgio Sacani
 
PPT
Human physiology and digestive system
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PDF
Introduction of Animal Behaviour full notes.pdf
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PDF
2025-06-10 TWDB Agency Updates & Legislative Outcomes
tagdpa
 
PPTX
Gene Therapy. Introduction, history and types of Gene therapy
Ashwini I Chuncha
 
PDF
RODENT PEST MANAGEMENT-converted-compressed.pdf
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PPTX
Scale up-1 bioreactors ppt. .
pandeysmriti129
 
PDF
Refractory solid condensation detected in an embedded protoplanetary disk
Sérgio Sacani
 
PPT
Introduction of animal physiology in vertebrates
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PDF
WUCHERIA BANCROFTI-converted-compressed.pdf
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PPTX
Q1 - W1 - D2 - Models of matter for science.pptx
RyanCudal3
 
PDF
The role of the Lorentz force in sunspot equilibrium
Sérgio Sacani
 
PDF
Phosphates reveal high pH ocean water on Enceladus
Sérgio Sacani
 
PDF
Continuous Model-Based Engineering of Software-Intensive Systems: Approaches,...
Hugo Bruneliere
 
PPTX
Diuretic Medicinal Chemistry II Unit II.pptx
Dhanashri Dupade
 
The-Origin- of -Metazoa-vertebrates .ppt
S.B.P.G. COLLEGE BARAGAON VARANASI
 
Pratik inorganic chemistry silicon based ppt
akshaythaker18
 
Lamarckism is one of the earliest theories of evolution, proposed before Darw...
Laxman Khatal
 
Polarized Multiwavelength Emission from Pulsar Wind—Accretion Disk Interactio...
Sérgio Sacani
 
GB1 Q1 04 Life in a Cell (1).pptx GRADE 11
JADE ACOSTA
 
Primordial Black Holes and the First Stars
Sérgio Sacani
 
Human physiology and digestive system
S.B.P.G. COLLEGE BARAGAON VARANASI
 
Introduction of Animal Behaviour full notes.pdf
S.B.P.G. COLLEGE BARAGAON VARANASI
 
2025-06-10 TWDB Agency Updates & Legislative Outcomes
tagdpa
 
Gene Therapy. Introduction, history and types of Gene therapy
Ashwini I Chuncha
 
RODENT PEST MANAGEMENT-converted-compressed.pdf
S.B.P.G. COLLEGE BARAGAON VARANASI
 
Scale up-1 bioreactors ppt. .
pandeysmriti129
 
Refractory solid condensation detected in an embedded protoplanetary disk
Sérgio Sacani
 
Introduction of animal physiology in vertebrates
S.B.P.G. COLLEGE BARAGAON VARANASI
 
WUCHERIA BANCROFTI-converted-compressed.pdf
S.B.P.G. COLLEGE BARAGAON VARANASI
 
Q1 - W1 - D2 - Models of matter for science.pptx
RyanCudal3
 
The role of the Lorentz force in sunspot equilibrium
Sérgio Sacani
 
Phosphates reveal high pH ocean water on Enceladus
Sérgio Sacani
 
Continuous Model-Based Engineering of Software-Intensive Systems: Approaches,...
Hugo Bruneliere
 
Diuretic Medicinal Chemistry II Unit II.pptx
Dhanashri Dupade
 

Extraction Based automatic summarization