Argument Extraction from News,
Blogs,and Social Media.
Theodosis Goudas, Christos Louizos, Georgios Petasis, Vangelis Karkaletsis.
Presented by :
Sharath T.S
Shubhangi Tandon
What is Argument Extraction?
An argument can be usually decomposed into a claim and one or more premises justifying it.
Task of identifying arguments along with their components in text
Difficult even for humans to distinguish whether a part of a sentence contains an argument element or not
Why social media?
● Most widely used and accessible platform available to seek advice or express opinion
● Is a storehouse of both meaningful and meaningless information on social media about recent trends and
topics.
● Almost no prior research in this field; only one publication related to product reviews on Amazon.
Why is this difficult ?
● Almost no prior research in this field; only one publication related to product reviews on Amazon!
● Text from social media may not always contain arguments
● Expressed in an informal form, and they do not follow any formal guidelines or specific rules
● Absence of widely used corpora in order to comparably evaluate approaches for argument extraction.
● Traditional research in the area, concentrates mainly on law documents and scientific publications.
Existing methods● Palau et al. [4,7]
○ Classification at the sentence level by trying to identify possible argumentative sentences. Using NB, SVM, maximum
entropy.
○ Identify groups of sentences that refer to the same argument, using semantic distance based on the relatedness of
words contained
○ Detect clauses of sentences through a parsing tool, which are classified as argumentative or not with a maximum
entropy classifier
○ Argumentative clauses are classified into premises and claims through support vector machines
○ Araucaria corpus and ECHR corpus [11], achieving an accuracy of 73% and 80%
● A rule based system - Input an argumentation scheme and an ontology concerning an object, for example, a camera and its
characteristic features. Argumentation schemes are populated with discourse indicators, domain specific features and rules
are constructed.
Proposed Method
The proposed Automatic Argument Extraction is a two step process :
Step A : Identification of Argumentative Sentences (Supervised Classification using standard classifiers : Logistic
regression, Random Forest, Support Vector Machines, Naive Bayes)
Step B :Extraction of Claims and Premises (From output of Step A , using Conditional Random Fields)
Feature Selection for Corpus
State of the art features:
Position Comma Token Number Connective Number Verb Number Word Number
Cue words # verbs in passive voice Domain Entities Number Adverb Number Word Mean Length
Feature Selection for Corpus (contd.)
New Domain Specific Features:
Adjective number : Number of adjectives in a sentence .Usually in argumentation opinions are expressed towards an entity/claim,
through adjectives.
Entities in previous sentences: Number of entities in the nth previous sentence. History of n = 5 sentences, obtain five features,
Correlates to the probability that the current sentence contains an argument element.
Cumulative number of entities in previous sentences: Total number of entities from the previous n sentences. Considering a history
of n = 5 we obtain four features.
Ratio of distributions: Two Language models created from sentences that contain argument elements and from sentences that do
not contain an argument element. The ratio between these two distributions based on unigrams, bigrams and trigrams of
words. Can be described as :
Distributions over unigrams, bigrams, trigrams of part of speech tags (POS tags): Identical to [4] with the exception that unigrams,
bigrams and trigrams are extracted from the part of speech tags instead of words.
Step B: Argument Extraction with CRFWhy Conditional Random Fields ?
Structured prediction algorithm
Can take local context into consideration ( help maintain linguistic aspects such as the word ordering in the sentence)
Features:
The words in these sentences
Gazetteer lists of known entities for the thematic domain related to the arguments we want to extract,
Gazetteer lists of cue words and indicator phrases
Lexica of verbs and adjectives automatically acquired using Term Frequency - Inverse Document Frequency (TF-IDF) between
two “documents” ( With and without argumentative text from Step A )
Corpus Preparation
● 204 documents (in Greek) collected from the social media
● Thematic domain of Renewable Energy Sources
● Selected documents were manually annotated with domain entities and text segments that correspond to argument premises.
● Claims are not represented into documents as segments, but implied by the author as positive or negative views
760 sentences:
Annotated as
containing
arguments
16000 sentences
from 204
documents
Final Output
Ellogon
Step A Step B
Evaluation : Base Case
Simple base case classifier:
1. Manually annotated segments (argument components) used to form a gazetteer.
2. Applied on the corpus in order to detect all exact matches of all these segments.
a. All segments identified are marked as argumentative segments
b. All sentences that contain at least one argumentative segment identified by the gazetteer, are characterised as an
argumentative sentence.
3. Argumentative segments/sentences are compared to “gold” counterparts, manually annotated by humans.
a. Sentences that contain these recognized fragments are marked as argumentative for the first step base case.
b. Segments marked as argumentative are evaluated for the second step base case.
4. Results are taken through 10-fold
cross validation on the whole corpus (all 16k sentences)
Evaluation : Step A
Each sentence represented as a fixed-size vector using features described (including class - Supervised learning )
Tested against classifiers such as : Support Vector Machines, Naive Bayes, Random Forest and Logistic Regression.
Initial Data set is heavily skewed towards non-argumentative documents . Therefore, Data Sampling and Testing was done in two
different ways :
Use Precision , Recall , F-1 Measure and Accuracy for Evaluation
Logistic Regression and Naive Bayes performed the best
Way #1 Way #2
Sampling Randomly ignore negative examples.
Result set contains equal number of instances
from both classes
Split Initial Data set in the ratio 70:30 for
testing and training
Evaluation 10-fold cross validation, achieved high accuracy Achieved 49% accuracy , Discarded
Evaluation : Step B
To use CRF, need BIO tagging for sentences:
B for starting a text segment (premise),
I for a token in a premise other than the first, and
O for all other tokens (outside of the premise segment)
Example for “Wind turbines generate noise in the summer”
Final Result after CRF
Baseline Results
What did we think ?
Questions/ Observations/Inputs?
Appendix
Evaluation
Results
Step A
Evaluation Results :Step A (contd)
Go back

More Related Content

PPTX
Argument extraction from news, blog and social media.
PPT
Email Data Cleaning
PPTX
Language Models for Information Retrieval
PPTX
Precis
PPT
Chain indexing
PPT
Scalable Discovery Of Hidden Emails From Large Folders
PDF
Classifying Text using CNN
PPTX
POPSI
Argument extraction from news, blog and social media.
Email Data Cleaning
Language Models for Information Retrieval
Precis
Chain indexing
Scalable Discovery Of Hidden Emails From Large Folders
Classifying Text using CNN
POPSI

What's hot (20)

PPTX
PDF
14. Michael Oakes (UoW) Natural Language Processing for Translation
PPT
Boolean Retrieval
PPTX
Text categorization
PPT
An Intuitive Natural Language Understanding System
PDF
AACL 2018 - Going Beyond Simple Word-list Creation Using CasualConc
PDF
Myanmar Named Entity Recognition with Hidden Markov Model
PDF
16. Anne Schumann (USAAR) Terminology and Ontologies 1
PPT
Vsm 벡터공간모델
PPTX
PPT
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
PPTX
PDF
17. Anne Schuman (USAAR) Terminology and Ontologies 2
PPTX
Topic Extraction on Domain Ontology
PDF
P33077080
PPTX
Thesaurus 2101
PPTX
Tdm probabilistic models (part 2)
PDF
Probabilistic Information Retrieval
PPTX
Presentation on Text Classification
PPTX
The vector space model
14. Michael Oakes (UoW) Natural Language Processing for Translation
Boolean Retrieval
Text categorization
An Intuitive Natural Language Understanding System
AACL 2018 - Going Beyond Simple Word-list Creation Using CasualConc
Myanmar Named Entity Recognition with Hidden Markov Model
16. Anne Schumann (USAAR) Terminology and Ontologies 1
Vsm 벡터공간모델
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
17. Anne Schuman (USAAR) Terminology and Ontologies 2
Topic Extraction on Domain Ontology
P33077080
Thesaurus 2101
Tdm probabilistic models (part 2)
Probabilistic Information Retrieval
Presentation on Text Classification
The vector space model
Ad

Similar to Argument extraction from news, blogs and social media. (20)

DOCX
bồn tắm jacuzzi.docx
PDF
IRJET- Text Highlighting – A Machine Learning Approach
PDF
IRJET- Sewage Treatment Potential of Coir Geotextiles in Conjunction with Act...
PDF
Extraction Based automatic summarization
PDF
ijcai05_srl
PDF
Document Summarization
PDF
Bogstad 2015
PDF
Swift Web Services Overiview
PDF
Towards efficient knowledge extraction: Natural language processing-based sum...
PDF
Argumentation Mining Schneider Jodi Stede Manfred
PPTX
Chi-Un Lei "Text Mining and Educational Discourse"
PDF
A template based algorithm for automatic summarization and dialogue managemen...
PDF
Conceptual framework for abstractive text summarization
PDF
Analyzing The Semantic Types Of Claims And Premises In An Online Persuasive F...
PPTX
PDF
Sending out an SOS (Summary of Summaries): A Brief Survey of Recent Work on A...
PDF
Text summarization
PDF
SwiftRiver 2011 Overview
PDF
Y24168171
PDF
SiLCC Overview
bồn tắm jacuzzi.docx
IRJET- Text Highlighting – A Machine Learning Approach
IRJET- Sewage Treatment Potential of Coir Geotextiles in Conjunction with Act...
Extraction Based automatic summarization
ijcai05_srl
Document Summarization
Bogstad 2015
Swift Web Services Overiview
Towards efficient knowledge extraction: Natural language processing-based sum...
Argumentation Mining Schneider Jodi Stede Manfred
Chi-Un Lei "Text Mining and Educational Discourse"
A template based algorithm for automatic summarization and dialogue managemen...
Conceptual framework for abstractive text summarization
Analyzing The Semantic Types Of Claims And Premises In An Online Persuasive F...
Sending out an SOS (Summary of Summaries): A Brief Survey of Recent Work on A...
Text summarization
SwiftRiver 2011 Overview
Y24168171
SiLCC Overview
Ad

Recently uploaded (20)

PDF
Grey Minimalist Professional Project Presentation (1).pdf
PPTX
GPS sensor used agriculture land for automation
PPTX
Hushh Hackathon for IIT Bombay: Create your very own Agents
PPT
Classification methods in data analytics.ppt
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PPTX
PPT for Diseases.pptx, there are 3 types of diseases
PPTX
inbound6529290805104538764.pptxmmmmmmmmm
PPTX
Chapter security of computer_8_v8.1.pptx
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PDF
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
PDF
The Role of Pathology AI in Translational Cancer Research and Education
PPT
What is life? We never know the answer exactly
PPTX
indiraparyavaranbhavan-240418134200-31d840b3.pptx
PPTX
machinelearningoverview-250809184828-927201d2.pptx
PPTX
lung disease detection using transfer learning approach.pptx
PPTX
research framework and review of related literature chapter 2
PPTX
C programming msc chemistry pankaj pandey
PDF
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
PPTX
PPT for Diseases (1)-2, types of diseases.pptx
PDF
REPORT CARD OF GRADE 2 2025-2026 MATATAG
Grey Minimalist Professional Project Presentation (1).pdf
GPS sensor used agriculture land for automation
Hushh Hackathon for IIT Bombay: Create your very own Agents
Classification methods in data analytics.ppt
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PPT for Diseases.pptx, there are 3 types of diseases
inbound6529290805104538764.pptxmmmmmmmmm
Chapter security of computer_8_v8.1.pptx
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
The Role of Pathology AI in Translational Cancer Research and Education
What is life? We never know the answer exactly
indiraparyavaranbhavan-240418134200-31d840b3.pptx
machinelearningoverview-250809184828-927201d2.pptx
lung disease detection using transfer learning approach.pptx
research framework and review of related literature chapter 2
C programming msc chemistry pankaj pandey
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
PPT for Diseases (1)-2, types of diseases.pptx
REPORT CARD OF GRADE 2 2025-2026 MATATAG

Argument extraction from news, blogs and social media.

  • 1. Argument Extraction from News, Blogs,and Social Media. Theodosis Goudas, Christos Louizos, Georgios Petasis, Vangelis Karkaletsis. Presented by : Sharath T.S Shubhangi Tandon
  • 2. What is Argument Extraction? An argument can be usually decomposed into a claim and one or more premises justifying it. Task of identifying arguments along with their components in text Difficult even for humans to distinguish whether a part of a sentence contains an argument element or not
  • 3. Why social media? ● Most widely used and accessible platform available to seek advice or express opinion ● Is a storehouse of both meaningful and meaningless information on social media about recent trends and topics. ● Almost no prior research in this field; only one publication related to product reviews on Amazon.
  • 4. Why is this difficult ? ● Almost no prior research in this field; only one publication related to product reviews on Amazon! ● Text from social media may not always contain arguments ● Expressed in an informal form, and they do not follow any formal guidelines or specific rules ● Absence of widely used corpora in order to comparably evaluate approaches for argument extraction. ● Traditional research in the area, concentrates mainly on law documents and scientific publications.
  • 5. Existing methods● Palau et al. [4,7] ○ Classification at the sentence level by trying to identify possible argumentative sentences. Using NB, SVM, maximum entropy. ○ Identify groups of sentences that refer to the same argument, using semantic distance based on the relatedness of words contained ○ Detect clauses of sentences through a parsing tool, which are classified as argumentative or not with a maximum entropy classifier ○ Argumentative clauses are classified into premises and claims through support vector machines ○ Araucaria corpus and ECHR corpus [11], achieving an accuracy of 73% and 80% ● A rule based system - Input an argumentation scheme and an ontology concerning an object, for example, a camera and its characteristic features. Argumentation schemes are populated with discourse indicators, domain specific features and rules are constructed.
  • 6. Proposed Method The proposed Automatic Argument Extraction is a two step process : Step A : Identification of Argumentative Sentences (Supervised Classification using standard classifiers : Logistic regression, Random Forest, Support Vector Machines, Naive Bayes) Step B :Extraction of Claims and Premises (From output of Step A , using Conditional Random Fields)
  • 7. Feature Selection for Corpus State of the art features: Position Comma Token Number Connective Number Verb Number Word Number Cue words # verbs in passive voice Domain Entities Number Adverb Number Word Mean Length
  • 8. Feature Selection for Corpus (contd.) New Domain Specific Features: Adjective number : Number of adjectives in a sentence .Usually in argumentation opinions are expressed towards an entity/claim, through adjectives. Entities in previous sentences: Number of entities in the nth previous sentence. History of n = 5 sentences, obtain five features, Correlates to the probability that the current sentence contains an argument element. Cumulative number of entities in previous sentences: Total number of entities from the previous n sentences. Considering a history of n = 5 we obtain four features. Ratio of distributions: Two Language models created from sentences that contain argument elements and from sentences that do not contain an argument element. The ratio between these two distributions based on unigrams, bigrams and trigrams of words. Can be described as : Distributions over unigrams, bigrams, trigrams of part of speech tags (POS tags): Identical to [4] with the exception that unigrams, bigrams and trigrams are extracted from the part of speech tags instead of words.
  • 9. Step B: Argument Extraction with CRFWhy Conditional Random Fields ? Structured prediction algorithm Can take local context into consideration ( help maintain linguistic aspects such as the word ordering in the sentence) Features: The words in these sentences Gazetteer lists of known entities for the thematic domain related to the arguments we want to extract, Gazetteer lists of cue words and indicator phrases Lexica of verbs and adjectives automatically acquired using Term Frequency - Inverse Document Frequency (TF-IDF) between two “documents” ( With and without argumentative text from Step A )
  • 10. Corpus Preparation ● 204 documents (in Greek) collected from the social media ● Thematic domain of Renewable Energy Sources ● Selected documents were manually annotated with domain entities and text segments that correspond to argument premises. ● Claims are not represented into documents as segments, but implied by the author as positive or negative views 760 sentences: Annotated as containing arguments 16000 sentences from 204 documents Final Output Ellogon Step A Step B
  • 11. Evaluation : Base Case Simple base case classifier: 1. Manually annotated segments (argument components) used to form a gazetteer. 2. Applied on the corpus in order to detect all exact matches of all these segments. a. All segments identified are marked as argumentative segments b. All sentences that contain at least one argumentative segment identified by the gazetteer, are characterised as an argumentative sentence. 3. Argumentative segments/sentences are compared to “gold” counterparts, manually annotated by humans. a. Sentences that contain these recognized fragments are marked as argumentative for the first step base case. b. Segments marked as argumentative are evaluated for the second step base case. 4. Results are taken through 10-fold cross validation on the whole corpus (all 16k sentences)
  • 12. Evaluation : Step A Each sentence represented as a fixed-size vector using features described (including class - Supervised learning ) Tested against classifiers such as : Support Vector Machines, Naive Bayes, Random Forest and Logistic Regression. Initial Data set is heavily skewed towards non-argumentative documents . Therefore, Data Sampling and Testing was done in two different ways : Use Precision , Recall , F-1 Measure and Accuracy for Evaluation Logistic Regression and Naive Bayes performed the best Way #1 Way #2 Sampling Randomly ignore negative examples. Result set contains equal number of instances from both classes Split Initial Data set in the ratio 70:30 for testing and training Evaluation 10-fold cross validation, achieved high accuracy Achieved 49% accuracy , Discarded
  • 13. Evaluation : Step B To use CRF, need BIO tagging for sentences: B for starting a text segment (premise), I for a token in a premise other than the first, and O for all other tokens (outside of the premise segment) Example for “Wind turbines generate noise in the summer” Final Result after CRF Baseline Results
  • 14. What did we think ?
  • 18. Evaluation Results :Step A (contd) Go back

Editor's Notes

  • #11: The corpus was constructed by manually filtering a larger corpus, automatically collected by performing queries on popular search engines (such as Bing2 ), Google Plus 3 , Twitter 4 , and by crawling sites from a list of sources relevant to the domain of renewable energy.