Aum workshop paper_presentation

Semantically Enriched Machine Learning Approach to
Filter YouTube Comments for Socially Augmented User
Models
Ahmad Ammari, Vania Dimitrova, Dimoklis
Despotakis
School of Computing, University of Leeds,
Leeds, UK

Presented By:

Ahmad Ammari
User and Community Modelling
School of Computing, University of Leeds,
UK

Outline
• The ImREAL Project
• Socially Augmented User Modelling
• Research Objective, Roadmap,
Challenges
• The Social Noise Filtering Approach
– Machine Learning – Based
– Methodology
– Comment Content Pre-Processing
– Semantic Enrichment
– Scoring and Labelling the Training Dataset
• Experimental Description / Results
• Evaluation
• Conclusions & Future Work

Immersive Reflective
Experience-based Adaptive
Specific Targeted Research Project STReP – FP7
Learning
Partners
University of Leeds, UK; Trinity College Dublin, Ireland;
Graz University of Technology, Austria; University of Erlangen-Nuremberg, Ger;
Delft University of Technology, NL; Imaginary SRL - IMA, Italy;
Empower The User, ETU, Ireland;
Problem:
Experience in a simulated world is disconnected from the ‘real-
world’

REALITY VIRTUALITY

ImREAL
Augmented Reality Approach Augmented Virtuality

Augmented Simulated Experiential
Learning

Interactive
User
model

Adaptive
Simulated Experiential
Learning Environment

coach
Augmented
user Real
modelling world
Practice
activity
model-
ling
Provide Meta-
content cognitive Records of Real
Other participants
Job-related
(e.g. customers,
scaffolding Experiences
managers)

Simulated Learning Environment Real World Experience

Augmented User Modelling
Socially Augmented User Modelling
Open
Social Spaces
Simulated
Environment

User
Profiles
Sports
Psycholo Social
gy
Profile
s
Diseases

Politic
s
Existing User
Socially
Model
Augmented User Limited Weighted Social
Model Scope!! Interests

Broad Research Objective
Mining Social Media Content

generated by Users having awareness
and/or Interest in an Activity Domain

to Derive Social Profiles

that Augment Existing User Models

Research Roadmap / Challenges
• Three-Phase Research Roadmap
towards achieving the Broad Objective
Phase One

Phase Three
Phase Two

Social
Noise
Filtration

The Social Noise Filtering Approach
• Supervised Machine Learning Model
– Historic Content with known relevance states are
used for training
– Machine Learning Model learns the underlying
rules
– Model is used to predict unknown relevance
states for new content with certain prediction
confidence

The Social Noise Filtration Service:
Methodology

Semantically
Enriched Job
Experimental Interview Bag of
CASE STUDY:
ly Controlled Analyze Filtering YouTube Comments
Words (JIBoW)
Comments

Social Media Source: YouTube
Subject Content: Public Comments on Shared
Videos
SCORE
Activity Domain: Job Interview
Term – Comment
Matrix
(Training Corpus)
S
C
Public
Pre- O
Comments R
Process E
On
S
YouTube

YouTube Video Selection
• Selected as part of a research study by
[Despotakis, Lau & Dimitrova, 2011]
• Four Job Interview-related categories are
manually identified from video content
– Guides / Best Practices
– Interviewee’s Stories
– Interviewer’s Stories
– Interview Mock Examples
• Videos from all categories are selected to
retrieve the comment set for ML training

Comment Content Pre-Processing
• Objective: Deriving dataset for
Classification
Stop tfidf
Comment
– Term
Word Stemming
Weighting Matrix
Removal
CTM
1 2 3 4

I think most
Americans are like the
first example

think – Americans – like – first –
example

Semantically Enriched Job Interview
Bag of Words
• A Semantically Enriched Job Interview Bag of Words (JIBoW)
used as Novel Means to Score and Label Training YouTube
Comment Set
• Collection of Textual Comments on Job Interview Videos [*]
– Experimentally controlled
– Closed social space
• Text and Semantic Pre-Processing Phases
• Semantically Expanded by the WordNet Lexicon and DISCO
with Word Synonyms, Antonyms, Derivations, and
semantically similar words

[*] Despotakis, Lau, Dimitrova (2011): A Semantic
Approach to Extract Individual Viewpoints from User
Comments on An Activity, AUM Workshop, UMAP
2011, Girona, Spain

Scoring and Labelling Training Corpus
• A Novel Term Frequency – based Mathematical Model
• Computes a Relevance Score for each observation in the
training comment dataset
– Intersection Size between Comment BoW and JIBoW
– Score is Normalized by the Average Intersection Size

• A Threshold is used to classify the comments for
training a binary classifier
• Labels observation (noisy, relevant) accordingly

Example Scoring & Labelling
C1: “The interviewee looks confident, he should
have some job experience in his work life”

Comment JIBOW
BOW w10
interviewee w21
confident w34
job w4
experience w57
work w113
life wn

Example Scored & Labelled Comments

Datasets
• YouTube API for Retrieval, Lucene API for Pre-
Processing
• Post –YouTube Corpus Description:
Analysis Data Experimentally Controlled Corpus

• Training Corpus: 1159 Instances
– Classified by the scoring model for Training C4.5 & Naïve
Bayes Multinomial (NBM) Classifiers
– {724 Noisy, 435 Relevant}
• Derived a Comment Term Matrix : 1159 Instances X 903
tfidf Term Weights + 1 Discrete Class Column

Experimental Results
• Three variations of Training-to-Testing ratio
Models for each classifier have been trained &
tested
See Evaluation
ROC Area
Results

• The Two Classifiers show good performance
in predicting relevant & noisy comments in the
testing data sets
• C4.5 is slightly better in predicting noisy
comments from within the total noise in the
data
• NBM shows less risk in misclassifying
relevant comments as noise

Evaluation
Human-based Evaluation Experiment was
conducted to measure how well the service:
Goal1: Considers the comments that show
awareness in the application domain (Job
Interviews) See Example Question and
Records

Goal2: Considers the comments that their authors
are likely interested in the application domain
See Example Question and
Records

Evaluation Results
Number of Evaluators 2
Number of Evaluated Comments (15% of Whole 180
Dataset)
Number of Comment Scored as Relevant 90
Comments
Number of Comment Scored as Noisy Comments
Evaluator 2 90
Evaluator 1
Goal 2 Goal 1 Goal 2 Goal 1
9%
3% Noisy
Noisy
15%
17 24 46%
% % Relevant
Releva 19%
42% 45% 66%
59 55% nt Doesn't
% know
Doesn't
know

Metric Goal 2 Goal 1 Metric Goal 2 Goal 1
Total Match Rate 51.1% 68.3% Total Match Rate 32.2% 60.0%
Total Mismatch Total Mismatch
48.9% 31.7% 67.8% 40.0%
Rate Rate
Precision (Noisy) 42.2% 76.7% Precision (Noisy) 36.7% 90.6%
Precision Precision
76.7% 63.3% 73.3% 44.4%
(Relevant) (Relevant)
Recall (Noisy) 73.1% 67.6% Recall (Noisy) 84.6% 68.2%

Summary
• Conclusions
– High Rate of YouTube Video comments are Noisy
– ML Models are good in Predicting and Filtering
out Comments that do not show author
awareness nor interests in the Activity Domain of
Interests
• Future Work
– Add more filters to improve the Scoring and
Labelling Mechanism based on Evaluation
Baseline
– Exploit Activity Modelling Ontology to Derive
JIBoW
– Evaluate Impact of Semantic Enrichment

YouTube-based Social Profiling Service:
Methodology
YouTube / SM Comments Noise Filtration Service Comments Predicted as
Relevant

RC1 … ……. RCn
…….

Clusters of Social Profiles
Profile1 Profile2 ProfileN
x y  u o  p q 
e r  x o  x c 
e y  f g  z s 

Associations of
Profiling Source Authors
Frequent Characteristics
YT User Profiles
Uploaded YT Video meta data
Favored YT Video meta data
ImREAL Comments on the YT Videos
Simulators Social Profiling Corpus

Presented By:

Ahmad Ammari
User and Community Modelling
School of Computing, University of
Leeds, UK

Aum workshop paper_presentation

More Related Content

What's hot (16)

Viewers also liked (6)

Similar to Aum workshop paper_presentation (20)

More from Ahmad Ammari (6)

Recently uploaded (20)

Aum workshop paper_presentation