SlideShare a Scribd company logo
Aggregating Multiple Dimensions
for Computing Document Relevance
Mauro Dragoni
Fondazione Bruno Kessler (FBK), Shape and Evolving Living Knowledge Unit (SHELL)
2nd KEYSTONE Summer School
Santiago de Compostela, July 21st 2016
1
How will we spend time today?
 Our Goal:
to understand how documents can be evaluated by adopting a multi-
criteria framework
2
Presentation of the theoretical framework
Case Study 1
Representing
documents through
different layers
Case Study 2
Combining user
profiles, queries, and
document content for
computing relevance
Case Study 3
Merge and explode
Case Study 1 and Case
Study 2… 
Why is this topic interesting?
 Indexing documents and querying repositories is not only a matter of
weighting terms
 At the end of this lesson you should be able to:
 consider a document from different perspectives
 understand why YOU can be part of the document score
 know how to treat different type of information content
 What might I expect from you?
 To see a paper on this topic published in the near future… 
 To get new ideas, proposed by you…
3
Some Background
 The main idea behind this topic is “multi-criteria decision making”
 What does it mean?
 Suppose to have an entity E and a set C of n criteria
 We need to evaluate, for each criterion Ci how much E satisfies Ci
 We have to aggregate all satisfaction degrees for evaluating E
 Some suggested papers
 Ronald R. Yager. Modeling prioritized multicriteria decision making. IEEE Trans. Systems, Man, and
Cybernetics, Part B 34(6): 2396-2404 (2004)
 Ronald R. Yager. Prioritized aggregation operators. Int. J. Approx. Reasoning 48(1): 263-274 (2008)
 Célia da Costa Pereira, Mauro Dragoni, Gabriella Pasi. Multidimensional relevance: Prioritized
aggregation in a personalized Information Retrieval setting. Inf. Process. Manage. 48(2): 340-357
(2012)
 Francesco Corcoglioniti, Mauro Dragoni, Marco Rospocher, Alessio Palmero Aprosio. Knowledge
Extraction for Information Retrieval. ESWC 2016: 317-333
4
Further Readings
 Fuzzy Logic
 Zadeh book and papers
 Knowledge Extraction
 Semantic Web (ISWC conference series, KBS and JWS journals, …)
 Knowledge Management (KR, IJCAI, AAAI, …)
 Natural Language Processing (ACL, COLING, …)
 User Modeling and Interaction
 UMAP proceedings
 HCI papers
5
Introductory Example
 John is looking for a bicycle for his little son
 John takes care of two criteria: “safety” and “inexpensiveness”
 John considers “safety” > “inexpensiveness”
 We may two scenarios:
1. John is not able to find a “safe” bicycle that is also “cheap”.
2. John has a low budget. Thus, he has to find a trade-off between the two criteria.
6
E
C1 C2
Problem Representation
 Components
 the set C of the n considered criteria: C = {C1, …, Cn};
 the collection D of entities (documents in the specific case of IR);
 an aggregation function F computing the score F(C1(d),…, Cn(d)) of each
document d contained in D;
 a priority model P defined by… someone (user, system maintainer, etc.);
 a weighting schema W.
7
Weighting Schema – Expert-based choice
 Weights are arbitrarily chosen by an expert.
 No rules for computing them.
 For example:
 C1  λ1 = 0.7
 C2  λ2 = 0.5
 C3  λ3 = 0.6
 C4  λ4 = 0.3
 You need to justify the values you choose.
8
Weighting Schema – Priority-based choice
 Weights are computed “automatically” based on the priority between
criteria.
 For each document d, the weight of the most important criterion C1 is set
to 1.0 by definition.
 The weights of the other criteria are computed as follows:
9
Weight Schema - Considerations
 A weighting schema can be decided a-priori but…
 We can learn a new weighting schema:
 from learn-to-rank dataset, or
 from the IR system usage.
 The choice of the weighting schema, obviously, affects the effectiveness
of your information retrieval system.
 Where can we apply such weighting schema?
10
Three (not exhaustive) Operators
 As you can imagine… there are different ways for combining weights and
criteria
 Operator 1: “Scoring”
 weighted criteria scores are summed
 Operator 2: “Min” or “And”
 among weighted criteria scores, minimum score is selected
 Operator 3: “Max” or “Or”
 among weighted criteria scores, maximum score is selected
11
The “Scoring” Operator
 The overall document score is computed by summing the weighted
scores computed for all criteria.
 The score computed on the most important criterion leads the overall
document score.
 Less important criteria help in refining the overall document score.
12
The “And” (or “Min”) Operator
 The document score is strongly dependent on the degree of satisfaction
of the least satisfied criterion
 Very restrictive operator
 Suggestion: consider criteria that are really relevant for a user!!!
13
The “Or” (or “Max”) Operator
 Dangerous operator!
 Recommendation: criteria with a satisfaction degree of zero do not have
to be considered.
 It is useful only when priority between criteria is not used.
 Weighting schema is manually defined
 Weight of less important criteria are not based on the value of the most
important ones.
14
Operators’ Properties
 Boundary Conditions
 Continuity
 Monotonicity (just for Scoring)
 Absorbing Element (“0”, for Scoring and Min operators)
15
The Operators in Action
 Assume to have a document D composed as follows:
16
Title
Abstract
Introduction
Content
Title  C1
Abstract  C2
Introduction  C3
Content  C4
The Operators in Action
 Suppose to perform a query as follows:
 Q = {qt1, qt2, qt3}
 Assume that, for each document field, you have a normalized similarity
values:
 sim(Q, DTitle) = 0.5
 sim(Q, DAbstract) = 0.4
 sim(Q, DIntroduction) = 0.2
 sim(Q, DContent) = 0.7
 As you can imagine, by using different priorities and different
aggregations, the document score will be different.
17
The Operators in Action
Criteria score: C1 = 0.5; C2 = 0.8; C3 = 0.2; C4 = 0.7
Priority schemas:
P1: C1 > C2 > C3 > C4
P2: C1 > C2 > C4 > C3
Weights:
for P1: for P2:
w1: 1.0 w1: 1.0
w2: 1.0 * 0.5 = 0.5 w2: 1.0 * 0.5 = 0.5
w3: 0.5 * 0.8 = 0.4 w3: 0.5 * 0.8 = 0.4
w4: 0.4 * 0.2 = 0.08 w4: 0.4 * 0.7 = 0.28
18
The Operators in Action
 Document score
 “Scoring” operator:
• DP1 = (0.5 * 1.0) + (0.8 * 0.5) + (0.2 * 0.4) + (0.7 * 0.08) = 1.036
• DP2 = (0.5 * 1.0) + (0.8 * 0.5) + (0.7 * 0.4) + (0.2 * 0.28) = 1.236
 “And” operator:
• DP1 = min(0.5^1.0, 0.8^0.5, 0.2^0.4, 0.7^0.08) = min(0.5, 0.89, 0.53, 0.97) = 0.5
• DP2 = min(0.5^1.0, 0.8^0.5, 0.7^0.4, 0.2^0.28) = min(0.5, 0.89, 0.87, 0.64) = 0.5
 “Or” operator:
• DP1 = max(0.5^1.0, 0.8^0.5, 0.2^0.4, 0.7^0.08) = max(0.5, 0.89, 0.53, 0.97) = 0.97
• DP2 = max(0.5^1.0, 0.8^0.5, 0.7^0.4, 0.2^0.28) = max(0.5, 0.89, 0.87, 0.64) = 0.89
19
Any question so far?
20
Timeout…
Case Study 1 – The Scenario
 Keyword search over a multi-layer representation of documents
 Documents and queries structure:
 Textual layer: natural language text
 Metadata layers:
• Entity Linking
• Predicates
• Roles/Types
• Timing Information
 Problems:
 How to compute the score for each layer?
 How to aggregate such scores?
 How to weight each layer?
21
Case Study 1 – The Scenario
 Natural language content is enriched with four metadata/semantic layers
 URI Layer: links with entities detected into the text and mapped to DBpedia
entities
 TYPE Layer: conceptual classification of the named entities detected into the
text and mapped with both DBpedia and Yago knowledge bases
 TIME Layer: metadata related to the temporal mentions find into the text by
using a temporal expression recognizer (ex. “the eighteenth century”, “2015-18-
12”, etc.)
 FRAME Layer: output of the application of semantic role labeling techniques.
Generally, this output includes predicates and their arguments describing a
specific role in the context of the predicate.
Example:
“He has been influenced by Carl Gauss” 
[framebase:Subjective_influence; dbpedia:Carl_Friedrich_Gauss]
22
Case Study 1 – Example
 Text: “astronomers influenced by Gauss”
 Layers
 URI Layer: “dbpedia:Carl_Friedrich_Gauss”
 TYPE Layer: “yago:GermanMathematicians”, “yago:NumberTheorists”,
“yago:FellowsOfTheRoyalSociety”
 TIME Layer: “day:1777-04-30”, “day:1855-02-23”, “century:1700”
 FRAME Layer: “Subjective_influence.v_Carl_Friedrich_Gauss”
 Annotations provided by PIKES (https://pikes.fbk.eu)
23
Case Study 1 - Evaluation
 331 documents, 35 queries
 Jörg Waitelonis, Claudia Exeler, Harald Sack. Enabled Generalized Vector Space Model to
Improve Document Retrieval. NLP-DBPEDIA@ISWC 2015: 33-44
 Multi-value relevance (1=irrelevant, 5=relevant)
 Diverse queries: from keyword-base search to queries requiring semantic
capabilities
24
Case Study 1 - Evaluation
 2 baselines:
 Google custom search API
 Textual layer only (~Lucene)
 Measures: Prec1,5,10, MAP, MAP10, NDCG, NDCG10
 Same weights for textual and semantic layers:
 TEXTUAL (50%)
 URI (12,5%), TYPE (12,5%), FRAME (12,5%), TIME (12,5%)
25
Case Study 1 - Evaluation
26
Approach/
System
Prec1 Prec5 Prec10 NDCG NDCG10 MAP MAP10
Google 0.543 0.411 0.343 0.434 0.405 0.255 0.219
Textual 0.943 0.669 0.453 0.832 0.782 0.733 0.681
KE4IR 0.971 0.680 0.474 0.854 0.806 0.758 0.713
KE4IR vs. Textual 3.03% 1.71% 4.55% 2.64% 2.99% 3.50% 4.74%
Case Study 1 - Evaluation
27
Layers (TEXTUAL+) Prec1 Prec5 Prec10 NDCG NDCG10 MAP MAP10
URI,TYPE,FRAME,TIME 0.971 0.680 0.474 0.854 0.806 0.758 0.713
URI,TYPE,FRAME 0.971 0.680 0.474 0.853 0.804 0.757 0.712
URI,TYPE,TIME 0.971 0.680 0.474 0.851 0.802 0.757 0.712
URI,TYPE 0.971 0.680 0.474 0.849 0.801 0.755 0.710
URI,FRAME,TIME 0.971 0.674 0.465 0.844 0.796 0.750 0.702
URI,FRAME 0.971 0.674 0.465 0.842 0.795 0.749 0.702
URI,TIME 0.971 0.674 0.465 0.840 0.791 0.747 0.700
TYPE,FRAME,TIME 0.943 0.674 0.471 0.848 0.799 0.745 0.700
TYPE,TIME 0.943 0.674 0.471 0.843 0.794 0.743 0.697
TYPE,FRAME 0.943 0.674 0.468 0.847 0.797 0.743 0.695
FRAME,TIME 0.943 0.674 0.462 0.842 0.793 0.741 0.693
Case Study 1 - Evaluation
28
Case Study 1 – What We Learnt
 How the effectiveness of a system can be affected if we change weights.
 In this specific case, the use of an expert-based weighting schema helps
you in balancing the importance of the semantic information…
 … however, we are using learning to rank for identifying potential
priorities between used layers.
 Further lessons more related to the use of semantic layers.
 Future work: to apply the approach to larger collections.
29
Any question on
Case Study 1?
30
Timeout…
Case Study 2 – The Scenario
 Combine document information with user profiles.
 Assumption: you already have computed user profiles.
 Which information can you use?
 RELIABILITY: How much a user trusts the document source.
 COVERAGE: How strongly a user profiles is represented in a document
(inclusion of a user profiles into a document).
 APPROPRIATENESS: How much a document satisfies a user profile (similarity
between user profile and document).
 ABOUTNESS: Trivial criterion, how much a document matches the performed
query
31
Case Study 2 – Reliability
 Why do I trust information source differently?
 How much do you trust an information source?
 you might fix such values;
 you might infer them.
32
Case Study 2 – Coverage
 The “coverage” criterion allows to compute how strongly a user profile is
contained in the document
 Suppose to have a profile of a user interested in the following topics:
 c = {sports, economics}
 Suppose to have a document talking about the following topics:
 d= {violence, politics, economics, sports}
 c = {0, 0, 1, 1} d = {1, 1, 1, 1}  Coverage(c,d) = 1.0
33
Case Study 2 – Appropriateness
 The “appropriateness” criterion allows to compute how much a
document satisfies a user profile
 Suppose to have a profile of a user interested in the following topics:
 c = {sports, economics}
 Suppose to have a document talking about the following topics:
 d= {violence, politics, economics, sports}
 c = {0, 0, 1, 1} d = {1, 1, 1, 1}  Appropriateness(c,d) = 0.5
34
Case Study 2 – Aboutness
 The “classic” similarity between a query and documents contained in a
repository.
 Many model available… and various adaptations based on the context.
35
Case Study 2 – Validation
 The Reuters RCV1 Collection has been used for creating user profiles and
for generating user queries.
 20 users have been involved in the evaluation campaign.
 Different aggregation schemas have been tested.
36
Case Study 2 – Validation (Ab > Ap > C > R)
37
Case Study 2 – What We Learnt
 When users are involved, it is very difficult to define an aggregation
schema.
 The same occurs for the priority between criteria.
 Creating (or learning) a user profiles is already a big problem itself.
 The quality of user profiles significantly affects the effectiveness of the
retrieval algorithm.
 If you start playing with criteria and weight schemas, you will never end!!!
38
Any question on
Case Study 2?
39
Timeout…
Case Study 3
 Let’s get back to the first simple example…
40
Title
Abstract
Introduction
Content
Title  C1
Abstract  C2
Introduction  C3
Content  C4
Case Study 3 – Suppose that…
 Each field has been annotated with different ontologies, but belonging to
the same domain
 this means that you have, for the same field, many layers with different
annotations… one for each used ontology
 Your repository contains documents coming from different sources
 is the reliability of each repository the same?
 Your users have a history
 Users profiles need to be updated
 this aspect is out of the scope of this talk… but you should be aware of it… 
 Any other idea?
41
Exploding Fields
42
You have something to think about… Good luck!!!
So… for concluding
 Considering retrieval as a multi-criteria decision making problem is
interesting to explore.
 There is room for investigating a lot of stuff.
 Do not be scary on using user profiles.
 I invite you to consider recent works on simulating user interactions with IR
systems
• David Maxwell, Leif Azzopardi. Simulating Interactive Information Retrieval: SimIIR: A
Framework for the Simulation of Interaction. SIGIR 2016: 1141-1144 (+ the tutorial he
gave)
 My suggestion: try to combine
 content
 semantic metadata
 users history
43
44
It’s time for questions…
Mauro Dragoni
Fondazione Bruno Kessler
https://shell.fbk.eu/index.php/Mauro_Dragoni
dragoni@fbk.eu

More Related Content

What's hot (19)

PPTX
Data What Type Of Data Do You Have V2.1
TimKasse
 
PDF
Machine learning meetup
QuantUniversity
 
PDF
Exploratory data analysis data visualization
Dr. Hamdan Al-Sabri
 
PPTX
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
PDF
Descriptive Analytics: Data Reduction
Nguyen Ngoc Binh Phuong
 
PDF
Descriptive Statistics
CIToolkit
 
PPTX
Business Statistics
Tim Walters
 
PPTX
Basic Statistics & Data Analysis
Ajendra Sharma
 
PPTX
Introduction to business statistics
Aakash Kulkarni
 
PPTX
Lect 3 background mathematics
hktripathy
 
PDF
Evaluation in Information Retrieval
Dishant Ailawadi
 
PPTX
Predictive analytics
Dinakar nk
 
PPT
Introduction to Business Statistics
Atiq Rehman
 
PPT
Data Preparation and Processing
Mehul Gondaliya
 
PDF
Anomaly detection Meetup Slides
QuantUniversity
 
PDF
Exploratory data analysis
gokulprasath06
 
PPTX
Statistics for data science
zekeLabs Technologies
 
DOC
Business analyst
Hemanth Kumar
 
PDF
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
csandit
 
Data What Type Of Data Do You Have V2.1
TimKasse
 
Machine learning meetup
QuantUniversity
 
Exploratory data analysis data visualization
Dr. Hamdan Al-Sabri
 
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
Descriptive Analytics: Data Reduction
Nguyen Ngoc Binh Phuong
 
Descriptive Statistics
CIToolkit
 
Business Statistics
Tim Walters
 
Basic Statistics & Data Analysis
Ajendra Sharma
 
Introduction to business statistics
Aakash Kulkarni
 
Lect 3 background mathematics
hktripathy
 
Evaluation in Information Retrieval
Dishant Ailawadi
 
Predictive analytics
Dinakar nk
 
Introduction to Business Statistics
Atiq Rehman
 
Data Preparation and Processing
Mehul Gondaliya
 
Anomaly detection Meetup Slides
QuantUniversity
 
Exploratory data analysis
gokulprasath06
 
Statistics for data science
zekeLabs Technologies
 
Business analyst
Hemanth Kumar
 
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
csandit
 

Viewers also liked (9)

PDF
Exploration, visualization and querying of linked open data sources
Laura Po
 
PDF
Introduction to linked data
Laura Po
 
PPTX
School intro
José Ramón Ríos Viqueira
 
PDF
1st KeyStone Summer School - Hackathon Challenge
Joel Azzopardi
 
PPTX
Keystone summer school 2015 paolo-missier-provenance
Paolo Missier
 
PDF
Search, Exploration and Analytics of Evolving Data
Nattiya Kanhabua
 
PDF
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Joel Azzopardi
 
PPTX
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Mauro Dragoni
 
PDF
Curse of Dimensionality and Big Data
Stephane Marchand-Maillet
 
Exploration, visualization and querying of linked open data sources
Laura Po
 
Introduction to linked data
Laura Po
 
1st KeyStone Summer School - Hackathon Challenge
Joel Azzopardi
 
Keystone summer school 2015 paolo-missier-provenance
Paolo Missier
 
Search, Exploration and Analytics of Evolving Data
Nattiya Kanhabua
 
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Joel Azzopardi
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Mauro Dragoni
 
Curse of Dimensionality and Big Data
Stephane Marchand-Maillet
 
Ad

Similar to Aggregating Multiple Dimensions for Computing Document Relevance (20)

PDF
An Introduction to Information Retrieval.pdf
Tiffany Daniels
 
PPT
Intro.ppt
WrushabhShirsat3
 
PPTX
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Sean Golliher
 
PDF
Information retrieval systems irt ppt do
PonnuthuraiSelvaraj1
 
PPTX
IRT Unit_ 2.pptx
thenmozhip8
 
PPTX
Week14-Multimedia Information Retrieval.pptx
HasanulFahmi2
 
PPT
Artificial Intelligence
vini89
 
DOCX
UNIT 3 IRT.docx
thenmozhip8
 
PPTX
Document ranking using qprp with concept of multi dimensional subspace
Prakash Dubey
 
PPTX
Information retrival system and PageRank algorithm
Rupali Bhatnagar
 
PPT
4-IR Models_new.ppt
BereketAraya
 
PPT
4-IR Models_new.ppt
BereketAraya
 
PPT
3392413.ppt information retreival systems
MARasheed3
 
PPT
Information Retrieval and Storage Systems
abduwasiahmed
 
PPT
chapter 5 Information Retrieval Models.ppt
KelemAlebachew
 
PPT
Cs583 info-retrieval
Borseshweta
 
PPTX
Ir 09
Mohammed Romi
 
PPTX
unit -4MODELING AND RETRIEVAL EVALUATION
karthiksmart21
 
PDF
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
onlmcq
 
PDF
Algorithm for calculating relevance of documents in information retrieval sys...
IRJET Journal
 
An Introduction to Information Retrieval.pdf
Tiffany Daniels
 
Intro.ppt
WrushabhShirsat3
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Sean Golliher
 
Information retrieval systems irt ppt do
PonnuthuraiSelvaraj1
 
IRT Unit_ 2.pptx
thenmozhip8
 
Week14-Multimedia Information Retrieval.pptx
HasanulFahmi2
 
Artificial Intelligence
vini89
 
UNIT 3 IRT.docx
thenmozhip8
 
Document ranking using qprp with concept of multi dimensional subspace
Prakash Dubey
 
Information retrival system and PageRank algorithm
Rupali Bhatnagar
 
4-IR Models_new.ppt
BereketAraya
 
4-IR Models_new.ppt
BereketAraya
 
3392413.ppt information retreival systems
MARasheed3
 
Information Retrieval and Storage Systems
abduwasiahmed
 
chapter 5 Information Retrieval Models.ppt
KelemAlebachew
 
Cs583 info-retrieval
Borseshweta
 
unit -4MODELING AND RETRIEVAL EVALUATION
karthiksmart21
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
onlmcq
 
Algorithm for calculating relevance of documents in information retrieval sys...
IRJET Journal
 
Ad

Recently uploaded (20)

PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Climate Action.pptx action plan for climate
justfortalabat
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 

Aggregating Multiple Dimensions for Computing Document Relevance

  • 1. Aggregating Multiple Dimensions for Computing Document Relevance Mauro Dragoni Fondazione Bruno Kessler (FBK), Shape and Evolving Living Knowledge Unit (SHELL) 2nd KEYSTONE Summer School Santiago de Compostela, July 21st 2016 1
  • 2. How will we spend time today?  Our Goal: to understand how documents can be evaluated by adopting a multi- criteria framework 2 Presentation of the theoretical framework Case Study 1 Representing documents through different layers Case Study 2 Combining user profiles, queries, and document content for computing relevance Case Study 3 Merge and explode Case Study 1 and Case Study 2… 
  • 3. Why is this topic interesting?  Indexing documents and querying repositories is not only a matter of weighting terms  At the end of this lesson you should be able to:  consider a document from different perspectives  understand why YOU can be part of the document score  know how to treat different type of information content  What might I expect from you?  To see a paper on this topic published in the near future…   To get new ideas, proposed by you… 3
  • 4. Some Background  The main idea behind this topic is “multi-criteria decision making”  What does it mean?  Suppose to have an entity E and a set C of n criteria  We need to evaluate, for each criterion Ci how much E satisfies Ci  We have to aggregate all satisfaction degrees for evaluating E  Some suggested papers  Ronald R. Yager. Modeling prioritized multicriteria decision making. IEEE Trans. Systems, Man, and Cybernetics, Part B 34(6): 2396-2404 (2004)  Ronald R. Yager. Prioritized aggregation operators. Int. J. Approx. Reasoning 48(1): 263-274 (2008)  Célia da Costa Pereira, Mauro Dragoni, Gabriella Pasi. Multidimensional relevance: Prioritized aggregation in a personalized Information Retrieval setting. Inf. Process. Manage. 48(2): 340-357 (2012)  Francesco Corcoglioniti, Mauro Dragoni, Marco Rospocher, Alessio Palmero Aprosio. Knowledge Extraction for Information Retrieval. ESWC 2016: 317-333 4
  • 5. Further Readings  Fuzzy Logic  Zadeh book and papers  Knowledge Extraction  Semantic Web (ISWC conference series, KBS and JWS journals, …)  Knowledge Management (KR, IJCAI, AAAI, …)  Natural Language Processing (ACL, COLING, …)  User Modeling and Interaction  UMAP proceedings  HCI papers 5
  • 6. Introductory Example  John is looking for a bicycle for his little son  John takes care of two criteria: “safety” and “inexpensiveness”  John considers “safety” > “inexpensiveness”  We may two scenarios: 1. John is not able to find a “safe” bicycle that is also “cheap”. 2. John has a low budget. Thus, he has to find a trade-off between the two criteria. 6 E C1 C2
  • 7. Problem Representation  Components  the set C of the n considered criteria: C = {C1, …, Cn};  the collection D of entities (documents in the specific case of IR);  an aggregation function F computing the score F(C1(d),…, Cn(d)) of each document d contained in D;  a priority model P defined by… someone (user, system maintainer, etc.);  a weighting schema W. 7
  • 8. Weighting Schema – Expert-based choice  Weights are arbitrarily chosen by an expert.  No rules for computing them.  For example:  C1  λ1 = 0.7  C2  λ2 = 0.5  C3  λ3 = 0.6  C4  λ4 = 0.3  You need to justify the values you choose. 8
  • 9. Weighting Schema – Priority-based choice  Weights are computed “automatically” based on the priority between criteria.  For each document d, the weight of the most important criterion C1 is set to 1.0 by definition.  The weights of the other criteria are computed as follows: 9
  • 10. Weight Schema - Considerations  A weighting schema can be decided a-priori but…  We can learn a new weighting schema:  from learn-to-rank dataset, or  from the IR system usage.  The choice of the weighting schema, obviously, affects the effectiveness of your information retrieval system.  Where can we apply such weighting schema? 10
  • 11. Three (not exhaustive) Operators  As you can imagine… there are different ways for combining weights and criteria  Operator 1: “Scoring”  weighted criteria scores are summed  Operator 2: “Min” or “And”  among weighted criteria scores, minimum score is selected  Operator 3: “Max” or “Or”  among weighted criteria scores, maximum score is selected 11
  • 12. The “Scoring” Operator  The overall document score is computed by summing the weighted scores computed for all criteria.  The score computed on the most important criterion leads the overall document score.  Less important criteria help in refining the overall document score. 12
  • 13. The “And” (or “Min”) Operator  The document score is strongly dependent on the degree of satisfaction of the least satisfied criterion  Very restrictive operator  Suggestion: consider criteria that are really relevant for a user!!! 13
  • 14. The “Or” (or “Max”) Operator  Dangerous operator!  Recommendation: criteria with a satisfaction degree of zero do not have to be considered.  It is useful only when priority between criteria is not used.  Weighting schema is manually defined  Weight of less important criteria are not based on the value of the most important ones. 14
  • 15. Operators’ Properties  Boundary Conditions  Continuity  Monotonicity (just for Scoring)  Absorbing Element (“0”, for Scoring and Min operators) 15
  • 16. The Operators in Action  Assume to have a document D composed as follows: 16 Title Abstract Introduction Content Title  C1 Abstract  C2 Introduction  C3 Content  C4
  • 17. The Operators in Action  Suppose to perform a query as follows:  Q = {qt1, qt2, qt3}  Assume that, for each document field, you have a normalized similarity values:  sim(Q, DTitle) = 0.5  sim(Q, DAbstract) = 0.4  sim(Q, DIntroduction) = 0.2  sim(Q, DContent) = 0.7  As you can imagine, by using different priorities and different aggregations, the document score will be different. 17
  • 18. The Operators in Action Criteria score: C1 = 0.5; C2 = 0.8; C3 = 0.2; C4 = 0.7 Priority schemas: P1: C1 > C2 > C3 > C4 P2: C1 > C2 > C4 > C3 Weights: for P1: for P2: w1: 1.0 w1: 1.0 w2: 1.0 * 0.5 = 0.5 w2: 1.0 * 0.5 = 0.5 w3: 0.5 * 0.8 = 0.4 w3: 0.5 * 0.8 = 0.4 w4: 0.4 * 0.2 = 0.08 w4: 0.4 * 0.7 = 0.28 18
  • 19. The Operators in Action  Document score  “Scoring” operator: • DP1 = (0.5 * 1.0) + (0.8 * 0.5) + (0.2 * 0.4) + (0.7 * 0.08) = 1.036 • DP2 = (0.5 * 1.0) + (0.8 * 0.5) + (0.7 * 0.4) + (0.2 * 0.28) = 1.236  “And” operator: • DP1 = min(0.5^1.0, 0.8^0.5, 0.2^0.4, 0.7^0.08) = min(0.5, 0.89, 0.53, 0.97) = 0.5 • DP2 = min(0.5^1.0, 0.8^0.5, 0.7^0.4, 0.2^0.28) = min(0.5, 0.89, 0.87, 0.64) = 0.5  “Or” operator: • DP1 = max(0.5^1.0, 0.8^0.5, 0.2^0.4, 0.7^0.08) = max(0.5, 0.89, 0.53, 0.97) = 0.97 • DP2 = max(0.5^1.0, 0.8^0.5, 0.7^0.4, 0.2^0.28) = max(0.5, 0.89, 0.87, 0.64) = 0.89 19
  • 20. Any question so far? 20 Timeout…
  • 21. Case Study 1 – The Scenario  Keyword search over a multi-layer representation of documents  Documents and queries structure:  Textual layer: natural language text  Metadata layers: • Entity Linking • Predicates • Roles/Types • Timing Information  Problems:  How to compute the score for each layer?  How to aggregate such scores?  How to weight each layer? 21
  • 22. Case Study 1 – The Scenario  Natural language content is enriched with four metadata/semantic layers  URI Layer: links with entities detected into the text and mapped to DBpedia entities  TYPE Layer: conceptual classification of the named entities detected into the text and mapped with both DBpedia and Yago knowledge bases  TIME Layer: metadata related to the temporal mentions find into the text by using a temporal expression recognizer (ex. “the eighteenth century”, “2015-18- 12”, etc.)  FRAME Layer: output of the application of semantic role labeling techniques. Generally, this output includes predicates and their arguments describing a specific role in the context of the predicate. Example: “He has been influenced by Carl Gauss”  [framebase:Subjective_influence; dbpedia:Carl_Friedrich_Gauss] 22
  • 23. Case Study 1 – Example  Text: “astronomers influenced by Gauss”  Layers  URI Layer: “dbpedia:Carl_Friedrich_Gauss”  TYPE Layer: “yago:GermanMathematicians”, “yago:NumberTheorists”, “yago:FellowsOfTheRoyalSociety”  TIME Layer: “day:1777-04-30”, “day:1855-02-23”, “century:1700”  FRAME Layer: “Subjective_influence.v_Carl_Friedrich_Gauss”  Annotations provided by PIKES (https://pikes.fbk.eu) 23
  • 24. Case Study 1 - Evaluation  331 documents, 35 queries  Jörg Waitelonis, Claudia Exeler, Harald Sack. Enabled Generalized Vector Space Model to Improve Document Retrieval. NLP-DBPEDIA@ISWC 2015: 33-44  Multi-value relevance (1=irrelevant, 5=relevant)  Diverse queries: from keyword-base search to queries requiring semantic capabilities 24
  • 25. Case Study 1 - Evaluation  2 baselines:  Google custom search API  Textual layer only (~Lucene)  Measures: Prec1,5,10, MAP, MAP10, NDCG, NDCG10  Same weights for textual and semantic layers:  TEXTUAL (50%)  URI (12,5%), TYPE (12,5%), FRAME (12,5%), TIME (12,5%) 25
  • 26. Case Study 1 - Evaluation 26 Approach/ System Prec1 Prec5 Prec10 NDCG NDCG10 MAP MAP10 Google 0.543 0.411 0.343 0.434 0.405 0.255 0.219 Textual 0.943 0.669 0.453 0.832 0.782 0.733 0.681 KE4IR 0.971 0.680 0.474 0.854 0.806 0.758 0.713 KE4IR vs. Textual 3.03% 1.71% 4.55% 2.64% 2.99% 3.50% 4.74%
  • 27. Case Study 1 - Evaluation 27 Layers (TEXTUAL+) Prec1 Prec5 Prec10 NDCG NDCG10 MAP MAP10 URI,TYPE,FRAME,TIME 0.971 0.680 0.474 0.854 0.806 0.758 0.713 URI,TYPE,FRAME 0.971 0.680 0.474 0.853 0.804 0.757 0.712 URI,TYPE,TIME 0.971 0.680 0.474 0.851 0.802 0.757 0.712 URI,TYPE 0.971 0.680 0.474 0.849 0.801 0.755 0.710 URI,FRAME,TIME 0.971 0.674 0.465 0.844 0.796 0.750 0.702 URI,FRAME 0.971 0.674 0.465 0.842 0.795 0.749 0.702 URI,TIME 0.971 0.674 0.465 0.840 0.791 0.747 0.700 TYPE,FRAME,TIME 0.943 0.674 0.471 0.848 0.799 0.745 0.700 TYPE,TIME 0.943 0.674 0.471 0.843 0.794 0.743 0.697 TYPE,FRAME 0.943 0.674 0.468 0.847 0.797 0.743 0.695 FRAME,TIME 0.943 0.674 0.462 0.842 0.793 0.741 0.693
  • 28. Case Study 1 - Evaluation 28
  • 29. Case Study 1 – What We Learnt  How the effectiveness of a system can be affected if we change weights.  In this specific case, the use of an expert-based weighting schema helps you in balancing the importance of the semantic information…  … however, we are using learning to rank for identifying potential priorities between used layers.  Further lessons more related to the use of semantic layers.  Future work: to apply the approach to larger collections. 29
  • 30. Any question on Case Study 1? 30 Timeout…
  • 31. Case Study 2 – The Scenario  Combine document information with user profiles.  Assumption: you already have computed user profiles.  Which information can you use?  RELIABILITY: How much a user trusts the document source.  COVERAGE: How strongly a user profiles is represented in a document (inclusion of a user profiles into a document).  APPROPRIATENESS: How much a document satisfies a user profile (similarity between user profile and document).  ABOUTNESS: Trivial criterion, how much a document matches the performed query 31
  • 32. Case Study 2 – Reliability  Why do I trust information source differently?  How much do you trust an information source?  you might fix such values;  you might infer them. 32
  • 33. Case Study 2 – Coverage  The “coverage” criterion allows to compute how strongly a user profile is contained in the document  Suppose to have a profile of a user interested in the following topics:  c = {sports, economics}  Suppose to have a document talking about the following topics:  d= {violence, politics, economics, sports}  c = {0, 0, 1, 1} d = {1, 1, 1, 1}  Coverage(c,d) = 1.0 33
  • 34. Case Study 2 – Appropriateness  The “appropriateness” criterion allows to compute how much a document satisfies a user profile  Suppose to have a profile of a user interested in the following topics:  c = {sports, economics}  Suppose to have a document talking about the following topics:  d= {violence, politics, economics, sports}  c = {0, 0, 1, 1} d = {1, 1, 1, 1}  Appropriateness(c,d) = 0.5 34
  • 35. Case Study 2 – Aboutness  The “classic” similarity between a query and documents contained in a repository.  Many model available… and various adaptations based on the context. 35
  • 36. Case Study 2 – Validation  The Reuters RCV1 Collection has been used for creating user profiles and for generating user queries.  20 users have been involved in the evaluation campaign.  Different aggregation schemas have been tested. 36
  • 37. Case Study 2 – Validation (Ab > Ap > C > R) 37
  • 38. Case Study 2 – What We Learnt  When users are involved, it is very difficult to define an aggregation schema.  The same occurs for the priority between criteria.  Creating (or learning) a user profiles is already a big problem itself.  The quality of user profiles significantly affects the effectiveness of the retrieval algorithm.  If you start playing with criteria and weight schemas, you will never end!!! 38
  • 39. Any question on Case Study 2? 39 Timeout…
  • 40. Case Study 3  Let’s get back to the first simple example… 40 Title Abstract Introduction Content Title  C1 Abstract  C2 Introduction  C3 Content  C4
  • 41. Case Study 3 – Suppose that…  Each field has been annotated with different ontologies, but belonging to the same domain  this means that you have, for the same field, many layers with different annotations… one for each used ontology  Your repository contains documents coming from different sources  is the reliability of each repository the same?  Your users have a history  Users profiles need to be updated  this aspect is out of the scope of this talk… but you should be aware of it…   Any other idea? 41
  • 42. Exploding Fields 42 You have something to think about… Good luck!!!
  • 43. So… for concluding  Considering retrieval as a multi-criteria decision making problem is interesting to explore.  There is room for investigating a lot of stuff.  Do not be scary on using user profiles.  I invite you to consider recent works on simulating user interactions with IR systems • David Maxwell, Leif Azzopardi. Simulating Interactive Information Retrieval: SimIIR: A Framework for the Simulation of Interaction. SIGIR 2016: 1141-1144 (+ the tutorial he gave)  My suggestion: try to combine  content  semantic metadata  users history 43
  • 44. 44 It’s time for questions… Mauro Dragoni Fondazione Bruno Kessler https://shell.fbk.eu/index.php/Mauro_Dragoni [email protected]