SlideShare a Scribd company logo
03/18/22 Heiko Paulheim 1
What do
Knowledge Graph Embeddings Learn?
includes: some new Adventures in RDF2vec
Heiko Paulheim
University of Mannheim
Heiko Paulheim
03/18/22 Heiko Paulheim 2
Graphs vs. Vectors
• Data Science tools for prediction etc.
– Python, Weka, R, RapidMiner, …
– Algorithms that work on vectors, not graphs
• Bridges built over the past years:
– FeGeLOD (Weka, 2012), RapidMiner LOD Extension (2015),
Python KG Extension (2021)
?
03/18/22 Heiko Paulheim 3
Graphs vs. Vectors
• Transformation strategies (aka propositionalization)
– e.g., types: type_horror_movie=true
– e.g., data values: year=2011
– e.g., aggregates: nominations=7
?
03/18/22 Heiko Paulheim 4
Graphs vs. Vectors
• Observations with simple propositionalization strategies
– Even simple features (e.g., add all numbers and types)
can help on many problems
– More sophisticated features often bring additional improvements
• Combinations of relations and individuals
– e.g., movies directed by Steven Spielberg
• Combinations of relations and types
– e.g., movies directed by Oscar-winning directors
• …
– But
• The search space is enormous!
• Generate first, filter later does not scale well
03/18/22 Heiko Paulheim 5
Towards RDF2vec
• Excursion: word embeddings
– word2vec proposed by Mikolov et al. (2013)
– predict a word from its context or vice versa
• Idea: similar words appear in similar contexts, like
– Jobs, Wozniak, and Wayne founded Apple Computer Company in April
1976
– Google was officially founded as a company in January 2006
– usually trained on large text corpora
• projection layer: embedding vectors
03/18/22 Heiko Paulheim 6
From Word Embeddings to Graph Embeddings
• Basic idea:
– extract random walks from an RDF graph:
Mulholland Dr. David Lynch US
– feed walks into word2vec algorithm
• Order of magnitude (e.g., DBpedia)
– ~6M entities (“words”)
– start up to 500 random walks per entity, length up to 8
→ corpus of >20B tokens
• Result:
– entity embeddings
– most often outperform other propositionalization techniques
director nationality
Ristoski and Paulheim (2016): RDF2vec: RDF graph embeddings for data mining
03/18/22 Heiko Paulheim 7
The End of Petar’s PhD Journey…
• ...and the beginning of the RDF2vec adventure
03/18/22 Heiko Paulheim 8
Why does RDF2vec Work?
• Example: PCA plot of an excerpt of a cities classification problem
– From cities classification task in the embedding evaluation framework by
Pellegrino et al.
03/18/22 Heiko Paulheim 9
Why does RDF2vec Work?
• In downstream machine learning, we usually want class separation
– to make the life of the classifier as easy as possible
• Class separation means
– Similar entities (i.e., same class) are projected
closely to each other
– Dissimilar entities (i.e., different classes) are projected
far away from each other
03/18/22 Heiko Paulheim 10
Why does RDF2vec Work?
• Observation: close projection of similar entities
– Usage example: content-based recommender system based on k-NN
Ristoski and Paulheim (2016): RDF2vec: RDF graph embeddings for data mining
03/18/22 Heiko Paulheim 11
Embeddings for Link Prediction
• RDF2vec observations
– similar instances form clusters, direction of relation is ~stable
– link prediction by analogy reasoning (Japan – Tokyo ≈ China – Beijing)
Ristoski & Paulheim: RDF2vec: RDF Graph Embeddings for Data Mining. ISWC, 2016
03/18/22 Heiko Paulheim 12
Embeddings for Link Prediction
• In RDF2vec, relation preservation is a by-product
• TransE (and its descendants): direct modeling
– Formulates RDF embedding as an optimization problem
– Find mapping of entities and relations to Rn
so that
• across all triples <s,p,o>
Σ ||s+p-o|| is minimized
• try to obtain a smaller error
for existing triples
than for non-existing ones
Bordes et al: Translating Embeddings for Modeling Multi-relational Data. NIPS 2013.
Fan et al.: Learning Embedding Representations for Knowledge Inference on Imperfect and Incomplete
Repositories. WI 2016
03/18/22 Heiko Paulheim 13
Link Prediction vs. Node Embedding
• Hypothesis:
– Embeddings for link prediction also cluster similar entities
– Node embeddings can also be used for link prediction
Portisch et al. (2022): Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for
Link Prediction - Two Sides of the Same Coin?
03/18/22 Heiko Paulheim 14
Close Projection of Similar Entities
• What does similar mean?
03/18/22 Heiko Paulheim 15
Similarity vs. Relatedness
• Closest 10 entities to Angela Merkel in different vector spaces
Portisch et al. (2022): Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for
Link Prediction - Two Sides of the Same Coin?
03/18/22 Heiko Paulheim 16
Back to Class Separation
• What is a class?
– e.g., cities per se
– e.g., cities in the Netherlands
– e.g., cities in the Netherlands above 100k inhabitants
• Or something different, such as
– e.g., everything located in Amsterdam
– e.g., everything Amsterdam is known for
03/18/22 Heiko Paulheim 17
Back to Class Separation
• Observation: there are different kinds of classes:
– Classes of objects of the same category (e.g., cities)
→ those are similar
– Classes of objects of different categories (e.g., buildings, dishes,
organizations, persons)
→ those are related
03/18/22 Heiko Paulheim 18
Intermediate Observation
• In most vector spaces of link prediction embeddings (TransE etc.):
proximity ~ similarity
• In RDF2vec embedding space:
proximity ~ a mix of similarity and relatedness
03/18/22 Heiko Paulheim 19
So… why does RDF2vec Work Then?
• Recap: downstream ML algorithms need class separation
– but RDF2vec groups items by similarity and relatedness
• Why is RDF2vec still so good at classification?
03/18/22 Heiko Paulheim 20
Example
• It depends on the classification problem at hand!
– Cities vs. countries
– Places in Europe vs. places in Asia
03/18/22 Heiko Paulheim 21
So… why does RDF2vec Work Then?
• Many downstream classification tasks are homogeneous
– e.g., classifying cities in different subclasses
• For homogeneous entities:
– relatedness provides finer-grained distinctions
03/18/22 Heiko Paulheim 22
Similarity vs. Relatedness
• Recap word embeddings:
– Jobs, Wozniak, and Wayne founded Apple Computer Company in April
1976
– Google was officially founded as a company in January 2006
• Graph walks:
– Hamburg → country → Germany → leader → Angela_Merkel
– Germany → leader → Angela_Merkel → birthPlace → Hamburg
– Hamburg → leader → Peter_Tschentscher → residence → Hamburg
Germany
Angela_Merkel Hamburg
birthPlace
country
leader
Peter_Tschentscher
leader
residence
country
03/18/22 Heiko Paulheim 23
Order-Aware RDF2vec
• Using an order-aware variant of word2vec
• Experimental results:
– order-aware RDF2vec most often outperforms classic RDF2vec
– a bit more computation heavy, but still scales to DBpedia etc.
Ling et al. (2015): Two/Too Simple Adaptations of Word2Vec for Syntax Problems.
03/18/22 Heiko Paulheim 24
Similarity vs. Relatedness
• Exploiting different notions of proximity
– Use case: table interpretation (a special case of entity disambiguation)
related
similar
03/18/22 Heiko Paulheim 25
Similarity vs. Relatedness in Graph Walks
• Which parts of a walk denote what?
– Hamburg → country → Germany → leader → Angela_Merkel
– Germany → leader → Angela_Merkel → birthPlace → Hamburg
– Hamburg → leader → Peter_Tschentscher → residence → Hamburg
– California → leader → Gavin_Newsom → birthPlace → San_Francisco
• Common predicates (leader, birthPlace)
– Similar entities
• Common entities (Hamburg)
– Related entities
– For same-class entities: similar entities!
Portisch and Paulheim (under review): Walk this Way! Entity Walks and Property Walks for RDF2vec.
03/18/22 Heiko Paulheim 26
Similarity vs. Relatedness in Graph Walks
• Given that observation:
– Common predicates (leader, birthPlace)
• Similar classes
– Common entities (Hamburg)
• Related entities
• For same-class entities: similar entities!
• ...we should be able to learn tailored embeddings
– using walks of predicates → embedding space encodes similarity
– using walks of entities → embedding space encodes relatedness
Portisch and Paulheim (under review): Walk this Way! Entity Walks and Property Walks for RDF2vec.
03/18/22 Heiko Paulheim 27
Similarity vs. Relatedness in Graph Walks
• Classic RDF2vec walks:
– Germany → leader → Angela_Merkel → birthPlace → Hamburg
• p-walk (predicates only except for focus entity)
– country → leader → Angela_Merkel → birthPlace → mayor
• e-walk (entities only)
– Berlin → Germany → Angela_Merkel → Hamburg → Elbphilharmonie
Portisch and Paulheim (under review): Walk this Way! Entity Walks and Property Walks for RDF2vec.
03/18/22 Heiko Paulheim 28
The RDF2vec Zoo
• We now have an entire zoo of RDF2vec variants
– SG vs. CBOW
– Order-aware vs. unordered (“classic”)
– Classic walks vs. e-walks vs. p-walks
03/18/22 Heiko Paulheim 29
The RDF2vec Zoo – Preliminary Evaluation
Classic is usually
quite good
￘
￘
oa variants are
often superior
03/18/22 Heiko Paulheim 30
The RDF2vec Zoo – Breeding New Embeddings
• Preliminary results show good results by combining embeddings
Adler_Mannheim → city → Mannheim → country → Germany
Adler_Mannheim → stadium → SAP_Arena → location → Mannheim
SAP_Arena → location → Mannheim → country → Germany
...
Classic random walks
city → Mannheim → country
stadium → location → Mannheim → country
location → Mannheim → federal_state → location
...
p-walks
Adler_Mannheim → Mannheim → Germany
Adler_Mannheim → SAP_Arena → Mannheim → Germany
SAP_Arena → Mannheim → Baden-Württemberg → Germany
...
e-walks
concatenated
vector
Global PCA
03/18/22 Heiko Paulheim 31
The RDF2vec Zoo – Breeding New Embeddings
• Combinations can be task specific
– Based on general embeddings
– Combination can pick up task-specific signals
Adler_Mannheim → city → Mannheim → country → Germany
Adler_Mannheim → stadium → SAP_Arena → location → Mannheim
SAP_Arena → location → Mannheim → country → Germany
...
Classic random walks
city → Mannheim → country
stadium → location → Mannheim → country
location → Mannheim → federal_state → location
...
p-walks
Adler_Mannheim → Mannheim → Germany
Adler_Mannheim → SAP_Arena → Mannheim → Germany
SAP_Arena → Mannheim → Baden-Württemberg → Germany
...
e-walks
concatenated
vector
w
3
w
1
(weighted)
local PCA
w
2
Task data
03/18/22 Heiko Paulheim 32
Which Classes can be Learned with RDF2vec?
• We already saw that there are different notions of classes
• Idea: compile a list of class definitions as a benchmark
– Classes are expressed as DL formulae, e.g.
– r.T, e.g. Class person with children
– r.{e}, e.g.: Class person born in New York City
– R.{e}, e.g., Class person with any relation to New York City
– r.C, e.g., Class person playing in a basketball team
– …
• First attempt:
– Create SPARQL queries in DBpedia
to get positive and negative examples
– Train binary classifier on different embeddings
03/18/22 Heiko Paulheim 33
Which Classes can be Learned with RDF2vec?
• Formulating hypotheses
– e.g., r.T, cannot be learned when using e walks
• Testing hypotheses
– using queries against DBpedia
• Seeing surprises
– e.g., models trained on e-walks can reach ~90% accuracy in that case
03/18/22 Heiko Paulheim 34
Which Classes can be Learned with RDF2vec?
• Challenge: isolating effects
– Let’s consider, r.T: e.g. almaMater.T
– In theory, we should not be able to learn this with e-walks
– Frequent entities in the neighborhoods of positive examples:
• Politician (3k examples)
• Bachelor of Arts (3k examples)
• Harvard Law School (2k examples)
• Lawyer (2k examples)
• Northwestern University (2k examples)
• Harvard University (2k examples)
• Doctor of Philosophy (2k examples)
• …
– Those signals are visible to e-walks!
03/18/22 Heiko Paulheim 35
Which Classes can be Learned with RDF2vec?
• Maybe, DBpedia is not
such a great testbed
– Hidden patterns, e.g.,
for relation cooccurence
– Many inter-pattern dependencies
– Information not missing at random
• Possible solution:
– Synthetic knowledge graphs!
– First experiments show
better visibility of expected effects
03/18/22 Heiko Paulheim 36
Alternatives to Understand KG Embeddings
cartoon
superhero
• Approach 1: learn symbolic
interpretation function for dimensions
• Each dimension of the embedding model
is a target for a separate learning problem
• Learn a function to explain the dimension
• E.g.:
• Just an approximation used for explanations and justifications
y≈−|∃character .Superhero|
03/18/22 Heiko Paulheim 37
Alternatives to Understand KG Embeddings
• Approach 2: learn symbolic substitute function for similarity function
Right hand side picture: RelFinder (https://interactivesystems.info/developments/relfinder)
03/18/22 Heiko Paulheim 38
Alternatives to Understand KG Embeddings
• Approach 3: generate
symbolic interpretations
for individual predictions
– Inspired by LIME:
• Generate perturbed
examples
• Label them using
embedding+downstream classifier
• Learn symbolic model on this labeled set
– Good news:
• RDF2vec can, in principle, create embeddings for unseen entities
• Those can be used to classify perturbed examples
https://c3.ai/glossary/data-science/lime-local-interpretable-model-agnostic-explanations/
03/18/22 Heiko Paulheim 39
Summary
• Knowledge Graph Embeddings with RDF2vec
– Encode similarity and relatedness
• Explicit trade-off is possible!
– Variations visited: walk extraction, order-awareness, materialization, ...
– Additional insights that are not explicit in the graph
• aka latent semantics
03/18/22 Heiko Paulheim 40
More on RDF2vec
• Collection of
– Implementations
– Pre-trained models
– >45 use cases
in various domains
03/18/22 Heiko Paulheim 41
Thank you!
http://www.heikopaulheim.com
@heikopaulheim
03/18/22 Heiko Paulheim 42
What do
Knowledge Graph Embeddings Learn?
includes: some new Adventures in RDF2vec
Heiko Paulheim
University of Mannheim
Heiko Paulheim

More Related Content

PDF
New Adventures in RDF2vec
Heiko Paulheim
 
PPTX
ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....
ArangoDB Database
 
PPTX
Vector_db_introduction.pptx
DataChest
 
ODP
Neo4j Spatial - Backing a GIS with a true graph database
Craig Taverner
 
PDF
Machine Learning Interpretability
inovex GmbH
 
PDF
Big Data Architecture
Guido Schmutz
 
PPTX
Ontology mapping for the semantic web
Worawith Sangkatip
 
PPT
Neo4J : Introduction to Graph Database
Mindfire Solutions
 
New Adventures in RDF2vec
Heiko Paulheim
 
ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....
ArangoDB Database
 
Vector_db_introduction.pptx
DataChest
 
Neo4j Spatial - Backing a GIS with a true graph database
Craig Taverner
 
Machine Learning Interpretability
inovex GmbH
 
Big Data Architecture
Guido Schmutz
 
Ontology mapping for the semantic web
Worawith Sangkatip
 
Neo4J : Introduction to Graph Database
Mindfire Solutions
 

What's hot (20)

PDF
Databricks Overview for MLOps
Databricks
 
PDF
Domain Driven Data: Apache Kafka® and the Data Mesh
confluent
 
PPTX
Relational Database to RDF (RDB2RDF)
EUCLID project
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PDF
Big Data Analytics
Sreedhar Chowdam
 
PDF
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
PDF
Deep learning for medical imaging
geetachauhan
 
PPTX
Deep Learning - A Literature survey
Akshay Hegde
 
PDF
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
vinoth raja
 
PPTX
Apache hive introduction
Mahmood Reza Esmaili Zand
 
PDF
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
IT Arena
 
PPTX
Dspace 7 presentation
mohamed Elzalabany
 
PDF
Crime Analysis & Prediction System
BigDataCloud
 
PDF
Enterprise Knowledge Graph
Lukas Masuch
 
PPTX
Knowledge Representation, Semantic Web
Serendipity Seraph
 
PDF
Semtech web-protege-tutorial
matthewhorridge
 
PPTX
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
Juan Sequeda
 
PPTX
Operational Data Vault
Empowered Holdings, LLC
 
PDF
Google Cloud Machine Learning
India Quotient
 
PDF
Developing a Knowledge Graph of your Competency, Skills, and Knowledge at NASA
Neo4j
 
Databricks Overview for MLOps
Databricks
 
Domain Driven Data: Apache Kafka® and the Data Mesh
confluent
 
Relational Database to RDF (RDB2RDF)
EUCLID project
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Big Data Analytics
Sreedhar Chowdam
 
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
Deep learning for medical imaging
geetachauhan
 
Deep Learning - A Literature survey
Akshay Hegde
 
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
vinoth raja
 
Apache hive introduction
Mahmood Reza Esmaili Zand
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
IT Arena
 
Dspace 7 presentation
mohamed Elzalabany
 
Crime Analysis & Prediction System
BigDataCloud
 
Enterprise Knowledge Graph
Lukas Masuch
 
Knowledge Representation, Semantic Web
Serendipity Seraph
 
Semtech web-protege-tutorial
matthewhorridge
 
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
Juan Sequeda
 
Operational Data Vault
Empowered Holdings, LLC
 
Google Cloud Machine Learning
India Quotient
 
Developing a Knowledge Graph of your Competency, Skills, and Knowledge at NASA
Neo4j
 
Ad

Similar to What_do_Knowledge_Graph_Embeddings_Learn.pdf (20)

PDF
New Adventures in RDF2vec
Heiko Paulheim
 
ODP
Machine Learning & Embeddings for Large Knowledge Graphs
Heiko Paulheim
 
PDF
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Heiko Paulheim
 
ODP
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Heiko Paulheim
 
PDF
Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
Harsh Thakkar
 
PPTX
Ariadne's Thread -- Exploring a world of networked information built from fre...
Shenghui Wang
 
PPTX
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Mikel Emaldi Manrique
 
PDF
Linked Open Data
Laura Hollink
 
PDF
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
Gezim Sejdiu
 
PDF
The discovery of knowledge graphs and their utility in biotech
ssuserf695691
 
PDF
The web of interlinked data and knowledge stripped
Sören Auer
 
PDF
Towards Knowledge Graph Profiling
Heiko Paulheim
 
PDF
Can Deep Learning Techniques Improve Entity Linking?
Julien PLU
 
ODP
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
LDBC council
 
PPTX
Linked data presentation for libraries (COMO)
robin fay
 
PPT
Re-using Media on the Web: Media fragment re-mixing and playout
MediaMixerCommunity
 
ODP
Knowledge Graphs on the Web
Heiko Paulheim
 
PDF
20130527 library linkeddata
Stefan Gradmann
 
PDF
Machine Learning Methods for Analysing and Linking RDF Data
Jens Lehmann
 
PPT
Linked data and voyager
Edmund Chamberlain
 
New Adventures in RDF2vec
Heiko Paulheim
 
Machine Learning & Embeddings for Large Knowledge Graphs
Heiko Paulheim
 
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Heiko Paulheim
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Heiko Paulheim
 
Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
Harsh Thakkar
 
Ariadne's Thread -- Exploring a world of networked information built from fre...
Shenghui Wang
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Mikel Emaldi Manrique
 
Linked Open Data
Laura Hollink
 
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
Gezim Sejdiu
 
The discovery of knowledge graphs and their utility in biotech
ssuserf695691
 
The web of interlinked data and knowledge stripped
Sören Auer
 
Towards Knowledge Graph Profiling
Heiko Paulheim
 
Can Deep Learning Techniques Improve Entity Linking?
Julien PLU
 
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
LDBC council
 
Linked data presentation for libraries (COMO)
robin fay
 
Re-using Media on the Web: Media fragment re-mixing and playout
MediaMixerCommunity
 
Knowledge Graphs on the Web
Heiko Paulheim
 
20130527 library linkeddata
Stefan Gradmann
 
Machine Learning Methods for Analysing and Linking RDF Data
Jens Lehmann
 
Linked data and voyager
Edmund Chamberlain
 
Ad

More from Heiko Paulheim (20)

PDF
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Heiko Paulheim
 
PDF
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Heiko Paulheim
 
PDF
From Wikis to Knowledge Graphs
Heiko Paulheim
 
PPT
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Heiko Paulheim
 
PPT
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Heiko Paulheim
 
ODP
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
Heiko Paulheim
 
ODP
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Heiko Paulheim
 
ODP
Make Embeddings Semantic Again!
Heiko Paulheim
 
ODP
How much is a Triple?
Heiko Paulheim
 
ODP
Machine Learning with and for Semantic Web Knowledge Graphs
Heiko Paulheim
 
ODP
Weakly Supervised Learning for Fake News Detection on Twitter
Heiko Paulheim
 
ODP
Fast Approximate A-box Consistency Checking using Machine Learning
Heiko Paulheim
 
PPT
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Heiko Paulheim
 
ODP
Combining Ontology Matchers via Anomaly Detection
Heiko Paulheim
 
PPT
Gathering Alternative Surface Forms for DBpedia Entities
Heiko Paulheim
 
ODP
What the Adoption of schema.org Tells about Linked Open Data
Heiko Paulheim
 
ODP
Linked Open Data enhanced Knowledge Discovery
Heiko Paulheim
 
ODP
Mining the Web of Linked Data with RapidMiner
Heiko Paulheim
 
ODP
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Heiko Paulheim
 
PDF
Detecting Incorrect Numerical Data in DBpedia
Heiko Paulheim
 
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Heiko Paulheim
 
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Heiko Paulheim
 
From Wikis to Knowledge Graphs
Heiko Paulheim
 
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Heiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Heiko Paulheim
 
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
Heiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Heiko Paulheim
 
Make Embeddings Semantic Again!
Heiko Paulheim
 
How much is a Triple?
Heiko Paulheim
 
Machine Learning with and for Semantic Web Knowledge Graphs
Heiko Paulheim
 
Weakly Supervised Learning for Fake News Detection on Twitter
Heiko Paulheim
 
Fast Approximate A-box Consistency Checking using Machine Learning
Heiko Paulheim
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Heiko Paulheim
 
Combining Ontology Matchers via Anomaly Detection
Heiko Paulheim
 
Gathering Alternative Surface Forms for DBpedia Entities
Heiko Paulheim
 
What the Adoption of schema.org Tells about Linked Open Data
Heiko Paulheim
 
Linked Open Data enhanced Knowledge Discovery
Heiko Paulheim
 
Mining the Web of Linked Data with RapidMiner
Heiko Paulheim
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Heiko Paulheim
 
Detecting Incorrect Numerical Data in DBpedia
Heiko Paulheim
 

Recently uploaded (20)

PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Chad Readey - An Independent Thinker
Chad Readey
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 

What_do_Knowledge_Graph_Embeddings_Learn.pdf

  • 1. 03/18/22 Heiko Paulheim 1 What do Knowledge Graph Embeddings Learn? includes: some new Adventures in RDF2vec Heiko Paulheim University of Mannheim Heiko Paulheim
  • 2. 03/18/22 Heiko Paulheim 2 Graphs vs. Vectors • Data Science tools for prediction etc. – Python, Weka, R, RapidMiner, … – Algorithms that work on vectors, not graphs • Bridges built over the past years: – FeGeLOD (Weka, 2012), RapidMiner LOD Extension (2015), Python KG Extension (2021) ?
  • 3. 03/18/22 Heiko Paulheim 3 Graphs vs. Vectors • Transformation strategies (aka propositionalization) – e.g., types: type_horror_movie=true – e.g., data values: year=2011 – e.g., aggregates: nominations=7 ?
  • 4. 03/18/22 Heiko Paulheim 4 Graphs vs. Vectors • Observations with simple propositionalization strategies – Even simple features (e.g., add all numbers and types) can help on many problems – More sophisticated features often bring additional improvements • Combinations of relations and individuals – e.g., movies directed by Steven Spielberg • Combinations of relations and types – e.g., movies directed by Oscar-winning directors • … – But • The search space is enormous! • Generate first, filter later does not scale well
  • 5. 03/18/22 Heiko Paulheim 5 Towards RDF2vec • Excursion: word embeddings – word2vec proposed by Mikolov et al. (2013) – predict a word from its context or vice versa • Idea: similar words appear in similar contexts, like – Jobs, Wozniak, and Wayne founded Apple Computer Company in April 1976 – Google was officially founded as a company in January 2006 – usually trained on large text corpora • projection layer: embedding vectors
  • 6. 03/18/22 Heiko Paulheim 6 From Word Embeddings to Graph Embeddings • Basic idea: – extract random walks from an RDF graph: Mulholland Dr. David Lynch US – feed walks into word2vec algorithm • Order of magnitude (e.g., DBpedia) – ~6M entities (“words”) – start up to 500 random walks per entity, length up to 8 → corpus of >20B tokens • Result: – entity embeddings – most often outperform other propositionalization techniques director nationality Ristoski and Paulheim (2016): RDF2vec: RDF graph embeddings for data mining
  • 7. 03/18/22 Heiko Paulheim 7 The End of Petar’s PhD Journey… • ...and the beginning of the RDF2vec adventure
  • 8. 03/18/22 Heiko Paulheim 8 Why does RDF2vec Work? • Example: PCA plot of an excerpt of a cities classification problem – From cities classification task in the embedding evaluation framework by Pellegrino et al.
  • 9. 03/18/22 Heiko Paulheim 9 Why does RDF2vec Work? • In downstream machine learning, we usually want class separation – to make the life of the classifier as easy as possible • Class separation means – Similar entities (i.e., same class) are projected closely to each other – Dissimilar entities (i.e., different classes) are projected far away from each other
  • 10. 03/18/22 Heiko Paulheim 10 Why does RDF2vec Work? • Observation: close projection of similar entities – Usage example: content-based recommender system based on k-NN Ristoski and Paulheim (2016): RDF2vec: RDF graph embeddings for data mining
  • 11. 03/18/22 Heiko Paulheim 11 Embeddings for Link Prediction • RDF2vec observations – similar instances form clusters, direction of relation is ~stable – link prediction by analogy reasoning (Japan – Tokyo ≈ China – Beijing) Ristoski & Paulheim: RDF2vec: RDF Graph Embeddings for Data Mining. ISWC, 2016
  • 12. 03/18/22 Heiko Paulheim 12 Embeddings for Link Prediction • In RDF2vec, relation preservation is a by-product • TransE (and its descendants): direct modeling – Formulates RDF embedding as an optimization problem – Find mapping of entities and relations to Rn so that • across all triples <s,p,o> Σ ||s+p-o|| is minimized • try to obtain a smaller error for existing triples than for non-existing ones Bordes et al: Translating Embeddings for Modeling Multi-relational Data. NIPS 2013. Fan et al.: Learning Embedding Representations for Knowledge Inference on Imperfect and Incomplete Repositories. WI 2016
  • 13. 03/18/22 Heiko Paulheim 13 Link Prediction vs. Node Embedding • Hypothesis: – Embeddings for link prediction also cluster similar entities – Node embeddings can also be used for link prediction Portisch et al. (2022): Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction - Two Sides of the Same Coin?
  • 14. 03/18/22 Heiko Paulheim 14 Close Projection of Similar Entities • What does similar mean?
  • 15. 03/18/22 Heiko Paulheim 15 Similarity vs. Relatedness • Closest 10 entities to Angela Merkel in different vector spaces Portisch et al. (2022): Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction - Two Sides of the Same Coin?
  • 16. 03/18/22 Heiko Paulheim 16 Back to Class Separation • What is a class? – e.g., cities per se – e.g., cities in the Netherlands – e.g., cities in the Netherlands above 100k inhabitants • Or something different, such as – e.g., everything located in Amsterdam – e.g., everything Amsterdam is known for
  • 17. 03/18/22 Heiko Paulheim 17 Back to Class Separation • Observation: there are different kinds of classes: – Classes of objects of the same category (e.g., cities) → those are similar – Classes of objects of different categories (e.g., buildings, dishes, organizations, persons) → those are related
  • 18. 03/18/22 Heiko Paulheim 18 Intermediate Observation • In most vector spaces of link prediction embeddings (TransE etc.): proximity ~ similarity • In RDF2vec embedding space: proximity ~ a mix of similarity and relatedness
  • 19. 03/18/22 Heiko Paulheim 19 So… why does RDF2vec Work Then? • Recap: downstream ML algorithms need class separation – but RDF2vec groups items by similarity and relatedness • Why is RDF2vec still so good at classification?
  • 20. 03/18/22 Heiko Paulheim 20 Example • It depends on the classification problem at hand! – Cities vs. countries – Places in Europe vs. places in Asia
  • 21. 03/18/22 Heiko Paulheim 21 So… why does RDF2vec Work Then? • Many downstream classification tasks are homogeneous – e.g., classifying cities in different subclasses • For homogeneous entities: – relatedness provides finer-grained distinctions
  • 22. 03/18/22 Heiko Paulheim 22 Similarity vs. Relatedness • Recap word embeddings: – Jobs, Wozniak, and Wayne founded Apple Computer Company in April 1976 – Google was officially founded as a company in January 2006 • Graph walks: – Hamburg → country → Germany → leader → Angela_Merkel – Germany → leader → Angela_Merkel → birthPlace → Hamburg – Hamburg → leader → Peter_Tschentscher → residence → Hamburg Germany Angela_Merkel Hamburg birthPlace country leader Peter_Tschentscher leader residence country
  • 23. 03/18/22 Heiko Paulheim 23 Order-Aware RDF2vec • Using an order-aware variant of word2vec • Experimental results: – order-aware RDF2vec most often outperforms classic RDF2vec – a bit more computation heavy, but still scales to DBpedia etc. Ling et al. (2015): Two/Too Simple Adaptations of Word2Vec for Syntax Problems.
  • 24. 03/18/22 Heiko Paulheim 24 Similarity vs. Relatedness • Exploiting different notions of proximity – Use case: table interpretation (a special case of entity disambiguation) related similar
  • 25. 03/18/22 Heiko Paulheim 25 Similarity vs. Relatedness in Graph Walks • Which parts of a walk denote what? – Hamburg → country → Germany → leader → Angela_Merkel – Germany → leader → Angela_Merkel → birthPlace → Hamburg – Hamburg → leader → Peter_Tschentscher → residence → Hamburg – California → leader → Gavin_Newsom → birthPlace → San_Francisco • Common predicates (leader, birthPlace) – Similar entities • Common entities (Hamburg) – Related entities – For same-class entities: similar entities! Portisch and Paulheim (under review): Walk this Way! Entity Walks and Property Walks for RDF2vec.
  • 26. 03/18/22 Heiko Paulheim 26 Similarity vs. Relatedness in Graph Walks • Given that observation: – Common predicates (leader, birthPlace) • Similar classes – Common entities (Hamburg) • Related entities • For same-class entities: similar entities! • ...we should be able to learn tailored embeddings – using walks of predicates → embedding space encodes similarity – using walks of entities → embedding space encodes relatedness Portisch and Paulheim (under review): Walk this Way! Entity Walks and Property Walks for RDF2vec.
  • 27. 03/18/22 Heiko Paulheim 27 Similarity vs. Relatedness in Graph Walks • Classic RDF2vec walks: – Germany → leader → Angela_Merkel → birthPlace → Hamburg • p-walk (predicates only except for focus entity) – country → leader → Angela_Merkel → birthPlace → mayor • e-walk (entities only) – Berlin → Germany → Angela_Merkel → Hamburg → Elbphilharmonie Portisch and Paulheim (under review): Walk this Way! Entity Walks and Property Walks for RDF2vec.
  • 28. 03/18/22 Heiko Paulheim 28 The RDF2vec Zoo • We now have an entire zoo of RDF2vec variants – SG vs. CBOW – Order-aware vs. unordered (“classic”) – Classic walks vs. e-walks vs. p-walks
  • 29. 03/18/22 Heiko Paulheim 29 The RDF2vec Zoo – Preliminary Evaluation Classic is usually quite good ￘ ￘ oa variants are often superior
  • 30. 03/18/22 Heiko Paulheim 30 The RDF2vec Zoo – Breeding New Embeddings • Preliminary results show good results by combining embeddings Adler_Mannheim → city → Mannheim → country → Germany Adler_Mannheim → stadium → SAP_Arena → location → Mannheim SAP_Arena → location → Mannheim → country → Germany ... Classic random walks city → Mannheim → country stadium → location → Mannheim → country location → Mannheim → federal_state → location ... p-walks Adler_Mannheim → Mannheim → Germany Adler_Mannheim → SAP_Arena → Mannheim → Germany SAP_Arena → Mannheim → Baden-Württemberg → Germany ... e-walks concatenated vector Global PCA
  • 31. 03/18/22 Heiko Paulheim 31 The RDF2vec Zoo – Breeding New Embeddings • Combinations can be task specific – Based on general embeddings – Combination can pick up task-specific signals Adler_Mannheim → city → Mannheim → country → Germany Adler_Mannheim → stadium → SAP_Arena → location → Mannheim SAP_Arena → location → Mannheim → country → Germany ... Classic random walks city → Mannheim → country stadium → location → Mannheim → country location → Mannheim → federal_state → location ... p-walks Adler_Mannheim → Mannheim → Germany Adler_Mannheim → SAP_Arena → Mannheim → Germany SAP_Arena → Mannheim → Baden-Württemberg → Germany ... e-walks concatenated vector w 3 w 1 (weighted) local PCA w 2 Task data
  • 32. 03/18/22 Heiko Paulheim 32 Which Classes can be Learned with RDF2vec? • We already saw that there are different notions of classes • Idea: compile a list of class definitions as a benchmark – Classes are expressed as DL formulae, e.g. – r.T, e.g. Class person with children – r.{e}, e.g.: Class person born in New York City – R.{e}, e.g., Class person with any relation to New York City – r.C, e.g., Class person playing in a basketball team – … • First attempt: – Create SPARQL queries in DBpedia to get positive and negative examples – Train binary classifier on different embeddings
  • 33. 03/18/22 Heiko Paulheim 33 Which Classes can be Learned with RDF2vec? • Formulating hypotheses – e.g., r.T, cannot be learned when using e walks • Testing hypotheses – using queries against DBpedia • Seeing surprises – e.g., models trained on e-walks can reach ~90% accuracy in that case
  • 34. 03/18/22 Heiko Paulheim 34 Which Classes can be Learned with RDF2vec? • Challenge: isolating effects – Let’s consider, r.T: e.g. almaMater.T – In theory, we should not be able to learn this with e-walks – Frequent entities in the neighborhoods of positive examples: • Politician (3k examples) • Bachelor of Arts (3k examples) • Harvard Law School (2k examples) • Lawyer (2k examples) • Northwestern University (2k examples) • Harvard University (2k examples) • Doctor of Philosophy (2k examples) • … – Those signals are visible to e-walks!
  • 35. 03/18/22 Heiko Paulheim 35 Which Classes can be Learned with RDF2vec? • Maybe, DBpedia is not such a great testbed – Hidden patterns, e.g., for relation cooccurence – Many inter-pattern dependencies – Information not missing at random • Possible solution: – Synthetic knowledge graphs! – First experiments show better visibility of expected effects
  • 36. 03/18/22 Heiko Paulheim 36 Alternatives to Understand KG Embeddings cartoon superhero • Approach 1: learn symbolic interpretation function for dimensions • Each dimension of the embedding model is a target for a separate learning problem • Learn a function to explain the dimension • E.g.: • Just an approximation used for explanations and justifications y≈−|∃character .Superhero|
  • 37. 03/18/22 Heiko Paulheim 37 Alternatives to Understand KG Embeddings • Approach 2: learn symbolic substitute function for similarity function Right hand side picture: RelFinder (https://interactivesystems.info/developments/relfinder)
  • 38. 03/18/22 Heiko Paulheim 38 Alternatives to Understand KG Embeddings • Approach 3: generate symbolic interpretations for individual predictions – Inspired by LIME: • Generate perturbed examples • Label them using embedding+downstream classifier • Learn symbolic model on this labeled set – Good news: • RDF2vec can, in principle, create embeddings for unseen entities • Those can be used to classify perturbed examples https://c3.ai/glossary/data-science/lime-local-interpretable-model-agnostic-explanations/
  • 39. 03/18/22 Heiko Paulheim 39 Summary • Knowledge Graph Embeddings with RDF2vec – Encode similarity and relatedness • Explicit trade-off is possible! – Variations visited: walk extraction, order-awareness, materialization, ... – Additional insights that are not explicit in the graph • aka latent semantics
  • 40. 03/18/22 Heiko Paulheim 40 More on RDF2vec • Collection of – Implementations – Pre-trained models – >45 use cases in various domains
  • 41. 03/18/22 Heiko Paulheim 41 Thank you! http://www.heikopaulheim.com @heikopaulheim
  • 42. 03/18/22 Heiko Paulheim 42 What do Knowledge Graph Embeddings Learn? includes: some new Adventures in RDF2vec Heiko Paulheim University of Mannheim Heiko Paulheim