SlideShare a Scribd company logo
Beyond Linked Data –
Exploiting Entity-Centric Knowledge on the Web
Stefan Dietze
L3S Research Center, Hannover, Germany
- Linked Data on the Web (LDOW2017), WWW2017 -
05/04/17 1Stefan Dietze
Research areas
 Web science, Information Retrieval, Semantic Web, Social Web
Analytics, Knowledge Discovery, Human Computation
 Interdisciplinary application areas: digital humanities,
TEL/education, Web archiving, mobility, ...
Some projects
Research @ L3S
05/04/17 2
 See also: http://www.l3s.de
Stefan Dietze
Acknowledgements: team
05/04/17 3Stefan Dietze
 Pavlos Fafalios (L3S)
 Besnik Fetahu (L3S)
 Elena Demidova (L3S)
 Ujwal Gadiraju (L3S)
 Eelco Herder (L3S)
 Ivana Marenzi (L3S)
 Nicolas Tempelmeier (L3S)
 Ran Yu (L3S)
 Nilamadhaba Mohapatra (L3S, IIT India)
 Bernardo Pereira Nunes (L3S, PUC Rio de Janeiro)
 Mathieu d‘Aquin (The Open University, UK)
 Mohamed Ben Ellefi (LIRMM, France)
 Davide Taibi (CNR, Italy)
 Konstantin Todorov (LIRMM, France)
 ...
Back in September 2016
05/04/17 4Stefan Dietze
A new look at the semantic web. Abraham
Bernstein, James Hendler, Natalya Noy,
Communications of the ACM, Vol. 59 No. 9, Pages 35-
37, September 2016
Retrieval, Crawling and Fusion of Entity-centric Data
on the Web, Dietze, S., in Semantic Keyword-Based
Search on Structured Data Sources, In: Calì A., Gorgan
D., Ugarte M. (eds) Semantic Keyword-Based Search on
Structured Data Sources. KEYSTONE 2016. LNCS, Vol
10151. Springer, 2017.
Overview
05/04/17Stefan Dietze 6
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
 Dataset recommendation
 Dataset profiling
 Entity retrieval
III – Beyond Linked Data – exploiting embedded Web semantics
 Web markup as emerging data source
 Case studies
 Data fusion for entity reconciliation (and retrieval)
III Wrap-up
Other emerging forms of
semantics/structured data on
the Web („Future“)
Dealing with heterogeneity &
shortcomings („Present“)
Data accessibility & quality?
SPARQL endpoint availability over time [Buil-Aranda et al 2013]
Accessibility of (linked) datasets?
 Less than 50% of all SPARQL endpoints actually responsive at given point of time [Buil-Aranda2013]
 “THE” SPARQL protocol? No, but variants, subsets and local restrictions
Semantics, links, quality?
 …data accuracy (eg DBpedia)? [Paulheim2013]
 …schema compliance & evolution [HoganJWS2012]
 …vocabulary reuse? [D’AquinWebSci13]
Stefan Dietze
Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A.,
Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013.
Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC
2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525
An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth,
A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012
05/04/17 7
SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil-
Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International
Semantic Web Conference 2013, (ISWC2013).
Co-occurence of
types
(in 146 datasets:
144 vocabularies,
588 overlapping
types, 719
predicates)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web
Science 2013 (WebSci2013), Paris, May 2013.
po:Programme
yov:Video
?
bibo:Book
Vocabulary reuse/linking?
05/04/17 8Stefan Dietze
typeX
typeX
Co-occurence after
mapping
(201 frequently
occuring types,
mapped into 79 types)
bibo:Film
bibo:Document
po:Programme
bibo:Book
foaf:Document
yov:Video
typeX
Co-occurence of
types
(in 146 datasets:
144 vocabularies,
588 overlapping
types, 719
predicates)
05/04/17 9
Vocabulary reuse/linking? Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web
Science 2013 (WebSci2013), Paris, May 2013.
“Completeness” ?
05/04/17Stefan Dietze 10
 Example: varying completeness of “book” (“movie”) entity
descriptions
 Missing facts: 49.8% (37.1%) in DBpedia, 63.8% (23.3%) in
Freebase and 60.9 % (40%) in Wikidata
(varies heavily across attributes)
Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM:
Query-Centric Data Fusion on Structured Web
Markup, ICDE2017.
Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze,
D., Dietze, S., KnowMore - Knowledge Base
Augmentation with Structured Web Markup,
Semantic Web Journal 2017, under review.
Consistency? Analyzing Relative Incompleteness of Movie
Descriptions in the Web of Data: A Case Study,
Yuan, W., Demidova, E., Dietze, S., Zhu, X.,
ISWC2014
05/04/17Stefan Dietze 11
Challenge for search/retrieval – heterogeneity of datasets & entities
Stefan Dietze 05/04/17
??? ?? ?
Discovery of suitable (1) datasets & (2) entities:
 Quality? Currentness, dynamics, accessability/reliability,
data quantity & quality?
 Topics/scope? Datasets/entities useful & trustworthy for
topic XY?
 Types? Datasets/entities about statistics, organisations,
videos, slides, publications etc?
12
Overview
05/04/17Stefan Dietze 13
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
 Dataset recommendation
 Dataset profiling
 Entity retrieval
III – Beyond Linked Data – exploiting embedded Web semantics
 Web markup as emerging data source
 Case studies
 Data fusion for entity reconciliation (and retrieval)
III Wrap-up
Other emerging forms of
semantics/structured data on
the Web („Future“)
Dealing with heterogeneity &
shortcomings („Now“)
05/04/17
Dataset recommendation I
14
S
Linkset1
Linkset2
Approach
 Given dataset s, ranking datasets from D
according to probability score (di, t) to
contain linking candidates (entities)
 Features:
 Approach 1: vocabulary overlap
 Approach 2: existing links (SNA)
 Linking candidates likely if datasets share
common (a) schema elements, or (b) links
(friend of a friend)
Conclusions
 Roughly 50% MAP for both approaches
 Simplistic approach (!)
Lopes, G.R., Paes Leme, L. A., Nunes, B.P., Casanova,
M.A., Dietze, S., Two approaches to the dataset
interlinking recommendation problem, 15th
International Conference on Web Information System
Engineering (WISE 2014), Thessaloniki, Greece.
Rank
1 DBLP
2 ACM
3 OAI
4 CiteSeer
5 IBM
6 Roma
7 IEEE
8 Ulm
9 Pisa
?
?
Stefan Dietze 14
Goal: finding candidate datasets, e.g. for entity retrieval
or interlinking tasks (eg enrichment)
Dataset recommendation II
05/04/17
Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.,
Intension-based Dataset Recommendation for Data
Linking, 13th Extended Semantic Web Conference
(ESWC2016), Heraklion, Crete, May, 2016, ESWC2016
Stefan Dietze 15
L. Han, A. L. Kashyap, T. Finin, J. Mayeld, and J. Weese, "Umbc ebiquity-core: Semantic textual similarity systems", in Proc. of the *SEM, Association for Computational Linguistics, 2013.
Preprocessing Datasets rankingDatasets filtering
Dataset recommendation II: results
05/04/17Stefan Dietze 16
Data & ground truth
 Experiments on (responsive) datasets
from LOD Cloud (http://datahub.io)
 Concept profiles from
http://lov.okfn.org
 Ground truth: existing links from VOID
profiles of datasets
(issue: not always representative for
actual linksets)
Results
 MAP for different similarity thresholds
from step 2 max. 54% (UMBC@0.7)
 Recall 100% below indicated similarity
(clustering) thresholds
Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.,
Intension-based Dataset Recommendation for Data
Linking, 13th Extended Semantic Web Conference
(ESWC2016), Heraklion, Crete, May, 2016, ESWC2016
Dataset search through dataset cataloging & profiling
Dataset
Catalog/Registry
http://data.linkededucation.org/linkedup/catalog/
 LinkedUp project (FP7 project: L3S, OU, OKFN, Elsevier, Exact Learning solutions)
 LinkedUp Catalog: largest collection of LD of educationally relevant resources (approx. 50 Datasets)
 Original datasets published with key content providers, automatically extracted metadata
05/04/17 17Stefan Dietze
05/04/17 18Stefan Dietze
LinkedUp Catalog: dataset index & registry, federated search
 “Federated queries” through schema mappings [WebSci13]
 Dataset accessibility
 Linking & topic profiling
Schema/Types
http://data.linkededucation.org/linkedup/catalog/
05/04/17 19Stefan Dietze
LinkedUp Catalog: dataset index & registry, federated search
 “Federated queries” through schema mappings [WebSci13]
 Dataset accessibility
 Linking & topic profiling [ESWC14]
Dataset topic
profiles
http://data.linkededucation.org/linkedup/catalog/
db:Biology
db:Cell biology
Dataset
Catalog/Registry
yov:Video
<yo:Video …>
<dc:title>Lecture 29 –
Stem Cells</dc:title>
…
</yo:Video…>
Yovisto Video
 Extraction of representative (DBpedia) categories („topic profile“) for arbitrary datasets ?
 Technically trivial through established NER/NED approaches, but scalability issues
(recall: LOD Cloud 1000+ datasets with <100 billion RDF statements)
 Efficient approach: sampling & ranking for balance between scalability and precision /recall
Scalable profiling of datasets
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl,
W., 11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
db:Cell
(Biology)
05/04/17 21
db:Cell
(Biology)
Stefan Dietze
Efficient dataset profiling
1. Sampling of resources
(random sampling, weighted sampling, resource
centrality sampling)
2. Entity- & topic-extraction (NER via DBpedia Spotlight,
category mapping & -expansion)
3. Normalisation & ranking (graph-based models such as
PageRank with Priors, HITS with Priors & K-Step Markov)
 Result: weighted dataset-topic profile graph
05/04/17 22Stefan Dietze
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl,
W., 11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
Search & exploration of datasets through topic profiles
 Applied to entire LOD cloud/graph
 Visual exploration of extracted RDF dataset profiles
(datasets, topics, relationships)
 Evaluation results: K-Step Markov (10% sampling size)
outperforms baselines (LDA, tf/idf on entire datasets)
http://data-observatory.org/lod-profiles/
05/04/17 23Stefan Dietze
Search: entity retrieval on large LD crawls?
 How to efficiently retrieve (related) entities/resources for given entity-seeking (keyword) query?
 State of the art: BM25F on inverted entity index (Blanco et al, ISWC2011)
 Challenges/observations:
 Explicit entity links (owl:sameAs etc) are sparse yet important to facilitate state of the art methods
 Query type affinity?
05/04/17 24Stefan Dietze
??
Large dataset/crawl
e.g. LinkedUp dataset graph, BTC2014, Dynamic LD Observatory
entities related to <Tim Berners Lee>
?
BTC2014
DyLDO
Entity retrieval: approach
(I) Offline processing (clustering to address link sparsity)
1. Feature vectors (lexical and structural features)
2. Bucketing: per type (LSH algorithm)
3. Clustering: X-means & Spectral clustering per bucket
Improving Entity Retrieval on Structured Data,
Fetahu, B., Gadiraju, U., Dietze, S., 14th
International Semantic Web Conference
(ISWC2015), Bethlehem, US, (2015).
(II) Online processing (retrieval)
1. Retrieval & expansion:
a) BM25F results
b) expansion from clusters (related entities)
2. Re-Ranking
(context terms & query type affinity)
05/04/17 25Stefan Dietze
Dataset
 BTC2014 (4 billion entities)
 92 SemSearch queries
Methods
 Our approaches: XM: Xmeans, SP: Spectral
 Baselines B: BM25F, S1: Tonon et al [SIGIR12]
Conclusions
 XM & SP outperform baselines
 Clustering to remedy link sparsity
(yet extensive offline processing required)
 Relevance to query more important than
relevance to BM25F results
Entity retrieval: evaluation
05/04/17 26Stefan Dietze
Improving Entity Retrieval on Structured Data,
Fetahu, B., Gadiraju, U., Dietze, S., 14th
International Semantic Web Conference
(ISWC2015), Bethlehem, US, (2015).
PROFILES2017 - Profiling & search of Linked Data
05/04/17 27Stefan Dietze
https://profiles2017.wordpress.com/
• Probably co-located with ISWC2017 (Vienna)
• Submissions due 21 June
Overview
05/04/17Stefan Dietze 28
I – Challenges
II – Enabling discovery & search in Linked Data & Knowledge Graphs
 Dataset recommendation
 Dataset profiling
 Entity retrieval
III – Beyond Linked Data – exploiting embedded Web semantics
 Web markup as emerging data source
 Case studies
 Data fusion for entity reconciliation (and retrieval)
III Wrap-up
Other emerging forms of
structured data on the Web
(„Future“)?
Dealing with heterogeneity &
shortcomings („Present“)
 Linked Data: approx.
1000+ datasets & 100 billion statements
 Open Data: XXX datasets
Web semantics & entity-centric Web data
05/04/17 29Stefan Dietze
 Web (of documents):
approx. 46.000.000.000.000 (46 trillion)
Web pages indexed by Google
 Other forms of Web semantics
and entity-centric knowledge?
 Dynamics?
 Quality?
 Accessibility?
 Scale?
 Embedded markup (RDFa, Microdata, Microformats) for
interpretation of Web documents (search, retrieval)
 Arbitrary vocabularies; schema.org used at scale:
(700 classes, 1000 predicates)
 Adoption on the Web: 26 %
(2014 Google study of 12 bn Web pages)
 “Web Data Commons” (Meusel & Paulheim [ISWC2014])
• Markup from Common Crawl (3.2 billion pages):
44 billion RDF quads (2016)
• Markup in 38% of pages in 2016
 Same order of magnitude as “the Web” (!)
Embedded Web page markup & schema.org
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Forrest Gump</h1>
<span>Actor: <span itemprop=„actor">Tom Hanks</span>
<span itemprop="genre">Drama</span>
...
</div>
05/04/17 30
RDF statements
node1 actor _node-x
node1 actor Robin Wright
node1 genre Comedy
node2 actor T. Hanks
node2 distributed by Paramount Pic.
node3 actor Tom Cruise
node3 distributed by Paramount Pic.
Stefan Dietze
http://webdatacommons.org
 schema:Product instances in WDC2015
 Facts: 1.414.937.431
(= 302.246.120 instances, i.e. products)
 Providers (distinct Pay Level Domains, PLDs): 93.705
 Power law distribution of terms across PLDs
 Top 10 PLDs
 Top provider ? (company)
05/04/17 31Stefan Dietze
Example: embedded Web markup about „products“
PLD # Resources
www.crateandbarrel.com 33.517.936,00
www.bentgate.com 17.215.499,00
www.aliexpress.com 9.621.943,00
www.ebay.com.au 8.861.308,00
us.fotolia.com 7.939.982,00
www.ebay.co.uk 6.556.820,00
www.competitivecyclist.com 6.214.500,00
www.maxstudio.com 6.075.626,00
approx. 35 million resources
1
10
100
1000
10000
100000
1000000
10000000
1 51 101 151 201
count(log)
PLD (ranked)
# entities # statements
Study on sample Web crawl (WDC2015)
 Metadata about scholarly articles, e.g.
s:ScholarlyArticle): 6.793.764 quads, 1.184.623
entities, 429 distinct predicates
(in WDC and for 1 type alone)
 Top 5 domains: Springer, MDPI, BMJ,
mendeley.com, Biodiversitylibrary.org
Domains, topics, disciplines?
 Life Sciences and Computer Science predominant
 Top-10 article titles
 Noise
Example: markup of bibliographic resources
05/04/17 32Stefan Dietze
Sahoo, P., Gadiraju, U., Yu, R., Saha, S., Dietze, S.,
Analysing Structured Scholarly Data embedded in Web
Pages, SAVE-SD2016, co-located with the WWW2016
Example: markup of learning resources on the Web
 “Learning Resources Metadata Intiative (LRMI)”:
schema.org vocabulary for annotation of learning
resources
 Developed through DCMI Task Force on LRMI
 Approx. 5000 PLDs (incl. subdomains) in CC
 LRMI adoption (WDC) [WWW17]:
 2015: 44,108,511 quads
 2014: 30,599,024 quads
 2013: 10.636873 quads
05/04/17 33
Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M., Analysing and
Improving embedded Markup of Learning Resources on the
Web, 26th International World Wide Web Conference
(WWW2017), Digital Learning track, Perth, April 2017.
Stefan Dietze
Example: markup of learning resources on the Web
 “Learning Resources Metadata Intiative (LRMI)”:
schema.org vocabulary for annotation of learning
resources
 Developed through DCMI Task Force on LRMI
 Approx. 5000 PLDs (incl. subdomains) in CC
 LRMI adoption (WDC) [WWW17]:
 2015: 44,108,511 quads
 2014: 30,599,024 quads
 2013: 10.636873 quads
 Frequent errors and unintended use (e.g. porn)
05/04/17 34
Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M., Analysing and
Improving embedded Markup of Learning Resources on the
Web, 26th International World Wide Web Conference
(WWW2017), Digital Learning track, Perth, April 2017.
Stefan Dietze
7xxxtube.com
1amateurporntube.com
virtualpornstars.com
sunriseseniorliving.com
simplyfinance.co.uk
menslifestyles.com
audiobooks.com
simplypsychology.org
helles-koepfchen.de
05/04/17 35Stefan Dietze
Entity retrieval on Web markup: state of the art
 Glimmer
(http://glimmer.research.yahoo.com)
 Entity retrieval on WDC dataset
[Blanco, Mika & Vigna, ISWC2011]
 BM25F retrieval model on WDC index
Web markup: challenges
05/04/17 36
Characteristics Example
Coreferences
18.000 results for <„Iphone 6“, type, s:Product>
(8,6 quads on average) in CommonCrawl
Redundancy <s, schema:name, „Iphone 6“> occurring 1000 times in CC
Lack of links Largely unlinked entity descriptions
Errors
(typos & schema
violations, see Meusel
et al [ESWC2015])
Wrong namespaces, such as http://schma.org
Undefined types & predicates:
9,7 %, less common than in LOD
Confusion of datatype and object properties:
<s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8%
in LOD
Data property range violations: e.g. literals vs numbers
(12,6% vs 4,6 in LOD)
 Using markup as knowledge graph, similar to Linked Data?
Stefan Dietze
A Survey on Challenges for Entity Retrieval in Markup
Data, Yu, R., Gadiraju, U., Fetahu, B., Dietze, S., 15th
International Semantic Web Conference (ISWC2016),
Kobe, Japan (2016).
“Strings, not things”
 Bias towards datatype properties / using any
property as such (!)
 Numbers from LRMI2015 markup corpus:
o 46 million “transversal” quads (i.e. excluding
hierarchical statements such as rdfs:typeOf)
o 64 % are actual datatype properties yet 97%
refer to literals (up from 70% in 2013)
 Challenges
o Markup data = flat entity descriptions
(=> fairly unconnected graph)
o Data reuse requires identity resolution
 Obtaining consolidated & verified entity description/facts (or
graph) for a given resource/entity from Web markup?
 Aiding tasks: such as document annotation, augmentation
or enrichment of existing data- or knowledge bases/graphs
Entity retrieval & reconciliation on markup
05/04/17 37
Query
iPhone 6, type:(Product)
Entity Description
brand Apple Inc.
weight 129
date 30.09.2015
manufacturer Foxconn
Storage 16 GB
<e1, s:name, „Iphone 6“>
<e2, s:brand, „Apple Inc.“>
<e3, s:brand, „Apple“> <e4, s:weight, 127>
<e5, s:releaseDate, „1.12.1972“>
Web (crawl)
(e.g. Common Crawl/WDC, focused crawl)
Stefan Dietze
Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM:
Query-Centric Data Fusion on Structured Web
Markup, ICDE2017.
Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O.,
Ritze, D., Dietze, S., KnowMore - Knowledge Base
Augmentation with Structured Web Markup,
Semantic Web Journal 2017, under review.
FuseM: query-centric data fusion on Web markup
05/04/17 38
 Entity matching: BM25 entity retrieval model on markup index (Common Crawl) & similarity-based matching
 Data fusion: ML classifier (SVM, knn, RandomForest), 3 feature categories (relevance, authority, clustering)
1. Matching
2. Fact selection
New Queries
Foxconn, type:(Organization)
Cupertino, type:(City)
Apple Inc., type:(Organization)
(supervised SVM classifier)
Entity Description
brand Apple Inc.
weight 129
date 30.09.2015
manufacturer Foxconn
Storage 16 GB
Query
iPhone 6, type:(Product)
Candidate Facts
node1 brand _node-x
node1 brand Apple Inc.
node1 weight 129
node2 weight 172
node2 manufacturer Foxconn
node3 releasedate 01.12.1972
node3 manufacturer Foxconn
Web page
markup
Web (crawl)
approx. 125.000 facts for „iPhone6“
Stefan Dietze
Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM:
Query-Centric Data Fusion on Structured Web
Markup, ICDE2017.
FuseM classifier: features
05/04/17 39Stefan Dietze
Evaluation & results: data fusion performance
05/04/17 40Stefan Dietze
Setup
 Dataset: Products, Movies, Books
(approx. 3 billion. facts) from Common
Crawl / WDC
 Baselines:
 BM25: top-k diverse facts via BM25
(Glimmer)
 CBFS: clustering-based approach
[ESWC2015]
 PreRecCorr: “Fusing data with
correlations” [Pochampally et. al.,
ACM SIGMOD 2014]
 10-fold cross validation
Results
 FuseM beats baselines in both tasks
(strong variance of baselines across
tasks)
 All feature categories contribute
Query-centric data fusion (precision)
Query-independent data fusion (P/R/F1)
05/04/17 42Stefan Dietze
Results: example of fused entity description
 Data fusion result for book „Brideshead Revisited“ (20 distinct facts)
New facts (compared to DBpedia):
• 60% - 70% of all facts for books & movies
new (across all KBs)
• 100% new for products
(„long tail entities“ not existing in KBs yet)
New facts and attributes
05/04/17 43Stefan Dietze
Results: KB augmentation
 Augmentation of 15 properties of
books (& movies) in three KBs
 DB: DBpedia
 FB: Freebase
 WD: Wikidata
 Augmentation performance: % of filled
slots (or „knowledge gaps“) in KB
 Performance varies heavily (yet some
attributes completed to 100%)
KBA result for entities of type „Book“
Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O.,
Ritze, D., Dietze, S., KnowMore - Knowledge Base
Augmentation with Structured Web Markup,
Semantic Web Journal 2017, under review.
Linked Data & knowledge graphs
Conclusions & outlook
05/04/17 45Stefan Dietze
 Retrieval/search of Linked Data hindered by
heterogeneity, quality, dynamics etc
 Dealing with diversity & heterogeneity
o Profiling & recommendation: dataset search &
recommendation
o Entity retrieval & clustering: entity search
Entity
node1 name
Molecular structure of
nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePublished 1956
node1 datePublished 1953
Entity
node2 name Francis Crick
node2 name Cricks
node2 born 1916
Embedded data/markup/tables
Unstructured (Web) data/docs
Linked Data & knowledge graphs
Conclusions & outlook
05/04/17 46Stefan Dietze
 Retrieval/search of Linked Data hindered by
heterogeneity, quality, dynamics etc
 Dealing with diversity & heterogeneity
o Profiling & recommendation: dataset search &
recommendation
o Entity retrieval & clustering: entity search
 New forms of (structured) Web data:
Web markup (schema.org et al.) & tables
o Convergence of structured and unstructured Web
(e.g. Voldemort KG, Tonon et al., ISWC2016)
o Scale and dynamics (!)
o Potential to augment existing knowledge graphs
(e.g. Google KG or Microsoft Satori)
o Potential training data for NED, entity interlinking
and other entity-centric tasks (e.g. OKE Challenge)
Entity
node1 name
Molecular structure of
nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePublished 1956
node1 datePublished 1953
Entity
node2 name Francis Crick
node2 name Cricks
node2 born 1916
Contact & resources
05/04/17 47Stefan Dietze
@stefandietze
http://stefandietze.net
More on Web markup: talk on
Wednesday, 11:00, WW2017/Digital
Learning track
Embedded data/markup/tables
Unstructured (Web) data/docs
Linked Data & knowledge graphs

More Related Content

What's hot (20)

PPT
euclid_linkedup WWW tutorial (Besnik Fetahu)
Besnik Fetahu
 
PDF
DataUp at ACRL 2013
Carly Strasser
 
PPTX
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Mathieu d'Aquin
 
PDF
DMPTool: Data Management Made Easier at CNI 2012
Carly Strasser
 
PPTX
Doing Clever Things with the Semantic Web
Mathieu d'Aquin
 
PDF
Turning Data into Knowledge (KESW2014 Keynote)
Stefan Dietze
 
PDF
Semantic Web / Linked Data Technologies
Mathieu d'Aquin
 
PDF
A structured catalog of open educational datasets
Stefan Dietze
 
PPTX
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
Armin Haller
 
PPT
LinkedUp - Linked Data & Education
Stefan Dietze
 
PPTX
LUCERO - Building the Open University Web of Linked Data
Mathieu d'Aquin
 
PPTX
Alamw15 VIVO
Kristi Holmes
 
PPTX
Semantic Web, Linked Data and Education: A Perfect Fit?
Mathieu d'Aquin
 
PPTX
Sanderson Shout It Out: LOUD
National Information Standards Organization (NISO)
 
PPTX
Interpreting Data Mining Results with Linked Data for Learning Analytics
Mathieu d'Aquin
 
PPTX
Science Data, Responsibly
University of Washington
 
PPTX
Presentation of LUCERO at EURECOM
Mathieu d'Aquin
 
PPTX
Working with Social Media Data: Ethics & good practice around collecting, usi...
Nicola Osborne
 
PPTX
ESWC2015 opening ceremony
Fabien Gandon
 
PPTX
Data Science, Data Curation, and Human-Data Interaction
University of Washington
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
Besnik Fetahu
 
DataUp at ACRL 2013
Carly Strasser
 
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Mathieu d'Aquin
 
DMPTool: Data Management Made Easier at CNI 2012
Carly Strasser
 
Doing Clever Things with the Semantic Web
Mathieu d'Aquin
 
Turning Data into Knowledge (KESW2014 Keynote)
Stefan Dietze
 
Semantic Web / Linked Data Technologies
Mathieu d'Aquin
 
A structured catalog of open educational datasets
Stefan Dietze
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
Armin Haller
 
LinkedUp - Linked Data & Education
Stefan Dietze
 
LUCERO - Building the Open University Web of Linked Data
Mathieu d'Aquin
 
Alamw15 VIVO
Kristi Holmes
 
Semantic Web, Linked Data and Education: A Perfect Fit?
Mathieu d'Aquin
 
Interpreting Data Mining Results with Linked Data for Learning Analytics
Mathieu d'Aquin
 
Science Data, Responsibly
University of Washington
 
Presentation of LUCERO at EURECOM
Mathieu d'Aquin
 
Working with Social Media Data: Ethics & good practice around collecting, usi...
Nicola Osborne
 
ESWC2015 opening ceremony
Fabien Gandon
 
Data Science, Data Curation, and Human-Data Interaction
University of Washington
 

Similar to Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web (20)

PDF
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Stefan Dietze
 
PDF
What's all the data about? - Linking and Profiling of Linked Datasets
Stefan Dietze
 
PDF
Semantic Linking & Retrieval for Digital Libraries
Stefan Dietze
 
PDF
KnowEscape workshop, OKCon 2013
Stefan Dietze
 
PDF
Open Data Dialog 2013 - Linked Data in Education
Stefan Dietze
 
PPTX
ESWC 2015 Closing and "General Chair's minute of Madness"
Fabien Gandon
 
PDF
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Stefan Dietze
 
PDF
Hide the Stack: Toward Usable Linked Data
aba-sah
 
PPTX
The Unreasonable Effectiveness of Metadata
James Hendler
 
PDF
2014_WWW_BTOR
Dongpo Deng
 
PDF
Interlinking educational data to Web of Data (Thesis presentation)
Enayat Rajabi
 
PDF
Introduction to Linked Data - Part 1
Itza Carbajal
 
PPTX
SWT Lecture Session 1 - Introduction
Mariano Rodriguez-Muro
 
PDF
From Data to Knowledge - Profiling & Interlinking Web Datasets
Stefan Dietze
 
PDF
Web-scale semantic search
Edgar Meij
 
PDF
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Stefan Dietze
 
PPT
Related Entity Finding on the Web
Peter Mika
 
PPTX
Semantic Linked Data
Praxitelis Nikolaos Kouroupetroglou
 
PPT
Where Does It Break?
Frank van Harmelen
 
PPTX
Linked Data past, present and futures
Pierre-Yves Vandenbussche, Ph.D.
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Stefan Dietze
 
What's all the data about? - Linking and Profiling of Linked Datasets
Stefan Dietze
 
Semantic Linking & Retrieval for Digital Libraries
Stefan Dietze
 
KnowEscape workshop, OKCon 2013
Stefan Dietze
 
Open Data Dialog 2013 - Linked Data in Education
Stefan Dietze
 
ESWC 2015 Closing and "General Chair's minute of Madness"
Fabien Gandon
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Stefan Dietze
 
Hide the Stack: Toward Usable Linked Data
aba-sah
 
The Unreasonable Effectiveness of Metadata
James Hendler
 
2014_WWW_BTOR
Dongpo Deng
 
Interlinking educational data to Web of Data (Thesis presentation)
Enayat Rajabi
 
Introduction to Linked Data - Part 1
Itza Carbajal
 
SWT Lecture Session 1 - Introduction
Mariano Rodriguez-Muro
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
Stefan Dietze
 
Web-scale semantic search
Edgar Meij
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Stefan Dietze
 
Related Entity Finding on the Web
Peter Mika
 
Where Does It Break?
Frank van Harmelen
 
Linked Data past, present and futures
Pierre-Yves Vandenbussche, Ph.D.
 
Ad

More from Stefan Dietze (20)

PDF
Understanding Scientific and Societal Adoption and Impact of Science Through ...
Stefan Dietze
 
PDF
NEWORDER Project - Science in the online knowledge order
Stefan Dietze
 
PDF
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Stefan Dietze
 
PDF
AI in between online and offline discourse - and what has ChatGPT to do with ...
Stefan Dietze
 
PDF
An interdisciplinary journey with the SAL spaceship – results and challenges ...
Stefan Dietze
 
PDF
Research Knowledge Graphs at NFDI4DS & GESIS
Stefan Dietze
 
PDF
Research Knowledge Graphs at GESIS & NFDI4DataScience
Stefan Dietze
 
PDF
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Stefan Dietze
 
PDF
Towards research data knowledge graphs
Stefan Dietze
 
PDF
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Stefan Dietze
 
PDF
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
Stefan Dietze
 
PDF
Using AI to understand everyday learning on the Web
Stefan Dietze
 
PDF
Analysing User Knowledge, Competence and Learning during Online Activities
Stefan Dietze
 
PDF
Big Data in Learning Analytics - Analytics for Everyday Learning
Stefan Dietze
 
PDF
Towards embedded Markup of Learning Resources on the Web
Stefan Dietze
 
PDF
Linked Data for Architecture, Engineering and Construction (AEC)
Stefan Dietze
 
PDF
Dietze linked data-vr-es
Stefan Dietze
 
PDF
LinkedUp - Linked Data Europe Workshop 2014
Stefan Dietze
 
PDF
Demo: Profiling & Exploration of Linked Open Data
Stefan Dietze
 
PDF
Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Stefan Dietze
 
Understanding Scientific and Societal Adoption and Impact of Science Through ...
Stefan Dietze
 
NEWORDER Project - Science in the online knowledge order
Stefan Dietze
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Stefan Dietze
 
AI in between online and offline discourse - and what has ChatGPT to do with ...
Stefan Dietze
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
Stefan Dietze
 
Research Knowledge Graphs at NFDI4DS & GESIS
Stefan Dietze
 
Research Knowledge Graphs at GESIS & NFDI4DataScience
Stefan Dietze
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Stefan Dietze
 
Towards research data knowledge graphs
Stefan Dietze
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Stefan Dietze
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
Stefan Dietze
 
Using AI to understand everyday learning on the Web
Stefan Dietze
 
Analysing User Knowledge, Competence and Learning during Online Activities
Stefan Dietze
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Stefan Dietze
 
Towards embedded Markup of Learning Resources on the Web
Stefan Dietze
 
Linked Data for Architecture, Engineering and Construction (AEC)
Stefan Dietze
 
Dietze linked data-vr-es
Stefan Dietze
 
LinkedUp - Linked Data Europe Workshop 2014
Stefan Dietze
 
Demo: Profiling & Exploration of Linked Open Data
Stefan Dietze
 
Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Stefan Dietze
 
Ad

Recently uploaded (20)

PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
PPTX
Top Managed Service Providers in Los Angeles
Captain IT
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
Top Managed Service Providers in Los Angeles
Captain IT
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 

Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web

  • 1. Beyond Linked Data – Exploiting Entity-Centric Knowledge on the Web Stefan Dietze L3S Research Center, Hannover, Germany - Linked Data on the Web (LDOW2017), WWW2017 - 05/04/17 1Stefan Dietze
  • 2. Research areas  Web science, Information Retrieval, Semantic Web, Social Web Analytics, Knowledge Discovery, Human Computation  Interdisciplinary application areas: digital humanities, TEL/education, Web archiving, mobility, ... Some projects Research @ L3S 05/04/17 2  See also: http://www.l3s.de Stefan Dietze
  • 3. Acknowledgements: team 05/04/17 3Stefan Dietze  Pavlos Fafalios (L3S)  Besnik Fetahu (L3S)  Elena Demidova (L3S)  Ujwal Gadiraju (L3S)  Eelco Herder (L3S)  Ivana Marenzi (L3S)  Nicolas Tempelmeier (L3S)  Ran Yu (L3S)  Nilamadhaba Mohapatra (L3S, IIT India)  Bernardo Pereira Nunes (L3S, PUC Rio de Janeiro)  Mathieu d‘Aquin (The Open University, UK)  Mohamed Ben Ellefi (LIRMM, France)  Davide Taibi (CNR, Italy)  Konstantin Todorov (LIRMM, France)  ...
  • 4. Back in September 2016 05/04/17 4Stefan Dietze A new look at the semantic web. Abraham Bernstein, James Hendler, Natalya Noy, Communications of the ACM, Vol. 59 No. 9, Pages 35- 37, September 2016 Retrieval, Crawling and Fusion of Entity-centric Data on the Web, Dietze, S., in Semantic Keyword-Based Search on Structured Data Sources, In: Calì A., Gorgan D., Ugarte M. (eds) Semantic Keyword-Based Search on Structured Data Sources. KEYSTONE 2016. LNCS, Vol 10151. Springer, 2017.
  • 5. Overview 05/04/17Stefan Dietze 6 I – Challenges II – Enabling discovery & search in Linked Data & Knowledge Graphs  Dataset recommendation  Dataset profiling  Entity retrieval III – Beyond Linked Data – exploiting embedded Web semantics  Web markup as emerging data source  Case studies  Data fusion for entity reconciliation (and retrieval) III Wrap-up Other emerging forms of semantics/structured data on the Web („Future“) Dealing with heterogeneity & shortcomings („Present“)
  • 6. Data accessibility & quality? SPARQL endpoint availability over time [Buil-Aranda et al 2013] Accessibility of (linked) datasets?  Less than 50% of all SPARQL endpoints actually responsive at given point of time [Buil-Aranda2013]  “THE” SPARQL protocol? No, but variants, subsets and local restrictions Semantics, links, quality?  …data accuracy (eg DBpedia)? [Paulheim2013]  …schema compliance & evolution [HoganJWS2012]  …vocabulary reuse? [D’AquinWebSci13] Stefan Dietze Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525 An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012 05/04/17 7 SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil- Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International Semantic Web Conference 2013, (ISWC2013).
  • 7. Co-occurence of types (in 146 datasets: 144 vocabularies, 588 overlapping types, 719 predicates) Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, May 2013. po:Programme yov:Video ? bibo:Book Vocabulary reuse/linking? 05/04/17 8Stefan Dietze
  • 8. typeX typeX Co-occurence after mapping (201 frequently occuring types, mapped into 79 types) bibo:Film bibo:Document po:Programme bibo:Book foaf:Document yov:Video typeX Co-occurence of types (in 146 datasets: 144 vocabularies, 588 overlapping types, 719 predicates) 05/04/17 9 Vocabulary reuse/linking? Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, May 2013.
  • 9. “Completeness” ? 05/04/17Stefan Dietze 10  Example: varying completeness of “book” (“movie”) entity descriptions  Missing facts: 49.8% (37.1%) in DBpedia, 63.8% (23.3%) in Freebase and 60.9 % (40%) in Wikidata (varies heavily across attributes) Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM: Query-Centric Data Fusion on Structured Web Markup, ICDE2017. Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze, D., Dietze, S., KnowMore - Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2017, under review.
  • 10. Consistency? Analyzing Relative Incompleteness of Movie Descriptions in the Web of Data: A Case Study, Yuan, W., Demidova, E., Dietze, S., Zhu, X., ISWC2014 05/04/17Stefan Dietze 11
  • 11. Challenge for search/retrieval – heterogeneity of datasets & entities Stefan Dietze 05/04/17 ??? ?? ? Discovery of suitable (1) datasets & (2) entities:  Quality? Currentness, dynamics, accessability/reliability, data quantity & quality?  Topics/scope? Datasets/entities useful & trustworthy for topic XY?  Types? Datasets/entities about statistics, organisations, videos, slides, publications etc? 12
  • 12. Overview 05/04/17Stefan Dietze 13 I – Challenges II – Enabling discovery & search in Linked Data & Knowledge Graphs  Dataset recommendation  Dataset profiling  Entity retrieval III – Beyond Linked Data – exploiting embedded Web semantics  Web markup as emerging data source  Case studies  Data fusion for entity reconciliation (and retrieval) III Wrap-up Other emerging forms of semantics/structured data on the Web („Future“) Dealing with heterogeneity & shortcomings („Now“)
  • 13. 05/04/17 Dataset recommendation I 14 S Linkset1 Linkset2 Approach  Given dataset s, ranking datasets from D according to probability score (di, t) to contain linking candidates (entities)  Features:  Approach 1: vocabulary overlap  Approach 2: existing links (SNA)  Linking candidates likely if datasets share common (a) schema elements, or (b) links (friend of a friend) Conclusions  Roughly 50% MAP for both approaches  Simplistic approach (!) Lopes, G.R., Paes Leme, L. A., Nunes, B.P., Casanova, M.A., Dietze, S., Two approaches to the dataset interlinking recommendation problem, 15th International Conference on Web Information System Engineering (WISE 2014), Thessaloniki, Greece. Rank 1 DBLP 2 ACM 3 OAI 4 CiteSeer 5 IBM 6 Roma 7 IEEE 8 Ulm 9 Pisa ? ? Stefan Dietze 14 Goal: finding candidate datasets, e.g. for entity retrieval or interlinking tasks (eg enrichment)
  • 14. Dataset recommendation II 05/04/17 Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K., Intension-based Dataset Recommendation for Data Linking, 13th Extended Semantic Web Conference (ESWC2016), Heraklion, Crete, May, 2016, ESWC2016 Stefan Dietze 15 L. Han, A. L. Kashyap, T. Finin, J. Mayeld, and J. Weese, "Umbc ebiquity-core: Semantic textual similarity systems", in Proc. of the *SEM, Association for Computational Linguistics, 2013. Preprocessing Datasets rankingDatasets filtering
  • 15. Dataset recommendation II: results 05/04/17Stefan Dietze 16 Data & ground truth  Experiments on (responsive) datasets from LOD Cloud (http://datahub.io)  Concept profiles from http://lov.okfn.org  Ground truth: existing links from VOID profiles of datasets (issue: not always representative for actual linksets) Results  MAP for different similarity thresholds from step 2 max. 54% ([email protected])  Recall 100% below indicated similarity (clustering) thresholds Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K., Intension-based Dataset Recommendation for Data Linking, 13th Extended Semantic Web Conference (ESWC2016), Heraklion, Crete, May, 2016, ESWC2016
  • 16. Dataset search through dataset cataloging & profiling Dataset Catalog/Registry http://data.linkededucation.org/linkedup/catalog/  LinkedUp project (FP7 project: L3S, OU, OKFN, Elsevier, Exact Learning solutions)  LinkedUp Catalog: largest collection of LD of educationally relevant resources (approx. 50 Datasets)  Original datasets published with key content providers, automatically extracted metadata 05/04/17 17Stefan Dietze
  • 17. 05/04/17 18Stefan Dietze LinkedUp Catalog: dataset index & registry, federated search  “Federated queries” through schema mappings [WebSci13]  Dataset accessibility  Linking & topic profiling Schema/Types http://data.linkededucation.org/linkedup/catalog/
  • 18. 05/04/17 19Stefan Dietze LinkedUp Catalog: dataset index & registry, federated search  “Federated queries” through schema mappings [WebSci13]  Dataset accessibility  Linking & topic profiling [ESWC14] Dataset topic profiles http://data.linkededucation.org/linkedup/catalog/
  • 19. db:Biology db:Cell biology Dataset Catalog/Registry yov:Video <yo:Video …> <dc:title>Lecture 29 – Stem Cells</dc:title> … </yo:Video…> Yovisto Video  Extraction of representative (DBpedia) categories („topic profile“) for arbitrary datasets ?  Technically trivial through established NER/NED approaches, but scalability issues (recall: LOD Cloud 1000+ datasets with <100 billion RDF statements)  Efficient approach: sampling & ranking for balance between scalability and precision /recall Scalable profiling of datasets A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). db:Cell (Biology) 05/04/17 21 db:Cell (Biology) Stefan Dietze
  • 20. Efficient dataset profiling 1. Sampling of resources (random sampling, weighted sampling, resource centrality sampling) 2. Entity- & topic-extraction (NER via DBpedia Spotlight, category mapping & -expansion) 3. Normalisation & ranking (graph-based models such as PageRank with Priors, HITS with Priors & K-Step Markov)  Result: weighted dataset-topic profile graph 05/04/17 22Stefan Dietze A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014).
  • 21. Search & exploration of datasets through topic profiles  Applied to entire LOD cloud/graph  Visual exploration of extracted RDF dataset profiles (datasets, topics, relationships)  Evaluation results: K-Step Markov (10% sampling size) outperforms baselines (LDA, tf/idf on entire datasets) http://data-observatory.org/lod-profiles/ 05/04/17 23Stefan Dietze
  • 22. Search: entity retrieval on large LD crawls?  How to efficiently retrieve (related) entities/resources for given entity-seeking (keyword) query?  State of the art: BM25F on inverted entity index (Blanco et al, ISWC2011)  Challenges/observations:  Explicit entity links (owl:sameAs etc) are sparse yet important to facilitate state of the art methods  Query type affinity? 05/04/17 24Stefan Dietze ?? Large dataset/crawl e.g. LinkedUp dataset graph, BTC2014, Dynamic LD Observatory entities related to <Tim Berners Lee> ? BTC2014 DyLDO
  • 23. Entity retrieval: approach (I) Offline processing (clustering to address link sparsity) 1. Feature vectors (lexical and structural features) 2. Bucketing: per type (LSH algorithm) 3. Clustering: X-means & Spectral clustering per bucket Improving Entity Retrieval on Structured Data, Fetahu, B., Gadiraju, U., Dietze, S., 14th International Semantic Web Conference (ISWC2015), Bethlehem, US, (2015). (II) Online processing (retrieval) 1. Retrieval & expansion: a) BM25F results b) expansion from clusters (related entities) 2. Re-Ranking (context terms & query type affinity) 05/04/17 25Stefan Dietze
  • 24. Dataset  BTC2014 (4 billion entities)  92 SemSearch queries Methods  Our approaches: XM: Xmeans, SP: Spectral  Baselines B: BM25F, S1: Tonon et al [SIGIR12] Conclusions  XM & SP outperform baselines  Clustering to remedy link sparsity (yet extensive offline processing required)  Relevance to query more important than relevance to BM25F results Entity retrieval: evaluation 05/04/17 26Stefan Dietze Improving Entity Retrieval on Structured Data, Fetahu, B., Gadiraju, U., Dietze, S., 14th International Semantic Web Conference (ISWC2015), Bethlehem, US, (2015).
  • 25. PROFILES2017 - Profiling & search of Linked Data 05/04/17 27Stefan Dietze https://profiles2017.wordpress.com/ • Probably co-located with ISWC2017 (Vienna) • Submissions due 21 June
  • 26. Overview 05/04/17Stefan Dietze 28 I – Challenges II – Enabling discovery & search in Linked Data & Knowledge Graphs  Dataset recommendation  Dataset profiling  Entity retrieval III – Beyond Linked Data – exploiting embedded Web semantics  Web markup as emerging data source  Case studies  Data fusion for entity reconciliation (and retrieval) III Wrap-up Other emerging forms of structured data on the Web („Future“)? Dealing with heterogeneity & shortcomings („Present“)
  • 27.  Linked Data: approx. 1000+ datasets & 100 billion statements  Open Data: XXX datasets Web semantics & entity-centric Web data 05/04/17 29Stefan Dietze  Web (of documents): approx. 46.000.000.000.000 (46 trillion) Web pages indexed by Google  Other forms of Web semantics and entity-centric knowledge?  Dynamics?  Quality?  Accessibility?  Scale?
  • 28.  Embedded markup (RDFa, Microdata, Microformats) for interpretation of Web documents (search, retrieval)  Arbitrary vocabularies; schema.org used at scale: (700 classes, 1000 predicates)  Adoption on the Web: 26 % (2014 Google study of 12 bn Web pages)  “Web Data Commons” (Meusel & Paulheim [ISWC2014]) • Markup from Common Crawl (3.2 billion pages): 44 billion RDF quads (2016) • Markup in 38% of pages in 2016  Same order of magnitude as “the Web” (!) Embedded Web page markup & schema.org <div itemscope itemtype ="http://schema.org/Movie"> <h1 itemprop="name">Forrest Gump</h1> <span>Actor: <span itemprop=„actor">Tom Hanks</span> <span itemprop="genre">Drama</span> ... </div> 05/04/17 30 RDF statements node1 actor _node-x node1 actor Robin Wright node1 genre Comedy node2 actor T. Hanks node2 distributed by Paramount Pic. node3 actor Tom Cruise node3 distributed by Paramount Pic. Stefan Dietze http://webdatacommons.org
  • 29.  schema:Product instances in WDC2015  Facts: 1.414.937.431 (= 302.246.120 instances, i.e. products)  Providers (distinct Pay Level Domains, PLDs): 93.705  Power law distribution of terms across PLDs  Top 10 PLDs  Top provider ? (company) 05/04/17 31Stefan Dietze Example: embedded Web markup about „products“ PLD # Resources www.crateandbarrel.com 33.517.936,00 www.bentgate.com 17.215.499,00 www.aliexpress.com 9.621.943,00 www.ebay.com.au 8.861.308,00 us.fotolia.com 7.939.982,00 www.ebay.co.uk 6.556.820,00 www.competitivecyclist.com 6.214.500,00 www.maxstudio.com 6.075.626,00 approx. 35 million resources
  • 30. 1 10 100 1000 10000 100000 1000000 10000000 1 51 101 151 201 count(log) PLD (ranked) # entities # statements Study on sample Web crawl (WDC2015)  Metadata about scholarly articles, e.g. s:ScholarlyArticle): 6.793.764 quads, 1.184.623 entities, 429 distinct predicates (in WDC and for 1 type alone)  Top 5 domains: Springer, MDPI, BMJ, mendeley.com, Biodiversitylibrary.org Domains, topics, disciplines?  Life Sciences and Computer Science predominant  Top-10 article titles  Noise Example: markup of bibliographic resources 05/04/17 32Stefan Dietze Sahoo, P., Gadiraju, U., Yu, R., Saha, S., Dietze, S., Analysing Structured Scholarly Data embedded in Web Pages, SAVE-SD2016, co-located with the WWW2016
  • 31. Example: markup of learning resources on the Web  “Learning Resources Metadata Intiative (LRMI)”: schema.org vocabulary for annotation of learning resources  Developed through DCMI Task Force on LRMI  Approx. 5000 PLDs (incl. subdomains) in CC  LRMI adoption (WDC) [WWW17]:  2015: 44,108,511 quads  2014: 30,599,024 quads  2013: 10.636873 quads 05/04/17 33 Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M., Analysing and Improving embedded Markup of Learning Resources on the Web, 26th International World Wide Web Conference (WWW2017), Digital Learning track, Perth, April 2017. Stefan Dietze
  • 32. Example: markup of learning resources on the Web  “Learning Resources Metadata Intiative (LRMI)”: schema.org vocabulary for annotation of learning resources  Developed through DCMI Task Force on LRMI  Approx. 5000 PLDs (incl. subdomains) in CC  LRMI adoption (WDC) [WWW17]:  2015: 44,108,511 quads  2014: 30,599,024 quads  2013: 10.636873 quads  Frequent errors and unintended use (e.g. porn) 05/04/17 34 Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M., Analysing and Improving embedded Markup of Learning Resources on the Web, 26th International World Wide Web Conference (WWW2017), Digital Learning track, Perth, April 2017. Stefan Dietze 7xxxtube.com 1amateurporntube.com virtualpornstars.com sunriseseniorliving.com simplyfinance.co.uk menslifestyles.com audiobooks.com simplypsychology.org helles-koepfchen.de
  • 33. 05/04/17 35Stefan Dietze Entity retrieval on Web markup: state of the art  Glimmer (http://glimmer.research.yahoo.com)  Entity retrieval on WDC dataset [Blanco, Mika & Vigna, ISWC2011]  BM25F retrieval model on WDC index
  • 34. Web markup: challenges 05/04/17 36 Characteristics Example Coreferences 18.000 results for <„Iphone 6“, type, s:Product> (8,6 quads on average) in CommonCrawl Redundancy <s, schema:name, „Iphone 6“> occurring 1000 times in CC Lack of links Largely unlinked entity descriptions Errors (typos & schema violations, see Meusel et al [ESWC2015]) Wrong namespaces, such as http://schma.org Undefined types & predicates: 9,7 %, less common than in LOD Confusion of datatype and object properties: <s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8% in LOD Data property range violations: e.g. literals vs numbers (12,6% vs 4,6 in LOD)  Using markup as knowledge graph, similar to Linked Data? Stefan Dietze A Survey on Challenges for Entity Retrieval in Markup Data, Yu, R., Gadiraju, U., Fetahu, B., Dietze, S., 15th International Semantic Web Conference (ISWC2016), Kobe, Japan (2016). “Strings, not things”  Bias towards datatype properties / using any property as such (!)  Numbers from LRMI2015 markup corpus: o 46 million “transversal” quads (i.e. excluding hierarchical statements such as rdfs:typeOf) o 64 % are actual datatype properties yet 97% refer to literals (up from 70% in 2013)  Challenges o Markup data = flat entity descriptions (=> fairly unconnected graph) o Data reuse requires identity resolution
  • 35.  Obtaining consolidated & verified entity description/facts (or graph) for a given resource/entity from Web markup?  Aiding tasks: such as document annotation, augmentation or enrichment of existing data- or knowledge bases/graphs Entity retrieval & reconciliation on markup 05/04/17 37 Query iPhone 6, type:(Product) Entity Description brand Apple Inc. weight 129 date 30.09.2015 manufacturer Foxconn Storage 16 GB <e1, s:name, „Iphone 6“> <e2, s:brand, „Apple Inc.“> <e3, s:brand, „Apple“> <e4, s:weight, 127> <e5, s:releaseDate, „1.12.1972“> Web (crawl) (e.g. Common Crawl/WDC, focused crawl) Stefan Dietze Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM: Query-Centric Data Fusion on Structured Web Markup, ICDE2017. Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze, D., Dietze, S., KnowMore - Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2017, under review.
  • 36. FuseM: query-centric data fusion on Web markup 05/04/17 38  Entity matching: BM25 entity retrieval model on markup index (Common Crawl) & similarity-based matching  Data fusion: ML classifier (SVM, knn, RandomForest), 3 feature categories (relevance, authority, clustering) 1. Matching 2. Fact selection New Queries Foxconn, type:(Organization) Cupertino, type:(City) Apple Inc., type:(Organization) (supervised SVM classifier) Entity Description brand Apple Inc. weight 129 date 30.09.2015 manufacturer Foxconn Storage 16 GB Query iPhone 6, type:(Product) Candidate Facts node1 brand _node-x node1 brand Apple Inc. node1 weight 129 node2 weight 172 node2 manufacturer Foxconn node3 releasedate 01.12.1972 node3 manufacturer Foxconn Web page markup Web (crawl) approx. 125.000 facts for „iPhone6“ Stefan Dietze Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM: Query-Centric Data Fusion on Structured Web Markup, ICDE2017.
  • 38. Evaluation & results: data fusion performance 05/04/17 40Stefan Dietze Setup  Dataset: Products, Movies, Books (approx. 3 billion. facts) from Common Crawl / WDC  Baselines:  BM25: top-k diverse facts via BM25 (Glimmer)  CBFS: clustering-based approach [ESWC2015]  PreRecCorr: “Fusing data with correlations” [Pochampally et. al., ACM SIGMOD 2014]  10-fold cross validation Results  FuseM beats baselines in both tasks (strong variance of baselines across tasks)  All feature categories contribute Query-centric data fusion (precision) Query-independent data fusion (P/R/F1)
  • 39. 05/04/17 42Stefan Dietze Results: example of fused entity description  Data fusion result for book „Brideshead Revisited“ (20 distinct facts) New facts (compared to DBpedia): • 60% - 70% of all facts for books & movies new (across all KBs) • 100% new for products („long tail entities“ not existing in KBs yet) New facts and attributes
  • 40. 05/04/17 43Stefan Dietze Results: KB augmentation  Augmentation of 15 properties of books (& movies) in three KBs  DB: DBpedia  FB: Freebase  WD: Wikidata  Augmentation performance: % of filled slots (or „knowledge gaps“) in KB  Performance varies heavily (yet some attributes completed to 100%) KBA result for entities of type „Book“ Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze, D., Dietze, S., KnowMore - Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2017, under review.
  • 41. Linked Data & knowledge graphs Conclusions & outlook 05/04/17 45Stefan Dietze  Retrieval/search of Linked Data hindered by heterogeneity, quality, dynamics etc  Dealing with diversity & heterogeneity o Profiling & recommendation: dataset search & recommendation o Entity retrieval & clustering: entity search
  • 42. Entity node1 name Molecular structure of nucleic acids node1 author James D. Watson node1 publisher Nature node1 datePublished 1956 node1 datePublished 1953 Entity node2 name Francis Crick node2 name Cricks node2 born 1916 Embedded data/markup/tables Unstructured (Web) data/docs Linked Data & knowledge graphs Conclusions & outlook 05/04/17 46Stefan Dietze  Retrieval/search of Linked Data hindered by heterogeneity, quality, dynamics etc  Dealing with diversity & heterogeneity o Profiling & recommendation: dataset search & recommendation o Entity retrieval & clustering: entity search  New forms of (structured) Web data: Web markup (schema.org et al.) & tables o Convergence of structured and unstructured Web (e.g. Voldemort KG, Tonon et al., ISWC2016) o Scale and dynamics (!) o Potential to augment existing knowledge graphs (e.g. Google KG or Microsoft Satori) o Potential training data for NED, entity interlinking and other entity-centric tasks (e.g. OKE Challenge)
  • 43. Entity node1 name Molecular structure of nucleic acids node1 author James D. Watson node1 publisher Nature node1 datePublished 1956 node1 datePublished 1953 Entity node2 name Francis Crick node2 name Cricks node2 born 1916 Contact & resources 05/04/17 47Stefan Dietze @stefandietze http://stefandietze.net More on Web markup: talk on Wednesday, 11:00, WW2017/Digital Learning track Embedded data/markup/tables Unstructured (Web) data/docs Linked Data & knowledge graphs

Editor's Notes

  • #12: Definition 2.1. ith-Order Value Inconsistency (Dx, Dy, P) between the pair of datasets Dx, Dy with respect to the ith-Order single-value property P is the proportion of the equivalent entities in Dx and Dy having contradicting values in P Definition 2.2. ith-Order Value Incompleteness (Dx, Dy, P) between the pair of datasets Dx, Dy with respect to a ith-Order multi-value property P is the proportion of entities in Dx and Dy having dierent values in P. ISSUES: different can mean incorrectness as well as incompleteness
  • #16: Filtering: identifying cluster of datasets which are similar to Ds (two metrics: LSA-based, Wordnet-based), threshold theta Ranking: cosine between profiles Experimentally better results than using the ranks from filtering step
  • #17: Evalualtion: map for different similarity thresholds (theta) from filtering step when explaining: - why is MAP decreasing with higher similarity thresholds? &amp;quot;For the given intervals [0, 0.7], [0, 0.8] and [0, 0.9], with respect to the used measures, we have 100% of recall --&amp;gt; all datasets considered as true are present in the recommanda list.
  • #23: Random Sampling: randomly selects resource instances from Ri 2 Di for further analysis in the proling pipeline. Weighted Sampling: weigh each resource as the ratio of the number of datatype properties used to dene a resource over the maximum number of datatype properties over all resources for a specic dataset. The weight for rk Fig. 1. Processing pipeline for generating structured proles of Linked Data graphs. is computed by wk = jf(rk)j=maxfjf(rj)jg (rj 2 Rijj = 1; ; n), where f(rk) represents the datatype properties of resource rk. An instance is included in a sample if, for a randomly generated number p from a uniform distribution, the weight wk such that wk &amp;gt; (1 􀀀 p). Such a strategy ensures that resources that carry more information (having more literal values) have higher chances of being included earlier at low cut-os of analysed samples. Resource Centrality Sampling: weighs each resource as the ratio of the number of resource types used to describe a particular resource (V 0 k Vk) divided by the total number of resource types in a dataset. The weight is dened by ck = jC0k j=jCj with C0k = C \ V 0 k. Similarly to `weighted sampling&amp;apos;, for a randomly generated number p, rk is included in the sample if ck &amp;gt; (1 􀀀 p). The main motivation behind computing the centrality of a resource is that important concepts in a dataset tend to be more structured and linked to other concepts.
  • #74: The underlying assumption is that very specific and targeted seed lists will require different crawling and relevance computation methods than very broad and unspecific seed lists. http://www.visualdataweb.org/relfinder/demo.swf?obj1=TWluaW9ucyAoZmlsbSl8aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL01pbmlvbnNfKGZpbG0p&amp;obj2=U2FuZHJhIEJ1bGxvY2t8aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL1NhbmRyYV9CdWxsb2Nr&amp;obj3=Sm9uIEhhbW18aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL0pvbl9IYW1t&amp;obj4=TWljaGFlbCBLZWF0b258aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL01pY2hhZWxfS2VhdG9u&amp;obj5=QWxsaXNvbiBKYW5uZXl8aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL0FsbGlzb25fSmFubmV5&amp;obj6=RGVzcGljYWJsZSBNZSAyfGh0dHA6Ly9kYnBlZGlhLm9yZy9yZXNvdXJjZS9EZXNwaWNhYmxlX01lXzI=&amp;obj7=U3RldmUgQ29vZ2FufGh0dHA6Ly9kYnBlZGlhLm9yZy9yZXNvdXJjZS9TdGV2ZV9Db29nYW4=&amp;obj8=R2VvZmZyZXkgUnVzaHxodHRwOi8vZGJwZWRpYS5vcmcvcmVzb3VyY2UvR2VvZmZyZXlfUnVzaA==&amp;name=REJwZWRpYSAobWlycm9yKQ==&amp;abbreviation=ZGJw&amp;description=TGlua2VkIERhdGEgdmVyc2lvbiBvZiBXaWtpcGVkaWEu&amp;endpointURI=aHR0cDovL2RicGVkaWEuaW50ZXJhY3RpdmVzeXN0ZW1zLmluZm8=&amp;dontAppendSPARQL=ZmFsc2U=&amp;defaultGraphURI=aHR0cDovL2RicGVkaWEub3Jn&amp;isVirtuoso=dHJ1ZQ==&amp;useProxy=ZmFsc2U=&amp;method=UE9TVA==&amp;autocompleteLanguage=ZW4=&amp;autocompleteURIs=aHR0cDovL3d3dy53My5vcmcvMjAwMC8wMS9yZGYtc2NoZW1hI2xhYmVs&amp;ignoredProperties=aHR0cDovL3d3dy53My5vcmcvMTk5OS8wMi8yMi1yZGYtc3ludGF4LW5zI3R5cGUsaHR0cDovL3d3dy53My5vcmcvMjAwNC8wMi9za29zL2NvcmUjc3ViamVjdCxodHRwOi8vZGJwZWRpYS5vcmcvcHJvcGVydHkvd2lraVBhZ2VVc2VzVGVtcGxhdGUsaHR0cDovL2RicGVkaWEub3JnL3Byb3BlcnR5L3dvcmRuZXRfdHlwZSxodHRwOi8vZGJwZWRpYS5vcmcvcHJvcGVydHkvd2lraWxpbmssaHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L3dpa2lQYWdlV2lraUxpbmssaHR0cDovL3d3dy53My5vcmcvMjAwMi8wNy9vd2wjc2FtZUFzLGh0dHA6Ly9wdXJsLm9yZy9kYy90ZXJtcy9zdWJqZWN0&amp;abstractURIs=aHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L2Fic3RyYWN0&amp;imageURIs=aHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L3RodW1ibmFpbCxodHRwOi8veG1sbnMuY29tL2ZvYWYvMC4xL2RlcGljdGlvbg==&amp;linkURIs=aHR0cDovL3B1cmwub3JnL29udG9sb2d5L21vL3dpa2lwZWRpYSxodHRwOi8veG1sbnMuY29tL2ZvYWYvMC4xL2hvbWVwYWdlLGh0dHA6Ly94bWxucy5jb20vZm9hZi8wLjEvcGFnZQ==&amp;maxRelationLegth=Mg==
  • #75: As shown in our experimental evaluation, specific entities within a seed list strongly reflect the crawl intent. {Pulp Fiction, Film, Entertainment}, the most specific entity \texttt{`Pulp Fiction&amp;apos;}, reflects the most specific crawl intent, whereas the entities \texttt{`Film&amp;apos;} and \texttt{`Entertainment&amp;apos;} provide contextual information, namely that \texttt{`Pulp Fiction&amp;apos;} is a movie. Motivated by this, we assume that the relevance of specific candidate entities is dependent on the seed entity they are related to. For example, candidate entities similar to entity \texttt{`Pulp Fiction&amp;apos;} will be ranked higher than entities that are similar to other seed entities.
  • #79: The average improvement across different NDCG levels is 1.6% on depth 2 and 4.3% on depth 3, suggesting a positive effect of the attrition factor for the cases of our seed lists. On the other hand, the coherence of the seed list appears to have no significant impact on the suitability of particular configuration. Given the significantly increased runtime when crawling beyond hop 2, a crawl depth of 2 seems to provide optimal efficiency, and it is not advisable to crawl to a higher distance.
  • #80: The average improvement across different NDCG levels is 1.6% on depth 2 and 4.3% on depth 3, suggesting a positive effect of the attrition factor for the cases of our seed lists. On the other hand, the coherence of the seed list appears to have no significant impact on the suitability of particular configuration. Given the significantly increased runtime when crawling beyond hop 2, a crawl depth of 2 seems to provide optimal efficiency, and it is not advisable to crawl to a higher distance.
  • #81: This is due to the fact that high coherence seed lists have a more specific crawl intent, leading to narrow and often small result sets, and hence also a limited ground truth, while the low coherence lists have a much broader crawl intent as well as relevant entity set. This is reflected in our ground truth: the average number of entities labeled as related (score≥ 3 and beyond) is 208 for low coherence seed list, and 145 for high coherence seed lists. Meanwhile, the narrow search intent also causes more disagreement among crowdsourcing workers for generating the ground truth, which makes the results for high coherence seed lists less consensual. Another difficulty faced when evaluating the crawling task is the highly heteroge- neous and varied nature of the possible result sets, originating from a highly heteroge- neous Linked Data graph.