SlideShare a Scribd company logo
What‘s all the data about –
profiling and interlinking Web datasets
Stefan Dietze
L3S Research Center
27/03/14 1Stefan Dietze
Recent work on Linked Data exploration/discovery/search
 Entity interlinking & dataset interlinking recommendation
 Dataset profiling
 Data consistency & conflicts
Research areas
 Web science, Information Retrieval, Semantic Web & Linked
Data, data & knowledge integration (mapping, classification,
interlinking)
 Application domains: education/TEL, Web archiving, …
Some projects
Introduction
http://www.l3s.de/
Stefan Dietze 27/03/14 2
 See also: http://purl.org/dietze
…why are there so few datasets actually used?
 Date reuse and in-links focused on trusted „reference
graphs“ such as DBpedia, Freebase etc
 Long tail of LD datasets which are neither reused nor linked
to (LOD Cloud alone 300+ datasets, 50 bn triples)
 Explanations?
Linked Data is awesome, but...
27/03/14
 „HTTP-accessibility“
(SPARQL, URI-dereferencing)
 „Structure“ & „Semantics“
(=> shared/linked vocabularies)
 „Interlinked“
 „Persistent“
Hm,
really?
Stefan Dietze
Linked data is more diverse than we think
SPARQL Web-Querying Infrastructure: Ready for Action?,
Carlos Buil-Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves
Vandenbussch, International Semantic Web Conference 2013,
(ISWC2013).
SPARQL endpoint availability over time [Buil-Aranda et al 2013]
Accessibility of datasets?
 Less than 50% of all SPARQL endpoints actually responsive
at given point of time
 “THE” SPARQL protocol? No, but many variants & subsets
 …
Shared vocabularies & schemas, but:
 …still very heterogeneous [d’Aquin, WebSci13]
 …data partially messy and not conformant
(RDFS, schemas) [HoganJWS2012]
 …even widely used reference datasets such as
DBpedia noisy [Paulheim2013]
Co-occurence graph of data
types in 146 datasets: 144
Vocabularies, 588 highly
overlapping types, 719
Properties
Assessing the Educational Linked Data Landscape, D’Aquin, M.,
Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris,
France, May 2013.
Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic
Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218,
2013, pp 510-525
An empirical survey of Linked Data conformance. Hogan, A., Umbrich,
J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., In the Journal of Web
Semantics 14: pp. 14–44, 2012Stefan Dietze
What about data consistency?
Inconsistency and Incompleteness of Linked Datasets – a
Case Study, Yuan, W., Demidova, E., Dietze, S., Zhu, X., Web
Science 2014, WebSci14, under review.
27/03/14
Too many/diverse datasets, too little information
Stefan Dietze 27/03/14
?
?
? ?? ?
 Which datasets are useful & trustworthy for case
XY (eg „learning about the solar system“) ? Which
topics are covered?
 Types: which datasets describe statistics, videos,
slides, publications etc?
 Currentness, dynamics, accessability/reliability,
data quantity & quality?
Data curation and dataset profiling
Dataset
Catalog/Registry
Stefan Dietze 27/03/14
 Catalog of data: classification of
datasets according to resource
types, disciplines/topics, data
quality, accessability, etc
 Infrastructure for
distributed/federated querying
describes
 Which datasets are useful & trustworthy for case
XY (eg „learning about the solar system“) ? Which
topics are covered?
 Types: which datasets describe statistics, videos,
slides, publications etc?
 Currentness, dynamics, accessability/reliability,
data quantity & quality?
db:Astro. Objects
Dataset profiling: what’s all the data about
Dataset
Metadata
Stefan Dietze 27/03/14
BIBO
AAISO
FOAF
contains
Entity disambiguation &
linking [ESWC13]
Topic profile extraction
[WWW13, ESCW14]
db:Astronomy
db:Astro. Objects
Dataset
Catalog/Registry
yov:Video
po:Programme
BBC Programme
<po:Programme …>
<po:Series>Wonders of the Solar System</.>
<po:Actor>Brian Cox</…>
</po:Programme…>
<yo:Video …>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video…>
Yovisto Video
bibo:Fil
bibo:Fi
bibo:Film
Schema mappings
[WebSci13]
Schemas/vocabularies on the Web: XKCD 927
Stefan Dietze 27/03/14
https://xkcd.com/927/
Schema assessment and mapping
Co-occurence of
data types
(in 146 datasets:
144 Vocabularies,
588 highly
overlapping types,
719 Properties)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
<po:Programme …>
<po:title>Secret Universe –
The Life of the Cell</po:title>
…
</po:Programme…>
BBC Programme
<sioc:Item …>
<label>Viral diseases &
bacteria</title>
…
</sioc:Item ….>
SlideShare Set
po:Programme
sioc:Item
?
http://datahub.io/group/linked-education
Stefan Dietze 27/03/14
Schema assessment and mapping
Co-occurence of
data types
(in 146 datasets:
144 Vocabularies,
588 highly
overlapping types,
719 Properties)
Co-occurence after
mapping into most
frequent schemas
(201 frequent types
mapped into 79
classes)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
bibo:Slideshow
bibo:Film
bibo:Document
<po:Programme …>
<po:title>Secret Universe –
The Life of the Cell</po:title>
…
</po:Programme…>
BBC Programme
<sioc:Item …>
<label>Viral diseases &
bacteria</title>
…
</sioc:Item ….>
SlideShare Set
po:Programme
sioc:Item
Stefan Dietze 27/03/14
LinkedUp Data Catalog
in a nutshell http://datahub.io/group/linked-education
http://data.linkededucation.org/linkedup/catalog/
 RDF (VoID) dataset catalog: browse &
query distributed datasets
 Live information about endpoint
accessibility
 Federated queries using type mappings
Stefan Dietze 27/03/14
http://datahub.io/group/linked-education
<yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
Topics/categories addressed?
Relatedness of resources/entities?
(types, semantics)
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Combining a co-occurrence-based and a semantic measure
for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R.
Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended
Semantic Web Conference, (May 2013).
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B., Dietze, S.,
Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended
Semantic Web Conference (ESWC2014), Crete, Greece, (2014).
Challenge: semantics of resources/datasets?
15Stefan Dietze 27/03/14
<yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Data disambiguation (for linking & profiling)
Brian Cox?
Sun?
Pluto?
16Stefan Dietze 27/03/14
db:Pluto
(Dwarf
Planet)
db:Astrono-
mical Objects
db:Sun
Data disambiguation using background knowledge
„Semantic relatetedness“ of resources?
db:Astronomy
17
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
<yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
Stefan Dietze 27/03/14
db:Pluto
(Dwarf
Planet)
db:Astrono-
mical Objects
<yov:Lecture8748720>
<title>Pluto & the Dwarf
Planets</title>
…
< yov:Lecture8748720>
Online Lecture
db:Astronomy
 Computation of connectivity scores
between resources/entities
 Method: combination of a
 (i) semantic (graph-based) connectivity
score (SCS) with
 (ii) a Web co-occurence-based measure
(CBM) (similar to NGD)
 For (i): adaptation of Katz-Index from SNA
for (linked) data graphs (considering path
number and path lengths of transversal
properties)
db:Sun
SCS = 0.32
CBM = 0.24
http://purl.org/vol/doc/
http://purl.org/vol/ns/
19/09/2013 19Stefan Dietze
Combining a co-occurrence-based and a semantic
measure for entity linking, B. P. Nunes, S. Dietze, M.A.
Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013
- 10th Extended Semantic Web Conference, (May 2013).
Entity linking: semantic relatedness
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Entity linking: evaluation
27/03/14 20Stefan Dietze
 Evaluation based on USA Today News items (80.000 entity pairs)
 Manually created gold standard
(1000 entity pairs)
 Baseline: Explicit Semantic Analysis (ESA)
=> CBM/SCS: „relatedness“; ESA: „similarity“
Precision/Recall/F1 for SCS, CBM, ESA.
Combining a co-occurrence-based and a semantic
measure for entity linking, B. P. Nunes, S. Dietze, M.A.
Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013
- 10th Extended Semantic Web Conference, (May 2013).
db:Astrono-
mical Objects
db:Astronomy
db:Sun
 Extracting representative metadata („topic profile“) for each dataset
 Ranking of most representative (DBpedia) categories (= topics); applied to all responsive LOD datasets
 Scalability vs representativeness: sampling & ranking for good scalability/accuracy balance
DBpedia category graph
Stefan Dietze 27/03/14
Dataset profiling: what‘s the data about?
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
,(ESWC2014), Crete, Greece, (2014).
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Dataset profiling: approach
Stefan Dietze 27/03/14
1. Sampling of resource instances
(random sampling, weighted sampling, resource
centrality sampling)
2. Entity and topic extraction (NER via DBpedia
Spotlight, category mapping and expansion)
3. Normalisation and ranking (using graphical-
models such as PageRank with Priors, HITS with
Priors and K-Step Markov)
=> Result: weighted dataset-topic profile graph
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
Dataset profiling: exploring LOD datasets/topics
in a nutshell http://data-observatory.org/lod-profiles/
Stefan Dietze 27/03/14
 Automatic extraction of dataset “topics” [ESWC2014]
 Visualisation & exploration of dataset-topic graph
(datasets, topics, relationships)
 Includes all (responsive) datasets of LOD Cloud
Dataset profiling: results evaluation
Stefan Dietze 27/03/14
NDCG (averaged over all datasets) .
Datasets & Ground Truth
 Yovisto, Oxpoints, LAK Dataset, Semantic Web
Dogfood
 Crowd-sourced topic indicators from datasets
(keywords, tags)
 Manual mapping to entities & category extraction
(ranking according to frequency)
Baselines
 1) LDA, 2) tf/idf (applied to entire datasets)
 Topic extraction according to our approach,
weighting/ranking based on term weight
Measure
 NDCG @ rank l
 Performance (time/NDCG) for different sampling
strategies/sizes etc
Stefan Dietze 27/03/14
dbp:Category:Royal_Medal_winners
dbp:Category:1955_births
dbp:Category:People_from_London
dbp:Category:Buzzwords
dbp:Category:Web_Services
dbp:Category:HTTP
dbp:Category:Unitarian_Universalists
dbp:Category:World_Wide_Web
What have these categories in common?
Stefan Dietze 27/03/14
Diversity of category profile for a single paper
Berners-Lee, Tim; Hendler, James, Ora Lassila (2001). "The Semantic Web".
Scientific American Magazine.
person
document
dbp:Tim_Berners-Lee
dbp:Category:1955_births
dbp:Category:People_from_London
dbp:Category:Buzzwords
dbp:Semantic_Web
dbp:Category:Semantic_Web
dbp:Category:Web_Services
dbp:Category:HTTP
dbp:Category:Unitarian_Universalists
first-level categories (dcterms:subject)
dbp:Category:World_Wide_Web
dbp:Category:Royal_Medal_winners
 DBpedia category graph not an ideal “topic” vocabulary:
 Broad and noisy
 “Categories” vs “topics” (for capturing disciplines, thesauri
like UMBEL or UNESCO Thesaurus seem better suited)
 Hierarchy ?
 Filtering of certain partitions of category graph (too generic
categories etc)
 Mixing categories across resource types (document, person)
creates “perceived noise”
 But: broadness is useful as general vocabulary for
categorisation of all sorts of resource types
Stefan Dietze 27/03/14
Dataset profiling: some lessons learned
Stefan Dietze 27/03/14
http://data-observatory.org/led-explorer/
 Type specific views on datasets/
categories
 “Document” (foaf:document)
 “Person “ (foaf:person)
 “Course” (aaiso:course)
 Currently applied to datasets in
LinkedUp Catalog only (as
schema mappings already
available here)
Type-specific exploration of dataset categories
Stefan Dietze 27/03/14
Dataset interlinking recommendation
Candidate datasets for interlinking?
34
t
Linkset1
Linkset2
Problem
 Given dataset t, ranking datasets from D
according to probability score (di, t) to
contain linking candidates (entities)
 Features:
 Vocabulary overlap
 Existing links (SNA)
 Datasets more likely to contain linking
candidates if they (a) share common
schema elements, or (b) already link to t
or datasets t links to (friend of a friend)
Conclusions
 Roughly 60% MAP for both approaches
 Future work: quantity of links, more
remote links, extraction of dataset links
rather than data from DataHub
Lopes, G.R., Paes Leme, L.A.P., Nunes, B.P., Casanova, M.A.,
Dietze, S., Recommending Tripleset Interlinking through a
Social Network Approach, The 14th International Conference
on Web Information System Engineering (WISE 2013),
Nanjing, China, 2013.
Paes Leme, L. A. P., Lopes, G. R., Nunes, B. P., Casanova,
M.A., Dietze, S., Identifying candidate datasets for data
interlinking, in Proceedings of the 13th International
Conference on Web Engineering, (2013).
Rank
1 DBLP
2 ACM
3 OAI
4 CiteSeer
5 IBM
6 Roma
7 IEEE
8 Ulm
9 Pisa
?
?
Stefan Dietze 27/03/14 37
Success models:
data & applications
 LinkedUp Challenge
to identify innovative
tools & applications
 Evaluation methods
and approaches
“LinkedUp” – Linking Web Data (for Education)
L
Data linking & curation
Technology transfer
& community-building
 Collecting & exposing open
data
=> LinkedUp Data Catalog
 Profiling and linking of Web
Data for education
=> educational data graph
[ESWC2013], [ISWC2013],
 Disseminating knowledge &
building communities
(educators, computer
scientists, data engineers)
 Gathering stakeholder
feedback: use cases, and
requirements
http://linkedup-challenge.org/#usecases
http://linkedup-project.eu/events
http://www.linkedup-challenge.org/
http://data.linkededucation.org
European suport action to
advance take-up of open
data & related technologies
http://www.linkedup-project.eu
Stefan Dietze 27/03/14
17/09/2013 38
Who we areL
LinkedUp Network
LinkedUp Consortium
LinkedUp Advisory Board
LinkedUp Challenge: using open data (for learning)
 Open Data Competition to promote tools and applications that analyse / integrate (Linked)
Web data
 Organised by LinkedUp project over 2 years (“Veni”, “Vidi”, “Vici”) with 40.000 EUR awards
 Veni Competition - 22 submissions, 8 shortlisted for presentation at Open Knowledge
Conference (17 September, Geneva Switzerland)
http://linkedup-challenge.org
Stefan Dietze 27/03/14
 Open & focused track(s)
 Final events at ESWC2014
(May, Crete)
 Open Track only
 Final events at OKCon 2013
(September 2013, Geneva)
 Open track & focused tracks
 Submission details and calls to be
released soon
 Final events at ISWC2014
(October, Riva del Garda, Italy)
May –September 2013 October 2013 – May 2014 May 2014 – October 2014
?
The Veni shortlist & winners
DataConf.
KnowNodes
Mismuseos
ReCredible
YourHistory
27/03/14
http://www.globe-town.org/
WeShare - 3rd price / people‘s choice
GlobeTown - 2nd price
http://seek.cloud.gsic.tel.uva.es/weshare/
http://www.polimedia.nl/
PoliMedia – 1st price
data.l3s.de – a DataHub for the L3S
Learning Analytics & Knowledge Dataset & Challenge
Facilitating Research on Learning Analytics and EDM
a nutshell
Stefan Dietze 27/03/14
http://lak.linkededucation.org/
http://lak.linkededucation.org/
LAK Dataset (450 publications in RDF/R)
 ACM International Conference on Learning Analytics and
Knowledge (LAK) (2011-13)
 International Conference on Educational Data Mining (2008-13)
 Journal of Educational Data Mining (2008-12)
LAK Data Challenge
 Analyse, explore correlate the LAK Dataset
 At ACM LAK 2014 (April 2014, Indianapolis)
KEYSTONE COST ACTION
27/03/14 51Stefan Dietze
http://www.keystone-cost.eu/
 Research network focused on distributed search,
dataset profiling, to Semantic Web, Databases, etc.
 Running 2013-2017
 WG1: Representation of structured data sources
 WG2: Keyword search
 WG3: User interaction and query interpretation
 WG4: Research integration, showcases,
benchmarks, and evaluations
 Open to new members (even beyond Europe)
 Joint workshops (eg PROFILES2014 @ ESWC2014)
Ongoing/future work … and some upcoming events
Linked Data evolution, preservation, consistency
 In RDF graphs (eg LOD Cloud), „all“ nodes are connected
 LD preservation: which datasets to preserve (direct links
or even more distant neighbours)?
=> semantic relatedness as guidance for scalable
preservation strategies /data enrichment
 Link correctness in evolving LD
 Investigating impact of changes on link correctness
(weekly LOD crawls over 1 year time span)
 Application: informed preservation strategies
 Conflict detection and LD quality (link quality, impact of
conflicts in distant nodes)
 PROFILES workshop @ ESWC2014
(http://keystone-cost.eu/profiles2014)
 26 May 2014, Crete, Greece
 Linking User Data 2014 at UMAP2014
(http://liud.linkededucation.org)
 Deadline: 1 April
 Online Learning & LD Tutorial at WWW2014
(http://www2014.kr/)
 07 April, Seoul
Thank you!
WWW
See also (general)
 http://linkedup-project.eu
 http://linkededucation.org
 http://data.l3s.de
http://purl.org/dietze
See also (data)
 http://data.linkededucation.org
 http://data.linkededucation.org/linkedup/catalog/
 http://lak.linkededucation.org
27/03/14 54Stefan Dietze
 Besnik Fetahu (L3S)
 Bernardo Pereira Nunes (PUC Rio)
 Marco Casanova (PUC Rio)
 Luiz Andre Paes Leme (PUC Rio)
 Giseli Lopes (PUC Rio)
 Davide Taibi (CNR, IT)
 Mathieu d’Aquin (Open University, UK)
 and many more…
Acknowledgements

More Related Content

What's hot (20)

PDF
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
KNOWeSCAPE2014
 
PPT
Participatory Web
University of Edinburgh
 
PPT
User Engagement in Research Data Curation
University of Edinburgh
 
PPT
Collaboration to Curation: The High Rise Project meets Edinburgh DataShare
University of Edinburgh
 
PDF
Semantic Web / Linked Data Technologies
Mathieu d'Aquin
 
PDF
Geospatial Metadata and Spatial Data: It's all Greek to me!
EDINA, University of Edinburgh
 
PPT
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
GigaScience, BGI Hong Kong
 
PPTX
Describing Scholarly Contributions semantically with the Open Research Knowle...
Sören Auer
 
PDF
WDAqua ITN – Answering Questions using Web Data
Christoph Lange
 
PDF
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
EUDAT
 
PPTX
Cognitive data
Sören Auer
 
PPTX
Data Management Planning at the DCC: a human factor
Martin Donnelly
 
PPTX
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
European Data Forum
 
PDF
Data management plans – EUDAT Best practices and case study | www.eudat.eu
EUDAT
 
PPTX
Interpreting Data Mining Results with Linked Data for Learning Analytics
Mathieu d'Aquin
 
PPT
Geospatial Metadata Workshop
EDINA, University of Edinburgh
 
PPT
Glasgow University Geo Metadata Workshop
EDINA, University of Edinburgh
 
PPTX
Towards an Open Research Knowledge Graph
Sören Auer
 
PDF
Long-term data curation, aka data preservation - EUDAT Summer School (Marjan ...
EUDAT
 
PPTX
Frankfurt Big Data Lab & Refugee Projeect
Goethe Univeristy
 
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
KNOWeSCAPE2014
 
Participatory Web
University of Edinburgh
 
User Engagement in Research Data Curation
University of Edinburgh
 
Collaboration to Curation: The High Rise Project meets Edinburgh DataShare
University of Edinburgh
 
Semantic Web / Linked Data Technologies
Mathieu d'Aquin
 
Geospatial Metadata and Spatial Data: It's all Greek to me!
EDINA, University of Edinburgh
 
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
GigaScience, BGI Hong Kong
 
Describing Scholarly Contributions semantically with the Open Research Knowle...
Sören Auer
 
WDAqua ITN – Answering Questions using Web Data
Christoph Lange
 
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
EUDAT
 
Cognitive data
Sören Auer
 
Data Management Planning at the DCC: a human factor
Martin Donnelly
 
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
European Data Forum
 
Data management plans – EUDAT Best practices and case study | www.eudat.eu
EUDAT
 
Interpreting Data Mining Results with Linked Data for Learning Analytics
Mathieu d'Aquin
 
Geospatial Metadata Workshop
EDINA, University of Edinburgh
 
Glasgow University Geo Metadata Workshop
EDINA, University of Edinburgh
 
Towards an Open Research Knowledge Graph
Sören Auer
 
Long-term data curation, aka data preservation - EUDAT Summer School (Marjan ...
EUDAT
 
Frankfurt Big Data Lab & Refugee Projeect
Goethe Univeristy
 

Viewers also liked (13)

PPTX
Presentation nokobit
netsoxx
 
PDF
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
Besnik Fetahu
 
PPT
DURAARK at Bibliotheksymposium Wildau
panitzm
 
PDF
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...
lindlar
 
PDF
Quality criteria for architectural 3D data in usage and preservation processes
lindlar
 
PDF
A Domain-driven Approach to Digital Curation and Preservation of 3D Architect...
lindlar
 
PDF
Presentation of the DURAARK project at Ex Libris conference, Berlin, Germany.
Lena Lindbäck
 
PPTX
DURAARK presentation CIB W78 "Applications of IT in AEC" conference Beijing 2...
Jakob Beetz
 
PDF
Towards preservation of semantically enriched architectural knowledge
Stefan Dietze
 
PPT
DURAARK at IGeLU 2014
panitzm
 
PDF
Grapp2014 presentation
netsoxx
 
PPT
DURAARK at AUdS 2015
panitzm
 
PPT
Preservation of 3 d objects of buildings
netsoxx
 
Presentation nokobit
netsoxx
 
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
Besnik Fetahu
 
DURAARK at Bibliotheksymposium Wildau
panitzm
 
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...
lindlar
 
Quality criteria for architectural 3D data in usage and preservation processes
lindlar
 
A Domain-driven Approach to Digital Curation and Preservation of 3D Architect...
lindlar
 
Presentation of the DURAARK project at Ex Libris conference, Berlin, Germany.
Lena Lindbäck
 
DURAARK presentation CIB W78 "Applications of IT in AEC" conference Beijing 2...
Jakob Beetz
 
Towards preservation of semantically enriched architectural knowledge
Stefan Dietze
 
DURAARK at IGeLU 2014
panitzm
 
Grapp2014 presentation
netsoxx
 
DURAARK at AUdS 2015
panitzm
 
Preservation of 3 d objects of buildings
netsoxx
 
Ad

Similar to What's all the data about? - Linking and Profiling of Linked Datasets (20)

PDF
From Data to Knowledge - Profiling & Interlinking Web Datasets
Stefan Dietze
 
PDF
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Stefan Dietze
 
PDF
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Stefan Dietze
 
PDF
Open Data Dialog 2013 - Linked Data in Education
Stefan Dietze
 
PDF
Semantic Linking & Retrieval for Digital Libraries
Stefan Dietze
 
PPT
euclid_linkedup WWW tutorial (Besnik Fetahu)
Besnik Fetahu
 
PDF
Intertwingularity, Semantic Web and linked Geo data
Dan Brickley
 
PDF
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Stefan Dietze
 
PDF
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
Stefan Dietze
 
PPTX
Linked data 20171106
Synaptica, LLC
 
PDF
Interlinking educational data to Web of Data (Thesis presentation)
Enayat Rajabi
 
PDF
Hide the Stack: Toward Usable Linked Data
aba-sah
 
PPSX
Linked Data to Improve the OER Experience
The Open Education Consortium
 
PPTX
Why do they call it Linked Data when they want to say...?
Oscar Corcho
 
PPTX
Linked dataresearch
Tope Omitola
 
PPTX
Get on the Linked Data Web!
Armin Haller
 
PDF
Linked Data for Architecture, Engineering and Construction (AEC)
Stefan Dietze
 
PDF
ITWS Capstone (RPI, Fall 2013)
Rensselaer Polytechnic Institute
 
PPT
In search of lost knowledge: joining the dots with Linked Data
jonblower
 
PPTX
Intro to the semantic web (for libraries)
robin fay
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
Stefan Dietze
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Stefan Dietze
 
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Stefan Dietze
 
Open Data Dialog 2013 - Linked Data in Education
Stefan Dietze
 
Semantic Linking & Retrieval for Digital Libraries
Stefan Dietze
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
Besnik Fetahu
 
Intertwingularity, Semantic Web and linked Geo data
Dan Brickley
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Stefan Dietze
 
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
Stefan Dietze
 
Linked data 20171106
Synaptica, LLC
 
Interlinking educational data to Web of Data (Thesis presentation)
Enayat Rajabi
 
Hide the Stack: Toward Usable Linked Data
aba-sah
 
Linked Data to Improve the OER Experience
The Open Education Consortium
 
Why do they call it Linked Data when they want to say...?
Oscar Corcho
 
Linked dataresearch
Tope Omitola
 
Get on the Linked Data Web!
Armin Haller
 
Linked Data for Architecture, Engineering and Construction (AEC)
Stefan Dietze
 
ITWS Capstone (RPI, Fall 2013)
Rensselaer Polytechnic Institute
 
In search of lost knowledge: joining the dots with Linked Data
jonblower
 
Intro to the semantic web (for libraries)
robin fay
 
Ad

More from Stefan Dietze (20)

PDF
Understanding Scientific and Societal Adoption and Impact of Science Through ...
Stefan Dietze
 
PDF
NEWORDER Project - Science in the online knowledge order
Stefan Dietze
 
PDF
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Stefan Dietze
 
PDF
AI in between online and offline discourse - and what has ChatGPT to do with ...
Stefan Dietze
 
PDF
An interdisciplinary journey with the SAL spaceship – results and challenges ...
Stefan Dietze
 
PDF
Research Knowledge Graphs at NFDI4DS & GESIS
Stefan Dietze
 
PDF
Research Knowledge Graphs at GESIS & NFDI4DataScience
Stefan Dietze
 
PDF
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Stefan Dietze
 
PDF
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Stefan Dietze
 
PDF
Towards research data knowledge graphs
Stefan Dietze
 
PDF
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Stefan Dietze
 
PDF
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
Stefan Dietze
 
PDF
Using AI to understand everyday learning on the Web
Stefan Dietze
 
PDF
Analysing User Knowledge, Competence and Learning during Online Activities
Stefan Dietze
 
PDF
Analysing & Improving Learning Resources Markup on the Web
Stefan Dietze
 
PDF
Big Data in Learning Analytics - Analytics for Everyday Learning
Stefan Dietze
 
PDF
Mining and Understanding Activities and Resources on the Web
Stefan Dietze
 
PDF
Towards embedded Markup of Learning Resources on the Web
Stefan Dietze
 
PDF
Dietze linked data-vr-es
Stefan Dietze
 
PDF
LinkedUp - Linked Data Europe Workshop 2014
Stefan Dietze
 
Understanding Scientific and Societal Adoption and Impact of Science Through ...
Stefan Dietze
 
NEWORDER Project - Science in the online knowledge order
Stefan Dietze
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Stefan Dietze
 
AI in between online and offline discourse - and what has ChatGPT to do with ...
Stefan Dietze
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
Stefan Dietze
 
Research Knowledge Graphs at NFDI4DS & GESIS
Stefan Dietze
 
Research Knowledge Graphs at GESIS & NFDI4DataScience
Stefan Dietze
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Stefan Dietze
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Stefan Dietze
 
Towards research data knowledge graphs
Stefan Dietze
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Stefan Dietze
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
Stefan Dietze
 
Using AI to understand everyday learning on the Web
Stefan Dietze
 
Analysing User Knowledge, Competence and Learning during Online Activities
Stefan Dietze
 
Analysing & Improving Learning Resources Markup on the Web
Stefan Dietze
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Stefan Dietze
 
Mining and Understanding Activities and Resources on the Web
Stefan Dietze
 
Towards embedded Markup of Learning Resources on the Web
Stefan Dietze
 
Dietze linked data-vr-es
Stefan Dietze
 
LinkedUp - Linked Data Europe Workshop 2014
Stefan Dietze
 

Recently uploaded (20)

PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Français Patch Tuesday - Juillet
Ivanti
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Français Patch Tuesday - Juillet
Ivanti
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 

What's all the data about? - Linking and Profiling of Linked Datasets

  • 1. What‘s all the data about – profiling and interlinking Web datasets Stefan Dietze L3S Research Center 27/03/14 1Stefan Dietze
  • 2. Recent work on Linked Data exploration/discovery/search  Entity interlinking & dataset interlinking recommendation  Dataset profiling  Data consistency & conflicts Research areas  Web science, Information Retrieval, Semantic Web & Linked Data, data & knowledge integration (mapping, classification, interlinking)  Application domains: education/TEL, Web archiving, … Some projects Introduction http://www.l3s.de/ Stefan Dietze 27/03/14 2  See also: http://purl.org/dietze
  • 3. …why are there so few datasets actually used?  Date reuse and in-links focused on trusted „reference graphs“ such as DBpedia, Freebase etc  Long tail of LD datasets which are neither reused nor linked to (LOD Cloud alone 300+ datasets, 50 bn triples)  Explanations? Linked Data is awesome, but... 27/03/14  „HTTP-accessibility“ (SPARQL, URI-dereferencing)  „Structure“ & „Semantics“ (=> shared/linked vocabularies)  „Interlinked“  „Persistent“ Hm, really? Stefan Dietze
  • 4. Linked data is more diverse than we think SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil-Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International Semantic Web Conference 2013, (ISWC2013). SPARQL endpoint availability over time [Buil-Aranda et al 2013] Accessibility of datasets?  Less than 50% of all SPARQL endpoints actually responsive at given point of time  “THE” SPARQL protocol? No, but many variants & subsets  … Shared vocabularies & schemas, but:  …still very heterogeneous [d’Aquin, WebSci13]  …data partially messy and not conformant (RDFS, schemas) [HoganJWS2012]  …even widely used reference datasets such as DBpedia noisy [Paulheim2013] Co-occurence graph of data types in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525 An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., In the Journal of Web Semantics 14: pp. 14–44, 2012Stefan Dietze
  • 5. What about data consistency? Inconsistency and Incompleteness of Linked Datasets – a Case Study, Yuan, W., Demidova, E., Dietze, S., Zhu, X., Web Science 2014, WebSci14, under review. 27/03/14
  • 6. Too many/diverse datasets, too little information Stefan Dietze 27/03/14 ? ? ? ?? ?  Which datasets are useful & trustworthy for case XY (eg „learning about the solar system“) ? Which topics are covered?  Types: which datasets describe statistics, videos, slides, publications etc?  Currentness, dynamics, accessability/reliability, data quantity & quality?
  • 7. Data curation and dataset profiling Dataset Catalog/Registry Stefan Dietze 27/03/14  Catalog of data: classification of datasets according to resource types, disciplines/topics, data quality, accessability, etc  Infrastructure for distributed/federated querying describes  Which datasets are useful & trustworthy for case XY (eg „learning about the solar system“) ? Which topics are covered?  Types: which datasets describe statistics, videos, slides, publications etc?  Currentness, dynamics, accessability/reliability, data quantity & quality?
  • 8. db:Astro. Objects Dataset profiling: what’s all the data about Dataset Metadata Stefan Dietze 27/03/14 BIBO AAISO FOAF contains Entity disambiguation & linking [ESWC13] Topic profile extraction [WWW13, ESCW14] db:Astronomy db:Astro. Objects Dataset Catalog/Registry yov:Video po:Programme BBC Programme <po:Programme …> <po:Series>Wonders of the Solar System</.> <po:Actor>Brian Cox</…> </po:Programme…> <yo:Video …> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video…> Yovisto Video bibo:Fil bibo:Fi bibo:Film Schema mappings [WebSci13]
  • 9. Schemas/vocabularies on the Web: XKCD 927 Stefan Dietze 27/03/14 https://xkcd.com/927/
  • 10. Schema assessment and mapping Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties) Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. <po:Programme …> <po:title>Secret Universe – The Life of the Cell</po:title> … </po:Programme…> BBC Programme <sioc:Item …> <label>Viral diseases & bacteria</title> … </sioc:Item ….> SlideShare Set po:Programme sioc:Item ? http://datahub.io/group/linked-education Stefan Dietze 27/03/14
  • 11. Schema assessment and mapping Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties) Co-occurence after mapping into most frequent schemas (201 frequent types mapped into 79 classes) Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. bibo:Slideshow bibo:Film bibo:Document <po:Programme …> <po:title>Secret Universe – The Life of the Cell</po:title> … </po:Programme…> BBC Programme <sioc:Item …> <label>Viral diseases & bacteria</title> … </sioc:Item ….> SlideShare Set po:Programme sioc:Item Stefan Dietze 27/03/14
  • 12. LinkedUp Data Catalog in a nutshell http://datahub.io/group/linked-education http://data.linkededucation.org/linkedup/catalog/  RDF (VoID) dataset catalog: browse & query distributed datasets  Live information about endpoint accessibility  Federated queries using type mappings Stefan Dietze 27/03/14 http://datahub.io/group/linked-education
  • 13. <yo:Video 8748720> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video 8748720> Video <sioc:Item 2139393292> <title>Planetary motion & gravity</title> … </sioc:Item 2139393292> Slideset Topics/categories addressed? Relatedness of resources/entities? (types, semantics) <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). Challenge: semantics of resources/datasets? 15Stefan Dietze 27/03/14
  • 14. <yo:Video 8748720> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video 8748720> Video <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme Data disambiguation (for linking & profiling) Brian Cox? Sun? Pluto? 16Stefan Dietze 27/03/14
  • 15. db:Pluto (Dwarf Planet) db:Astrono- mical Objects db:Sun Data disambiguation using background knowledge „Semantic relatetedness“ of resources? db:Astronomy 17 <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme <sioc:Item 2139393292> <title>Planetary motion & gravity</title> … </sioc:Item 2139393292> Slideset <yo:Video 8748720> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video 8748720> Video Stefan Dietze 27/03/14
  • 16. db:Pluto (Dwarf Planet) db:Astrono- mical Objects <yov:Lecture8748720> <title>Pluto & the Dwarf Planets</title> … < yov:Lecture8748720> Online Lecture db:Astronomy  Computation of connectivity scores between resources/entities  Method: combination of a  (i) semantic (graph-based) connectivity score (SCS) with  (ii) a Web co-occurence-based measure (CBM) (similar to NGD)  For (i): adaptation of Katz-Index from SNA for (linked) data graphs (considering path number and path lengths of transversal properties) db:Sun SCS = 0.32 CBM = 0.24 http://purl.org/vol/doc/ http://purl.org/vol/ns/ 19/09/2013 19Stefan Dietze Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). Entity linking: semantic relatedness <sioc:Item 2139393292> <title>Planetary motion & gravity</title> … </sioc:Item 2139393292> Slideset <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme
  • 17. Entity linking: evaluation 27/03/14 20Stefan Dietze  Evaluation based on USA Today News items (80.000 entity pairs)  Manually created gold standard (1000 entity pairs)  Baseline: Explicit Semantic Analysis (ESA) => CBM/SCS: „relatedness“; ESA: „similarity“ Precision/Recall/F1 for SCS, CBM, ESA. Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013).
  • 18. db:Astrono- mical Objects db:Astronomy db:Sun  Extracting representative metadata („topic profile“) for each dataset  Ranking of most representative (DBpedia) categories (= topics); applied to all responsive LOD datasets  Scalability vs representativeness: sampling & ranking for good scalability/accuracy balance DBpedia category graph Stefan Dietze 27/03/14 Dataset profiling: what‘s the data about? A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference ,(ESWC2014), Crete, Greece, (2014). <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme
  • 19. Dataset profiling: approach Stefan Dietze 27/03/14 1. Sampling of resource instances (random sampling, weighted sampling, resource centrality sampling) 2. Entity and topic extraction (NER via DBpedia Spotlight, category mapping and expansion) 3. Normalisation and ranking (using graphical- models such as PageRank with Priors, HITS with Priors and K-Step Markov) => Result: weighted dataset-topic profile graph A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014).
  • 20. Dataset profiling: exploring LOD datasets/topics in a nutshell http://data-observatory.org/lod-profiles/ Stefan Dietze 27/03/14  Automatic extraction of dataset “topics” [ESWC2014]  Visualisation & exploration of dataset-topic graph (datasets, topics, relationships)  Includes all (responsive) datasets of LOD Cloud
  • 21. Dataset profiling: results evaluation Stefan Dietze 27/03/14 NDCG (averaged over all datasets) . Datasets & Ground Truth  Yovisto, Oxpoints, LAK Dataset, Semantic Web Dogfood  Crowd-sourced topic indicators from datasets (keywords, tags)  Manual mapping to entities & category extraction (ranking according to frequency) Baselines  1) LDA, 2) tf/idf (applied to entire datasets)  Topic extraction according to our approach, weighting/ranking based on term weight Measure  NDCG @ rank l  Performance (time/NDCG) for different sampling strategies/sizes etc
  • 23. Stefan Dietze 27/03/14 Diversity of category profile for a single paper Berners-Lee, Tim; Hendler, James, Ora Lassila (2001). "The Semantic Web". Scientific American Magazine. person document dbp:Tim_Berners-Lee dbp:Category:1955_births dbp:Category:People_from_London dbp:Category:Buzzwords dbp:Semantic_Web dbp:Category:Semantic_Web dbp:Category:Web_Services dbp:Category:HTTP dbp:Category:Unitarian_Universalists first-level categories (dcterms:subject) dbp:Category:World_Wide_Web dbp:Category:Royal_Medal_winners
  • 24.  DBpedia category graph not an ideal “topic” vocabulary:  Broad and noisy  “Categories” vs “topics” (for capturing disciplines, thesauri like UMBEL or UNESCO Thesaurus seem better suited)  Hierarchy ?  Filtering of certain partitions of category graph (too generic categories etc)  Mixing categories across resource types (document, person) creates “perceived noise”  But: broadness is useful as general vocabulary for categorisation of all sorts of resource types Stefan Dietze 27/03/14 Dataset profiling: some lessons learned
  • 25. Stefan Dietze 27/03/14 http://data-observatory.org/led-explorer/  Type specific views on datasets/ categories  “Document” (foaf:document)  “Person “ (foaf:person)  “Course” (aaiso:course)  Currently applied to datasets in LinkedUp Catalog only (as schema mappings already available here) Type-specific exploration of dataset categories
  • 26. Stefan Dietze 27/03/14 Dataset interlinking recommendation Candidate datasets for interlinking? 34 t Linkset1 Linkset2 Problem  Given dataset t, ranking datasets from D according to probability score (di, t) to contain linking candidates (entities)  Features:  Vocabulary overlap  Existing links (SNA)  Datasets more likely to contain linking candidates if they (a) share common schema elements, or (b) already link to t or datasets t links to (friend of a friend) Conclusions  Roughly 60% MAP for both approaches  Future work: quantity of links, more remote links, extraction of dataset links rather than data from DataHub Lopes, G.R., Paes Leme, L.A.P., Nunes, B.P., Casanova, M.A., Dietze, S., Recommending Tripleset Interlinking through a Social Network Approach, The 14th International Conference on Web Information System Engineering (WISE 2013), Nanjing, China, 2013. Paes Leme, L. A. P., Lopes, G. R., Nunes, B. P., Casanova, M.A., Dietze, S., Identifying candidate datasets for data interlinking, in Proceedings of the 13th International Conference on Web Engineering, (2013). Rank 1 DBLP 2 ACM 3 OAI 4 CiteSeer 5 IBM 6 Roma 7 IEEE 8 Ulm 9 Pisa ? ?
  • 27. Stefan Dietze 27/03/14 37 Success models: data & applications  LinkedUp Challenge to identify innovative tools & applications  Evaluation methods and approaches “LinkedUp” – Linking Web Data (for Education) L Data linking & curation Technology transfer & community-building  Collecting & exposing open data => LinkedUp Data Catalog  Profiling and linking of Web Data for education => educational data graph [ESWC2013], [ISWC2013],  Disseminating knowledge & building communities (educators, computer scientists, data engineers)  Gathering stakeholder feedback: use cases, and requirements http://linkedup-challenge.org/#usecases http://linkedup-project.eu/events http://www.linkedup-challenge.org/ http://data.linkededucation.org European suport action to advance take-up of open data & related technologies http://www.linkedup-project.eu
  • 28. Stefan Dietze 27/03/14 17/09/2013 38 Who we areL LinkedUp Network LinkedUp Consortium LinkedUp Advisory Board
  • 29. LinkedUp Challenge: using open data (for learning)  Open Data Competition to promote tools and applications that analyse / integrate (Linked) Web data  Organised by LinkedUp project over 2 years (“Veni”, “Vidi”, “Vici”) with 40.000 EUR awards  Veni Competition - 22 submissions, 8 shortlisted for presentation at Open Knowledge Conference (17 September, Geneva Switzerland) http://linkedup-challenge.org Stefan Dietze 27/03/14
  • 30.  Open & focused track(s)  Final events at ESWC2014 (May, Crete)  Open Track only  Final events at OKCon 2013 (September 2013, Geneva)  Open track & focused tracks  Submission details and calls to be released soon  Final events at ISWC2014 (October, Riva del Garda, Italy) May –September 2013 October 2013 – May 2014 May 2014 – October 2014 ?
  • 31. The Veni shortlist & winners DataConf. KnowNodes Mismuseos ReCredible YourHistory 27/03/14 http://www.globe-town.org/ WeShare - 3rd price / people‘s choice GlobeTown - 2nd price http://seek.cloud.gsic.tel.uva.es/weshare/ http://www.polimedia.nl/ PoliMedia – 1st price
  • 32. data.l3s.de – a DataHub for the L3S
  • 33. Learning Analytics & Knowledge Dataset & Challenge Facilitating Research on Learning Analytics and EDM a nutshell Stefan Dietze 27/03/14 http://lak.linkededucation.org/ http://lak.linkededucation.org/ LAK Dataset (450 publications in RDF/R)  ACM International Conference on Learning Analytics and Knowledge (LAK) (2011-13)  International Conference on Educational Data Mining (2008-13)  Journal of Educational Data Mining (2008-12) LAK Data Challenge  Analyse, explore correlate the LAK Dataset  At ACM LAK 2014 (April 2014, Indianapolis)
  • 34. KEYSTONE COST ACTION 27/03/14 51Stefan Dietze http://www.keystone-cost.eu/  Research network focused on distributed search, dataset profiling, to Semantic Web, Databases, etc.  Running 2013-2017  WG1: Representation of structured data sources  WG2: Keyword search  WG3: User interaction and query interpretation  WG4: Research integration, showcases, benchmarks, and evaluations  Open to new members (even beyond Europe)  Joint workshops (eg PROFILES2014 @ ESWC2014)
  • 35. Ongoing/future work … and some upcoming events Linked Data evolution, preservation, consistency  In RDF graphs (eg LOD Cloud), „all“ nodes are connected  LD preservation: which datasets to preserve (direct links or even more distant neighbours)? => semantic relatedness as guidance for scalable preservation strategies /data enrichment  Link correctness in evolving LD  Investigating impact of changes on link correctness (weekly LOD crawls over 1 year time span)  Application: informed preservation strategies  Conflict detection and LD quality (link quality, impact of conflicts in distant nodes)  PROFILES workshop @ ESWC2014 (http://keystone-cost.eu/profiles2014)  26 May 2014, Crete, Greece  Linking User Data 2014 at UMAP2014 (http://liud.linkededucation.org)  Deadline: 1 April  Online Learning & LD Tutorial at WWW2014 (http://www2014.kr/)  07 April, Seoul
  • 36. Thank you! WWW See also (general)  http://linkedup-project.eu  http://linkededucation.org  http://data.l3s.de http://purl.org/dietze See also (data)  http://data.linkededucation.org  http://data.linkededucation.org/linkedup/catalog/  http://lak.linkededucation.org 27/03/14 54Stefan Dietze  Besnik Fetahu (L3S)  Bernardo Pereira Nunes (PUC Rio)  Marco Casanova (PUC Rio)  Luiz Andre Paes Leme (PUC Rio)  Giseli Lopes (PUC Rio)  Davide Taibi (CNR, IT)  Mathieu d’Aquin (Open University, UK)  and many more… Acknowledgements