SlideShare a Scribd company logo
DB Group @ UNIMO 
Fabio Benedetti Sonia Bergamaschi Laura Po 
Department of Engineering “Enzo Ferrari” 
University of Modena & Reggio Emilia 
LD4IE 2014 – Riva Del Garda, Italy 
Online Index Extraction from Linked Open Data Sources 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 1
DB Group @ UNIMO 
2 
• Selection of a relevant LOD source 
• Statistical indexes 
• Architecture Overview 
• Performance Evaluation 
• LODeX & Conclusions 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
3 
Schmachtenberg, Max, Christian Bizer, and Heiko Paulheim. "Adoption of the Linked Data Best Practices in 
Different Topical Domains." The Semantic Web–ISWC 2014. Springer International Publishing, 2014. 245-260. 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
4 
2009 2014* 
Domain Number % Number % 
Cross-domain 41 13.95% 41 4.04% 
Geographic 31 10.54% 21 2.07% 
Government 49 16.67% 183 18.05% 
Life sciences 41 13.95% 83 8.19% 
Media 25 8.50% 22 2.17% 
Publications 87 29.59% 96 9.47% 
Social web 0 0.00% 520 51.28% 
User-generated 
content 20 6.80% 48 4.73% 
Total 294 1014 
*Only 570 datasets belong to the LOD cloud, 
the remaining datasets do not contain 
ingoing/outgoing links to the LOD Cloud. 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources 
2009 Domain 
Cross-domain 
Geographic 
Government 
Life sciences 
Media 
Publications 
Social web 
2014
DB Group @ UNIMO 
5 
1. The documentation of the dataset 
– The documentation can be poor or absent 
– There are no standard to provide the documentation 
– Sometime it is provided as an RDF file in XML format 
2. Searching features of existing catalogs (i.e. Datahub) 
– The metadata contain poor information 
– None information about the structure of the dataset is used by the 
search engine 
3. The manual exploration of the Dataset 
– It is required a good knowledge of SPARQL language 
– It is a time consuming task 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
6 
To automatically extract a set of indexes able to 
describe the structure of a LOD dataset 
How to describe the dataset 
LOD datasets can have different purpose and structure: 
• Ontology/Vocabulary (OWL & RDFS constraints) 
• Open Data (i.e. generated from existing RDBMS) 
The indexes should maximize the value of the information extraction 
from heterogeneous datasets 
Online & Automatic extraction 
• It does not require any additional information by the user 
• It works with SPARQL endpoints 
– We have to handle the bad performance issues of these Datasets 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
7 
We can think the entire set of RDF triples partitioned between: 
• Intensional Knowledge 
• Extensional Knowledge 
The Intensional knowledge 
• It contains the RDFS or OWL constraints of the Ontology 
• It represents the T-Box components of the knowledge base 
The Extensional knowledge 
• It contains the entities of the real word 
described in the dataset 
• It represents the A-Box components of 
the knowledge base 
• its triples cover most of the dataset 
Instantiated classes act as a 
bridge between the two type of 
knowledge 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
8 
ex:sector 
rdf:label rdf:Property 
owl:Class 
rdfs:domain 
rdf:type rdf:type 
ex:Sector ex:Organization 
sector 
rdf:type 
rdf:type 
rdf:type 
ex:sector 
Intensional 
Knowledge 
Instantiated 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources 
rdfs:range 
rdf:label 
rdf:type 
owl:ObjectProperty 
rdf:type 
sector1 
organization1 
ex:sector 
dc:name 
“Energy” organization2 
Classes 
Extensional 
Knowledge
DB Group @ UNIMO 
9 
The Statistical Indexes are grouped in three categories: 
• Generic 
• Intensional 
• Extensional 
Name Description Structure Category 
t Number of Triples Integer 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources 
Generic 
c Number of Classes Integer 
I Number of Instances Integer 
Cl Class List List(name, n. Instances) 
Pl Property List List(name, n. occurrence) 
IK Intensional K. triples List(s, p, o) Intensional 
Sc Subject Class List(c, p, n. occurrence) 
SCl Subject Class to literal List(c, p, n. occurrence) Extensional 
Oc Object Class List(c, p, n. occurrence)
DB Group @ UNIMO 
10 
ex:Sector ex:Organization 
rdf:type 
sector1 
rdf:type 
Subject 
Class 
ex:sector rdf:type 
Subject 
Class to 
literal 
ex:Sector ex:Organization 
rdf:type 
sector1 
rdf:type 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources 
organization1 
ex:sector 
dc:name 
“Energy” organization2 
Sc - Subject Class SCl - Subject Class to literal Oc -Object Class 
S ex:Organization ex:Sector ex:Sector 
P ex:sector dc:name ex:sector 
n 2 1 1 
organization1 
ex:sector 
dc:name 
“Energy” 
ex:sector 
Object 
Class
DB Group @ UNIMO 
11 
It takes in input a list of URLs of SPARQL endpoints 
A set of Statistical Indexes for each endpoint is the output 
• The IE process dynamically generates the SPARQL query used to 
extract the Statistical Indexes 
• It works in parallel querying different datasets 
• Partial results and the Statistical Indexes are stored in a NoSQL DB 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
12 
General Statistic Extraction 
• It uses 6 different queries to extract the indexes of this group 
Intensional Knowledge Extraction 
• The extraction of the Intensional knowledge is performed through an 
iterative algorithm 
• The algorithm traverses the graph starting from the instantiated classes 
Extensional Schema Extraction 
• It uses different SPARQL aggregation query to extract SC, SCl and OC 
• Use a technique called Pattern Strategy to complete the extraction 
– It is a technique able to produce an higher number of less 
complex SPARQL query 
– It is used when the endpoint is not able to answer an aggregation 
query and it throws a timeout error 
A complete list of the 24 query patterns is available at http://dbgroup.unimo.it/lodexQueries 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
13 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
14 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources 
Reachable datasets 244 
SPARQL 1.1 compatible 137 
Extraction completed 107 
Extraction completed 
33 
Without PS 
Total triples (107 datasets) 3,45 b 
AVG time extraction 6,12 m 
Total time (single process) 11,15 h 
Total time (9 processes) 3,35 h 
The test has been performed on a list of 
469 Datasets 
• More than the 90 % completed the 
extraction in less than 500 s 
• The PS technique has proved its worth 
• from 33 to 107 completed the 
extraction 
• The IE process is scalable 
• linear correlation between number of 
triples and time
DB Group @ UNIMO 
LODeX is an online tool able to shows a visual Schema Summary for a LOD source 
• We made use of the statistical indexes for the generation of the Schema 
F. Benedetti, S. Bergamaschi, and L. Po, “A visual summary for linked open data sources” 2014, International Semantic Web Conference (Posters & Demos). 
17 
Summary. 
• Users can interact with the Schema Summary dataset and focus on the 
information that they are more interested in. 
The tool is accessible at: www.dbgroup.unimo.it/lodex 
Come to attend the LODeX demo at the ISWC demo session! 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
18 
Conclusion 
• We are able to extract valuable indexes from a LOD dataset 
taking advantage of the definition of Intensional and 
Extensional knowledge 
• The process of extraction is been tested with an huge number 
of dataset and its efficiency and effectiveness has been 
proven 
Future Works 
• To extend VOID vocabulary with our descriptors 
• We want propose LODeX as assistance tool for LOD portals. 
• We are extending LODeX in order to support the automatic 
SPARQL query generation 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
19 
LD4IE 2014 – Riva Del Garda, Italy 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 
Online Index Extraction from Linked Open Data Sources
DB Group @ UNIMO 
20 
Thanks for your attention! 
LD4IE 2014 – Riva Del Garda, Italy 
Online Index Extraction from Linked Open Data Sources 
Dot. Fabio Benedetti 
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

More Related Content

What's hot (7)

PDF
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Association for Computational Linguistics
 
PPT
Enlighten research staff_conference_2010
elizadams
 
DOCX
EDS Web-scale Panel (Preprint), 2012 Charleston Conference
Rafal Kasprowski
 
PPT
IOTA @ NASIG 2011: Measuring the Quality of OpenURL Links
Rafal Kasprowski
 
PDF
Data Wrangling Week 4
Ferdin Joe John Joseph PhD
 
PPT
Alan Cope (De Montfort University) – EXPLORER (create workflows and processes...
Repository Fringe
 
PPT
bonino
Dario Bonino
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Association for Computational Linguistics
 
Enlighten research staff_conference_2010
elizadams
 
EDS Web-scale Panel (Preprint), 2012 Charleston Conference
Rafal Kasprowski
 
IOTA @ NASIG 2011: Measuring the Quality of OpenURL Links
Rafal Kasprowski
 
Data Wrangling Week 4
Ferdin Joe John Joseph PhD
 
Alan Cope (De Montfort University) – EXPLORER (create workflows and processes...
Repository Fringe
 
bonino
Dario Bonino
 

Viewers also liked (14)

PPTX
Visual Querying LOD sources with LODeX
Fabio Benedetti
 
PPTX
Introduction to British Education Index
RupertKahn
 
KEY
The Competency Convergence: Core Skills and Knowledge of Library and Museum P...
jzgarnett
 
PDF
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Jeff Z. Pan
 
PPTX
The British education system
Мария Maria Georgieva TeacherBG
 
PDF
Natural Language Access to Data via Deduction
diannepatricia
 
PDF
Issues in Online Education
Mike KEPPELL
 
PDF
Linked Open Data Principles, Technologies and Examples
Open Data Support
 
PPTX
The advantages and disadvantages of online learning
Janna8482
 
PPSX
Online education vs regular education
Rajashri Ns
 
PPTX
Tutorial of Sentiment Analysis
Fabio Benedetti
 
PDF
Create icons in PowerPoint
Presentitude
 
PPTX
10 Tips for Making Beautiful Slideshow Presentations by www.visuali.se
Edahn Small
 
PDF
8 Tips for an Awesome Powerpoint Presentation
Slides | Presentation Design Agency
 
Visual Querying LOD sources with LODeX
Fabio Benedetti
 
Introduction to British Education Index
RupertKahn
 
The Competency Convergence: Core Skills and Knowledge of Library and Museum P...
jzgarnett
 
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Jeff Z. Pan
 
The British education system
Мария Maria Georgieva TeacherBG
 
Natural Language Access to Data via Deduction
diannepatricia
 
Issues in Online Education
Mike KEPPELL
 
Linked Open Data Principles, Technologies and Examples
Open Data Support
 
The advantages and disadvantages of online learning
Janna8482
 
Online education vs regular education
Rajashri Ns
 
Tutorial of Sentiment Analysis
Fabio Benedetti
 
Create icons in PowerPoint
Presentitude
 
10 Tips for Making Beautiful Slideshow Presentations by www.visuali.se
Edahn Small
 
8 Tips for an Awesome Powerpoint Presentation
Slides | Presentation Design Agency
 
Ad

Similar to Online Index Extraction from Linked Open Data Sources (20)

PPTX
Wi2015 - Clustering of Linked Open Data - the LODeX tool
Laura Po
 
PDF
Linked Open Data Visualization
Laura Po
 
PPTX
4V - WP3 Progress Report (TIN2013-46238)
Nandana Mihindukulasooriya
 
PPTX
Exposing Bibliographic Information as Linked Open Data using Standards-based ...
Nikolaos Konstantinou
 
PPTX
Southwickc lampert lodlam_training
ssouthwick
 
ODP
Mining the Web of Linked Data with RapidMiner
Heiko Paulheim
 
PDF
Ontologies & linked open data
João Rocha da Silva
 
PDF
Visualize open data with Plone - eea.daviz PLOG 2013
Antonio De Marinis
 
PDF
Linked Open Graph: browsing multiple SPARQL entry points to build your own LO...
Paolo Nesi
 
PPTX
Linked Open Data and Applications
Victor de Boer
 
PPT
The Power of Semantic Technologies to Explore Linked Open Data
Ontotext
 
PDF
Web at 25 - Ontos Linked Open Data
AI4BD GmbH
 
PPTX
Virtuoso -- The Prometheus of RDF
OpenLink Software
 
PDF
Linked Data Generation for the University Data From Legacy Database
dannyijwest
 
PPTX
Session 1 and 2 "Challenges and Opportunities with Big Linked Data Visualiza...
Laura Po
 
ODP
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Heiko Paulheim
 
PDF
Data Collection and Integration, Linked Data Management
RENDER project
 
PDF
Alberto Ciaramella: "Linked patent data: opportunities and challenges for pat...
IntelliSemantic
 
PDF
Implementing Linked Data in Low-Resource Conditions
AIMS (Agricultural Information Management Standards)
 
PPTX
Linked Open Data Utrecht University Library
Ruben Schalk
 
Wi2015 - Clustering of Linked Open Data - the LODeX tool
Laura Po
 
Linked Open Data Visualization
Laura Po
 
4V - WP3 Progress Report (TIN2013-46238)
Nandana Mihindukulasooriya
 
Exposing Bibliographic Information as Linked Open Data using Standards-based ...
Nikolaos Konstantinou
 
Southwickc lampert lodlam_training
ssouthwick
 
Mining the Web of Linked Data with RapidMiner
Heiko Paulheim
 
Ontologies & linked open data
João Rocha da Silva
 
Visualize open data with Plone - eea.daviz PLOG 2013
Antonio De Marinis
 
Linked Open Graph: browsing multiple SPARQL entry points to build your own LO...
Paolo Nesi
 
Linked Open Data and Applications
Victor de Boer
 
The Power of Semantic Technologies to Explore Linked Open Data
Ontotext
 
Web at 25 - Ontos Linked Open Data
AI4BD GmbH
 
Virtuoso -- The Prometheus of RDF
OpenLink Software
 
Linked Data Generation for the University Data From Legacy Database
dannyijwest
 
Session 1 and 2 "Challenges and Opportunities with Big Linked Data Visualiza...
Laura Po
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Heiko Paulheim
 
Data Collection and Integration, Linked Data Management
RENDER project
 
Alberto Ciaramella: "Linked patent data: opportunities and challenges for pat...
IntelliSemantic
 
Implementing Linked Data in Low-Resource Conditions
AIMS (Agricultural Information Management Standards)
 
Linked Open Data Utrecht University Library
Ruben Schalk
 
Ad

Recently uploaded (20)

PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
Designing Production-Ready AI Agents
Kunal Rai
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
July Patch Tuesday
Ivanti
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Designing Production-Ready AI Agents
Kunal Rai
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
July Patch Tuesday
Ivanti
 

Online Index Extraction from Linked Open Data Sources

  • 1. DB Group @ UNIMO Fabio Benedetti Sonia Bergamaschi Laura Po Department of Engineering “Enzo Ferrari” University of Modena & Reggio Emilia LD4IE 2014 – Riva Del Garda, Italy Online Index Extraction from Linked Open Data Sources Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 1
  • 2. DB Group @ UNIMO 2 • Selection of a relevant LOD source • Statistical indexes • Architecture Overview • Performance Evaluation • LODeX & Conclusions LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 3. DB Group @ UNIMO 3 Schmachtenberg, Max, Christian Bizer, and Heiko Paulheim. "Adoption of the Linked Data Best Practices in Different Topical Domains." The Semantic Web–ISWC 2014. Springer International Publishing, 2014. 245-260. LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 4. DB Group @ UNIMO 4 2009 2014* Domain Number % Number % Cross-domain 41 13.95% 41 4.04% Geographic 31 10.54% 21 2.07% Government 49 16.67% 183 18.05% Life sciences 41 13.95% 83 8.19% Media 25 8.50% 22 2.17% Publications 87 29.59% 96 9.47% Social web 0 0.00% 520 51.28% User-generated content 20 6.80% 48 4.73% Total 294 1014 *Only 570 datasets belong to the LOD cloud, the remaining datasets do not contain ingoing/outgoing links to the LOD Cloud. LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources 2009 Domain Cross-domain Geographic Government Life sciences Media Publications Social web 2014
  • 5. DB Group @ UNIMO 5 1. The documentation of the dataset – The documentation can be poor or absent – There are no standard to provide the documentation – Sometime it is provided as an RDF file in XML format 2. Searching features of existing catalogs (i.e. Datahub) – The metadata contain poor information – None information about the structure of the dataset is used by the search engine 3. The manual exploration of the Dataset – It is required a good knowledge of SPARQL language – It is a time consuming task LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 6. DB Group @ UNIMO 6 To automatically extract a set of indexes able to describe the structure of a LOD dataset How to describe the dataset LOD datasets can have different purpose and structure: • Ontology/Vocabulary (OWL & RDFS constraints) • Open Data (i.e. generated from existing RDBMS) The indexes should maximize the value of the information extraction from heterogeneous datasets Online & Automatic extraction • It does not require any additional information by the user • It works with SPARQL endpoints – We have to handle the bad performance issues of these Datasets LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 7. DB Group @ UNIMO 7 We can think the entire set of RDF triples partitioned between: • Intensional Knowledge • Extensional Knowledge The Intensional knowledge • It contains the RDFS or OWL constraints of the Ontology • It represents the T-Box components of the knowledge base The Extensional knowledge • It contains the entities of the real word described in the dataset • It represents the A-Box components of the knowledge base • its triples cover most of the dataset Instantiated classes act as a bridge between the two type of knowledge LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 8. DB Group @ UNIMO 8 ex:sector rdf:label rdf:Property owl:Class rdfs:domain rdf:type rdf:type ex:Sector ex:Organization sector rdf:type rdf:type rdf:type ex:sector Intensional Knowledge Instantiated LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources rdfs:range rdf:label rdf:type owl:ObjectProperty rdf:type sector1 organization1 ex:sector dc:name “Energy” organization2 Classes Extensional Knowledge
  • 9. DB Group @ UNIMO 9 The Statistical Indexes are grouped in three categories: • Generic • Intensional • Extensional Name Description Structure Category t Number of Triples Integer LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources Generic c Number of Classes Integer I Number of Instances Integer Cl Class List List(name, n. Instances) Pl Property List List(name, n. occurrence) IK Intensional K. triples List(s, p, o) Intensional Sc Subject Class List(c, p, n. occurrence) SCl Subject Class to literal List(c, p, n. occurrence) Extensional Oc Object Class List(c, p, n. occurrence)
  • 10. DB Group @ UNIMO 10 ex:Sector ex:Organization rdf:type sector1 rdf:type Subject Class ex:sector rdf:type Subject Class to literal ex:Sector ex:Organization rdf:type sector1 rdf:type LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources organization1 ex:sector dc:name “Energy” organization2 Sc - Subject Class SCl - Subject Class to literal Oc -Object Class S ex:Organization ex:Sector ex:Sector P ex:sector dc:name ex:sector n 2 1 1 organization1 ex:sector dc:name “Energy” ex:sector Object Class
  • 11. DB Group @ UNIMO 11 It takes in input a list of URLs of SPARQL endpoints A set of Statistical Indexes for each endpoint is the output • The IE process dynamically generates the SPARQL query used to extract the Statistical Indexes • It works in parallel querying different datasets • Partial results and the Statistical Indexes are stored in a NoSQL DB LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 12. DB Group @ UNIMO 12 General Statistic Extraction • It uses 6 different queries to extract the indexes of this group Intensional Knowledge Extraction • The extraction of the Intensional knowledge is performed through an iterative algorithm • The algorithm traverses the graph starting from the instantiated classes Extensional Schema Extraction • It uses different SPARQL aggregation query to extract SC, SCl and OC • Use a technique called Pattern Strategy to complete the extraction – It is a technique able to produce an higher number of less complex SPARQL query – It is used when the endpoint is not able to answer an aggregation query and it throws a timeout error A complete list of the 24 query patterns is available at http://dbgroup.unimo.it/lodexQueries LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 13. DB Group @ UNIMO 13 LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 14. DB Group @ UNIMO 14 LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources Reachable datasets 244 SPARQL 1.1 compatible 137 Extraction completed 107 Extraction completed 33 Without PS Total triples (107 datasets) 3,45 b AVG time extraction 6,12 m Total time (single process) 11,15 h Total time (9 processes) 3,35 h The test has been performed on a list of 469 Datasets • More than the 90 % completed the extraction in less than 500 s • The PS technique has proved its worth • from 33 to 107 completed the extraction • The IE process is scalable • linear correlation between number of triples and time
  • 15. DB Group @ UNIMO LODeX is an online tool able to shows a visual Schema Summary for a LOD source • We made use of the statistical indexes for the generation of the Schema F. Benedetti, S. Bergamaschi, and L. Po, “A visual summary for linked open data sources” 2014, International Semantic Web Conference (Posters & Demos). 17 Summary. • Users can interact with the Schema Summary dataset and focus on the information that they are more interested in. The tool is accessible at: www.dbgroup.unimo.it/lodex Come to attend the LODeX demo at the ISWC demo session! LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 16. DB Group @ UNIMO 18 Conclusion • We are able to extract valuable indexes from a LOD dataset taking advantage of the definition of Intensional and Extensional knowledge • The process of extraction is been tested with an huge number of dataset and its efficiency and effectiveness has been proven Future Works • To extend VOID vocabulary with our descriptors • We want propose LODeX as assistance tool for LOD portals. • We are extending LODeX in order to support the automatic SPARQL query generation LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 17. DB Group @ UNIMO 19 LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources
  • 18. DB Group @ UNIMO 20 Thanks for your attention! LD4IE 2014 – Riva Del Garda, Italy Online Index Extraction from Linked Open Data Sources Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia