SlideShare a Scribd company logo
1

WebSmatch : a platform
 for data and metadata
       integration
      Remi Coletta, Emmanuel Castanier,
                Patrick Valduriez,
Christian Frisch, DuyHoa Ngo, Zohra Bellahsene
2




                      Motivations
Context: open data in France
Problems
   •   High number of data sources
   •   Heterogeneous formats
   •   Poorly structured
Example (DataPublica): the web crawl for french open data
sources found 148509 Excel files and only 369 RDF files
Needs: integrate and visualize data sources to yield high-
value information


                                                        2
3




              www.data-publica.com
Business: market place for open data
Functions: crawl, classify, document and reference data
sources in a search engine
The data is extracted and structured in a database in order to
be visualized and accessible through APIs
Problem: scale to high numbers of heterogeneous, poorly
structured sources




                                                            3
4




                DataPublica Workflow

DataPublica provides more than 10 000 XLS files (from several
sources such as INSEE, various public organizations...)
WebSmatch is integrated in their workflow




                                                           4
5




                Example of input
URL : http://www.data-publica.com/publication/4736


                                    Problem : where are
                                    data and metadata?
                                    incomplete lines,
                                    unnamed attributes

                                    Existing tools such
                                    as OpenII or Google
                                    Refine work only on
                                    clean files



                                                      5
6




                Example of input
URL : http://www.data-publica.com/publication/4736


                                      Find data table
                                      Remove blank lines
                                      or columns




                                                      6
7




                Example of input
URL : http://www.data-publica.com/publication/4736


                                      Find metadata such
                                      as titles
                                      Identify collections
                                      for bidimensionnal
                                      tables




                                                       7
8




                 WebSmatch workflow
Focus on metadata extraction service
This service is not used if the input is in a structured format
(such as RDF, RDFS, OWL...)




                                                             8
9




MetaData Extraction: XLS example



                          First step :
                          Table detection
                          using vision
                          algorithms
                          (dilate/erode)




                                      9
10




MetaData Extraction: XLS example




                        Second step :
                        Attribute detection
                        using
                        machine learning
                        on cell content
                        and neigboorhood




                                        10
11




       MetaData Extraction: XLS example




Third step : automatic detection of concepts using YAM++
(14 matching techniques such as string matching, instance
based, wordnet...)

YAM++ came 1st and 2nd at OAEI 2011 : http://oaei.ontologymatching.org/2011/results/

                                                                                       11
12




                 WebSmatch Workflow
Focus on matching service
Relies on YAM++, combining different metrics (String, Wordnet,
Instance based)




                                                            12
13




                  Data Visualization
Structured export formats easy to use for third parties : DSPL
DSPL : DataSet Publishing Language from Google Inc. see
https://developers.google.com/public-data/
For bidimensionnal tables, we need to denormalize as DSPL
uses flat CSV files for data



                           =>




                                                            13
Exporting the Results : integrated
                                                             14




                    metadata
How to make richer datasets : aggregation or intersection
   – using generic concepts such as time or location
   – find a specific concept using the matching




                                                            14
15




Visualizing the Results




                          15
16




      Visualizing the Results
http://api.data-publica.com/…/content.json?
limit=10&filter={revenue_fiscal_par_foyer:{$gt:25000}}
                     • Multi format (json, xml, spreadsheet,csv)
                     • Geolocalized queries
                     • Mashups




                                                                   16
17




                       Perspectives


1. Automating large volume extraction: confidence / machine
   learning
2. Clustering documents (on specific concepts & concept
   instances)
•   Integration with other tools
     •   Google Refine
     •   RDF export



                                                          17
18




                       Conclusion


WebSmatch is a flexible environment for Open Data
integration
End-to-end process: importing,         data   cleansing   and
integrating data sources
DSPL export format for visualization
Real validation with DataPublica data sources




                                                            18

More Related Content

What's hot (15)

PDF
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
datascienceiqss
 
PPTX
Linked Data Tutorial
tomasknap
 
PDF
Deploying PHP applications using Virtuoso as Application Server
webhostingguy
 
PDF
Ecuadorian Geospatial Linked Data
Boris Villazón-Terrazas
 
PPT
Metadata: A concept
SrikantaSahu10
 
PPT
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
Andy Powell
 
PDF
Linked Open Data: an overview
Iván Ruiz-Rube
 
PPTX
Linked data life cycles
Michael Hausenblas
 
PDF
DBpedia Tutorial - Feb 2015, Dublin
m_ackermann
 
PDF
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Ontotext
 
PPTX
Building Linked Data Applications
EUCLID project
 
PPTX
Sören Auer | Enterprise Knowledge Graphs
semanticsconference
 
PDF
Maps4 finland 28.8.2012, jari reini
Olli Rinne
 
PDF
Charleston 2012 - The Future of Serials in a Linked Data World
ProQuest
 
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
datascienceiqss
 
Linked Data Tutorial
tomasknap
 
Deploying PHP applications using Virtuoso as Application Server
webhostingguy
 
Ecuadorian Geospatial Linked Data
Boris Villazón-Terrazas
 
Metadata: A concept
SrikantaSahu10
 
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
Andy Powell
 
Linked Open Data: an overview
Iván Ruiz-Rube
 
Linked data life cycles
Michael Hausenblas
 
DBpedia Tutorial - Feb 2015, Dublin
m_ackermann
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Ontotext
 
Building Linked Data Applications
EUCLID project
 
Sören Auer | Enterprise Knowledge Graphs
semanticsconference
 
Maps4 finland 28.8.2012, jari reini
Olli Rinne
 
Charleston 2012 - The Future of Serials in a Linked Data World
ProQuest
 

Viewers also liked (8)

PPTX
Bime analytics
data publica
 
PDF
Open source vs. open data
data publica
 
PPT
Treerank richard drai
data publica
 
PDF
Open data Websmatch
data publica
 
PPTX
Tinyclues david bessis
data publica
 
PPTX
Vecteur Plus 2013
Charlotte Herry
 
PDF
Mapping french open data actors on the web with common crawl
data publica
 
PDF
Suez environnement frédéric charles
data publica
 
Bime analytics
data publica
 
Open source vs. open data
data publica
 
Treerank richard drai
data publica
 
Open data Websmatch
data publica
 
Tinyclues david bessis
data publica
 
Vecteur Plus 2013
Charlotte Herry
 
Mapping french open data actors on the web with common crawl
data publica
 
Suez environnement frédéric charles
data publica
 
Ad

Similar to Web smatch wod2012 (20)

PPT
OCLC Linked Data Roundtable event IFLA 2012
nw13
 
PDF
Python business intelligence (PyData 2012 talk)
Stefan Urbanek
 
PPTX
From open data to API-driven business
OpenDataSoft
 
PDF
On demand access to Big Data through Semantic Technologies
Peter Haase
 
PPT
Tutorial
Atner Yegorov
 
PDF
Sieve - Data Quality and Fusion - LWDM2012
Pablo Mendes
 
PDF
EDF2012: The Web of Data and its Five Stars
Richard Cyganiak
 
PPTX
Linked_Open_Data_Rome_Netcamp_13
Michele Piunti
 
PDF
Introduction to the FP7 CODE project @ BDBC
Florian Stegmaier
 
PDF
DataGraft: Data-as-a-Service for Open Data
dapaasproject
 
PPT
Establishing the Connection: Creating a Linked Data Version of the BNB
nw13
 
PPTX
The Information Workbench - Linked Data and Semantic Wikis in the Enterprise
Peter Haase
 
PDF
Sharing data on the web (2013)
3 Round Stones
 
ODP
OpenRefine - Data Science Training for Librarians
tfmorris
 
PPTX
Soren Auer - LOD2 - creating knowledge out of Interlinked Data
Open City Foundation
 
PPTX
Comet project
Edmund Chamberlain
 
PDF
Soeren okfn greece meetup
OKFN-GR
 
PPTX
Everything Self-Service:Linked Data Applications with the Information Workbench
Peter Haase
 
PPTX
Scientific data management from the lab to the web
Jose Manuel Gómez-Pérez
 
PDF
20110728 datalift-rpi-troy
François Scharffe
 
OCLC Linked Data Roundtable event IFLA 2012
nw13
 
Python business intelligence (PyData 2012 talk)
Stefan Urbanek
 
From open data to API-driven business
OpenDataSoft
 
On demand access to Big Data through Semantic Technologies
Peter Haase
 
Tutorial
Atner Yegorov
 
Sieve - Data Quality and Fusion - LWDM2012
Pablo Mendes
 
EDF2012: The Web of Data and its Five Stars
Richard Cyganiak
 
Linked_Open_Data_Rome_Netcamp_13
Michele Piunti
 
Introduction to the FP7 CODE project @ BDBC
Florian Stegmaier
 
DataGraft: Data-as-a-Service for Open Data
dapaasproject
 
Establishing the Connection: Creating a Linked Data Version of the BNB
nw13
 
The Information Workbench - Linked Data and Semantic Wikis in the Enterprise
Peter Haase
 
Sharing data on the web (2013)
3 Round Stones
 
OpenRefine - Data Science Training for Librarians
tfmorris
 
Soren Auer - LOD2 - creating knowledge out of Interlinked Data
Open City Foundation
 
Comet project
Edmund Chamberlain
 
Soeren okfn greece meetup
OKFN-GR
 
Everything Self-Service:Linked Data Applications with the Information Workbench
Peter Haase
 
Scientific data management from the lab to the web
Jose Manuel Gómez-Pérez
 
20110728 datalift-rpi-troy
François Scharffe
 
Ad

Recently uploaded (20)

PPTX
Basics and rules of probability with real-life uses
ravatkaran694
 
PPTX
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
DOCX
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
PPTX
Electrophysiology_of_Heart. Electrophysiology studies in Cardiovascular syste...
Rajshri Ghogare
 
PPTX
Translation_ Definition, Scope & Historical Development.pptx
DhatriParmar
 
PDF
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
PPTX
The Future of Artificial Intelligence Opportunities and Risks Ahead
vaghelajayendra784
 
PPTX
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
PDF
TOP 10 AI TOOLS YOU MUST LEARN TO SURVIVE IN 2025 AND ABOVE
digilearnings.com
 
PPTX
Cybersecurity: How to Protect your Digital World from Hackers
vaidikpanda4
 
PPTX
PROTIEN ENERGY MALNUTRITION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
Dakar Framework Education For All- 2000(Act)
santoshmohalik1
 
PPTX
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
PPTX
TOP 10 AI TOOLS YOU MUST LEARN TO SURVIVE IN 2025 AND ABOVE
digilearnings.com
 
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
PPTX
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
PPTX
Virus sequence retrieval from NCBI database
yamunaK13
 
PPTX
Applied-Statistics-1.pptx hardiba zalaaa
hardizala899
 
PPTX
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
PPTX
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
Basics and rules of probability with real-life uses
ravatkaran694
 
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
Electrophysiology_of_Heart. Electrophysiology studies in Cardiovascular syste...
Rajshri Ghogare
 
Translation_ Definition, Scope & Historical Development.pptx
DhatriParmar
 
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
The Future of Artificial Intelligence Opportunities and Risks Ahead
vaghelajayendra784
 
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
TOP 10 AI TOOLS YOU MUST LEARN TO SURVIVE IN 2025 AND ABOVE
digilearnings.com
 
Cybersecurity: How to Protect your Digital World from Hackers
vaidikpanda4
 
PROTIEN ENERGY MALNUTRITION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
Dakar Framework Education For All- 2000(Act)
santoshmohalik1
 
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
TOP 10 AI TOOLS YOU MUST LEARN TO SURVIVE IN 2025 AND ABOVE
digilearnings.com
 
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
Virus sequence retrieval from NCBI database
yamunaK13
 
Applied-Statistics-1.pptx hardiba zalaaa
hardizala899
 
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 

Web smatch wod2012

  • 1. 1 WebSmatch : a platform for data and metadata integration Remi Coletta, Emmanuel Castanier, Patrick Valduriez, Christian Frisch, DuyHoa Ngo, Zohra Bellahsene
  • 2. 2 Motivations Context: open data in France Problems • High number of data sources • Heterogeneous formats • Poorly structured Example (DataPublica): the web crawl for french open data sources found 148509 Excel files and only 369 RDF files Needs: integrate and visualize data sources to yield high- value information 2
  • 3. 3 www.data-publica.com Business: market place for open data Functions: crawl, classify, document and reference data sources in a search engine The data is extracted and structured in a database in order to be visualized and accessible through APIs Problem: scale to high numbers of heterogeneous, poorly structured sources 3
  • 4. 4 DataPublica Workflow DataPublica provides more than 10 000 XLS files (from several sources such as INSEE, various public organizations...) WebSmatch is integrated in their workflow 4
  • 5. 5 Example of input URL : http://www.data-publica.com/publication/4736 Problem : where are data and metadata? incomplete lines, unnamed attributes Existing tools such as OpenII or Google Refine work only on clean files 5
  • 6. 6 Example of input URL : http://www.data-publica.com/publication/4736 Find data table Remove blank lines or columns 6
  • 7. 7 Example of input URL : http://www.data-publica.com/publication/4736 Find metadata such as titles Identify collections for bidimensionnal tables 7
  • 8. 8 WebSmatch workflow Focus on metadata extraction service This service is not used if the input is in a structured format (such as RDF, RDFS, OWL...) 8
  • 9. 9 MetaData Extraction: XLS example First step : Table detection using vision algorithms (dilate/erode) 9
  • 10. 10 MetaData Extraction: XLS example Second step : Attribute detection using machine learning on cell content and neigboorhood 10
  • 11. 11 MetaData Extraction: XLS example Third step : automatic detection of concepts using YAM++ (14 matching techniques such as string matching, instance based, wordnet...) YAM++ came 1st and 2nd at OAEI 2011 : http://oaei.ontologymatching.org/2011/results/ 11
  • 12. 12 WebSmatch Workflow Focus on matching service Relies on YAM++, combining different metrics (String, Wordnet, Instance based) 12
  • 13. 13 Data Visualization Structured export formats easy to use for third parties : DSPL DSPL : DataSet Publishing Language from Google Inc. see https://developers.google.com/public-data/ For bidimensionnal tables, we need to denormalize as DSPL uses flat CSV files for data => 13
  • 14. Exporting the Results : integrated 14 metadata How to make richer datasets : aggregation or intersection – using generic concepts such as time or location – find a specific concept using the matching 14
  • 16. 16 Visualizing the Results http://api.data-publica.com/…/content.json? limit=10&filter={revenue_fiscal_par_foyer:{$gt:25000}} • Multi format (json, xml, spreadsheet,csv) • Geolocalized queries • Mashups 16
  • 17. 17 Perspectives 1. Automating large volume extraction: confidence / machine learning 2. Clustering documents (on specific concepts & concept instances) • Integration with other tools • Google Refine • RDF export 17
  • 18. 18 Conclusion WebSmatch is a flexible environment for Open Data integration End-to-end process: importing, data cleansing and integrating data sources DSPL export format for visualization Real validation with DataPublica data sources 18