SlideShare a Scribd company logo
Linked Data for Information Extraction 
Challenge 2014 
Tasks and Results 
Robert Meusel and Heiko Paulheim
2 
Task 
Creation of an information extraction system that scrape 
structured information from HTML web sites. 
 Training dataset was created from HTML pages, which are 
annotated using Microformats hCard. 
 The data is a subset of the WebDataCommons Microformats 
Dataset. 
 The original data is provided by the Common Crawl Foundation, 
the largest public available collection of web crawls 
Linked Data for Information Extractin Challenge 2014 - Task and Results
3 
The Common Crawl Foundation (CC) 
 Non-profit foundation dedicated to building and maintaining 
an open crawl of the Web 
 9 crawl corpora from 2008 till 2014 available so far 
 Crawling Strategies: 
• Earlier crawled using BFS (with link discovery) seeded with a large list of ranked 
Seeds (PageRank), current crawls are gathered using a >6billion URL seed list 
from the blekko search index 
• By this, all crawls represent the popular part of the Web 
 Data availability 
• CC provides three different datasets for each crawl 
• All data can be freely downloaded from AWS S3 
Linked Data for Information Extractin Challenge 2014 - Task and Results
4 
The WebDataCommons Project 
Extraction of Structured Data from the Common Crawl Corpora 
 Extracts information annotated with the Markup languages 
Microformats, Microdata and RDFa 
 Till now, three different datasets gathered from crawls of 2010, 
2012, and 2013 
RDFa 
Microdata 
Microformats 
Linked Data for Information Extractin Challenge 2014 - Task and Results
5 
Extracting the Data 
 Webmaster markup their information within the HTML page 
directly using one of the three markup languages 
 Using Any23 (http://any23.apache.org/) those information are 
extracted as RDF triples 
Any23 
1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
<http://schema.org/Product> . 
2. _:node1 <http://schema.org/Product/name> "Predator Instinct FG 
Fuu00DFballschuh"@de . 
3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
<http://schema.org/Offer> . 
4. _:node1 <http://schema.org/Offer/price> "u20AC 219,95"@de . 
5. _:node1 <http://schema.org/Offer/priceCurrency> "EUR"@de . 
6. … 
Linked Data for Information Extractin Challenge 2014 - Task and Results
6 
The Original Dataset of 2013 
 Over 1.7 million domains using at least one markup language 
 Over 17 billion quads with over 4 billion records (typed entities) 
 hCard the most dominant among domains 
Linked Data for Information Extractin Challenge 2014 - Task and Results
7 
Extraction of Challenge Dataset 
 Selected a subset of over 10k web pages from the corpus 
including over 450k extracted triples (annotated with MF hCard) 
• Training: 9 877 web pages / 373 501 triples 
• Test: 2 379 web pages / 85 248 triples 
Linked Data for Information Extractin Challenge 2014 - Task and Results
8 
Creation of the Gold Standard 
 Input: Annotated HTML Pages & Triples (extracted with Any23) 
 After extraction of triples, all hCard tags are replaced 
• Replacement by random generated tags 
• stable per page, but different across pages 
• Replacement of comments: as CMS systems like to comment 
<!– here is the name of the company --> 
 Output 
• Training: 
• Annotated HTML Page 
• Cleaned HTML Page 
• Triples 
• Testing: 
• Cleaned HTML Page 
• Triples (not public) 
Linked Data for Information Extractin Challenge 2014 - Task and Results
9 
Overview: Dataset Creation and Evaluation Process 
Linked Data for Information Extractin Challenge 2014 - Task and Results
10 
Evaluation 
 Methodology: We consider each triple within extracted 
statements (submission) and extracted statements (Any23 from 
original test HTML pages) as equal if they have the same 
predicate and object for one page. 
 Baseline: Each page has at least one statement declaring there 
is one VCard 
_:1 rdf:type hcard:Vcard . 
Linked Data for Information Extractin Challenge 2014 - Task and Results
11 
Challenge Results 
 We got one submission (which you will learn about in some 
minutes) 
 The submission outperforms the baseline for Recall and F-Measure 
 The Gold Standard is not perfect, as within the data, we also 
find names and other attributes without a giving a type 
(whenever webmasters did not model this) Even a perfect 
extraction system would not reach a precision of 1. 
Linked Data for Information Extractin Challenge 2014 - Task and Results
12 
Outlook: LD4IE Challenge 2015 
 Include more classes (e.g. Microdata and/or RDFa) 
 Add negative examples to generate a more realistic setting 
• as today, systems can assume there is something within the test sample 
• challenge of making sure, that in the negative examples there is no not marked 
data included 
 Improve representativity of the challenge dataset 
• Wide-spread CMS systems automatically allow marking up of articles, posts etc. 
• Eliminate such bias, if present for next challenges 
<html> 
Linked Data for Information Extractin Challenge 2014 - Task and Results 
<html> 
MF:hCard 
</html> 
<html> 
</html> 
<html> 
MF:hCard 
</html> 
</html> 
<html> 
Microdata 
</html> 
<html> 
RDFa 
</html>

More Related Content

What's hot (19)

PPTX
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
Robert Meusel
 
PDF
Sagnik_AnalytixLabs_Projects
Sagnik Jena
 
PDF
Adoption of the Linked Data Best Practices in Different Topical Domains
Chris Bizer
 
PDF
Using Linked Data Resources to generate web pages based on a BBC case study
Leila Zemmouchi-Ghomari
 
PPTX
An Extensible Framework to Validate and Build Dataset Profiles
Ahmad Assaf
 
PPTX
Neo4j_allHands_04112013
Arka Pattanayak
 
PDF
Geant4 Model Testing Framework: From PAW to ROOT
Roman Atachiants
 
PPT
Methodology for the publication of Linked Open Data from small and medium siz...
International Federation for Information Technologies in Travel and Tourism (IFITT)
 
PDF
Linked Data Notifications Distributed Update Notification and Propagation on ...
Aksw Group
 
PDF
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
semanticsconference
 
PPTX
Or2019 DSpace 7 Enhanced submission &amp; workflow
4Science
 
PDF
CCCB Germline Variant Analysis on Cloud Platform
Yaoyu Wang
 
PPTX
The CIARD RINGValeri
CIARD Movement
 
PPTX
Thesis presentation
Concordia university
 
PDF
Mining a Large Web Corpus
Robert Meusel
 
PDF
The Linked Data Lifecycle
geoknow
 
PPT
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Chris Bizer
 
PPTX
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
semanticsconference
 
PPTX
Web scraping
Selecto
 
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
Robert Meusel
 
Sagnik_AnalytixLabs_Projects
Sagnik Jena
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Chris Bizer
 
Using Linked Data Resources to generate web pages based on a BBC case study
Leila Zemmouchi-Ghomari
 
An Extensible Framework to Validate and Build Dataset Profiles
Ahmad Assaf
 
Neo4j_allHands_04112013
Arka Pattanayak
 
Geant4 Model Testing Framework: From PAW to ROOT
Roman Atachiants
 
Methodology for the publication of Linked Open Data from small and medium siz...
International Federation for Information Technologies in Travel and Tourism (IFITT)
 
Linked Data Notifications Distributed Update Notification and Propagation on ...
Aksw Group
 
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
semanticsconference
 
Or2019 DSpace 7 Enhanced submission &amp; workflow
4Science
 
CCCB Germline Variant Analysis on Cloud Platform
Yaoyu Wang
 
The CIARD RINGValeri
CIARD Movement
 
Thesis presentation
Concordia university
 
Mining a Large Web Corpus
Robert Meusel
 
The Linked Data Lifecycle
geoknow
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Chris Bizer
 
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
semanticsconference
 
Web scraping
Selecto
 

Similar to Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014 (20)

PPTX
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
Robert Meusel
 
PPTX
JavaOne2013 Leveraging Linked Data and OSLC
Steve Speicher
 
PDF
WWW2014 Overview of W3C Linked Data Platform 20140410
Arnaud Le Hors
 
PPSX
The Web of data and web data commons
Jesse Wang
 
PDF
Key projects Data Science and Engineering
Vijayananda Mohire
 
PDF
Key projects Data Science and Engineering
Vijayananda Mohire
 
PPT
contentDM
spacecowboyian
 
PDF
Nadee2018
SharadPatil81
 
PPTX
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Ahmad Assaf
 
PPTX
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
DataWorks Summit
 
PDF
Pratical Deep Dive into the Semantic Web - #smconnect
Jan-Willem Bobbink - Freelance SEO Consultant
 
PDF
10-Step Methodology to Building a Single View with MongoDB
Mat Keep
 
PPTX
project_phrase I.pptx
Nambiraju
 
PDF
Open Source, The Natural Fit for Content Management in the Enterprise
Matt Hamilton
 
PDF
Continuous delivery for machine learning
Rajesh Muppalla
 
PPTX
Data-Analytics using python (Module 4).pptx
DRSHk10
 
PDF
Disrupting Data Discovery
markgrover
 
PPT
Ibm connect 2014_presentation - cust109
Beck et al. GmbH
 
PDF
Sree saranya
sreesaranya
 
PDF
Sree saranya
sreesaranya
 
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
Robert Meusel
 
JavaOne2013 Leveraging Linked Data and OSLC
Steve Speicher
 
WWW2014 Overview of W3C Linked Data Platform 20140410
Arnaud Le Hors
 
The Web of data and web data commons
Jesse Wang
 
Key projects Data Science and Engineering
Vijayananda Mohire
 
Key projects Data Science and Engineering
Vijayananda Mohire
 
contentDM
spacecowboyian
 
Nadee2018
SharadPatil81
 
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Ahmad Assaf
 
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
DataWorks Summit
 
Pratical Deep Dive into the Semantic Web - #smconnect
Jan-Willem Bobbink - Freelance SEO Consultant
 
10-Step Methodology to Building a Single View with MongoDB
Mat Keep
 
project_phrase I.pptx
Nambiraju
 
Open Source, The Natural Fit for Content Management in the Enterprise
Matt Hamilton
 
Continuous delivery for machine learning
Rajesh Muppalla
 
Data-Analytics using python (Module 4).pptx
DRSHk10
 
Disrupting Data Discovery
markgrover
 
Ibm connect 2014_presentation - cust109
Beck et al. GmbH
 
Sree saranya
sreesaranya
 
Sree saranya
sreesaranya
 
Ad

Recently uploaded (20)

PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Ad

Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

  • 1. Linked Data for Information Extraction Challenge 2014 Tasks and Results Robert Meusel and Heiko Paulheim
  • 2. 2 Task Creation of an information extraction system that scrape structured information from HTML web sites.  Training dataset was created from HTML pages, which are annotated using Microformats hCard.  The data is a subset of the WebDataCommons Microformats Dataset.  The original data is provided by the Common Crawl Foundation, the largest public available collection of web crawls Linked Data for Information Extractin Challenge 2014 - Task and Results
  • 3. 3 The Common Crawl Foundation (CC)  Non-profit foundation dedicated to building and maintaining an open crawl of the Web  9 crawl corpora from 2008 till 2014 available so far  Crawling Strategies: • Earlier crawled using BFS (with link discovery) seeded with a large list of ranked Seeds (PageRank), current crawls are gathered using a >6billion URL seed list from the blekko search index • By this, all crawls represent the popular part of the Web  Data availability • CC provides three different datasets for each crawl • All data can be freely downloaded from AWS S3 Linked Data for Information Extractin Challenge 2014 - Task and Results
  • 4. 4 The WebDataCommons Project Extraction of Structured Data from the Common Crawl Corpora  Extracts information annotated with the Markup languages Microformats, Microdata and RDFa  Till now, three different datasets gathered from crawls of 2010, 2012, and 2013 RDFa Microdata Microformats Linked Data for Information Extractin Challenge 2014 - Task and Results
  • 5. 5 Extracting the Data  Webmaster markup their information within the HTML page directly using one of the three markup languages  Using Any23 (http://any23.apache.org/) those information are extracted as RDF triples Any23 1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Product> . 2. _:node1 <http://schema.org/Product/name> "Predator Instinct FG Fuu00DFballschuh"@de . 3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Offer> . 4. _:node1 <http://schema.org/Offer/price> "u20AC 219,95"@de . 5. _:node1 <http://schema.org/Offer/priceCurrency> "EUR"@de . 6. … Linked Data for Information Extractin Challenge 2014 - Task and Results
  • 6. 6 The Original Dataset of 2013  Over 1.7 million domains using at least one markup language  Over 17 billion quads with over 4 billion records (typed entities)  hCard the most dominant among domains Linked Data for Information Extractin Challenge 2014 - Task and Results
  • 7. 7 Extraction of Challenge Dataset  Selected a subset of over 10k web pages from the corpus including over 450k extracted triples (annotated with MF hCard) • Training: 9 877 web pages / 373 501 triples • Test: 2 379 web pages / 85 248 triples Linked Data for Information Extractin Challenge 2014 - Task and Results
  • 8. 8 Creation of the Gold Standard  Input: Annotated HTML Pages & Triples (extracted with Any23)  After extraction of triples, all hCard tags are replaced • Replacement by random generated tags • stable per page, but different across pages • Replacement of comments: as CMS systems like to comment <!– here is the name of the company -->  Output • Training: • Annotated HTML Page • Cleaned HTML Page • Triples • Testing: • Cleaned HTML Page • Triples (not public) Linked Data for Information Extractin Challenge 2014 - Task and Results
  • 9. 9 Overview: Dataset Creation and Evaluation Process Linked Data for Information Extractin Challenge 2014 - Task and Results
  • 10. 10 Evaluation  Methodology: We consider each triple within extracted statements (submission) and extracted statements (Any23 from original test HTML pages) as equal if they have the same predicate and object for one page.  Baseline: Each page has at least one statement declaring there is one VCard _:1 rdf:type hcard:Vcard . Linked Data for Information Extractin Challenge 2014 - Task and Results
  • 11. 11 Challenge Results  We got one submission (which you will learn about in some minutes)  The submission outperforms the baseline for Recall and F-Measure  The Gold Standard is not perfect, as within the data, we also find names and other attributes without a giving a type (whenever webmasters did not model this) Even a perfect extraction system would not reach a precision of 1. Linked Data for Information Extractin Challenge 2014 - Task and Results
  • 12. 12 Outlook: LD4IE Challenge 2015  Include more classes (e.g. Microdata and/or RDFa)  Add negative examples to generate a more realistic setting • as today, systems can assume there is something within the test sample • challenge of making sure, that in the negative examples there is no not marked data included  Improve representativity of the challenge dataset • Wide-spread CMS systems automatically allow marking up of articles, posts etc. • Eliminate such bias, if present for next challenges <html> Linked Data for Information Extractin Challenge 2014 - Task and Results <html> MF:hCard </html> <html> </html> <html> MF:hCard </html> </html> <html> Microdata </html> <html> RDFa </html>