Analysing and Improving embedded Markup of
Learning Resources on the Web
Stefan Dietze, Davide Taibi, Ran Yu, Phil Barker, Mathieu d’Aquin
- WWW2017, Digital Learning Track -
05/04/17 1Stefan Dietze
Open Data & Linked Data
Structured data about learning resources on the Web?
05/04/17 2Stefan Dietze
Resource metadata
 Standards: LOM, ADL SCORM, IMS LD etc.
 Repositories: Open Courseware, Merlot, ARIADNE etc
Educational(ly relevant) linked data
 Vocabularies: BIBO, LOM/RDF, mEducator etc
 Datasets: e.g. LinkedUp Catalog
(approx. 50 M resources)
http://data.linkededucation.org/linkedup/catalog/
Structured data about learning resources on the Web?
05/04/17 3Stefan Dietze
Web: approx. 46.000.000.000.000 (46 trillion)
Web pages indexed by Google
Open Data & Linked Data
Resource metadata
 Standards: LOM, ADL SCORM, IMS LD etc.
 Repositories: Open Courseware, Merlot, ARIADNE etc
Educational(ly relevant) linked data
 Vocabularies: BIBO, LOM/RDF, mEducator etc
 Datasets: e.g. LinkedUp Catalog
(approx. 50 M resources)
 Embedded markup (RDFa, Microdata, Microformats) for
interpretation of Web documents (search, retrieval)
 schema.org vocabulary used at scale
(700 classes, 1000 predicates) and supported by Yahoo,
Yandex, Bing, Google
 Adoption on the Web (2016):
o 38 % out of 3.2 bn pages
o 44 bn statements/quads
(see “Web Data Commons”, see Meusel & Paulheim
[ISWC2014])
 Same order of magnitude as “the Web” (scale, dynamics)
Embedded markup data & schema.org
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Forrest Gump</h1>
<span>Actor: <span itemprop=„actor">Tom Hanks</span>
<span itemprop="genre">Drama</span>
...
</div>
05/04/17 4
RDF statements
node1 actor _node-x
node1 actor Robin Wright
node1 genre Comedy
node2 actor T. Hanks
node2 distributed by Paramount Pic.
node3 actor Tom Cruise
node3 distributed by Paramount Pic.
Stefan Dietze
 schema.org extension providing
vocabulary for annotation of learning
resources
 Association of resources
(s:CreativeWork, e.g. books, videos etc)
with learning-related attributes (typical
age, learning resource type,
educational frameworks etc)
 Dublin Core Metadata Initiative task
force on LRMI
Learning Resources Metadata Initiative (LRMI)
05/04/17 5Stefan Dietze
http://lrmi.dublincore.net/
Learning Resources Metadata Initiative: research questions
05/04/17 6Stefan Dietze
How is LRMI actually being used on the Web?
 RQ1) Adoption of LRMI terms / patterns and its evolution?
 RQ2) Distribution across the Web?
 RQ3) Quality (and how to improve/cleanse/interpret)?
Why is it important?
 Enable data reuse (KB construction, recommenders, search)
 Inform vocabulary design (LRMI, schema.org)
2013 2014 2015
Documents (CC) 2,224,829,946 2,014,175,679 1,770,525,212
URLs (WDC)
585,792,337
(26.3%)
620,151,400
(30.7%)
541,514,775
(30.5%)
Quads (WDC) 17,241,313,916 20,484,755,485 24,377,132,352
URLs (LRMI) 83,791 430,861 779,260
URLs (LRMI’) 84,098 430,895 929,573
Quads (LRMI) 9,245,793 26,256,833 44,108,511
Quads(LRMI’) 9,251,553 26,258,524 69,932,849
 CC: Common Crawl, 2013-2015
(http://commoncrawl.org)
 WDC: Web Data Commons, 2013-2015:
statements/quads extracted from CC
(http://webdatacommons.org)
 LRMI: all quads extracted from WDC/CC
which include or co-occur with an LRMI
term (according to LRMI spec)
 LRMI‘: extracted from WDC/CC as above,
but considering „common errors“
[Meusel et al 2015]
Data extraction
05/04/17 7Stefan Dietze
 CC: Common Crawl, 2013-2015
(http://commoncrawl.org)
 WDC: Web Data Commons, 2013-2015:
statements/quads extracted from CC
(http://webdatacommons.org)
 LRMI: all quads extracted from WDC/CC
which include or co-occur with an LRMI
term (LRMI spec)
 LRMI‘: extracted from WDC/CC as above,
but considering „common errors“
[Meusel et al 2015]
Data extraction
05/04/17 8Stefan Dietze
2013 2014 2015
Documents (CC) 2,224,829,946 2,014,175,679 1,770,525,212
URLs (WDC)
585,792,337
(26.3%)
620,151,400
(30.7%)
541,514,775
(30.5%)
Quads (WDC) 17,241,313,916 20,484,755,485 24,377,132,352
URLs (LRMI) 83,791 430,861 779,260
URLs (LRMI’) 84,098 430,895 929,573
Quads (LRMI) 9,245,793 26,256,833 44,108,511
Quads(LRMI’) 9,251,553 26,258,524 69,932,849
 Power law distribution across
approx. 300 PLDs and 4000
subdomains (2015)
 Top 10% of contributors
provide 98.4% of all quads
(2015)
LRMI distribution across pay-level-domains (PLDs)
05/04/17 9Stefan Dietze
7xxxtube.com
1amateurporntube.com
virtualpornstars.com
sunriseseniorliving.com
simplyfinance.co.uk
menslifestyles.com
audiobooks.com
simplypsychology.org
helles-koepfchen.de
05/04/17 10Stefan Dietze
Markup quality (1/2): addressing schema misuse
sunriseseniorliving.com
7xxxtube.com
1amateurporntube.com
virtualpornstars.com
simplyfinance.co.uk
menslifestyles.com
audiobooks.com
simplypsychology.org
helles-koepfchen.de
Clustering/classification of unintended uses of
LRMI terms?
• Domain blacklist: recall 96%, roughly 10% of
PLDs (0,5 % of documents) affected
• Clustering of PLDs/resource types (XMeans)
• Variety of features, in particular related to
term adoption
Term co-occurrence within markup from top-ranked PLDs
(„learning resources in the LRMI sense“)
Unintended schema use: term distribution as clustering feature?
05/04/17 11Stefan Dietze
Term co-occurrence within markup from
filtered adult content PLDs
Rank Year Type # Quads # PLDs
1
2013 EducationalEvent 6004 1
2014 EducationalEvent 3047 1
2015 offer 100516 1
2
2013 UserComment 20 1
2014 Therapist 25 1
2015 headline 6724 1
3
2013 CompetencyObject 4 1
2014 UserComment 23 1
2015 URL 693 1
4
2013 Webpage 2 1
2014 learningResourceType 21 1
2015 webpage 360 1
5
2013 about 1 1
2014 EducationalEvent 19 1
2015 musicrecording 296 1
 Heuristics for fixing frequent errors
(see Meusel et al., ESWC2015)
o Wrong namespaces
(eg.: “htp:/schema.org”): 501,530 quads in
2015
o Undefined types and properties: 1,172,893
quads in 2015
o Object properties misused as data type
property: 10,288,717 quads in 2015
 Errors fixed in most PLDs and documents
 But: lower error rate in LRMI corpus than
markup in general (WDC)
Markup quality (2/2): heuristics for fixing frequent errors
05/04/17 12Stefan Dietze
Top-5 undefined types
“Strings, not things”
 Numbers from 2015:
o 46 million “transversal” quads (i.e. non-hierarchical
statements)
o 64% datatype properties, yet 97% refer to literals
(up from 70% in 2013)
 Issues
o Lack of links and controlled vocabularies
o Data reuse requires identity resolution
2013 2014 2015
# quads
520,815
(5.63%)
1,601,796
(6.10%)
6,179,097
(8.84%)
# docs
46,382
(55.15%)
369,772
(85.81%)
754,863
(81.21%)
# PLDs
75
(75.76%)
154
(67.54%)
291
(77.39%)
Fixed quads/documents/PLDs
Key findings & implications
05/04/17 13Stefan Dietze
I. Significant growth, but biased term adoption.
 Growing adoption: 138 M (48 M) statements in 2016 (2015) (observable even in general-purpose crawl/CC)
 Bias towards simple data type & generic properties
 Implications for data consumption & identity resolution
II. Power-law distribution of LRMI markup.
 Top 10% contributors provide 98.4% of quads 2015
 Efficient crawling / extraction of LRMI-specific data (eg for building index or recommender)
=> focused crawling of most probable data providers
III. Frequent errors.
 Vast amounts of erroneous statements (80% of PLDs in 2015), yet fewer than in markup in general
 Steady increase (total and relative) of errors
 Need for data cleansing & fixing: heuristics and frequency-based approaches
(e.g. erroneous terms usually in few PLDs only)
IV. Unintended use of vocabulary terms.
 Terms applied in variety of contexts (e.g. adult content)
 Not necessarily schema violation
 But: need for further processing (e.g. clustering/classification) when interpreting/using LRMI
Consumption, reuse & fusion of markup data
 Clustering for data cleansing and categorisation
(features: eg term distribution, page-rank, etc)
 Supervised data fusion for entity matching and fact verification –
related work [ICDE2017, SWJ2017]
 Augmenting knowledge bases
Vocabulary design
 Feed findings into DCMI task force on LRMI
 Bootstrap pattern and terms (from actual usage) ?
 Wider schema.org question: reflecting lack of acceptance of
object-object relationships in vocabularies?
Future work
05/04/17 14Stefan Dietze
Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM: Query-
Centric Data Fusion on Structured Web Markup,
ICDE2017.
Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze, D.,
Dietze, S., KnowMore - Knowledge Base Augmentation
with Structured Web Markup, Semantic Web Journal
2017, under review.
Contact, data & stats
05/04/17 15Stefan Dietze
Data
http://lrmi.itd.cnr.it/
Contact
@stefandietze | http://stefandietze.net

More Related Content

PDF
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
PDF
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
PDF
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
PPT
euclid_linkedup WWW tutorial (Besnik Fetahu)
PPTX
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
PPTX
Experience from 10 months of University Linked Data
PPTX
Working with data.open.ac.uk, the Linked Data Platform of the Open University
PPTX
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
euclid_linkedup WWW tutorial (Besnik Fetahu)
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
Experience from 10 months of University Linked Data
Working with data.open.ac.uk, the Linked Data Platform of the Open University

What's hot (20)

PPTX
Linked Data at the Open University: From Technical Challenges to Organization...
PPTX
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
PDF
Introduction of Knowledge Graphs
PPTX
Doing Clever Things with the Semantic Web
PPTX
LUCERO - Building the Open University Web of Linked Data
PPTX
Software Sustainability: Better Software Better Science
PPTX
DataCite: the Perfect Complement to CrossRef
PDF
Semantic Web / Linked Data Technologies
PDF
Introduction to linked data
PDF
Exploration, visualization and querying of linked open data sources
PDF
Research Knowledge Graphs at GESIS & NFDI4DataScience
PDF
Data Management for Mountain Observatories Workshop
PPTX
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
PDF
Web Data Management in the RDF Age
PDF
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
PDF
Make our Scientific Datasets Accessible and Interoperable on the Web
PDF
Trustworthy AI and Open Science
PPTX
ESWC2015 opening ceremony
PPTX
It19 20140721 linked data personal perspective
PPTX
Alamw15 VIVO
Linked Data at the Open University: From Technical Challenges to Organization...
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Introduction of Knowledge Graphs
Doing Clever Things with the Semantic Web
LUCERO - Building the Open University Web of Linked Data
Software Sustainability: Better Software Better Science
DataCite: the Perfect Complement to CrossRef
Semantic Web / Linked Data Technologies
Introduction to linked data
Exploration, visualization and querying of linked open data sources
Research Knowledge Graphs at GESIS & NFDI4DataScience
Data Management for Mountain Observatories Workshop
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
Web Data Management in the RDF Age
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Make our Scientific Datasets Accessible and Interoperable on the Web
Trustworthy AI and Open Science
ESWC2015 opening ceremony
It19 20140721 linked data personal perspective
Alamw15 VIVO

Similar to Analysing & Improving Learning Resources Markup on the Web (20)

PDF
Towards embedded Markup of Learning Resources on the Web
PDF
Mining and Understanding Activities and Resources on the Web
PPT
Linked Data Competency Index : Mapping the field for teachers and learners
PDF
WWW2013 Tutorial: Linked Data & Education
PDF
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
PDF
Open Data Dialog 2013 - Linked Data in Education
PDF
What Factors Influence the Design of a Linked Data Generation Algorithm?
PDF
LRMI Presentations from ISTE
PPTX
Learning resource metadata on the web (LiLE workshop)
PPTX
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
PPTX
Relationship status: Libraries and linked data in Europe
PDF
Web Data Extraction: A Crash Course
PDF
Implementing Linked Data in Low-Resource Conditions
PDF
LD4L OCLC Data Strategy
PPSX
The Web of data and web data commons
PDF
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
PPTX
Research into Practice case study 2: Library linked data implementations an...
PDF
Using Linked Data Resources to generate web pages based on a BBC case study
PDF
RMLEditor: A Graph-based Mapping Editor for Linked Data Mappings
Towards embedded Markup of Learning Resources on the Web
Mining and Understanding Activities and Resources on the Web
Linked Data Competency Index : Mapping the field for teachers and learners
WWW2013 Tutorial: Linked Data & Education
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Data Dialog 2013 - Linked Data in Education
What Factors Influence the Design of a Linked Data Generation Algorithm?
LRMI Presentations from ISTE
Learning resource metadata on the web (LiLE workshop)
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
Relationship status: Libraries and linked data in Europe
Web Data Extraction: A Crash Course
Implementing Linked Data in Low-Resource Conditions
LD4L OCLC Data Strategy
The Web of data and web data commons
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Research into Practice case study 2: Library linked data implementations an...
Using Linked Data Resources to generate web pages based on a BBC case study
RMLEditor: A Graph-based Mapping Editor for Linked Data Mappings

More from Stefan Dietze (20)

PDF
Understanding Scientific and Societal Adoption and Impact of Science Through ...
PDF
NEWORDER Project - Science in the online knowledge order
PDF
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
PDF
AI in between online and offline discourse - and what has ChatGPT to do with ...
PDF
An interdisciplinary journey with the SAL spaceship – results and challenges ...
PDF
Research Knowledge Graphs at NFDI4DS & GESIS
PDF
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
PDF
Towards research data knowledge graphs
PDF
Beyond research data infrastructures: exploiting artificial & crowd intellige...
PDF
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
PDF
Using AI to understand everyday learning on the Web
PDF
Analysing User Knowledge, Competence and Learning during Online Activities
PDF
Big Data in Learning Analytics - Analytics for Everyday Learning
PDF
Semantic Linking & Retrieval for Digital Libraries
PDF
Linked Data for Architecture, Engineering and Construction (AEC)
PDF
Dietze linked data-vr-es
PDF
Turning Data into Knowledge (KESW2014 Keynote)
PDF
From Data to Knowledge - Profiling & Interlinking Web Datasets
PDF
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
PDF
What's all the data about? - Linking and Profiling of Linked Datasets
Understanding Scientific and Societal Adoption and Impact of Science Through ...
NEWORDER Project - Science in the online knowledge order
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
AI in between online and offline discourse - and what has ChatGPT to do with ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...
Research Knowledge Graphs at NFDI4DS & GESIS
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Towards research data knowledge graphs
Beyond research data infrastructures: exploiting artificial & crowd intellige...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
Using AI to understand everyday learning on the Web
Analysing User Knowledge, Competence and Learning during Online Activities
Big Data in Learning Analytics - Analytics for Everyday Learning
Semantic Linking & Retrieval for Digital Libraries
Linked Data for Architecture, Engineering and Construction (AEC)
Dietze linked data-vr-es
Turning Data into Knowledge (KESW2014 Keynote)
From Data to Knowledge - Profiling & Interlinking Web Datasets
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
What's all the data about? - Linking and Profiling of Linked Datasets

Recently uploaded (20)

PPTX
Rise of the Digital Control Grid Zeee Media and Hope and Tivon FTWProject.com
PDF
Be ready for tomorrow’s needs with a longer-lasting, higher-performing PC
PDF
Human Computer Interaction Miterm Lesson
PPTX
Build automations faster and more reliably with UiPath ScreenPlay
PDF
NewMind AI Journal Monthly Chronicles - August 2025
PPTX
Information-Technology-in-Human-Society (2).pptx
PDF
GDG Cloud Southlake #45: Patrick Debois: The Impact of GenAI on Development a...
PDF
Altius execution marketplace concept.pdf
PDF
Examining Bias in AI Generated News Content.pdf
PPTX
maintenance powerrpoint for adaprive and preventive
PDF
Uncertainty-aware contextual multi-armed bandits for recommendations in e-com...
PDF
substrate PowerPoint Presentation basic one
PDF
FASHION-DRIVEN TEXTILES AS A CRYSTAL OF A NEW STREAM FOR STAKEHOLDER CAPITALI...
PPTX
From Curiosity to ROI — Cost-Benefit Analysis of Agentic Automation [3/6]
PDF
ELLIE29.pdfWETWETAWTAWETAETAETERTRTERTER
PDF
Decision Optimization - From Theory to Practice
PDF
Domain-specific knowledge and context in large language models: challenges, c...
PDF
TicketRoot: Event Tech Solutions Deck 2025
PPTX
Blending method and technology for hydrogen.pptx
PPTX
Presentation - Principles of Instructional Design.pptx
Rise of the Digital Control Grid Zeee Media and Hope and Tivon FTWProject.com
Be ready for tomorrow’s needs with a longer-lasting, higher-performing PC
Human Computer Interaction Miterm Lesson
Build automations faster and more reliably with UiPath ScreenPlay
NewMind AI Journal Monthly Chronicles - August 2025
Information-Technology-in-Human-Society (2).pptx
GDG Cloud Southlake #45: Patrick Debois: The Impact of GenAI on Development a...
Altius execution marketplace concept.pdf
Examining Bias in AI Generated News Content.pdf
maintenance powerrpoint for adaprive and preventive
Uncertainty-aware contextual multi-armed bandits for recommendations in e-com...
substrate PowerPoint Presentation basic one
FASHION-DRIVEN TEXTILES AS A CRYSTAL OF A NEW STREAM FOR STAKEHOLDER CAPITALI...
From Curiosity to ROI — Cost-Benefit Analysis of Agentic Automation [3/6]
ELLIE29.pdfWETWETAWTAWETAETAETERTRTERTER
Decision Optimization - From Theory to Practice
Domain-specific knowledge and context in large language models: challenges, c...
TicketRoot: Event Tech Solutions Deck 2025
Blending method and technology for hydrogen.pptx
Presentation - Principles of Instructional Design.pptx

Analysing & Improving Learning Resources Markup on the Web

  • 1. Analysing and Improving embedded Markup of Learning Resources on the Web Stefan Dietze, Davide Taibi, Ran Yu, Phil Barker, Mathieu d’Aquin - WWW2017, Digital Learning Track - 05/04/17 1Stefan Dietze
  • 2. Open Data & Linked Data Structured data about learning resources on the Web? 05/04/17 2Stefan Dietze Resource metadata  Standards: LOM, ADL SCORM, IMS LD etc.  Repositories: Open Courseware, Merlot, ARIADNE etc Educational(ly relevant) linked data  Vocabularies: BIBO, LOM/RDF, mEducator etc  Datasets: e.g. LinkedUp Catalog (approx. 50 M resources) http://data.linkededucation.org/linkedup/catalog/
  • 3. Structured data about learning resources on the Web? 05/04/17 3Stefan Dietze Web: approx. 46.000.000.000.000 (46 trillion) Web pages indexed by Google Open Data & Linked Data Resource metadata  Standards: LOM, ADL SCORM, IMS LD etc.  Repositories: Open Courseware, Merlot, ARIADNE etc Educational(ly relevant) linked data  Vocabularies: BIBO, LOM/RDF, mEducator etc  Datasets: e.g. LinkedUp Catalog (approx. 50 M resources)
  • 4.  Embedded markup (RDFa, Microdata, Microformats) for interpretation of Web documents (search, retrieval)  schema.org vocabulary used at scale (700 classes, 1000 predicates) and supported by Yahoo, Yandex, Bing, Google  Adoption on the Web (2016): o 38 % out of 3.2 bn pages o 44 bn statements/quads (see “Web Data Commons”, see Meusel & Paulheim [ISWC2014])  Same order of magnitude as “the Web” (scale, dynamics) Embedded markup data & schema.org <div itemscope itemtype ="http://schema.org/Movie"> <h1 itemprop="name">Forrest Gump</h1> <span>Actor: <span itemprop=„actor">Tom Hanks</span> <span itemprop="genre">Drama</span> ... </div> 05/04/17 4 RDF statements node1 actor _node-x node1 actor Robin Wright node1 genre Comedy node2 actor T. Hanks node2 distributed by Paramount Pic. node3 actor Tom Cruise node3 distributed by Paramount Pic. Stefan Dietze
  • 5.  schema.org extension providing vocabulary for annotation of learning resources  Association of resources (s:CreativeWork, e.g. books, videos etc) with learning-related attributes (typical age, learning resource type, educational frameworks etc)  Dublin Core Metadata Initiative task force on LRMI Learning Resources Metadata Initiative (LRMI) 05/04/17 5Stefan Dietze http://lrmi.dublincore.net/
  • 6. Learning Resources Metadata Initiative: research questions 05/04/17 6Stefan Dietze How is LRMI actually being used on the Web?  RQ1) Adoption of LRMI terms / patterns and its evolution?  RQ2) Distribution across the Web?  RQ3) Quality (and how to improve/cleanse/interpret)? Why is it important?  Enable data reuse (KB construction, recommenders, search)  Inform vocabulary design (LRMI, schema.org)
  • 7. 2013 2014 2015 Documents (CC) 2,224,829,946 2,014,175,679 1,770,525,212 URLs (WDC) 585,792,337 (26.3%) 620,151,400 (30.7%) 541,514,775 (30.5%) Quads (WDC) 17,241,313,916 20,484,755,485 24,377,132,352 URLs (LRMI) 83,791 430,861 779,260 URLs (LRMI’) 84,098 430,895 929,573 Quads (LRMI) 9,245,793 26,256,833 44,108,511 Quads(LRMI’) 9,251,553 26,258,524 69,932,849  CC: Common Crawl, 2013-2015 (http://commoncrawl.org)  WDC: Web Data Commons, 2013-2015: statements/quads extracted from CC (http://webdatacommons.org)  LRMI: all quads extracted from WDC/CC which include or co-occur with an LRMI term (according to LRMI spec)  LRMI‘: extracted from WDC/CC as above, but considering „common errors“ [Meusel et al 2015] Data extraction 05/04/17 7Stefan Dietze
  • 8.  CC: Common Crawl, 2013-2015 (http://commoncrawl.org)  WDC: Web Data Commons, 2013-2015: statements/quads extracted from CC (http://webdatacommons.org)  LRMI: all quads extracted from WDC/CC which include or co-occur with an LRMI term (LRMI spec)  LRMI‘: extracted from WDC/CC as above, but considering „common errors“ [Meusel et al 2015] Data extraction 05/04/17 8Stefan Dietze 2013 2014 2015 Documents (CC) 2,224,829,946 2,014,175,679 1,770,525,212 URLs (WDC) 585,792,337 (26.3%) 620,151,400 (30.7%) 541,514,775 (30.5%) Quads (WDC) 17,241,313,916 20,484,755,485 24,377,132,352 URLs (LRMI) 83,791 430,861 779,260 URLs (LRMI’) 84,098 430,895 929,573 Quads (LRMI) 9,245,793 26,256,833 44,108,511 Quads(LRMI’) 9,251,553 26,258,524 69,932,849
  • 9.  Power law distribution across approx. 300 PLDs and 4000 subdomains (2015)  Top 10% of contributors provide 98.4% of all quads (2015) LRMI distribution across pay-level-domains (PLDs) 05/04/17 9Stefan Dietze 7xxxtube.com 1amateurporntube.com virtualpornstars.com sunriseseniorliving.com simplyfinance.co.uk menslifestyles.com audiobooks.com simplypsychology.org helles-koepfchen.de
  • 10. 05/04/17 10Stefan Dietze Markup quality (1/2): addressing schema misuse sunriseseniorliving.com 7xxxtube.com 1amateurporntube.com virtualpornstars.com simplyfinance.co.uk menslifestyles.com audiobooks.com simplypsychology.org helles-koepfchen.de Clustering/classification of unintended uses of LRMI terms? • Domain blacklist: recall 96%, roughly 10% of PLDs (0,5 % of documents) affected • Clustering of PLDs/resource types (XMeans) • Variety of features, in particular related to term adoption
  • 11. Term co-occurrence within markup from top-ranked PLDs („learning resources in the LRMI sense“) Unintended schema use: term distribution as clustering feature? 05/04/17 11Stefan Dietze Term co-occurrence within markup from filtered adult content PLDs
  • 12. Rank Year Type # Quads # PLDs 1 2013 EducationalEvent 6004 1 2014 EducationalEvent 3047 1 2015 offer 100516 1 2 2013 UserComment 20 1 2014 Therapist 25 1 2015 headline 6724 1 3 2013 CompetencyObject 4 1 2014 UserComment 23 1 2015 URL 693 1 4 2013 Webpage 2 1 2014 learningResourceType 21 1 2015 webpage 360 1 5 2013 about 1 1 2014 EducationalEvent 19 1 2015 musicrecording 296 1  Heuristics for fixing frequent errors (see Meusel et al., ESWC2015) o Wrong namespaces (eg.: “htp:/schema.org”): 501,530 quads in 2015 o Undefined types and properties: 1,172,893 quads in 2015 o Object properties misused as data type property: 10,288,717 quads in 2015  Errors fixed in most PLDs and documents  But: lower error rate in LRMI corpus than markup in general (WDC) Markup quality (2/2): heuristics for fixing frequent errors 05/04/17 12Stefan Dietze Top-5 undefined types “Strings, not things”  Numbers from 2015: o 46 million “transversal” quads (i.e. non-hierarchical statements) o 64% datatype properties, yet 97% refer to literals (up from 70% in 2013)  Issues o Lack of links and controlled vocabularies o Data reuse requires identity resolution 2013 2014 2015 # quads 520,815 (5.63%) 1,601,796 (6.10%) 6,179,097 (8.84%) # docs 46,382 (55.15%) 369,772 (85.81%) 754,863 (81.21%) # PLDs 75 (75.76%) 154 (67.54%) 291 (77.39%) Fixed quads/documents/PLDs
  • 13. Key findings & implications 05/04/17 13Stefan Dietze I. Significant growth, but biased term adoption.  Growing adoption: 138 M (48 M) statements in 2016 (2015) (observable even in general-purpose crawl/CC)  Bias towards simple data type & generic properties  Implications for data consumption & identity resolution II. Power-law distribution of LRMI markup.  Top 10% contributors provide 98.4% of quads 2015  Efficient crawling / extraction of LRMI-specific data (eg for building index or recommender) => focused crawling of most probable data providers III. Frequent errors.  Vast amounts of erroneous statements (80% of PLDs in 2015), yet fewer than in markup in general  Steady increase (total and relative) of errors  Need for data cleansing & fixing: heuristics and frequency-based approaches (e.g. erroneous terms usually in few PLDs only) IV. Unintended use of vocabulary terms.  Terms applied in variety of contexts (e.g. adult content)  Not necessarily schema violation  But: need for further processing (e.g. clustering/classification) when interpreting/using LRMI
  • 14. Consumption, reuse & fusion of markup data  Clustering for data cleansing and categorisation (features: eg term distribution, page-rank, etc)  Supervised data fusion for entity matching and fact verification – related work [ICDE2017, SWJ2017]  Augmenting knowledge bases Vocabulary design  Feed findings into DCMI task force on LRMI  Bootstrap pattern and terms (from actual usage) ?  Wider schema.org question: reflecting lack of acceptance of object-object relationships in vocabularies? Future work 05/04/17 14Stefan Dietze Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM: Query- Centric Data Fusion on Structured Web Markup, ICDE2017. Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze, D., Dietze, S., KnowMore - Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2017, under review.
  • 15. Contact, data & stats 05/04/17 15Stefan Dietze Data http://lrmi.itd.cnr.it/ Contact @stefandietze | http://stefandietze.net