Analysing & Improving Learning Resources Markup on the Web

Analysing and Improving embedded Markup of
Learning Resources on the Web
Stefan Dietze, Davide Taibi, Ran Yu, Phil Barker, Mathieu d’Aquin
- WWW2017, Digital Learning Track -
05/04/17 1Stefan Dietze

Open Data & Linked Data
Structured data about learning resources on the Web?
Resource metadata
 Standards: LOM, ADL SCORM, IMS LD etc.
 Repositories: Open Courseware, Merlot, ARIADNE etc
Educational(ly relevant) linked data
 Vocabularies: BIBO, LOM/RDF, mEducator etc
 Datasets: e.g. LinkedUp Catalog
(approx. 50 M resources)
http://data.linkededucation.org/linkedup/catalog/

Structured data about learning resources on the Web?
Web: approx. 46.000.000.000.000 (46 trillion)
Web pages indexed by Google
Open Data & Linked Data
Resource metadata
 Standards: LOM, ADL SCORM, IMS LD etc.
 Repositories: Open Courseware, Merlot, ARIADNE etc
Educational(ly relevant) linked data
 Vocabularies: BIBO, LOM/RDF, mEducator etc
 Datasets: e.g. LinkedUp Catalog
(approx. 50 M resources)

 Embedded markup (RDFa, Microdata, Microformats) for
interpretation of Web documents (search, retrieval)
 schema.org vocabulary used at scale
(700 classes, 1000 predicates) and supported by Yahoo,
Yandex, Bing, Google
 Adoption on the Web (2016):
o 38 % out of 3.2 bn pages
o 44 bn statements/quads
(see “Web Data Commons”, see Meusel & Paulheim
[ISWC2014])
 Same order of magnitude as “the Web” (scale, dynamics)
Embedded markup data & schema.org
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Forrest Gump</h1>
<span>Actor: <span itemprop=„actor">Tom Hanks</span>
<span itemprop="genre">Drama</span>
...
</div>
05/04/17 4
RDF statements
node1 actor _node-x
node1 actor Robin Wright
node1 genre Comedy
node2 actor T. Hanks
node2 distributed by Paramount Pic.
node3 actor Tom Cruise
node3 distributed by Paramount Pic.
Stefan Dietze

 schema.org extension providing
vocabulary for annotation of learning
resources
 Association of resources
(s:CreativeWork, e.g. books, videos etc)
with learning-related attributes (typical
age, learning resource type,
educational frameworks etc)
 Dublin Core Metadata Initiative task
force on LRMI
Learning Resources Metadata Initiative (LRMI)
http://lrmi.dublincore.net/

Learning Resources Metadata Initiative: research questions
How is LRMI actually being used on the Web?
 RQ1) Adoption of LRMI terms / patterns and its evolution?
 RQ2) Distribution across the Web?
 RQ3) Quality (and how to improve/cleanse/interpret)?
Why is it important?
 Enable data reuse (KB construction, recommenders, search)
 Inform vocabulary design (LRMI, schema.org)

2013 2014 2015
Documents (CC) 2,224,829,946 2,014,175,679 1,770,525,212
URLs (WDC)
585,792,337
(26.3%)
620,151,400
(30.7%)
541,514,775
(30.5%)
Quads (WDC) 17,241,313,916 20,484,755,485 24,377,132,352
URLs (LRMI) 83,791 430,861 779,260
URLs (LRMI’) 84,098 430,895 929,573
Quads (LRMI) 9,245,793 26,256,833 44,108,511
Quads(LRMI’) 9,251,553 26,258,524 69,932,849
 CC: Common Crawl, 2013-2015
(http://commoncrawl.org)
 WDC: Web Data Commons, 2013-2015:
statements/quads extracted from CC
(http://webdatacommons.org)
 LRMI: all quads extracted from WDC/CC
which include or co-occur with an LRMI
term (according to LRMI spec)
 LRMI‘: extracted from WDC/CC as above,
but considering „common errors“
[Meusel et al 2015]
Data extraction

 CC: Common Crawl, 2013-2015
(http://commoncrawl.org)
 WDC: Web Data Commons, 2013-2015:
statements/quads extracted from CC
(http://webdatacommons.org)
 LRMI: all quads extracted from WDC/CC
which include or co-occur with an LRMI
term (LRMI spec)
 LRMI‘: extracted from WDC/CC as above,
but considering „common errors“
[Meusel et al 2015]
Data extraction
2013 2014 2015
Documents (CC) 2,224,829,946 2,014,175,679 1,770,525,212
URLs (WDC)
585,792,337
(26.3%)
620,151,400
(30.7%)
541,514,775
(30.5%)
Quads (WDC) 17,241,313,916 20,484,755,485 24,377,132,352
URLs (LRMI) 83,791 430,861 779,260
URLs (LRMI’) 84,098 430,895 929,573
Quads (LRMI) 9,245,793 26,256,833 44,108,511
Quads(LRMI’) 9,251,553 26,258,524 69,932,849

 Power law distribution across
approx. 300 PLDs and 4000
subdomains (2015)
 Top 10% of contributors
provide 98.4% of all quads
(2015)
LRMI distribution across pay-level-domains (PLDs)
7xxxtube.com
1amateurporntube.com
virtualpornstars.com
sunriseseniorliving.com
simplyfinance.co.uk
menslifestyles.com
audiobooks.com
simplypsychology.org
helles-koepfchen.de

Markup quality (1/2): addressing schema misuse
sunriseseniorliving.com
7xxxtube.com
1amateurporntube.com
virtualpornstars.com
simplyfinance.co.uk
menslifestyles.com
audiobooks.com
simplypsychology.org
helles-koepfchen.de
Clustering/classification of unintended uses of
LRMI terms?
• Domain blacklist: recall 96%, roughly 10% of
PLDs (0,5 % of documents) affected
• Clustering of PLDs/resource types (XMeans)
• Variety of features, in particular related to
term adoption

Term co-occurrence within markup from top-ranked PLDs
(„learning resources in the LRMI sense“)
Unintended schema use: term distribution as clustering feature?
Term co-occurrence within markup from
filtered adult content PLDs

Rank Year Type # Quads # PLDs
1
2013 EducationalEvent 6004 1
2015 offer 100516 1
2
2013 UserComment 20 1
2014 Therapist 25 1
2015 headline 6724 1
3
2013 CompetencyObject 4 1
2014 UserComment 23 1
2015 URL 693 1
4
2013 Webpage 2 1
2014 learningResourceType 21 1
2015 webpage 360 1
5
2013 about 1 1
2015 musicrecording 296 1
 Heuristics for fixing frequent errors
(see Meusel et al., ESWC2015)
o Wrong namespaces
(eg.: “htp:/schema.org”): 501,530 quads in
2015
o Undefined types and properties: 1,172,893
quads in 2015
o Object properties misused as data type
property: 10,288,717 quads in 2015
 Errors fixed in most PLDs and documents
 But: lower error rate in LRMI corpus than
markup in general (WDC)
Markup quality (2/2): heuristics for fixing frequent errors
Top-5 undefined types
“Strings, not things”
 Numbers from 2015:
o 46 million “transversal” quads (i.e. non-hierarchical
statements)
o 64% datatype properties, yet 97% refer to literals
(up from 70% in 2013)
 Issues
o Lack of links and controlled vocabularies
o Data reuse requires identity resolution
2013 2014 2015
# quads
520,815
(5.63%)
1,601,796
(6.10%)
6,179,097
(8.84%)
# docs
46,382
(55.15%)
369,772
(85.81%)
754,863
(81.21%)
# PLDs
75
(75.76%)
154
(67.54%)
291
(77.39%)
Fixed quads/documents/PLDs

Key findings & implications
I. Significant growth, but biased term adoption.
 Growing adoption: 138 M (48 M) statements in 2016 (2015) (observable even in general-purpose crawl/CC)
 Bias towards simple data type & generic properties
 Implications for data consumption & identity resolution
II. Power-law distribution of LRMI markup.
 Top 10% contributors provide 98.4% of quads 2015
 Efficient crawling / extraction of LRMI-specific data (eg for building index or recommender)
=> focused crawling of most probable data providers
III. Frequent errors.
 Vast amounts of erroneous statements (80% of PLDs in 2015), yet fewer than in markup in general
 Steady increase (total and relative) of errors
 Need for data cleansing & fixing: heuristics and frequency-based approaches
(e.g. erroneous terms usually in few PLDs only)
IV. Unintended use of vocabulary terms.
 Terms applied in variety of contexts (e.g. adult content)
 Not necessarily schema violation
 But: need for further processing (e.g. clustering/classification) when interpreting/using LRMI

Consumption, reuse & fusion of markup data
 Clustering for data cleansing and categorisation
(features: eg term distribution, page-rank, etc)
 Supervised data fusion for entity matching and fact verification –
related work [ICDE2017, SWJ2017]
 Augmenting knowledge bases
Vocabulary design
 Feed findings into DCMI task force on LRMI
 Bootstrap pattern and terms (from actual usage) ?
 Wider schema.org question: reflecting lack of acceptance of
object-object relationships in vocabularies?
Future work
Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM: Query-
Centric Data Fusion on Structured Web Markup,
ICDE2017.
Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze, D.,
Dietze, S., KnowMore - Knowledge Base Augmentation
with Structured Web Markup, Semantic Web Journal
2017, under review.

Contact, data & stats
Data
http://lrmi.itd.cnr.it/
Contact
@stefandietze | http://stefandietze.net

Analysing & Improving Learning Resources Markup on the Web

More Related Content

What's hot (20)

Similar to Analysing & Improving Learning Resources Markup on the Web (20)

More from Stefan Dietze (20)

Recently uploaded (20)

Analysing & Improving Learning Resources Markup on the Web