SlideShare a Scribd company logo
Backup
Beyond research data infrastructures - exploiting artificial & crowd
intelligence for building research knowledge graphs
Stefan Dietze
GESIS – Leibniz Institute for the Social Sciences &
Heinrich-Heine-Universität Düsseldorf
LWDA2019, 02 October 2019
Backup
Beyond research data infrastructures - exploiting artificial & crowd
intelligence for building research knowledge graphs
Stefan Dietze
GESIS – Leibniz Institute for the Social Sciences &
Heinrich-Heine-Universität Düsseldorf
LWDA2019, 02 October 2019
research data
infrastructure
data fusion
distant
supervision
Web mining
distributional
semantics
knowledge
graph
neural entity
linking
research data
machine
learning
social web
artificial
intelligence
semantics
claim extraction
stance
detection
fact verification crowd
(Buzzword) Bingo !?
Finding research data on the Web?
02/10/19 3Stefan Dietze
Finding research data on the Web?
02/10/19 4Stefan Dietze
Finding research data on the Web?
02/10/19 5Stefan Dietze
Finding (social sciences) research data on the Web
02/10/19 6Stefan Dietze
Traditional & novel forms of research data: the case of social sciences
02/10/19 7Stefan Dietze
 Traditional social science research data: survey & census
data, microdata, lab studies etc (lack of scale, dynamics)
 Social science vision: substituting & complementing
traditional research data through data mined from the Web
 Example: investigations into misinformation and opinion
forming (e.g. [Vousoughi et al. 2018])
 Aims usually at investigating insights by also dealing with
methodological/computational challenges
 Insights, mostly (computational) social sciences, e.g.
o Spreading of claims and misinformation
o Effect of biased and fake news on public opinions
o Reinforcement of biases and echo chambers
 Methods, mostly in computer science, e.g. for
o Crawling, harvesting, scraping of data
o Extraction of structured knowledge
(entities, sentiments, stances, claims, etc)
o Claim/fact detection and verification („fake news
detection“), e.g. CLEF 2018 Fact Checking Lab
o Stance detection, e.g. Fake News Challenge (FNC)
Part I
Mining, fusing and linking research data (in particular: metadata) on the Web
Part II
Mining novel forms of research knowledge graphs from the Web
02/10/19 8Stefan Dietze
Datasets
Metadata
Publications
Web pages
Opinions
Claims
Stances
Overview
Web mining of dataset metadata (or: dataset KGs)
 Harvesting from open data portals (e.g. DCAT/VoID-
metadata on DataHub.io, DataCite etc.)
 Information extraction on long tail of Web documents?
=> dynamics & scale: approx. 50 trn (50.000.000.000.000)
Web pages indexed by Google (plus gazillion of temporal
snapshots)
 Embedded markup (RDFa, Microdata, Microformats) for
annotation of Web pages
 Supports Web search & interpretation
 Pushed by Google, Yahoo, Bing et al
(schema.org vocabulary)
 Adoption on the Web by 38% all Web pages
(sample: Common Crawl 2016, 3.2 bn Web pages)
 Easily accesible, large-scale source of factual knowledge
(about research data & research information)
 Large-scale source of training data, e.g. manually
annotated Web pages citing datasets
Facts (“quads”)
node1 name WB Commodity URI-1
node1 distribution node_xy URI-1
node1 creator Worldbank URI-1
node1 dateCreated 26 April 2017 URI-1
node2 creator World Bank URI-2
node2 encodingFormat text/CSV URI-2
node3 dateCreated 26 April 2007 URI-3
node3 keywords crude URI-3
<div itemscope itemtype ="http://schema.org/Dataset">
<h1 itemprop="name">World Bank-Commodity Prices</h1>
<span itemprop=„distribution">URL-X</span>
<span itemprop=„license">CC-BY</span>
...
</div>
02/10/19 9Stefan Dietze
02/10/19 10Stefan Dietze
Research dataset markup on the Web
 In Common Crawl 2017 (3.2 bn pages):
o 14.1 M statements & 3.4 M instances
related to „s:Dataset“
o Spread across 200 K pages from 2878 PLDs
(top 10% of PLDs provide 95% of data)
 Studies of scholarly articles and other types
[SAVESD16, WWW2017]: majority of major
publishers, data hosting sites, data registries,
libraries, research organisations respresented
power law distribution of dataset metadata across PLDs
 Challenges
o Errors. Factual errors, annotation errors (see
also [Meusel et al, ESWC2015])
o Ambiguity & coreferences. e.g. 18.000 entity
descriptions of “iPhone 6” in Common Crawl
2016 & ambiguous literals (e.g. „Apple“>)
o Redundancies & conflicts vast amounts of
equivalent or conflicting statements
 0. Noise: data cleansing (node URIs, deduplication etc)
 1.a) Scale: Blocking through BM25 entity retrieval on markup index
 1.b) Relevance: supervised coreference resolution
 2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse
feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level
KnowMore: data fusion on markup
02/10/19 11
1. Blocking &
coreference
resolution
2. Fusion / Fact selection
New Queries
WorldBank, type:(Organization)
Washington, type:(City)
David Malpass, type:(Person)
(supervised)
Entity Description
name
“WorldBank Commodity
Prices 2019”
distribution Worldbank (node)
releaseDate 26.04.2019
keywords „crude”, “prizes”, “market”
encodingFormat text/CSV
Query
WorldBank Commodity,
Prices 2019, type:(Dataset)
Candidate Facts
node1 name WB Commodity
node1 distribution node_xy
node1 creator Worldbank
node1 dateReleased 26 April 2019
node2 creator World Bank
node2 encodingFormat text/CSV
node3 dateCreated 26 April 2007
node4 keywords “crude”
Web page
markup
Web crawl
(Common Crawl,
44 bn facts)
approx. 125.000 facts for query [ s:Product, „iPhone6“ ]
Stefan Dietze
Yu, R., [..], Dietze, S., KnowMore-Knowledge Base
Augmentation with Structured Web Markup, Semantic
Web Journal 2019 (SWJ2019)
Tempelmeier, N., Demidova, S., Dietze, S., Inferring
Missing Categorical Information in Noisy and Sparse
Web Markup, The Web Conf. 2018 (WWW2018)
 0. Noise: data cleansing (node URIs, deduplication etc)
 1.a) Scale: Blocking through BM25 entity retrieval on markup index
 1.b) Relevance: supervised coreference resolution
 2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse
feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level
KnowMore: data fusion on markup
02/10/19 12
1. Blocking &
coreference
resolution
2. Fusion / Fact selection
New Queries
WorldBank, type:(Organization)
Washington, type:(City)
David Malpass, type:(Person)
(supervised)
Entity Description
name
“WorldBank Commodity
Prices 2019”
distribution Worldbank (node)
releaseDate 26.04.2019
keywords „crude”, “prizes”, “market”
encodingFormat text/CSV
Query
WorldBank Commodity,
Prices 2019, type:(Dataset)
Candidate Facts
node1 name WB Commodity
node1 distribution node_xy
node1 creator Worldbank
node1 dateReleased 26 April 2019
node2 creator World Bank
node2 encodingFormat text/CSV
node3 dateCreated 26 April 2007
node4 keywords “crude”
Web page
markup
Web crawl
(Common Crawl,
44 bn facts)
approx. 125.000 facts for query [ s:Product, „iPhone6“ ]
Stefan Dietze
Yu, R., [..], Dietze, S., KnowMore-Knowledge Base
Augmentation with Structured Web Markup, Semantic
Web Journal 2019 (SWJ2019)
Tempelmeier, N., Demidova, S., Dietze, S., Inferring
Missing Categorical Information in Noisy and Sparse
Web Markup, The Web Conf. 2018 (WWW2018)
Fusion performance
 Experiments on books, movies, products (ongoing: datasets)
 Baselines: BM25, CBFS [ESWC2015], PreRecCorr [Pochampally
et. al., ACM SIGMOD 2014], strong variance across types
Knowledge Graph Augmentation
 On average 60% - 70% of all facts new (across DBpedia,
Wikidata, Freebase)
 Additional experiments on learning new categorical features
(e.g. product categories or movie genres) [WWW2018]
Rich Context & Coleridge Initiative
building (yet another) KG of scholarly resources & datasets
13Stefan Dietze
 Context/corpus: publications
(currently: social sciences, SAGE Publishing)
 Tasks:
I. Extraction/disambiguation of dataset mentions
II. Extraction/detection of research methods
III. Classification of research fields
https://coleridgeinitiative.org/richcontextcompetition
Applications: search for social science resources (and links)
14Stefan Dietze
https://search.gesis.org/
Disambiguation of dataset citations Otto, W. et al., Knowledge Extraction from scholarly
publications – the GESIS contribution to the Rich Context
Competition, to appear, Sage Publishing, 2020
15Stefan Dietze
All these issues are addressed in the current report,
which is based on analysis of data obtained in the
National Comorbidity Survey (NCS) (15). The NCS is
a nationally representative survey of the US household
population that includes retrospective reports about the
ages at onset and lifetime occurrences of suicidal
ideation, plans, and attempts along with information
about the occurrences of mental disorders, substance
use, substance abuse, and substance dependence.
National Comorbidity Survey (NCS) NCS
Challenges
 Ambiguous (incomplete) citations
 Lack of high-quality and representative training
data (usually: weak labels, domain bias)
Approaches & results
 Prior work: supervised pattern induction
[Boland et al, TPDL2012]
 Current approach:
o neural NER based on spaCy (CRF-based
approach for research method detection)
o Training (testing) on 12.000 (3.000) paragraphs
(distribution of negative/positive differs,
training batch size=25, dropout=0.4)
o Results approx. P = .50, R= .90 (weakly labelled
test data)
o On small set of manually labelled test data:
P= .52; R= .21)
Profiling datasets
for dataset search
16Stefan Dietze
 Dataset metadata is crucial for search,
discovery, reuse
 But: dataset metadata is sparse,
incomplete, noisy, costly
 Profiling datasets = generating dataset
metadata from actual data(set) at
hand
 Various profiling dimensions
depending on use case (e.g. statistical
features, dynamics, topics), cf.
[SWJ18]
 Works on topic profiling [ESWC14] and
profiling of graph features [ESWC19]
Ben Ellefi, M., Bellahsene, Z., Breslin, J., Demidova, E., Dietze, S., Szymanski, J.,
Todorov, K., RDF Dataset Profiling – a Survey of Features, Methods,
Vocabularies and Applications, Semantic Web Journal, IOS Press 2018
Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., A Scalable
Approach for Efficiently Generating Structured Dataset Topic Profiles,
ESWC2014
Profiling (Graph) Datasets
17Stefan Dietze
 Graph metrics essential to
describe/profile research datasets of
graph-based nature (social graphs,
knowledge graphs) in order to:
 Distinguish dataset categories
 Find/discover datasets with
particular features
 Generate synthetic data, e.g., for
query benchmarks
 Sample subsets while maintaining
graph topology (e.g. to investigate
network effects in social sciences
research)
 Research question: effective and
discriminative features (non-redundant,
non-noisy) for representing specific
categories of datasets
Zloch, M., Acosta, M., Hienert, D., Dietze, S., Conrad, S., A
Software Framework and Datasets for the Analysis of Graph
Measures on RDF Graphs, ESWC19, Best Student Paper
Feature correlation matrix
Noisy vs. homogenous
features
Lighter colour = more
homogenous metric within
domain
Profiling (Graph) Datasets Zloch, M., Acosta, M., Hienert, D., Dietze, S., Conrad, S., A
Software Framework and Datasets for the Analysis of Graph
Measures on RDF Graphs, ESWC19, Best Student Paper
18Stefan Dietze
Discriminative features
Descriptive features able
to characterise specific
dataset categories
(= feature impact in binary
classification task aimed at
distinguishing each
dataset category)
 Certain kinds of datasets (categories) hard to
describe due to inherent diversity/variance of
datasets
 Selection of descriptive, non-redundant dataset
profile features vary for different dataset categories
(and use cases)
Overview
Part I
Mining, fusing and linking research data (in particular: metadata) on the Web
Part II
Mining novel forms of research data knowledge graphs from the Web
02/10/19 19Stefan Dietze
Datasets
Metadata
Publications
Web pages
Opinions
Claims
Stances
02/10/19 20Stefan Dietze
http://dbpedia.org/resource/Tim_Berners-Lee
wna:positive-emotion
onyx:hasEmotionIntensity "0.75"
onyx:hasEmotionIntensity "0.0"
Mining opinions & interactions (the case of Twitter)
 Heterogenity: multimodal, multilingual, informal,
“noisy” language
 Context dependence: interpretation of
tweets/posts (entities, sentiments) requires
consideration of context (e.g. time, linked
content), “Dusseldorf” => City or Football team
 Dynamics & scale: e.g. 6000 tweets per second,
plus interactions (retweets etc) and context (e.g.
25% of tweets contain URLs)
 Evolution and temporal aspects: evolution of
interactions over time crucial for many social
sciences questions
 Representativity and bias: demographic
distributions not known a priori in archived data
collections
http://dbpedia.org/resource/Solid
wna:negative-emotion
P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public
and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
02/10/19 21Stefan Dietze
P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public
and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
TweetsKB: a knowledge graph of Web mined “opinions”
https://data.gesis.org/tweetskb/
 Harvesting & archiving of 9 Bn tweets over 6 years
(permanent collection from Twitter 1% sample since
2013)
 Information extraction pipeline to build a KG of entities,
interactions & sentiments
(distributed batch processing via Hadoop Map/Reduce)
o Entity linking with knowledge graph/DBpedia
(Yahoo‘s FEL [Blanco et al. 2015])
(“president”/“potus”/”trump” =>
dbp:DonaldTrump), to disambiguate text and use
background knowledge (eg US politicians?
Republicans?), high precision (.85), low recall (.39)
o Sentiment analysis/annotation using SentiStrength
[Thelwall et al., 2017], F1 approx. .80
o Extraction of metadata and lifting into established
schemas (SIOC, schema.org), publication using W3C
standards (RDF/SPARQL)
02/10/19 22Stefan Dietze
P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public
and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
 Harvesting & archiving of 9 Bn tweets over 5 years
(permanent collection from Twitter 1% sample since
2013)
 Information extraction pipeline (distributed via Hadoop
Map/Reduce)
o Entity linking with knowledge graph/DBpedia
(Yahoo‘s FEL [Blanco et al. 2015])
(“president”/“potus”/”trump” =>
dbp:DonaldTrump), to disambiguate text and use
background knowledge (eg US politicians?
Republicans?), high precision (.85), low recall (.39)
o Sentiment analysis/annotation using SentiStrength
[Thelwall et al., 2012], F1 approx. .80
o Extraction of metadata and lifting into established
schemas (SIOC, schema.org), publication using W3C
standards (RDF/SPARQL)
Use cases
 Aggregating sentiments towards topics/entities, e.g. about
CDU vs SPD politicians in particular time period
 Twitter archives as general corpus for understanding temporal
entity relatedness (e.g. “austerity” & “Greece” 2010-2015)
 Investigating spreading & impact of fake news
(e.g. TweetsKB, ClaimsKG, stance detection)
Limitations
 Bias & representativity: demographic distributions of users
(not known a priori and not representative)
-0.40000
-0.30000
-0.20000
-0.10000
0.00000
0.10000
0.20000
0.30000
0.40000
Cologne Düsseldorf
https://data.gesis.org/tweetskb/
TweetsKB: a knowledge graph of Web mined “opinions”
Research datasets (mentions/citations) on Twitter?
https://data.gesis.org/tweetskb/
Daily mentions of datasets (DOIs, literals)
in TweetsKB
(approx. 20.000 mentions of datasets per
month)
Why?
- Discovering long tail datasets
- Dataset popularity, trends, „paradata“,
usage, context
Plot © Robert Jäschke
02/10/19 24Stefan Dietze
Mining/finding knowledge about claims and stances
stance,
claim trustworthiness?
stance,
claim trustworthiness?
Detecting stances towards claims/opinions
Motivation
 Problem: detecting stance of documents (e.g. Web
pages, scientific publication) towards a given claim
(unbalanced class distribution)
 Motivation: stance of documents (in particular
disagreement) useful (a) as signal for truthfulness
(fake news detection) and (b) Document or Source
classification (PLDs, publishers)
Approach
 Cascading binary classifiers: addressing individual
issues (e.g. misclassification costs) per step
 Features, e.g. textual similarity (Word2Vec etc),
sentiments, LIWC, etc.
 Best-performing models: 1) SVM with class-wise
penalty, 2) CNN, 3) SVM with class-wise penalty
 Experiments on FNC-1 dataset (and FNC baselines)
Results
 Minor overall performance improvement
 Improvement on disagree class by 27%
(but still far from robust)
A. Roy, A. Ekbal, S. Dietze, P. Fafalios, Exploiting
stance hierarchies for cost-sensitive stance detection
of Web documents, WSDM2020 under review.
25Stefan Dietze
02/10/19 26Stefan Dietze
ClaimsKG: a knowledge graph of claims and claim-related metadata
Motivation
 Claims spread across various
(unstructured) fact-checking sites
 Example: finding claims about / made by
US republican politicians across the Web?
Approach
 Harvesting claims & metadata from fact-
checking sites (e.g. snopes.com,
Politifact.com etc); currently approx.
30.000 claims (plus mining
schema.org/ClaimReview markup (>
500.000 statements in Common Crawl
2017)
 Information extraction & linking
o Linking mentioned entities to DBpedia
o Normalisation of ratings (true, false,
mixture, other); coreference resolution
of claims
o Exposing data through established
vocabulary and W3C standards
(e.g. SPARQL endpoint)
https://data.gesis.org/claimskg/
A. Tchechmedjiev, P. Fafalios, K. Boland, S. Dietze, B. Zapilko, K.
Todorov, ClaimsKG – A Live Knowledge Graph of fact-checked Claims,
ISWC2019
Conclusions
Mining and profiling of research dataset metadata (KGs)
 Mining of unstructured Web pages and scholarly articles for research datasets &
metadata
 Profiling of research datasets for discovery, sampling, generation of synthetic data
 Plenty of related initiatives and efforts
(e.g. Rich Context, Research Graph, OpenAIRE, ORKG)
 Some challenges: generalisable/reusable methods for extraction & mining across
domains and corpora
Mining and sharing novel forms of research data (KGs)
 Mining the Web for novel forms of research data
 Examples from social sciences: opinions (sentiments on entities) and interactions
on Twitter & structured knowledge about resource relations (for instance: stances)
and claims
 Some challenges: language understanding/interpretation, representativity and
bias
02/10/19 27Stefan Dietze
Acknowledgements
Co-authors
• Maribel Acosta (KIT, Karlsruhe)
• Mohamad Ben Ellefi (LIRMM, France)
• Katarina Boland (GESIS, Germany)
• Stefan Conrad (HHU, Germany)
• Elena Demidova (L3S, Germany)
• Asif Ekbal (IIT Patna, India)
• Pavlos Fafalios (L3S, Germany)
• Ujwal Gadiraju (L3S, Germany)
• Daniel Hienert (GESIS, Germany)
• Peter Holtz (IWM, Germany)
• Eirini Ntoutsi (LUH, Germany)
• Vasilis Iosifidis (L3S, Germany)
• Markus Rokicki (L3S, Germany)
• Arjun Roy (IIT Patna, India)
• Renato Stoffalette Joao (L3S, Germany)
• Davide Taibi (CNR, ITD, Italy)
• Nicolas Tempelmeier (L3S, Germany)
• Konstantin Todorov (LIRMM, France)
• Ran Yu (GESIS, Germany)
• Benjamin Zapilko (GESIS, Germany)
• Matthäus Zloch (GESIS, Germany)
02/10/19 28Stefan Dietze
29Stefan Dietze
Knowledge Technologies for the Social Sciences (WTS)
https://www.gesis.org/en/institute/departments/knowledge-technologies-for-the-social-sciences/
WTS Labs
https://www.gesis.org/en/research/applied-computer-science/labs/wts-research-labs
Data & Knowledge Engineering @ HHU
https://www.cs.hhu.de/en/research-groups/data-knowledge-engineering.html
L3S
http://www.l3s.de
Personal
http://stefandietze.net

More Related Content

What's hot (20)

PPTX
Ziegler Open Data in Special Collections Libraries
National Information Standards Organization (NISO)
 
PDF
McGeary Data Curation Network: Developing and Scaling
National Information Standards Organization (NISO)
 
PPTX
Omitola birmingham cityuniv
Tope Omitola
 
PDF
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
Joachim Neubert
 
PPTX
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
dkNET
 
PDF
Data Visualization in the Newsroom
Carl V. Lewis
 
PDF
Tuning Personalized PageRank for Semantics-aware Recommendations based on Lin...
Cataldo Musto
 
PDF
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Joel Azzopardi
 
PPTX
Sanderson Shout It Out: LOUD
National Information Standards Organization (NISO)
 
PDF
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...
Anastasija Nikiforova
 
PPTX
Towards Knowledge Graph based Representation, Augmentation and Exploration of...
Sören Auer
 
PDF
Exploration, visualization and querying of linked open data sources
Laura Po
 
PPTX
Washington Linked Data Authority Service at University of Houston
National Information Standards Organization (NISO)
 
PPTX
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
Micah Altman
 
PDF
Linked Data
Anusuriya Devaraju
 
PPTX
Describing Scholarly Contributions semantically with the Open Research Knowle...
Sören Auer
 
PDF
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Stefan Dietze
 
PPTX
Cognitive data
Sören Auer
 
PDF
Introduction to linked data
Laura Po
 
PDF
Turning Data into Knowledge (KESW2014 Keynote)
Stefan Dietze
 
Ziegler Open Data in Special Collections Libraries
National Information Standards Organization (NISO)
 
McGeary Data Curation Network: Developing and Scaling
National Information Standards Organization (NISO)
 
Omitola birmingham cityuniv
Tope Omitola
 
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
Joachim Neubert
 
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
dkNET
 
Data Visualization in the Newsroom
Carl V. Lewis
 
Tuning Personalized PageRank for Semantics-aware Recommendations based on Lin...
Cataldo Musto
 
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Joel Azzopardi
 
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...
Anastasija Nikiforova
 
Towards Knowledge Graph based Representation, Augmentation and Exploration of...
Sören Auer
 
Exploration, visualization and querying of linked open data sources
Laura Po
 
Washington Linked Data Authority Service at University of Houston
National Information Standards Organization (NISO)
 
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
Micah Altman
 
Linked Data
Anusuriya Devaraju
 
Describing Scholarly Contributions semantically with the Open Research Knowle...
Sören Auer
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Stefan Dietze
 
Cognitive data
Sören Auer
 
Introduction to linked data
Laura Po
 
Turning Data into Knowledge (KESW2014 Keynote)
Stefan Dietze
 

Similar to Beyond research data infrastructures: exploiting artificial & crowd intelligence towards building research knowledge graphs (20)

PDF
Web-scale semantic search
Edgar Meij
 
PPT
Applications of Semantic Technology in the Real World Today
Amit Sheth
 
PDF
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Stefan Dietze
 
PDF
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
IJwest
 
PPT
euclid_linkedup WWW tutorial (Besnik Fetahu)
Besnik Fetahu
 
PDF
Mining and Understanding Activities and Resources on the Web
Stefan Dietze
 
PDF
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
KNOWeSCAPE2014
 
PDF
Discovering Related Data Sources in Data Portals
Peter Haase
 
PDF
Hide the Stack: Toward Usable Linked Data
aba-sah
 
PPTX
SWT Lecture Session 1 - Introduction
Mariano Rodriguez-Muro
 
PPTX
Linked Energy Data Generation
Filip Radulovic
 
PPT
4.5 mining the worldwideweb
Krish_ver2
 
PDF
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Stefan Dietze
 
PPTX
The Unreasonable Effectiveness of Metadata
James Hendler
 
DOC
Introduction abstract
Sanghvi Innovative Academy
 
PPT
Introduction to the Semantic Web
GIS Colorado
 
PDF
A semantic based approach for knowledge discovery and acquistion from multipl...
csandit
 
PDF
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
csandit
 
PDF
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
cscpconf
 
PDF
Data Research Vision
PlanetData Network of Excellence
 
Web-scale semantic search
Edgar Meij
 
Applications of Semantic Technology in the Real World Today
Amit Sheth
 
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Stefan Dietze
 
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
IJwest
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
Besnik Fetahu
 
Mining and Understanding Activities and Resources on the Web
Stefan Dietze
 
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
KNOWeSCAPE2014
 
Discovering Related Data Sources in Data Portals
Peter Haase
 
Hide the Stack: Toward Usable Linked Data
aba-sah
 
SWT Lecture Session 1 - Introduction
Mariano Rodriguez-Muro
 
Linked Energy Data Generation
Filip Radulovic
 
4.5 mining the worldwideweb
Krish_ver2
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Stefan Dietze
 
The Unreasonable Effectiveness of Metadata
James Hendler
 
Introduction abstract
Sanghvi Innovative Academy
 
Introduction to the Semantic Web
GIS Colorado
 
A semantic based approach for knowledge discovery and acquistion from multipl...
csandit
 
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
csandit
 
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
cscpconf
 
Data Research Vision
PlanetData Network of Excellence
 
Ad

More from Stefan Dietze (20)

PDF
Understanding Scientific and Societal Adoption and Impact of Science Through ...
Stefan Dietze
 
PDF
NEWORDER Project - Science in the online knowledge order
Stefan Dietze
 
PDF
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Stefan Dietze
 
PDF
AI in between online and offline discourse - and what has ChatGPT to do with ...
Stefan Dietze
 
PDF
An interdisciplinary journey with the SAL spaceship – results and challenges ...
Stefan Dietze
 
PDF
Research Knowledge Graphs at NFDI4DS & GESIS
Stefan Dietze
 
PDF
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Stefan Dietze
 
PDF
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
Stefan Dietze
 
PDF
Using AI to understand everyday learning on the Web
Stefan Dietze
 
PDF
Analysing User Knowledge, Competence and Learning during Online Activities
Stefan Dietze
 
PDF
Analysing & Improving Learning Resources Markup on the Web
Stefan Dietze
 
PDF
Big Data in Learning Analytics - Analytics for Everyday Learning
Stefan Dietze
 
PDF
Towards embedded Markup of Learning Resources on the Web
Stefan Dietze
 
PDF
Semantic Linking & Retrieval for Digital Libraries
Stefan Dietze
 
PDF
Linked Data for Architecture, Engineering and Construction (AEC)
Stefan Dietze
 
PDF
Dietze linked data-vr-es
Stefan Dietze
 
PDF
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Stefan Dietze
 
PDF
From Data to Knowledge - Profiling & Interlinking Web Datasets
Stefan Dietze
 
PDF
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
Stefan Dietze
 
PDF
What's all the data about? - Linking and Profiling of Linked Datasets
Stefan Dietze
 
Understanding Scientific and Societal Adoption and Impact of Science Through ...
Stefan Dietze
 
NEWORDER Project - Science in the online knowledge order
Stefan Dietze
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Stefan Dietze
 
AI in between online and offline discourse - and what has ChatGPT to do with ...
Stefan Dietze
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
Stefan Dietze
 
Research Knowledge Graphs at NFDI4DS & GESIS
Stefan Dietze
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Stefan Dietze
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
Stefan Dietze
 
Using AI to understand everyday learning on the Web
Stefan Dietze
 
Analysing User Knowledge, Competence and Learning during Online Activities
Stefan Dietze
 
Analysing & Improving Learning Resources Markup on the Web
Stefan Dietze
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Stefan Dietze
 
Towards embedded Markup of Learning Resources on the Web
Stefan Dietze
 
Semantic Linking & Retrieval for Digital Libraries
Stefan Dietze
 
Linked Data for Architecture, Engineering and Construction (AEC)
Stefan Dietze
 
Dietze linked data-vr-es
Stefan Dietze
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Stefan Dietze
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
Stefan Dietze
 
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
Stefan Dietze
 
What's all the data about? - Linking and Profiling of Linked Datasets
Stefan Dietze
 
Ad

Recently uploaded (20)

PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PPT
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
fashion industry boom.pptx an economics project
TGMPandeyji
 
PDF
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
AI/ML Applications in Financial domain projects
Rituparna De
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
fashion industry boom.pptx an economics project
TGMPandeyji
 
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 

Beyond research data infrastructures: exploiting artificial & crowd intelligence towards building research knowledge graphs

  • 1. Backup Beyond research data infrastructures - exploiting artificial & crowd intelligence for building research knowledge graphs Stefan Dietze GESIS – Leibniz Institute for the Social Sciences & Heinrich-Heine-Universität Düsseldorf LWDA2019, 02 October 2019
  • 2. Backup Beyond research data infrastructures - exploiting artificial & crowd intelligence for building research knowledge graphs Stefan Dietze GESIS – Leibniz Institute for the Social Sciences & Heinrich-Heine-Universität Düsseldorf LWDA2019, 02 October 2019 research data infrastructure data fusion distant supervision Web mining distributional semantics knowledge graph neural entity linking research data machine learning social web artificial intelligence semantics claim extraction stance detection fact verification crowd (Buzzword) Bingo !?
  • 3. Finding research data on the Web? 02/10/19 3Stefan Dietze
  • 4. Finding research data on the Web? 02/10/19 4Stefan Dietze
  • 5. Finding research data on the Web? 02/10/19 5Stefan Dietze
  • 6. Finding (social sciences) research data on the Web 02/10/19 6Stefan Dietze
  • 7. Traditional & novel forms of research data: the case of social sciences 02/10/19 7Stefan Dietze  Traditional social science research data: survey & census data, microdata, lab studies etc (lack of scale, dynamics)  Social science vision: substituting & complementing traditional research data through data mined from the Web  Example: investigations into misinformation and opinion forming (e.g. [Vousoughi et al. 2018])  Aims usually at investigating insights by also dealing with methodological/computational challenges  Insights, mostly (computational) social sciences, e.g. o Spreading of claims and misinformation o Effect of biased and fake news on public opinions o Reinforcement of biases and echo chambers  Methods, mostly in computer science, e.g. for o Crawling, harvesting, scraping of data o Extraction of structured knowledge (entities, sentiments, stances, claims, etc) o Claim/fact detection and verification („fake news detection“), e.g. CLEF 2018 Fact Checking Lab o Stance detection, e.g. Fake News Challenge (FNC)
  • 8. Part I Mining, fusing and linking research data (in particular: metadata) on the Web Part II Mining novel forms of research knowledge graphs from the Web 02/10/19 8Stefan Dietze Datasets Metadata Publications Web pages Opinions Claims Stances Overview
  • 9. Web mining of dataset metadata (or: dataset KGs)  Harvesting from open data portals (e.g. DCAT/VoID- metadata on DataHub.io, DataCite etc.)  Information extraction on long tail of Web documents? => dynamics & scale: approx. 50 trn (50.000.000.000.000) Web pages indexed by Google (plus gazillion of temporal snapshots)  Embedded markup (RDFa, Microdata, Microformats) for annotation of Web pages  Supports Web search & interpretation  Pushed by Google, Yahoo, Bing et al (schema.org vocabulary)  Adoption on the Web by 38% all Web pages (sample: Common Crawl 2016, 3.2 bn Web pages)  Easily accesible, large-scale source of factual knowledge (about research data & research information)  Large-scale source of training data, e.g. manually annotated Web pages citing datasets Facts (“quads”) node1 name WB Commodity URI-1 node1 distribution node_xy URI-1 node1 creator Worldbank URI-1 node1 dateCreated 26 April 2017 URI-1 node2 creator World Bank URI-2 node2 encodingFormat text/CSV URI-2 node3 dateCreated 26 April 2007 URI-3 node3 keywords crude URI-3 <div itemscope itemtype ="http://schema.org/Dataset"> <h1 itemprop="name">World Bank-Commodity Prices</h1> <span itemprop=„distribution">URL-X</span> <span itemprop=„license">CC-BY</span> ... </div> 02/10/19 9Stefan Dietze
  • 10. 02/10/19 10Stefan Dietze Research dataset markup on the Web  In Common Crawl 2017 (3.2 bn pages): o 14.1 M statements & 3.4 M instances related to „s:Dataset“ o Spread across 200 K pages from 2878 PLDs (top 10% of PLDs provide 95% of data)  Studies of scholarly articles and other types [SAVESD16, WWW2017]: majority of major publishers, data hosting sites, data registries, libraries, research organisations respresented power law distribution of dataset metadata across PLDs  Challenges o Errors. Factual errors, annotation errors (see also [Meusel et al, ESWC2015]) o Ambiguity & coreferences. e.g. 18.000 entity descriptions of “iPhone 6” in Common Crawl 2016 & ambiguous literals (e.g. „Apple“>) o Redundancies & conflicts vast amounts of equivalent or conflicting statements
  • 11.  0. Noise: data cleansing (node URIs, deduplication etc)  1.a) Scale: Blocking through BM25 entity retrieval on markup index  1.b) Relevance: supervised coreference resolution  2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level KnowMore: data fusion on markup 02/10/19 11 1. Blocking & coreference resolution 2. Fusion / Fact selection New Queries WorldBank, type:(Organization) Washington, type:(City) David Malpass, type:(Person) (supervised) Entity Description name “WorldBank Commodity Prices 2019” distribution Worldbank (node) releaseDate 26.04.2019 keywords „crude”, “prizes”, “market” encodingFormat text/CSV Query WorldBank Commodity, Prices 2019, type:(Dataset) Candidate Facts node1 name WB Commodity node1 distribution node_xy node1 creator Worldbank node1 dateReleased 26 April 2019 node2 creator World Bank node2 encodingFormat text/CSV node3 dateCreated 26 April 2007 node4 keywords “crude” Web page markup Web crawl (Common Crawl, 44 bn facts) approx. 125.000 facts for query [ s:Product, „iPhone6“ ] Stefan Dietze Yu, R., [..], Dietze, S., KnowMore-Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2019 (SWJ2019) Tempelmeier, N., Demidova, S., Dietze, S., Inferring Missing Categorical Information in Noisy and Sparse Web Markup, The Web Conf. 2018 (WWW2018)
  • 12.  0. Noise: data cleansing (node URIs, deduplication etc)  1.a) Scale: Blocking through BM25 entity retrieval on markup index  1.b) Relevance: supervised coreference resolution  2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level KnowMore: data fusion on markup 02/10/19 12 1. Blocking & coreference resolution 2. Fusion / Fact selection New Queries WorldBank, type:(Organization) Washington, type:(City) David Malpass, type:(Person) (supervised) Entity Description name “WorldBank Commodity Prices 2019” distribution Worldbank (node) releaseDate 26.04.2019 keywords „crude”, “prizes”, “market” encodingFormat text/CSV Query WorldBank Commodity, Prices 2019, type:(Dataset) Candidate Facts node1 name WB Commodity node1 distribution node_xy node1 creator Worldbank node1 dateReleased 26 April 2019 node2 creator World Bank node2 encodingFormat text/CSV node3 dateCreated 26 April 2007 node4 keywords “crude” Web page markup Web crawl (Common Crawl, 44 bn facts) approx. 125.000 facts for query [ s:Product, „iPhone6“ ] Stefan Dietze Yu, R., [..], Dietze, S., KnowMore-Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2019 (SWJ2019) Tempelmeier, N., Demidova, S., Dietze, S., Inferring Missing Categorical Information in Noisy and Sparse Web Markup, The Web Conf. 2018 (WWW2018) Fusion performance  Experiments on books, movies, products (ongoing: datasets)  Baselines: BM25, CBFS [ESWC2015], PreRecCorr [Pochampally et. al., ACM SIGMOD 2014], strong variance across types Knowledge Graph Augmentation  On average 60% - 70% of all facts new (across DBpedia, Wikidata, Freebase)  Additional experiments on learning new categorical features (e.g. product categories or movie genres) [WWW2018]
  • 13. Rich Context & Coleridge Initiative building (yet another) KG of scholarly resources & datasets 13Stefan Dietze  Context/corpus: publications (currently: social sciences, SAGE Publishing)  Tasks: I. Extraction/disambiguation of dataset mentions II. Extraction/detection of research methods III. Classification of research fields https://coleridgeinitiative.org/richcontextcompetition
  • 14. Applications: search for social science resources (and links) 14Stefan Dietze https://search.gesis.org/
  • 15. Disambiguation of dataset citations Otto, W. et al., Knowledge Extraction from scholarly publications – the GESIS contribution to the Rich Context Competition, to appear, Sage Publishing, 2020 15Stefan Dietze All these issues are addressed in the current report, which is based on analysis of data obtained in the National Comorbidity Survey (NCS) (15). The NCS is a nationally representative survey of the US household population that includes retrospective reports about the ages at onset and lifetime occurrences of suicidal ideation, plans, and attempts along with information about the occurrences of mental disorders, substance use, substance abuse, and substance dependence. National Comorbidity Survey (NCS) NCS Challenges  Ambiguous (incomplete) citations  Lack of high-quality and representative training data (usually: weak labels, domain bias) Approaches & results  Prior work: supervised pattern induction [Boland et al, TPDL2012]  Current approach: o neural NER based on spaCy (CRF-based approach for research method detection) o Training (testing) on 12.000 (3.000) paragraphs (distribution of negative/positive differs, training batch size=25, dropout=0.4) o Results approx. P = .50, R= .90 (weakly labelled test data) o On small set of manually labelled test data: P= .52; R= .21)
  • 16. Profiling datasets for dataset search 16Stefan Dietze  Dataset metadata is crucial for search, discovery, reuse  But: dataset metadata is sparse, incomplete, noisy, costly  Profiling datasets = generating dataset metadata from actual data(set) at hand  Various profiling dimensions depending on use case (e.g. statistical features, dynamics, topics), cf. [SWJ18]  Works on topic profiling [ESWC14] and profiling of graph features [ESWC19] Ben Ellefi, M., Bellahsene, Z., Breslin, J., Demidova, E., Dietze, S., Szymanski, J., Todorov, K., RDF Dataset Profiling – a Survey of Features, Methods, Vocabularies and Applications, Semantic Web Journal, IOS Press 2018 Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, ESWC2014
  • 17. Profiling (Graph) Datasets 17Stefan Dietze  Graph metrics essential to describe/profile research datasets of graph-based nature (social graphs, knowledge graphs) in order to:  Distinguish dataset categories  Find/discover datasets with particular features  Generate synthetic data, e.g., for query benchmarks  Sample subsets while maintaining graph topology (e.g. to investigate network effects in social sciences research)  Research question: effective and discriminative features (non-redundant, non-noisy) for representing specific categories of datasets Zloch, M., Acosta, M., Hienert, D., Dietze, S., Conrad, S., A Software Framework and Datasets for the Analysis of Graph Measures on RDF Graphs, ESWC19, Best Student Paper Feature correlation matrix
  • 18. Noisy vs. homogenous features Lighter colour = more homogenous metric within domain Profiling (Graph) Datasets Zloch, M., Acosta, M., Hienert, D., Dietze, S., Conrad, S., A Software Framework and Datasets for the Analysis of Graph Measures on RDF Graphs, ESWC19, Best Student Paper 18Stefan Dietze Discriminative features Descriptive features able to characterise specific dataset categories (= feature impact in binary classification task aimed at distinguishing each dataset category)  Certain kinds of datasets (categories) hard to describe due to inherent diversity/variance of datasets  Selection of descriptive, non-redundant dataset profile features vary for different dataset categories (and use cases)
  • 19. Overview Part I Mining, fusing and linking research data (in particular: metadata) on the Web Part II Mining novel forms of research data knowledge graphs from the Web 02/10/19 19Stefan Dietze Datasets Metadata Publications Web pages Opinions Claims Stances
  • 20. 02/10/19 20Stefan Dietze http://dbpedia.org/resource/Tim_Berners-Lee wna:positive-emotion onyx:hasEmotionIntensity "0.75" onyx:hasEmotionIntensity "0.0" Mining opinions & interactions (the case of Twitter)  Heterogenity: multimodal, multilingual, informal, “noisy” language  Context dependence: interpretation of tweets/posts (entities, sentiments) requires consideration of context (e.g. time, linked content), “Dusseldorf” => City or Football team  Dynamics & scale: e.g. 6000 tweets per second, plus interactions (retweets etc) and context (e.g. 25% of tweets contain URLs)  Evolution and temporal aspects: evolution of interactions over time crucial for many social sciences questions  Representativity and bias: demographic distributions not known a priori in archived data collections http://dbpedia.org/resource/Solid wna:negative-emotion P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
  • 21. 02/10/19 21Stefan Dietze P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18. TweetsKB: a knowledge graph of Web mined “opinions” https://data.gesis.org/tweetskb/  Harvesting & archiving of 9 Bn tweets over 6 years (permanent collection from Twitter 1% sample since 2013)  Information extraction pipeline to build a KG of entities, interactions & sentiments (distributed batch processing via Hadoop Map/Reduce) o Entity linking with knowledge graph/DBpedia (Yahoo‘s FEL [Blanco et al. 2015]) (“president”/“potus”/”trump” => dbp:DonaldTrump), to disambiguate text and use background knowledge (eg US politicians? Republicans?), high precision (.85), low recall (.39) o Sentiment analysis/annotation using SentiStrength [Thelwall et al., 2017], F1 approx. .80 o Extraction of metadata and lifting into established schemas (SIOC, schema.org), publication using W3C standards (RDF/SPARQL)
  • 22. 02/10/19 22Stefan Dietze P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.  Harvesting & archiving of 9 Bn tweets over 5 years (permanent collection from Twitter 1% sample since 2013)  Information extraction pipeline (distributed via Hadoop Map/Reduce) o Entity linking with knowledge graph/DBpedia (Yahoo‘s FEL [Blanco et al. 2015]) (“president”/“potus”/”trump” => dbp:DonaldTrump), to disambiguate text and use background knowledge (eg US politicians? Republicans?), high precision (.85), low recall (.39) o Sentiment analysis/annotation using SentiStrength [Thelwall et al., 2012], F1 approx. .80 o Extraction of metadata and lifting into established schemas (SIOC, schema.org), publication using W3C standards (RDF/SPARQL) Use cases  Aggregating sentiments towards topics/entities, e.g. about CDU vs SPD politicians in particular time period  Twitter archives as general corpus for understanding temporal entity relatedness (e.g. “austerity” & “Greece” 2010-2015)  Investigating spreading & impact of fake news (e.g. TweetsKB, ClaimsKG, stance detection) Limitations  Bias & representativity: demographic distributions of users (not known a priori and not representative) -0.40000 -0.30000 -0.20000 -0.10000 0.00000 0.10000 0.20000 0.30000 0.40000 Cologne Düsseldorf https://data.gesis.org/tweetskb/ TweetsKB: a knowledge graph of Web mined “opinions”
  • 23. Research datasets (mentions/citations) on Twitter? https://data.gesis.org/tweetskb/ Daily mentions of datasets (DOIs, literals) in TweetsKB (approx. 20.000 mentions of datasets per month) Why? - Discovering long tail datasets - Dataset popularity, trends, „paradata“, usage, context Plot © Robert Jäschke
  • 24. 02/10/19 24Stefan Dietze Mining/finding knowledge about claims and stances stance, claim trustworthiness? stance, claim trustworthiness?
  • 25. Detecting stances towards claims/opinions Motivation  Problem: detecting stance of documents (e.g. Web pages, scientific publication) towards a given claim (unbalanced class distribution)  Motivation: stance of documents (in particular disagreement) useful (a) as signal for truthfulness (fake news detection) and (b) Document or Source classification (PLDs, publishers) Approach  Cascading binary classifiers: addressing individual issues (e.g. misclassification costs) per step  Features, e.g. textual similarity (Word2Vec etc), sentiments, LIWC, etc.  Best-performing models: 1) SVM with class-wise penalty, 2) CNN, 3) SVM with class-wise penalty  Experiments on FNC-1 dataset (and FNC baselines) Results  Minor overall performance improvement  Improvement on disagree class by 27% (but still far from robust) A. Roy, A. Ekbal, S. Dietze, P. Fafalios, Exploiting stance hierarchies for cost-sensitive stance detection of Web documents, WSDM2020 under review. 25Stefan Dietze
  • 26. 02/10/19 26Stefan Dietze ClaimsKG: a knowledge graph of claims and claim-related metadata Motivation  Claims spread across various (unstructured) fact-checking sites  Example: finding claims about / made by US republican politicians across the Web? Approach  Harvesting claims & metadata from fact- checking sites (e.g. snopes.com, Politifact.com etc); currently approx. 30.000 claims (plus mining schema.org/ClaimReview markup (> 500.000 statements in Common Crawl 2017)  Information extraction & linking o Linking mentioned entities to DBpedia o Normalisation of ratings (true, false, mixture, other); coreference resolution of claims o Exposing data through established vocabulary and W3C standards (e.g. SPARQL endpoint) https://data.gesis.org/claimskg/ A. Tchechmedjiev, P. Fafalios, K. Boland, S. Dietze, B. Zapilko, K. Todorov, ClaimsKG – A Live Knowledge Graph of fact-checked Claims, ISWC2019
  • 27. Conclusions Mining and profiling of research dataset metadata (KGs)  Mining of unstructured Web pages and scholarly articles for research datasets & metadata  Profiling of research datasets for discovery, sampling, generation of synthetic data  Plenty of related initiatives and efforts (e.g. Rich Context, Research Graph, OpenAIRE, ORKG)  Some challenges: generalisable/reusable methods for extraction & mining across domains and corpora Mining and sharing novel forms of research data (KGs)  Mining the Web for novel forms of research data  Examples from social sciences: opinions (sentiments on entities) and interactions on Twitter & structured knowledge about resource relations (for instance: stances) and claims  Some challenges: language understanding/interpretation, representativity and bias 02/10/19 27Stefan Dietze
  • 28. Acknowledgements Co-authors • Maribel Acosta (KIT, Karlsruhe) • Mohamad Ben Ellefi (LIRMM, France) • Katarina Boland (GESIS, Germany) • Stefan Conrad (HHU, Germany) • Elena Demidova (L3S, Germany) • Asif Ekbal (IIT Patna, India) • Pavlos Fafalios (L3S, Germany) • Ujwal Gadiraju (L3S, Germany) • Daniel Hienert (GESIS, Germany) • Peter Holtz (IWM, Germany) • Eirini Ntoutsi (LUH, Germany) • Vasilis Iosifidis (L3S, Germany) • Markus Rokicki (L3S, Germany) • Arjun Roy (IIT Patna, India) • Renato Stoffalette Joao (L3S, Germany) • Davide Taibi (CNR, ITD, Italy) • Nicolas Tempelmeier (L3S, Germany) • Konstantin Todorov (LIRMM, France) • Ran Yu (GESIS, Germany) • Benjamin Zapilko (GESIS, Germany) • Matthäus Zloch (GESIS, Germany) 02/10/19 28Stefan Dietze
  • 29. 29Stefan Dietze Knowledge Technologies for the Social Sciences (WTS) https://www.gesis.org/en/institute/departments/knowledge-technologies-for-the-social-sciences/ WTS Labs https://www.gesis.org/en/research/applied-computer-science/labs/wts-research-labs Data & Knowledge Engineering @ HHU https://www.cs.hhu.de/en/research-groups/data-knowledge-engineering.html L3S http://www.l3s.de Personal http://stefandietze.net