Semantics and Machine Learning

Semantics and ML
Semantic Integration Is What You Do Before The Deep Learning
Vladimir Alexiev, Chief Data Architect, Sirma AI (Ontotext)
dev.bg Machine Learning seminar, 13 May 2019, Sofia

Outline
• Semantic Web and Linked Data
• Knowledge Graphs
• Ontotext Projects
• Ontotext Demos
• Use of Machine Learning

What is Semantic Web and Linked Open Data?
• Semantic Web and Semantic Technologies
• Exposing data and datasets to machines
• Allowing machines to "understand" a bit of the data. Not giving a "higher meaning" to data
• RDF, Ontologies, RDF Shapes
• RDF: simple graph data model: triples (S,P,O), also quads (S,P,O,C=G)
• RDFS and OWL Ontologies: describe classes, properties, subclasses, sub-properties,
description logic constructs
• RDF Shapes (Application Profiles): describe constraints on RDF data
• May use with or without schema; the schema is part of the data
• Linked Open Data
• Expose datasets globally, making each entity/data point addressable (URL)
• Use global identifiers not ambiguous names: "things not strings"
• Link entities

Web 1.0, 2.0, 3.0
• Web 1.0: linked documents (World Wide Web)
• Before it there was ftp, gopher, online library catalogs…
• Web 2.0: web applications, social web
• Has Facebook taken over the web? New "decentralization" movement
• Web 3.0: linked data (Giant Global Graph)
• Metadata about documents, but also data about real-world entities: persons,
organizations, hierarchy, projects, publications, companies, startups, transactions,
networks, servers, printers, IoT things, etc

Where did it come from?
• TimBL CERN proposal,
1989:
• Both Web (1.0) and
Semantic Web (3.0)
• "Vague but Exciting"
• Not just documents, but
also real-world entities
• Why was it successful?
• Not the first nor the "best"
hypertext proposal
• But simple, workable, most
importantly open

LOD Cloud
WebDataCommons
Dec 2018: 30B
triples

What does LOD know about TimBL?
• TimBL at
Wikidata
Reasonator
• Names in 50
languages
• Description
is auto-
generated
• Parents
confirmed 3
times (with
different
details not
shown)

What does LOD know about TimBL?
• Depth of
Information
on TimBL
• Links to ~200
authority
files
• Info about
~20 awards
• Life Timeline
• etc, etc

Everybody is Building a KG!
KG Conference, 7-8 May, Columbia, NY
• Digital Commerce
• Airbnb - Knowledge Graph at Airbnb
• Amazon - Deep Learning for Knowledge Extraction and Integration to build the Amazon Product
Graph
• Uber - Building an Enterprise Knowledge Graph at Uber: Lessons from Reality
• Pitney Bowes - Intelligent Customer Service Using Knowledge Graphs
• Financial Services
• Causality Link - A Perspective on the Reasoning Power of Knowledge Graphs
• Capital One - Knowledge Graph Pilot Provides Value
• Goldman Sachs - Pythia: the Goldman Sachs Social Graph
• TigerGraph - Analyzing Time-varying Transitive Risk in Swap Networks using Graphs
• Refinitiv Financial - Practical Use Cases and Challenges to Implement Graphs in Financial
Services: Combating Financial Crime
• Wells Fargo - Knowledge Graphs and AI: The Future of Financial Data
• Forensics
• OCCRP - Using Graphs and Data Integration to Track Organised Crime
• Enigma.io - Impact and Insights from Public Data: Fighting Money Laundering by Linking and
Resolving Entities
• Refinitiv Financial - Practical Use Cases and Challenges to Implement Graphs in Financial
Services: Combating Financial Crime
• Health Care, Government, Supply Chain, Libraries
• AstraZeneca - Fair Data Knowledge Graphs (From Theory to Practice)
• Montefiore Hospital - The Chasm of a Million Analytics, and How to
Bridge it?
• United Nations - A Graph as a Means to Store Unpredictable Knowledge
– A Practical Implementation
• JSTOR Labs - Why Wikibase? Why not?
• Eccenca - Knowledge Graph for Digital Transformation in the Supply-
Chain
• German National Library of Science and Technology - Creating a
knowledge graph based Enterprise Innovation Architecture
• How To...
• Diffbot - Knowledge Graphs for AI
• Accenture Labs - Using a Domain Knowledge Graph to Manage AI at
Scale
• Capsenta - Designing and Building Enterprise Knowledge Graphs from
Relational Databases in the Real World
• Google AI - Wikidata, Knowledge Graphs, and Beyond
• IBM Research - Extending Knowledge Graphs using Distantly Supervised
Deep Nets
• Microsoft - Building a Large-scale, Accurate and Fresh Knowledge
Graph
• Neo4J - A Real-World Guide to Building Your Knowledge Graphs
• Collibra - Collibra's Context Graph
• Ontotext - How Analytics on Big Knowledge Graphs Help Data Linking:
Company Importance and Similarity Demo

KG & ML Literature & Seminars
• Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web.
• Dagstuhl Seminar 18371, Mar 2019
• Grand Challenges: structure of knowledge & data and scale
• Creation of Knowledge Graphs
• Knowledge Integration at Scale
• Knowledge Dynamics and Evolution
• Evaluation of Knowledge Graphs
• Combining Graph Queries with Graph Analysis
• (Re)Defining Knowledge Graphs
• NLP and Knowledge Graphs
• ML and Knowledge Graphs
• Human and Social Factors of Knowledge Graphs
• Applications of Knowledge Graphs
• Knowledge Graphs and the Web
• Deep Learning for the Masses (… and The Semantic Layer), Favio Vázquez, Nov 20, 2018
• Acknowledgement: my title is stolen from this blog post
• 4th Workshop on Semantic Deep Learning (SemDeep-4) at ISWC 2018
• Big Data Semantics. Journal on Data Semantics, Apr 2018. DOI: 10.1007/s13740-018-0086-2
• Forbes: Why Machine Learning Needs Semantics Not Just Statistics (Jan 2019)
• Wired: Amazon Alexa and the Search for the One Perfect Answer (Feb 2019)

Thomson Reuters permid Company Graph

Wikidata Scholia: comparing authors

Ontotext Essential Facts
• World-leading
• Semantic technology vendor established year
2000
• 65 staff: 7 PhD, 30 MS, 20 BS, 6 university
lecturers
• Over 400 person-years invested in R&D
• Part of Sirma Group: 400 persons, public
company (BSE:SKK)
• Profitable and growing
• 80% of revenue from commercial projects
• Innovator
• Attracted $15M in innovation funding
• Trendsetter
• Member of: W3C, EDMC (FIBO), ODI, LDBC,
STI, DBPedia Association
• Ontotext Innovation Awards
• Innovative Enterprise of the Year 2017
• EU Innovation Radar Prize 2016 nomination
• BAIT Business Innovation Award 2014
• Innovative Enterprise of the Year 2014
• Washington Post “Destination Innovation”
Competition 2014 Award
• Pythagoras Award 2010
• Most successful BG company in EU FP
projects

Some of Our Clients (selection)

Ontotext Approach and Applications

Ontotext GraphDB, a Leading Graph Database
• Source:
db-engines.com
ranking of graph
databases

• GraphDB Workbench: User-
friendly DB admin and querying • REST API for database access
• Plugins / Connectors

OntoRefine: Uplift Tabular Data to LOD
• Easily clean and
import tabular
data
• View as RDF in
real-time with
virtual SPARQL
endpoint
• Transform
using JS & SPIN
• Import newly
created RDF
directly to
GraphDB

Knowledge Graph Platform Use Cases
• Content enrichment
• Who: STM publishers transforming their business model from publishing to information
• Challenges: Control & generate meta-data
• Reference projects: Elsevier, Wiley, IET, BBC, Euromoney
• Semantic search of enterprise documents
• Who: Enterprises with transactional document flows lacking analytic capabilities
• Challenges: Integrate with existing CMS/DMS + security + analytics
• Reference projects: Platts, AstraZeneca, Top-5 US bank, Top-5 German bank
• Knowledge graph development and continuous updates
• Who: Innovative businesses based on knowledge intensive processes
• Challenges: Collect, integrate, and maintain complex knowledge graph, semantic
search + analytics
• Reference projects: Top Asian business information agency, Big-4 Consultant

Example KG: FactForge
o DBpedia (the English version) 496M
o Geonames (all geographic features on Earth) 150M
o owl:sameAs links between DBpedia and Geonames 471K
o GLEI (global company register data) 3M
o Panama Papers DB (#LinkedLeaks) 20M
o Other datasets and ontologies: WordNet, WorldFacts, FIBO
o News metadata (2000 articles/day enriched by NOW) 1 023M
o Total size (2.2B explicit + 328M inferred statements) 2 522М

Class Exploration
o About 1400
Classes
o To cope with
this one needs
specific tools
o GraphDB
Workbench’s
Class Hierarchy
exploration
tool

Visual Graph: Node details
#31

Reference Case for GraphDB and Ontotext Platform
Big Knowledge Graph
• 1B statements of master data
• 100M entities and concepts
• Entity linking across 5 data sources
• 1M documents, 100 KG tags/doc.
Performance
• 10 transactional updates/sec on master data
• 500 updates/sec for documents and metadata
• 100 graph queries/sec/node, incl. inferred facts
• RDFS+ reasoning: instant and transparent
• 1000 full-text searches/sec across docs and data
Text & Graph Analytics
• Extract new entities and facts from text
• Retrieval of similar documents and entities
• Automatic classification and link prediction
• Relevance and importance ranking
• Operations & Data Quality
• Multi-DC deployment across continents
• Worker nodes: 16 vCPU, 32GB RAM
• Daily updates from external data sources
• Maintain quality of linking and text analysis
• Metadata and instance data curation

Entity Awareness
• What does it mean to be "aware" of something?
• To have background info that allows some measure of
"intelligence"?
• We believe the numbers on the previous slide are a minimum that
can help a machine achieve "awareness"
• Let's try some games:
• Airports near London (within 50 miles)
• Airports near New York City
• Educational institutions near New York
• Educational institutions near Kaspichan

Demo: Ontotext Rank (and Similarity)

Ontotext R&D Projects
• More EU research projects than some BG
universities combined
• Vertical domains
• Cultural heritage (Europeana Creative, Food and
Drink, EHRI2)
• Companies (euBusinessGraph, CIMA), real estate
data (PDM) (ProDataMarket)
• Media/Publishing (TrendMiner, Multisensor, Evala)
• Fact & rumour checking (Pheme, WeVerify)
• Life Science (LarKC, KHRESMOI, KConnect,
ExaMode)
• Agriculture (BigDataGrapes)
• Science/innovation (TRR, InnoRate)

Project CIMA: Company Graph
• R&D
• Data virtualization (OBDA)
• Entity Linking
• Alignment Learning
• KG Embedding and Similarity
• Company Classification
• Company Graph
• Dataset discovery and analysis, procure
datasets
• Semantic structure mapping, taxonomy
mapping
• Semantic integration pipeline, data updates
• Cognitive Entity Matching
• Data curation
• ML algorithms and training
• Integration to Ontotext Platform, Demos
• Big Data connectors (e.g. Mongo, Cassandra)
• Cloud Services
• Demo applications

Project TRR: Science KG for FP7 Projects
• Info (Wikidata): Client: EC DG RTD (ministry of
science). Budget: 4M EUR, Duration: 4y. Partners:
PPMI (LT), Ontotext (BG), Fraunhofer (DE),
Intrasoft (LU)
• Get 8000 core FP7 projects (SP1 Collaboration)
• Build KG of science (projects, participants,
researchers, contacts, subjects, etc)
• Assess outputs (publications, datasets, patents…)
• Assess outcomes (startups, collaborations,
researcher mobility…)
• Assess impact (on research policy, economic,
societal, on health…)

Machine Learning at Ontotext
We're not a ML company but use ML for some of our tasks

ML at Ontotext
• Alignment Learning for Entity Matching
• Disambiguation for Named Entity Extraction
• Relation Learning for Relation Extraction
• Word+KG Embeddings for semantic similarity (VSA, predications)
• Ranking for auto-completion, entity popularity

GraphDB Semantic Similarity (Mar 2019)
• create hybrid similarity
searches
• use pre-built text-based
similarity vectors
• predication-based
similarity index
• run similarity indexes in
more that one iterations
• add term weights when
searching text-based
similarity indexes
• use analogical search for
predication indexing

New Developments in Bulgaria
Collaborations Between Academia and Industry

NBU MS Data Science
• Starts Sep 2019
• Covers ML, mathematics, R, Python, distributed (Spark), cloud…
• Ontotext course: Semantic Web Proof of Concept
• IICT BAS course: Semantic Text Analysis

GATE CoE: SU FMI + Chalmers Teaming
• Host
• Teaming
• Industry Supporters

Thank you!
Контакти: • Ontotext: Website, LinkedIn, Twitter, Rate GraphDB
• Vladimir Alexiev: Email, Publications, Homepage,
Resume, Linkedin; Twitter, Github
Следващо събитие:
Repeatability and reproducibility of ML research

Semantics and Machine Learning

In this document