SlideShare a Scribd company logo
Rajesh Bordawekar
IBM T. J. Watson Research Center
bordaw@us.ibm.com
Cognitive Database: An Apache
Spark-Based AI-Enabled
Relational Database System
#AI5SAIS
Outline
• Word Embedding Overview
• Cognitive Database Design
• Cognitive Intelligence (CI) Queries
• Spark Implementation Details
• Case Study: Image and Text Database
• Summary
2#AI5SAIS
#AI5SAIS
Word Embedding Overview
• Unsupervised neural network based NLP approach to
capture meanings of words based neighborhood context
– Meaning is captured as collective contributions from words in the
neighborhood
• Generates semantic representation of words as low-
dimensional vectors (200-300 dimensions)
• Semantic similarity measured using distance metric
(e.g., cosine distance) between vectors
3#AI5SAIS
#AI5SAIS
Cognitive Database Key Ideas
• Uses dual view of relational data: tables and
meaningful text, with all relational entities mapped to
text, without loss of information
• Uses word-embedding approach to extract latent
information from database tables
• Classical Word embedding model extended to capture
constraints of the relational model (e.g., primary keys)
• Enables relational databases to capture and exploit
semantic contextual similarities
4#AI5SAIS
#AI5SAIS
Structured Data sources
Relational Tables
Word Embedding
Model
External
Unstructured
and
Structured
Data sources
Cognitive Intelligence Queries
in Structured Query Systems
Structured Results
Relational Tables
Pre-trained
Model
Model built from data
source being queried
Using Embedding Models
#AI5SAIS
#AI5SAIS
Cognitive Database Features
• Enables SQL-based information retrieval based on
semantic context, rather than, data values
• Unlike analytics databases, does not view database tables
as feature and model repositories
• Latent features exposed to users via standard SQL based
Cognitive Intelligence (CI) queries
• Users can invoke standard SQL queries using typed
relational variables over a semantic model built over
untyped strings
6#AI5SAIS
#AI5SAIS
7#AI5SAIS
#AI5SAIS
custID Date Merchant State Category Items Amount
custA 9/16 Whole Foods
NJ
Fresh Produce Bananas, Apples 200
custB
custC
custD
10/16
10/16
9/16
Target
Trader Joes
Walmart
Stationery
Fresh Produce
Stationery
Bananas, Oranges
Crayons, Pens, Notebooks
Crayons, Folders
60
80
25
NY
CT
NY
“custD 9/16 Walmart NY Stationery ‘Crayons, Folders’ 25”
Text representation of a table row
Words in the neighborhood contribute to the overall meaning of “custID”
custA
custC
custB
custD
For this relational view,
custA is similar to custC
custB is similar to custD
Words similar in meaning
closer in vector space
Customer Analytics Workload
Meaning vector
for every token
Cognitive Intelligence Queries
• Semantic Similarity/Dissimilarities
• Semantic Clustering
• Cognitive OLAP queries
• Inductive Reasoning queries
• Semantic Relational Operations
8#AI5SAIS
#AI5SAIS
Can work with externally trained models and over
multiple data types.
CI Query Example
9#AI5SAIS
#AI5SAIS
val result_df = spark.sql(s”””
SELECT VENDOR_NAME,
proximityCust_NameUDF(VENDOR_NAME, ‘$v’)
AS proximityValue FROM Index_view
HAVING proximityValue > 0.5
ORDER BY proximityValue DESC
”””)
CI similarity Query: Find similar entities to a given entity
(VENDOR_NAME) based on transaction characteristic similarities
Cognitive UDF
• Operates on relational
variables. Can be sets or
sequences
• For each input variable,
fetches vectors from the
embedding model
• Computes semantic
similarity between vectors
using nearest neighbor
approaches
Cognitive Database Applications
• Analysis over multi-modal data (Retail, Health,
Insurance)
• Entity similarity queries (Customer Analytics, IT
Ticket Management, Time-series)
• Cognitive OLAP (Finance, Insurance…)
• Entity Resolution (Master Data Management)
• Analysis of time-series data (IoT, Health)
10#AI5SAIS
#AI5SAIS
Cognitive Databases Stages
VectorDomain
Learned Vectors
Pre-computed
External Learned
Vectors
UDFs
External Text
Sources
Tokenized
Relations
Relational
Tables
Relational
System Tables
CI Queries Relations
Cognitive ETL Vector Storage Query Execution
RelationalTextDomain
#AI5SAIS
#AI5SAIS
Training from source database
Relational
Tables
Data
Cleaning
Training Text File
Word Embedding
Training
Word Embedding
Model
Create unique tokens
(Python)
k-means clustering
(Numpy/Scipy)
Create unique tokens
(Python)
text numerical values images
Get image tags
(Watson VRS)
Create image features
Window size
Vector Dimensions
Hyperparameter Tuning*
#AI5SAIS
#AI5SAIS
Why Spark?
2.2.0 Dataframes-based Representation
Spark SQL based Cognitive Intelligence Queries
(IBM Z zOS/zLinux, IBM P Linux,AIX, x86)
Relational Databases CSV Files …. JSON
Database Community Data Science Community
Spark SQL+UDFs
(Scala/Python)
PySpark/Pandas APIs
via Jupyter
…..
GPU
Acceleration
Flexibility over
multiple input
data formats
Portability
across multiple
platforms/OS
Support for
Standardized
SQL Queries
Usability
across multiple
user domains
Opportunities for
Acceleration
#AI5SAIS
#AI5SAIS
14#AI5SAIS
#AI5SAIS
Cognitive Database: Spark Execution Flow
SELECT X.custID, X.custName,
proximityAvg(X.InvestType,Y.InvestType)
FROM cust X, cust Y
WHERE Y.custID=‘471’ AND
proximityAvg(X.InvestType,Y.InvestType)
LIMIT 5
SQL Query
Similarity
Computation
Output TableTrained ModelInput Table
Spark SQL UDF
Nearest Neighbor
Spark
DF
Spark DF
Specialized
Word Embedding
Spark
SQL
Source
Data
Spark
DF
Invoking Cognitive Database in Jupyter
#AI5SAIS
#AI5SAIS
16#AI5SAIS
#AI5SAIS
Picture ID National Park Country Path of JPEG Image
PK_01 Corbett India ./Img_Folder/Img_01.JPEG
PK_05 Kruger South Africa ./Img_Folder/Img_05.JPEG
PK_09 Sunderbans India ./Img_Folder/Img_09.JPEG
PK_11 Serengeti Tanzania ./Img_Folder/Img_11.JPEG
Picture
Id
Image Id National Park Country Animal Name Class Dietary Habit color
PK_01 Img_01.JPEG Corbett India Elephant Mammal Herbivores Gray
PK_05 Img_05.JPEG Kruger South Africa Rhinoceros Mammal Herbivores Gray
PK_09 Img_09.JPEG Sunderbans India Crocodile Reptile Carnivorous Gray
PK_11 Img_11.JPEG Serengeti Tanzania Lion Mammal Carnivorous Yellow
Case Study: Application Database with links to images
Internal Training database with features extracted from linked images
The above merged data is used as an input to train the word embedding model that generates embeddings
of each unique token based on the neighborhood. Each row of the database is viewed as a sentence.
17#AI5SAIS
#AI5SAIS
CI Semantic Clustering Query: Find all images whose similarity to user chosen images of [lion,
vulture, shark] using the attributeSimAvg UDF with similarity score greater than 0.75
SELECT X.imagename, X.classA, X.classB, X.classC, X.classD,
FROM ImageDataTable X
WHERE (X.imagename <> ’n01314663_7147.jpeg’) AND (X.imagename <> ’n01323781_13094.jpeg’) AND
(X.imagename <> ’n01314663_8531.jpeg’) AND
(attributeSimAvgUDF(’n01314663_7147.jpeg’, ’n01323781_13094.jpeg’, ’n01314663_8531.jpeg’, X.imagename) > 0.75)
18#AI5SAIS
#AI5SAIS
X.Imagename X.classB X.classC X.classD
n01604330_12473 bird_of_prey,
mammal
new_world_vulture,
carnivore
andean_condor, condor, sloth_bear
n01316422_1684 mammal,
bird_of_prey
carnivore, eagle glutton_wolverine, piste_ski_run,
downhill_skiing, ern, ski_slope
n01324431_7056 bird_of_prey,
mammal
new_world_vulture,
carnivore
andean_condor, tayra
n01604330_12473 n01316422_1684
Output
n01324431_7056
19#AI5SAIS
#AI5SAIS
CI Analogy Query: Find all images whose classD satisfies the analogy query [reptile:
monitor_lizard :: aquatic_vertebrate : ?] using analogyQuery UDF having similarity score
greater than 0.5.
SELECT X.imagename, X.classA, X.classB, X.classC, X.classD
FROM ImageDataTable X
WHERE (analogyQuery(’reptile’,’monitor_lizard’,’aquatic_vertebrate’,X.classD,1) > 0.5)
X.Imagename X.classB X.classC X.classD
n02512053_1493 aquatic_vertebrate spiny_finned_fish permit, archerfish
n02512053_3292 aquatic_vertebrate spiny_finned_fish archerfish, mojarra
n02512053_602 aquatic_vertebrate spiny_finned_fish lookdown, permit
20#AI5SAIS
#AI5SAIS
CI Query using external knowledge base: Find all images of animals whose classD similarity score
to the Concept of ‘‘Hypercarnivore" of Wikipedia using proximityAvgForExtKB UDF is greater than
0.5. Exclude images that are already tagged as carnivore, herbivore, omnivore or scavenger.
SELECT X.imagename,X.classA,X.classB,X.classC, X.classD
FROM ImageDataTable X
WHERE
(proximityAvgAdvForExtKB(’CONCEPT_Hypercarnivore’,
X.classD) > 0.5)
ORDER BY SimScore DESC
Summary
• Novel relational database system that uses word
embedding approach to enable semantic queries in SQL
• Spark-based implementation that loads data from a
variety of sources and invokes Cognitive Intelligence
queries using Spark SQL
• Demonstration of the cognitive database capabilities
using a multi-modal (text+image) dataset
• Illustration of seamlessly integrating AI capabilities into
relational database ecosystem
21#AI5SAIS
#AI5SAIS
References
• Bordawekar and Shmueli, Enabling Cognitive Intelligence Queries in
Relational Databases using Low-dimensional Word Embeddings,
arXiv:1603.07185, March 2016
• Bordawekar, Bandopadhyay, and Shmueli, Cognitive Database: A Step
Towards Endowing Relational Databases with Artificial Intelligence
Capabilities, arXiv:1712:07199, December 2017
22#AI5SAIS
#AI5SAIS

More Related Content

What's hot (19)

PDF
Azure SQL Data Warehouse
Antonios Chatzipavlis
 
PPTX
An intro to Azure Data Lake
Rick van den Bosch
 
PDF
Azure Data services
Rajesh Kolla
 
PDF
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
PPTX
Snowflake Datawarehouse Architecturing
Ishan Bhawantha Hewanayake
 
PDF
Data Lakes with Azure Databricks
Data Con LA
 
PDF
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Lace Lofranco
 
PDF
Introducing Azure SQL Data Warehouse
Grant Fritchey
 
PDF
Designing a modern data warehouse in azure
Antonios Chatzipavlis
 
PPTX
Azure Lowlands: An intro to Azure Data Lake
Rick van den Bosch
 
PPTX
Azure SQL Data Warehouse for beginners
Michaela Murray
 
PPTX
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Michael Rys
 
PDF
Zero to 60 with Azure Cosmos DB
Adnan Hashmi
 
PPTX
Introduction to azure cosmos db
Ratan Parai
 
PPTX
Microsoft Azure Data Warehouse Overview
Justin Munsters
 
PPTX
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
PPTX
A developer's introduction to big data processing with Azure Databricks
Microsoft Tech Community
 
PPTX
Azure data bricks by Eugene Polonichko
Alex Tumanoff
 
PPTX
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Microsoft Tech Community
 
Azure SQL Data Warehouse
Antonios Chatzipavlis
 
An intro to Azure Data Lake
Rick van den Bosch
 
Azure Data services
Rajesh Kolla
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
Snowflake Datawarehouse Architecturing
Ishan Bhawantha Hewanayake
 
Data Lakes with Azure Databricks
Data Con LA
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Lace Lofranco
 
Introducing Azure SQL Data Warehouse
Grant Fritchey
 
Designing a modern data warehouse in azure
Antonios Chatzipavlis
 
Azure Lowlands: An intro to Azure Data Lake
Rick van den Bosch
 
Azure SQL Data Warehouse for beginners
Michaela Murray
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Michael Rys
 
Zero to 60 with Azure Cosmos DB
Adnan Hashmi
 
Introduction to azure cosmos db
Ratan Parai
 
Microsoft Azure Data Warehouse Overview
Justin Munsters
 
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
A developer's introduction to big data processing with Azure Databricks
Microsoft Tech Community
 
Azure data bricks by Eugene Polonichko
Alex Tumanoff
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Microsoft Tech Community
 

Similar to Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database System with Rajesh Bordawekar (20)

PDF
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
GeeksLab Odessa
 
PDF
Awesome Banking API's
Natalino Busa
 
PPTX
Machine Learning with Microsoft Azure
Dmitry Petukhov
 
PDF
TAO Fayan_Report on Top 10 data mining algorithms applications with R
Fayan TAO
 
PDF
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi
 
PPTX
AZMS PRESENTATION.pptx
SonuShaw16
 
PPTX
7 Databases in 70 minutes
Karen Lopez
 
PPTX
Presentation
Dimitris Stripelis
 
PDF
Overview of running R in the Oracle Database
Brendan Tierney
 
PPTX
N1QL: What's new in Couchbase 5.0
Keshav Murthy
 
PDF
Signal Digital: The Skinny on Wide Rows
DataStax Academy
 
PPSX
Conquer Big Data with Oracle 18c, In-Memory External Tables and Analytic Func...
Jim Czuprynski
 
PDF
Scalding big ADta
b0ris_1
 
PDF
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
PPTX
Scylla Summit 2018: From SAP to Scylla - Tracking the Fleet at GPS Insight
ScyllaDB
 
PDF
Agile Data Science 2.0
Russell Jurney
 
PPTX
Couchbase Data Platform | Big Data Demystified
Omid Vahdaty
 
PPTX
Azure Data Lake and U-SQL
Michael Rys
 
PPTX
U-SQL - Azure Data Lake Analytics for Developers
Michael Rys
 
PDF
Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...
Beat Signer
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
GeeksLab Odessa
 
Awesome Banking API's
Natalino Busa
 
Machine Learning with Microsoft Azure
Dmitry Petukhov
 
TAO Fayan_Report on Top 10 data mining algorithms applications with R
Fayan TAO
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi
 
AZMS PRESENTATION.pptx
SonuShaw16
 
7 Databases in 70 minutes
Karen Lopez
 
Presentation
Dimitris Stripelis
 
Overview of running R in the Oracle Database
Brendan Tierney
 
N1QL: What's new in Couchbase 5.0
Keshav Murthy
 
Signal Digital: The Skinny on Wide Rows
DataStax Academy
 
Conquer Big Data with Oracle 18c, In-Memory External Tables and Analytic Func...
Jim Czuprynski
 
Scalding big ADta
b0ris_1
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
Scylla Summit 2018: From SAP to Scylla - Tracking the Fleet at GPS Insight
ScyllaDB
 
Agile Data Science 2.0
Russell Jurney
 
Couchbase Data Platform | Big Data Demystified
Omid Vahdaty
 
Azure Data Lake and U-SQL
Michael Rys
 
U-SQL - Azure Data Lake Analytics for Developers
Michael Rys
 
Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...
Beat Signer
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 

Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database System with Rajesh Bordawekar

  • 1. Rajesh Bordawekar IBM T. J. Watson Research Center [email protected] Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database System #AI5SAIS
  • 2. Outline • Word Embedding Overview • Cognitive Database Design • Cognitive Intelligence (CI) Queries • Spark Implementation Details • Case Study: Image and Text Database • Summary 2#AI5SAIS #AI5SAIS
  • 3. Word Embedding Overview • Unsupervised neural network based NLP approach to capture meanings of words based neighborhood context – Meaning is captured as collective contributions from words in the neighborhood • Generates semantic representation of words as low- dimensional vectors (200-300 dimensions) • Semantic similarity measured using distance metric (e.g., cosine distance) between vectors 3#AI5SAIS #AI5SAIS
  • 4. Cognitive Database Key Ideas • Uses dual view of relational data: tables and meaningful text, with all relational entities mapped to text, without loss of information • Uses word-embedding approach to extract latent information from database tables • Classical Word embedding model extended to capture constraints of the relational model (e.g., primary keys) • Enables relational databases to capture and exploit semantic contextual similarities 4#AI5SAIS #AI5SAIS
  • 5. Structured Data sources Relational Tables Word Embedding Model External Unstructured and Structured Data sources Cognitive Intelligence Queries in Structured Query Systems Structured Results Relational Tables Pre-trained Model Model built from data source being queried Using Embedding Models #AI5SAIS #AI5SAIS
  • 6. Cognitive Database Features • Enables SQL-based information retrieval based on semantic context, rather than, data values • Unlike analytics databases, does not view database tables as feature and model repositories • Latent features exposed to users via standard SQL based Cognitive Intelligence (CI) queries • Users can invoke standard SQL queries using typed relational variables over a semantic model built over untyped strings 6#AI5SAIS #AI5SAIS
  • 7. 7#AI5SAIS #AI5SAIS custID Date Merchant State Category Items Amount custA 9/16 Whole Foods NJ Fresh Produce Bananas, Apples 200 custB custC custD 10/16 10/16 9/16 Target Trader Joes Walmart Stationery Fresh Produce Stationery Bananas, Oranges Crayons, Pens, Notebooks Crayons, Folders 60 80 25 NY CT NY “custD 9/16 Walmart NY Stationery ‘Crayons, Folders’ 25” Text representation of a table row Words in the neighborhood contribute to the overall meaning of “custID” custA custC custB custD For this relational view, custA is similar to custC custB is similar to custD Words similar in meaning closer in vector space Customer Analytics Workload Meaning vector for every token
  • 8. Cognitive Intelligence Queries • Semantic Similarity/Dissimilarities • Semantic Clustering • Cognitive OLAP queries • Inductive Reasoning queries • Semantic Relational Operations 8#AI5SAIS #AI5SAIS Can work with externally trained models and over multiple data types.
  • 9. CI Query Example 9#AI5SAIS #AI5SAIS val result_df = spark.sql(s””” SELECT VENDOR_NAME, proximityCust_NameUDF(VENDOR_NAME, ‘$v’) AS proximityValue FROM Index_view HAVING proximityValue > 0.5 ORDER BY proximityValue DESC ”””) CI similarity Query: Find similar entities to a given entity (VENDOR_NAME) based on transaction characteristic similarities Cognitive UDF • Operates on relational variables. Can be sets or sequences • For each input variable, fetches vectors from the embedding model • Computes semantic similarity between vectors using nearest neighbor approaches
  • 10. Cognitive Database Applications • Analysis over multi-modal data (Retail, Health, Insurance) • Entity similarity queries (Customer Analytics, IT Ticket Management, Time-series) • Cognitive OLAP (Finance, Insurance…) • Entity Resolution (Master Data Management) • Analysis of time-series data (IoT, Health) 10#AI5SAIS #AI5SAIS
  • 11. Cognitive Databases Stages VectorDomain Learned Vectors Pre-computed External Learned Vectors UDFs External Text Sources Tokenized Relations Relational Tables Relational System Tables CI Queries Relations Cognitive ETL Vector Storage Query Execution RelationalTextDomain #AI5SAIS #AI5SAIS
  • 12. Training from source database Relational Tables Data Cleaning Training Text File Word Embedding Training Word Embedding Model Create unique tokens (Python) k-means clustering (Numpy/Scipy) Create unique tokens (Python) text numerical values images Get image tags (Watson VRS) Create image features Window size Vector Dimensions Hyperparameter Tuning* #AI5SAIS #AI5SAIS
  • 13. Why Spark? 2.2.0 Dataframes-based Representation Spark SQL based Cognitive Intelligence Queries (IBM Z zOS/zLinux, IBM P Linux,AIX, x86) Relational Databases CSV Files …. JSON Database Community Data Science Community Spark SQL+UDFs (Scala/Python) PySpark/Pandas APIs via Jupyter ….. GPU Acceleration Flexibility over multiple input data formats Portability across multiple platforms/OS Support for Standardized SQL Queries Usability across multiple user domains Opportunities for Acceleration #AI5SAIS #AI5SAIS
  • 14. 14#AI5SAIS #AI5SAIS Cognitive Database: Spark Execution Flow SELECT X.custID, X.custName, proximityAvg(X.InvestType,Y.InvestType) FROM cust X, cust Y WHERE Y.custID=‘471’ AND proximityAvg(X.InvestType,Y.InvestType) LIMIT 5 SQL Query Similarity Computation Output TableTrained ModelInput Table Spark SQL UDF Nearest Neighbor Spark DF Spark DF Specialized Word Embedding Spark SQL Source Data Spark DF
  • 15. Invoking Cognitive Database in Jupyter #AI5SAIS #AI5SAIS
  • 16. 16#AI5SAIS #AI5SAIS Picture ID National Park Country Path of JPEG Image PK_01 Corbett India ./Img_Folder/Img_01.JPEG PK_05 Kruger South Africa ./Img_Folder/Img_05.JPEG PK_09 Sunderbans India ./Img_Folder/Img_09.JPEG PK_11 Serengeti Tanzania ./Img_Folder/Img_11.JPEG Picture Id Image Id National Park Country Animal Name Class Dietary Habit color PK_01 Img_01.JPEG Corbett India Elephant Mammal Herbivores Gray PK_05 Img_05.JPEG Kruger South Africa Rhinoceros Mammal Herbivores Gray PK_09 Img_09.JPEG Sunderbans India Crocodile Reptile Carnivorous Gray PK_11 Img_11.JPEG Serengeti Tanzania Lion Mammal Carnivorous Yellow Case Study: Application Database with links to images Internal Training database with features extracted from linked images The above merged data is used as an input to train the word embedding model that generates embeddings of each unique token based on the neighborhood. Each row of the database is viewed as a sentence.
  • 17. 17#AI5SAIS #AI5SAIS CI Semantic Clustering Query: Find all images whose similarity to user chosen images of [lion, vulture, shark] using the attributeSimAvg UDF with similarity score greater than 0.75 SELECT X.imagename, X.classA, X.classB, X.classC, X.classD, FROM ImageDataTable X WHERE (X.imagename <> ’n01314663_7147.jpeg’) AND (X.imagename <> ’n01323781_13094.jpeg’) AND (X.imagename <> ’n01314663_8531.jpeg’) AND (attributeSimAvgUDF(’n01314663_7147.jpeg’, ’n01323781_13094.jpeg’, ’n01314663_8531.jpeg’, X.imagename) > 0.75)
  • 18. 18#AI5SAIS #AI5SAIS X.Imagename X.classB X.classC X.classD n01604330_12473 bird_of_prey, mammal new_world_vulture, carnivore andean_condor, condor, sloth_bear n01316422_1684 mammal, bird_of_prey carnivore, eagle glutton_wolverine, piste_ski_run, downhill_skiing, ern, ski_slope n01324431_7056 bird_of_prey, mammal new_world_vulture, carnivore andean_condor, tayra n01604330_12473 n01316422_1684 Output n01324431_7056
  • 19. 19#AI5SAIS #AI5SAIS CI Analogy Query: Find all images whose classD satisfies the analogy query [reptile: monitor_lizard :: aquatic_vertebrate : ?] using analogyQuery UDF having similarity score greater than 0.5. SELECT X.imagename, X.classA, X.classB, X.classC, X.classD FROM ImageDataTable X WHERE (analogyQuery(’reptile’,’monitor_lizard’,’aquatic_vertebrate’,X.classD,1) > 0.5) X.Imagename X.classB X.classC X.classD n02512053_1493 aquatic_vertebrate spiny_finned_fish permit, archerfish n02512053_3292 aquatic_vertebrate spiny_finned_fish archerfish, mojarra n02512053_602 aquatic_vertebrate spiny_finned_fish lookdown, permit
  • 20. 20#AI5SAIS #AI5SAIS CI Query using external knowledge base: Find all images of animals whose classD similarity score to the Concept of ‘‘Hypercarnivore" of Wikipedia using proximityAvgForExtKB UDF is greater than 0.5. Exclude images that are already tagged as carnivore, herbivore, omnivore or scavenger. SELECT X.imagename,X.classA,X.classB,X.classC, X.classD FROM ImageDataTable X WHERE (proximityAvgAdvForExtKB(’CONCEPT_Hypercarnivore’, X.classD) > 0.5) ORDER BY SimScore DESC
  • 21. Summary • Novel relational database system that uses word embedding approach to enable semantic queries in SQL • Spark-based implementation that loads data from a variety of sources and invokes Cognitive Intelligence queries using Spark SQL • Demonstration of the cognitive database capabilities using a multi-modal (text+image) dataset • Illustration of seamlessly integrating AI capabilities into relational database ecosystem 21#AI5SAIS #AI5SAIS
  • 22. References • Bordawekar and Shmueli, Enabling Cognitive Intelligence Queries in Relational Databases using Low-dimensional Word Embeddings, arXiv:1603.07185, March 2016 • Bordawekar, Bandopadhyay, and Shmueli, Cognitive Database: A Step Towards Endowing Relational Databases with Artificial Intelligence Capabilities, arXiv:1712:07199, December 2017 22#AI5SAIS #AI5SAIS