SlideShare a Scribd company logo
1
Ajay Shriwastava
Sachin Ghai
ImpetusTechnologies Inc.
Logical Data Warehouse:
Building a virtual data services layer
Hadoop Summit – San Jose – 11 June 2015
2
AGENDA
Emergence of
Logical Data
Warehouse
Virtualization Offload
3
EVOLUTION OFTHE DATAWAREHOUSE
4
EVOLUTION OFTHE DATAWAREHOUSE
If this slide looks inverted to you, it actually is.
Data warehouse as we knew so far has inverted the concepts
with emergence of BIG DATA.
5
EVOLUTION OFTHE DATAWAREHOUSE
Pre-
determin
ed input
schema
Extensive
data
governance
ANSI SQL
compliance
IT teams
ownership
Concurren
t users
Pre-
canned BI
Low cost
storage/
archive
Non
SQL
access
Explorator
y analysis
Machine
Learning/
Graph
Data
discovery
Self Service
BI/analytics
Enterprise Data Warehouse (EDW) Big Data Warehouse (BDW)
6
CO-EXISTENCE OF EDW AND BDW – A REALITY
Organizations
still on initial
phase of Big
Data journey.
Existing ETL
jobs feeding
EDW.
BI and
downstream
applications
usingANSI SQL
for querying
data in EDW.
Business use
cases for big
data are
emerging.
7
EMERGENCE OF LOGICAL DATAWAREHOUSE
Logical Data
Warehouse
In response to emerging forces like Big Data, the data warehousing practices evolution led
to emergence of “Logical Data Warehouse”.
Key
components
include:
Repository management
Data virtualization
Distributed process
Auditing statistics and performance evaluation services
SLA management
Taxonomy/Ontology resolution
Metadata management
First proposed in May 2009 and published in August 2011 research by Gartner.
8
NEW PARADIGMS
DATA LAKE
DISTRIBUTED
PARALLEL
PROCESS
VIRTUALIZATION OFFLOAD
Repositories
continue to be no
longer Enterprise
Data Warehouse
or Data marts –
emergence of
HDFS as the “data
lake” along with
NoSQL data
stores.
Distributed
process now
have become
synonymous
with MapReduce
on files or DB.
With Spark,
more distributed
operators are
becoming
common place.
Virtualization
gaining favor as
an access
mechanism
where transient
consolidation is
required for a
use case.
Offload to newer
repositories and
process engines
requiring more
accurate science
and process now.
9
VIRTUALIZATION – AND ITS MANY DELINEATIONS
• simplified, unified, and integrated viewDataVirtualization
• is a subset of data virtualization
• enhanced with query optimization strategies for specific source
Data Federation
• involves actual data movement and ‘write’ to a repository rather
than just ‘read’ for a transient use caseData Blending
10
BRIDGINGTHE CHANNELWITHVIRTUALIZATION
VIRTUALIZATION
Relational
Oracle,
DB2…
NoSQL
Cassandra,
MongoDB
…
File Systems
HDFS,
GPFS…
MPP
Teradata,
Netezza…
Hadoop based
Warehouse
Hive,Tajo…
Users
Enterprise
Web, Mobile
Applications
Enterprise,
ESB…
BI
Reporting,
Visualization…
Data Science
Machine Learning,
Graph…
Data
Management
MDM,
Discovery…
TARGET
SYSTEMS
SOURCESYSTEMS
11
WEALTH MANAGEMENT – USE CASE
Big data can transform the client and account centric wealth management to personalized goal based
wealth management.
Unfortunately the information is spread across many different line of business using separate data
sources and platforms
12
GOAL BASEDWEALTH MANAGEMENT
• From account or client centric view to household and relationship view.
• A comprehensive approach to understand long term financial goals of client.
• Facilitate financial security during life changing events – marriage, college, job changes, retirement,
inter generational wealth transfer.
13
BUILDING RELATIONSHIPS
HouseholdView Business NetworkView HierarchyView.
Collect the data and process on a
common platform.
- Different LOB’s
- IVR Logs/Web Logs
Discover relationships.
Unified view over existing data in
client and account systems.
Integrate with social data.
ImplementGovernance.
Implement corporate hierarchy over
client data.
Jerry
Mayfield
Paul
Robinson
Andrew
Madura
Linda Mays
Jack Kline
Root
Node
Friends
Golf Buddy
College
Alumni
Neighbors
Father
Son
Daughter-1 Daughter-2
Self Spouse
14
PERSONALIZED SERVICES CROSS/UP SELL
HouseholdView Business NetworkView HierarchyView.
• 401 K / IRA
• 529
• College loans
• Student Credit Cards.
• Gift Cards
• Estate Planning
• Alimonies
• Company loans
• Asset standing
• Business prospects
• Corporate Accounts.
• Corporate discounts.
• Group based services.
• Prospects identification.
• Goal based services.
Assets
Distribution
0
10
20
30
40
50
Goals Actuals
Liabilities
0%
20%
40%
60%
80%
100%
15
TRADITIONAL DATA INTEGRATION
• Data is embedded in silos
• Time consuming and resource intensive ETL processes
• Create data duplication
• Governance is inhibitor and not enabler
• Inability to handle theV’s of big data
16
Governance–Security,Audit,Lineage
MonitoringandClusterManagement
HDFS/Hive/PIG
EDW –Teradata
/Netezza etc
ProducerA
Stream Analytix
SQL
Offload
Solution
Kyvos Engine
EDWMigration
In memory data Layer (spark)
Centralized
Schema
Big Data
Governance
Ankush
Jumbune
Analytics/VirtualizationStreamIngestion
BatchIngestion
Batch Data Ingestion
Sqoop /Talend etc.
Kundera
DataVirtualization Layer
Distributed Messaging Layer (Kafka)
Producer B
DataAccess
Spark Streaming
REST
API
Hadoop Cluster /YARN
Query
API (sql)
Custom layer for
universal connectivity
Search
API
ES, Solr, NLP
Propriety
connectors /
ODBC Drivers
BITools – Micro Strategy,Tableau, Kyvos…
Impetus Offerings (Details : Appendix 1)
Recommended Platform
Platform Requirements
JDBC
ML Lib
OLTP/RD
BMS
R Algorithms.
Data Quality.
Mahout
Storm
LOGICAL DATA WAREHOUSE REFERENCE
ARCHITECTURE
17
UNIFIEDVIEW - ADVANTAGES
• Fast real time data integration without creating expensive copies of data.
• Significant saving of time and resources required for ETL
• Facility to create a final composite schema.
• Information management capability.
• Meet stringent service level agreements.
18
OFFLOAD
Offload cold data and exploratory analysis workloads to commodity hardware
driven Hadoop cluster
– Save cost, resources
19
OFFLOAD CONCEPT
Run Hive
queries on
Hadoop
SQL
ScriptsEDW
Static SQL
Procedures/
PL-SQL
Proprietary
script
(RoadMap)
BDW
Hive
Queries
Hive Queries
SQL
Script
Parser
JAVA code with Hive
Queries
Tables
An Enterprise Data Warehouse (EDW) to Big Data Warehouse (BDW) offload will
essentially involve tables and code migration.
20
KEY CHALLENGES IN OFFLOAD
Varied input sources
Validating complete
schema and data offload
ANSI SQL
incompatibility
User Defined Functions
unavailability in target
system
Lack of unified view and
UI
Missing Data Quality
checks
21
HOWWE BUILTTHE OFFLOAD SOLUTION
Step 1: Identification
Step 2: Schema and
Data Migration
Step 3: Logic
Migration
Step 4: Data Quality
Enhancement
Step 5:Transformed
code Execution
ReducedTime!
Reduced Risk!
Automation!
22
SAMPLE QUERY – AUTOTRANSFORMED
Teradata Query:
INSERT INTO month_wise_ship_agg
select d_month_seq
,substr(w_warehouse_name,1,20) , TRIM(TRAILING '_' FROM sm_type)
,cc_name ,sum(case when (cs_ship_date_sk - cs_sold_date_sk <= 30 ) then 1 else 0 end) as "30 days"
,sum(case when (cs_ship_date_sk - cs_sold_date_sk > 30) and (cs_ship_date_sk - cs_sold_date_sk <= 60)
then 1 else 0 end ) as "31-60 days"
from catalog_sales ,warehouse ,ship_mode ,call_center C,date_dim
where cs_ship_date_sk = d_date_sk
and EXTRACT (YEAR FROM d_date) IN (2000,2001)
and DAYNUMBER_OF_MONTH(d_date) > 1
and TD_QUARTER_BEGIN(d_date) <= CURRENT_DATE
and cs_warehouse_sk = w_warehouse_sk
and cs_ship_mode_sk = sm_ship_mode_sk
and cs_call_center_sk = cc_call_center_sk
and (C.cust_id , C.address) LIKE ANY ( SELECT C1.cust_id, C1.address FROM Customer C1 )
group by substr(w_warehouse_name,1,20) ,sm_type ,cc_name ,d_month_seq ;
Hive Query:
INSERT INTO TABLE month_wise_ship_agg
SELECT d_month_seq,SUBSTR( w_warehouse_name , 1 , 20 ) AS auto_c01,
sm_type, UDF_TRIM('TRAILING ','_' , sm_type) ,
cc_name,SUM( CASE WHEN ( cs_ship_date_sk - cs_sold_date_sk <= 30) THEN 1 ELSE 0 END ) AS 30_days,
SUM( CASE WHEN ( cs_ship_date_sk - cs_sold_date_sk > 30) AND ( cs_ship_date_sk - cs_sold_date_sk <=
60) THEN 1 ELSE 0 END ) AS 31_60_days
FROM catalog_sales, warehouse, ship_mode, call_center C, date_dim
WHERE cs_ship_date_sk = d_date_sk
AND EXTRACT ('YEAR', d_date) IN (2000,2001)
AND DAYNUMBER_OF_MONTH(d_date) > 1
and TD_QUARTER_BEGIN(d_date) <= CURRENT_DATE()
AND cs_warehouse_sk = w_warehouse_sk
AND cs_ship_mode_sk = sm_ship_mode_sk
AND cs_call_center_sk = cc_call_center_sk
AND EXISTS ( SELECT * Customer C1 where C.cust_id LIKE C1.cust_id AND C.age LIKE C1.address )
GROUP BY SUBSTR( w_warehouse_name , 1 , 20 ) ,sm_type,cc_name,d_month_seq;
23
OFFLOAD - KEY ADVANTAGES
Optimize MPP and
Relational database
resources for
workloads
Re-use
millions of
lines of
code and
$
Avoid the learning
curve and re-code,
re-test cycles
Seamless
integration
of
downstream
/ upstream
apps and
reports
24
SUMMARY
Establish
your data
strategy
Identify key
component
: Hadoop,
MPP, Spark,
NoSQL etc.
Segregate
your
workloads
Offload to
low cost
Hadoop
where
required
Leverage
Virtualization
for key use
case
Establish
Data
Quality,
SLA,
Semantics
and
MetaData
as key
supporting
pillars
To summarize, following steps are recommended for creating a logical data warehouse
25
SIGN-OFF QUOTE
In the end, there can be only 2 types of data
warehouses: Logical data warehouse and illogical
data warehouse…
“
”
26
Thank you.
Questions??
ajay.shriwastava@impetus.co.in
sachin.ghai@impetus.co.in
27
APPENDIX 1
Product URL
StreamAnalytix http://streamanalytix.com/
Kyvos http://www.kyvosinsights.com/
Kundera http://bigdata.impetus.com/open_source_kundera
SQL Offload Solution
http://www.impetus.com/sites/impetus.com/impetus/br
ochures/ETL_Offloading_Datasheet.pdf
Ankush http://bigdata.impetus.com/ankush
Jumbune http://www.jumbune.org/

More Related Content

What's hot (20)

PPTX
Insights into Real World Data Management Challenges
DataWorks Summit
 
PPTX
End-to-End Security and Auditing in a Big Data as a Service Deployment
DataWorks Summit/Hadoop Summit
 
PPTX
Druid and Hive Together : Use Cases and Best Practices
DataWorks Summit
 
PPTX
What's new in apache hive
DataWorks Summit
 
PDF
Ingesting Data at Blazing Speed Using Apache Orc
DataWorks Summit
 
PPTX
What's new in Ambari
DataWorks Summit
 
PDF
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
PPTX
Protecting your Critical Hadoop Clusters Against Disasters
DataWorks Summit
 
PPTX
Deploying Docker applications on YARN via Slider
Hortonworks
 
PPTX
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
 
PPTX
GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...
DataWorks Summit
 
PPTX
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
PPTX
Cloudy with a chance of Hadoop - real world considerations
DataWorks Summit
 
PPTX
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
avanttic Consultoría Tecnológica
 
PPTX
Innovation in the Enterprise Rent-A-Car Data Warehouse
DataWorks Summit
 
PPTX
Preventative Maintenance of Robots in Automotive Industry
DataWorks Summit/Hadoop Summit
 
PDF
Visualizing Big Data in Realtime
DataWorks Summit
 
PPTX
YARN Ready: Apache Spark
Hortonworks
 
PPTX
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
DataWorks Summit
 
PDF
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Hortonworks
 
Insights into Real World Data Management Challenges
DataWorks Summit
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
DataWorks Summit/Hadoop Summit
 
Druid and Hive Together : Use Cases and Best Practices
DataWorks Summit
 
What's new in apache hive
DataWorks Summit
 
Ingesting Data at Blazing Speed Using Apache Orc
DataWorks Summit
 
What's new in Ambari
DataWorks Summit
 
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
Protecting your Critical Hadoop Clusters Against Disasters
DataWorks Summit
 
Deploying Docker applications on YARN via Slider
Hortonworks
 
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
 
GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...
DataWorks Summit
 
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Cloudy with a chance of Hadoop - real world considerations
DataWorks Summit
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
avanttic Consultoría Tecnológica
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
DataWorks Summit
 
Preventative Maintenance of Robots in Automotive Industry
DataWorks Summit/Hadoop Summit
 
Visualizing Big Data in Realtime
DataWorks Summit
 
YARN Ready: Apache Spark
Hortonworks
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
DataWorks Summit
 
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Hortonworks
 

Similar to Logical Data Warehouse: How to Build a Virtualized Data Services Layer (20)

PPTX
Thu-310pm-Impetus-SachinAndAjay
Ajay Shriwastava
 
PPTX
Hadoop and Your Data Warehouse
Caserta
 
PDF
Complement Your Existing Data Warehouse with Big Data & Hadoop
Datameer
 
PPTX
Data Lake Overview
James Serra
 
PPTX
Is the traditional data warehouse dead?
James Serra
 
PDF
Using Data Platforms That Are Fit-For-Purpose
DATAVERSITY
 
PPTX
Data warehouse
MR Z
 
PPT
Introduction to Business Intelligence and Data warehousing - ppt
nansambakuluthum7
 
PDF
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
Fwdays
 
PPTX
Build data warehouse for retail using Hadoop
Alex Nguyen
 
PDF
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
Jane Roberts
 
DOCX
Business Intelligence, Analytics, and Data Science A Managerial
TawnaDelatorrejs
 
PPTX
Architecting a Modern Data Warehouse: Enterprise Must-Haves
Yellowbrick Data
 
PPTX
Building a Big Data Solution
James Serra
 
PPT
Ch1 data-warehousing
Ahmad Shlool
 
PPT
Ch1 data-warehousing
Ahmad Shlool
 
PPTX
Data warehouse-complete-1-100227093028-phpapp01.pptx
ArunPatrick2
 
PPTX
dataWarehouse.pptx
hqlm1
 
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Thu-310pm-Impetus-SachinAndAjay
Ajay Shriwastava
 
Hadoop and Your Data Warehouse
Caserta
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Datameer
 
Data Lake Overview
James Serra
 
Is the traditional data warehouse dead?
James Serra
 
Using Data Platforms That Are Fit-For-Purpose
DATAVERSITY
 
Data warehouse
MR Z
 
Introduction to Business Intelligence and Data warehousing - ppt
nansambakuluthum7
 
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
Fwdays
 
Build data warehouse for retail using Hadoop
Alex Nguyen
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
Jane Roberts
 
Business Intelligence, Analytics, and Data Science A Managerial
TawnaDelatorrejs
 
Architecting a Modern Data Warehouse: Enterprise Must-Haves
Yellowbrick Data
 
Building a Big Data Solution
James Serra
 
Ch1 data-warehousing
Ahmad Shlool
 
Ch1 data-warehousing
Ahmad Shlool
 
Data warehouse-complete-1-100227093028-phpapp01.pptx
ArunPatrick2
 
dataWarehouse.pptx
hqlm1
 
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
July Patch Tuesday
Ivanti
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 

Logical Data Warehouse: How to Build a Virtualized Data Services Layer

  • 1. 1 Ajay Shriwastava Sachin Ghai ImpetusTechnologies Inc. Logical Data Warehouse: Building a virtual data services layer Hadoop Summit – San Jose – 11 June 2015
  • 4. 4 EVOLUTION OFTHE DATAWAREHOUSE If this slide looks inverted to you, it actually is. Data warehouse as we knew so far has inverted the concepts with emergence of BIG DATA.
  • 5. 5 EVOLUTION OFTHE DATAWAREHOUSE Pre- determin ed input schema Extensive data governance ANSI SQL compliance IT teams ownership Concurren t users Pre- canned BI Low cost storage/ archive Non SQL access Explorator y analysis Machine Learning/ Graph Data discovery Self Service BI/analytics Enterprise Data Warehouse (EDW) Big Data Warehouse (BDW)
  • 6. 6 CO-EXISTENCE OF EDW AND BDW – A REALITY Organizations still on initial phase of Big Data journey. Existing ETL jobs feeding EDW. BI and downstream applications usingANSI SQL for querying data in EDW. Business use cases for big data are emerging.
  • 7. 7 EMERGENCE OF LOGICAL DATAWAREHOUSE Logical Data Warehouse In response to emerging forces like Big Data, the data warehousing practices evolution led to emergence of “Logical Data Warehouse”. Key components include: Repository management Data virtualization Distributed process Auditing statistics and performance evaluation services SLA management Taxonomy/Ontology resolution Metadata management First proposed in May 2009 and published in August 2011 research by Gartner.
  • 8. 8 NEW PARADIGMS DATA LAKE DISTRIBUTED PARALLEL PROCESS VIRTUALIZATION OFFLOAD Repositories continue to be no longer Enterprise Data Warehouse or Data marts – emergence of HDFS as the “data lake” along with NoSQL data stores. Distributed process now have become synonymous with MapReduce on files or DB. With Spark, more distributed operators are becoming common place. Virtualization gaining favor as an access mechanism where transient consolidation is required for a use case. Offload to newer repositories and process engines requiring more accurate science and process now.
  • 9. 9 VIRTUALIZATION – AND ITS MANY DELINEATIONS • simplified, unified, and integrated viewDataVirtualization • is a subset of data virtualization • enhanced with query optimization strategies for specific source Data Federation • involves actual data movement and ‘write’ to a repository rather than just ‘read’ for a transient use caseData Blending
  • 10. 10 BRIDGINGTHE CHANNELWITHVIRTUALIZATION VIRTUALIZATION Relational Oracle, DB2… NoSQL Cassandra, MongoDB … File Systems HDFS, GPFS… MPP Teradata, Netezza… Hadoop based Warehouse Hive,Tajo… Users Enterprise Web, Mobile Applications Enterprise, ESB… BI Reporting, Visualization… Data Science Machine Learning, Graph… Data Management MDM, Discovery… TARGET SYSTEMS SOURCESYSTEMS
  • 11. 11 WEALTH MANAGEMENT – USE CASE Big data can transform the client and account centric wealth management to personalized goal based wealth management. Unfortunately the information is spread across many different line of business using separate data sources and platforms
  • 12. 12 GOAL BASEDWEALTH MANAGEMENT • From account or client centric view to household and relationship view. • A comprehensive approach to understand long term financial goals of client. • Facilitate financial security during life changing events – marriage, college, job changes, retirement, inter generational wealth transfer.
  • 13. 13 BUILDING RELATIONSHIPS HouseholdView Business NetworkView HierarchyView. Collect the data and process on a common platform. - Different LOB’s - IVR Logs/Web Logs Discover relationships. Unified view over existing data in client and account systems. Integrate with social data. ImplementGovernance. Implement corporate hierarchy over client data. Jerry Mayfield Paul Robinson Andrew Madura Linda Mays Jack Kline Root Node Friends Golf Buddy College Alumni Neighbors Father Son Daughter-1 Daughter-2 Self Spouse
  • 14. 14 PERSONALIZED SERVICES CROSS/UP SELL HouseholdView Business NetworkView HierarchyView. • 401 K / IRA • 529 • College loans • Student Credit Cards. • Gift Cards • Estate Planning • Alimonies • Company loans • Asset standing • Business prospects • Corporate Accounts. • Corporate discounts. • Group based services. • Prospects identification. • Goal based services. Assets Distribution 0 10 20 30 40 50 Goals Actuals Liabilities 0% 20% 40% 60% 80% 100%
  • 15. 15 TRADITIONAL DATA INTEGRATION • Data is embedded in silos • Time consuming and resource intensive ETL processes • Create data duplication • Governance is inhibitor and not enabler • Inability to handle theV’s of big data
  • 16. 16 Governance–Security,Audit,Lineage MonitoringandClusterManagement HDFS/Hive/PIG EDW –Teradata /Netezza etc ProducerA Stream Analytix SQL Offload Solution Kyvos Engine EDWMigration In memory data Layer (spark) Centralized Schema Big Data Governance Ankush Jumbune Analytics/VirtualizationStreamIngestion BatchIngestion Batch Data Ingestion Sqoop /Talend etc. Kundera DataVirtualization Layer Distributed Messaging Layer (Kafka) Producer B DataAccess Spark Streaming REST API Hadoop Cluster /YARN Query API (sql) Custom layer for universal connectivity Search API ES, Solr, NLP Propriety connectors / ODBC Drivers BITools – Micro Strategy,Tableau, Kyvos… Impetus Offerings (Details : Appendix 1) Recommended Platform Platform Requirements JDBC ML Lib OLTP/RD BMS R Algorithms. Data Quality. Mahout Storm LOGICAL DATA WAREHOUSE REFERENCE ARCHITECTURE
  • 17. 17 UNIFIEDVIEW - ADVANTAGES • Fast real time data integration without creating expensive copies of data. • Significant saving of time and resources required for ETL • Facility to create a final composite schema. • Information management capability. • Meet stringent service level agreements.
  • 18. 18 OFFLOAD Offload cold data and exploratory analysis workloads to commodity hardware driven Hadoop cluster – Save cost, resources
  • 19. 19 OFFLOAD CONCEPT Run Hive queries on Hadoop SQL ScriptsEDW Static SQL Procedures/ PL-SQL Proprietary script (RoadMap) BDW Hive Queries Hive Queries SQL Script Parser JAVA code with Hive Queries Tables An Enterprise Data Warehouse (EDW) to Big Data Warehouse (BDW) offload will essentially involve tables and code migration.
  • 20. 20 KEY CHALLENGES IN OFFLOAD Varied input sources Validating complete schema and data offload ANSI SQL incompatibility User Defined Functions unavailability in target system Lack of unified view and UI Missing Data Quality checks
  • 21. 21 HOWWE BUILTTHE OFFLOAD SOLUTION Step 1: Identification Step 2: Schema and Data Migration Step 3: Logic Migration Step 4: Data Quality Enhancement Step 5:Transformed code Execution ReducedTime! Reduced Risk! Automation!
  • 22. 22 SAMPLE QUERY – AUTOTRANSFORMED Teradata Query: INSERT INTO month_wise_ship_agg select d_month_seq ,substr(w_warehouse_name,1,20) , TRIM(TRAILING '_' FROM sm_type) ,cc_name ,sum(case when (cs_ship_date_sk - cs_sold_date_sk <= 30 ) then 1 else 0 end) as "30 days" ,sum(case when (cs_ship_date_sk - cs_sold_date_sk > 30) and (cs_ship_date_sk - cs_sold_date_sk <= 60) then 1 else 0 end ) as "31-60 days" from catalog_sales ,warehouse ,ship_mode ,call_center C,date_dim where cs_ship_date_sk = d_date_sk and EXTRACT (YEAR FROM d_date) IN (2000,2001) and DAYNUMBER_OF_MONTH(d_date) > 1 and TD_QUARTER_BEGIN(d_date) <= CURRENT_DATE and cs_warehouse_sk = w_warehouse_sk and cs_ship_mode_sk = sm_ship_mode_sk and cs_call_center_sk = cc_call_center_sk and (C.cust_id , C.address) LIKE ANY ( SELECT C1.cust_id, C1.address FROM Customer C1 ) group by substr(w_warehouse_name,1,20) ,sm_type ,cc_name ,d_month_seq ; Hive Query: INSERT INTO TABLE month_wise_ship_agg SELECT d_month_seq,SUBSTR( w_warehouse_name , 1 , 20 ) AS auto_c01, sm_type, UDF_TRIM('TRAILING ','_' , sm_type) , cc_name,SUM( CASE WHEN ( cs_ship_date_sk - cs_sold_date_sk <= 30) THEN 1 ELSE 0 END ) AS 30_days, SUM( CASE WHEN ( cs_ship_date_sk - cs_sold_date_sk > 30) AND ( cs_ship_date_sk - cs_sold_date_sk <= 60) THEN 1 ELSE 0 END ) AS 31_60_days FROM catalog_sales, warehouse, ship_mode, call_center C, date_dim WHERE cs_ship_date_sk = d_date_sk AND EXTRACT ('YEAR', d_date) IN (2000,2001) AND DAYNUMBER_OF_MONTH(d_date) > 1 and TD_QUARTER_BEGIN(d_date) <= CURRENT_DATE() AND cs_warehouse_sk = w_warehouse_sk AND cs_ship_mode_sk = sm_ship_mode_sk AND cs_call_center_sk = cc_call_center_sk AND EXISTS ( SELECT * Customer C1 where C.cust_id LIKE C1.cust_id AND C.age LIKE C1.address ) GROUP BY SUBSTR( w_warehouse_name , 1 , 20 ) ,sm_type,cc_name,d_month_seq;
  • 23. 23 OFFLOAD - KEY ADVANTAGES Optimize MPP and Relational database resources for workloads Re-use millions of lines of code and $ Avoid the learning curve and re-code, re-test cycles Seamless integration of downstream / upstream apps and reports
  • 24. 24 SUMMARY Establish your data strategy Identify key component : Hadoop, MPP, Spark, NoSQL etc. Segregate your workloads Offload to low cost Hadoop where required Leverage Virtualization for key use case Establish Data Quality, SLA, Semantics and MetaData as key supporting pillars To summarize, following steps are recommended for creating a logical data warehouse
  • 25. 25 SIGN-OFF QUOTE In the end, there can be only 2 types of data warehouses: Logical data warehouse and illogical data warehouse… “ ”
  • 27. 27 APPENDIX 1 Product URL StreamAnalytix http://streamanalytix.com/ Kyvos http://www.kyvosinsights.com/ Kundera http://bigdata.impetus.com/open_source_kundera SQL Offload Solution http://www.impetus.com/sites/impetus.com/impetus/br ochures/ETL_Offloading_Datasheet.pdf Ankush http://bigdata.impetus.com/ankush Jumbune http://www.jumbune.org/

Editor's Notes

  • #22: Add list of implemented functions