SlideShare a Scribd company logo
EVENT SPEAKER
DANISH BI MEETUP, SEP’ 2016
FROM LOTS OF REPORTS (WITH SOME DATA ANALYSIS) 

TO MASSIVE DATA ANALYSIS (WITH SOME REPORTING)
MARK RITTMAN, ORACLE ACE DIRECTOR
info@rittmanmead.com www.rittmanmead.com @rittmanmead 2
•Mark Rittman, Co-Founder of Rittman Mead

‣Oracle ACE Director, specialising in Oracle BI&DW

‣14 Years Experience with Oracle Technology

‣Regular columnist for Oracle Magazine

•Author of two Oracle Press Oracle BI books

‣Oracle Business Intelligence Developers Guide

‣Oracle Exalytics Revealed

‣Writer for Rittman Mead Blog :

http://www.rittmanmead.com/blog

•Email : mark.rittman@rittmanmead.com

•Twitter : @markrittman
About the Speaker
info@rittmanmead.com www.rittmanmead.com @rittmanmead 3
•Started back in 1996 on a bank Oracle DW project

•Our tools were Oracle 7.3.4, SQL*Plus, PL/SQL 

and shell scripts

•Went on to use Oracle Developer/2000 and Designer/2000

•Our initial users queried the DW using SQL*Plus

•And later on, we rolled-out Discoverer/2000 to everyone else

•And life was fun…
20 Years in Oracle BI and Data Warehousing
info@rittmanmead.com www.rittmanmead.com @rittmanmead 4
•Data warehouses provided a unified view of the business

‣Single place to store key data and metrics

‣Joined-up view of the business

‣Aggregates and conformed dimensions

‣ETL routines to load, cleanse and conform data

•BI tools for simple, guided access to information

‣Tabular data access using SQL-generating tools

‣Drill paths, hierarchies, facts, attributes

‣Fast access to pre-computed aggregates

‣Packaged BI for fast-start ERP analytics
Data Warehouses and Enterprise BI Tools
Oracle
MongoDB
Oracle
Sybase
IBM	DB/2
MS	SQL	
MS	SQL	Server
Core	ERP	Platform
Retail	
Banking	
Call	Center	
E-Commerce	
CRM	


Business	
Intelligence	
Tools


Data	Warehouse
Access	&

Performance

Layer
ODS	/

Foundation

Layer
4
info@rittmanmead.com www.rittmanmead.com @rittmanmead 5
•Examples were Crystal Reports, Oracle Reports, Cognos Impromptu, Business Objects

•Report written against carefully-curated BI dataset, or directly connecting to ERP/CRM

•Adding data from external sources, or other RDBMSs,

was difficult and involved IT resources

•Report-writing was a skilled job

•High ongoing cost for maintenance and changes

•Little scope for analysis, predictive modeling

•Often user frustration and pace of delivery
Reporting Back Then…
5
info@rittmanmead.com www.rittmanmead.com @rittmanmead 6
•For example Oracle OBIEE, SAP Business Objects, IBM Cognos

•Full-featured, IT-orientated enterprise BI platforms

•Metadata layers, integrated security, web delivery

•Pre-build ERP metadata layers, dashboards + reports

•Federated queries across multiple sources

•Single version of the truth across the enterprise

•Mobile, web dashboards, alerts, published reports

•Integration with SOA and web services
Then Came Enterprise BI Tools
6
info@rittmanmead.com www.rittmanmead.com @rittmanmead
Traditional Three-Layer Relational Data Warehouses
Staging Foundation /

ODS
Performance /

Dimensional
ETL ETL
BI Tool (OBIEE)

with metadata

layer
OLAP / In-Memory

Tool with data load

into own database
Direct

Read
Data

Load
Traditional structured

data sources
Data

Load
Data

Load
Data

Load
Traditional Relational Data Warehouse
•Three-layer architecture - staging, foundation and access/performance
•All three layers stored in a relational database (Oracle)
•ETL used to move data from layer-to-layer
And All Was Good…
(a big BI project)
From lots of reports (with some data Analysis) 
to Massive Data Analysis (With some Reporting)
Lots of reports 

(with some data analysis)
Meanwhile…
The world got digitised
and connected.
From lots of reports (with some data Analysis) 
to Massive Data Analysis (With some Reporting)
and users got impatient…
From lots of reports (with some data Analysis) 
to Massive Data Analysis (With some Reporting)
Reporting and Dashboards…
became self-service 

data discovery
Advanced analytics for everyone
Cloud and SaaS have won
From lots of reports (with some data Analysis) 
to Massive Data Analysis (With some Reporting)
BI has changed
info@rittmanmead.com www.rittmanmead.com @rittmanmead
The Gartner BI & Analytics Magic Quadrant 2016
From lots of reports (with some data Analysis) 
to Massive Data Analysis (With some Reporting)
From lots of reports (with some data Analysis) 
to Massive Data Analysis (With some Reporting)
info@rittmanmead.com www.rittmanmead.com @rittmanmead 29
Analytic Workflow
Component
Traditional BI Platform Modern BI Platform
Data source
Upfront dimensional modeling required (IT-built
star schemas)
Upfront modeling not required (flat files/
flat tables)
Data ingestion and
preparation
IT-produced IT-enabled
Content authoring Primarily IT staff, but also some power users Business users
Analysis
Predefined, ad hoc reporting, based on
predefined model
Free-form exploration
Insight delivery
Distribution and notifications via scheduled
reports or portal
Sharing and collaboration, storytelling,
open APIs
Gartner’s View of A “Modern BI Platform” in 2016
2007 - 2015
Died of ingratitude by business users

Just when we got the infrastructure right

Doesn’t anyone appreciate a single version of the truth?

Don’t say we didn’t warn you

No you can’t just export it to Excel

Watch out OLAP you’re next
Analytic data platforms 

info@rittmanmead.com www.rittmanmead.com @rittmanmead 32
•Data now landed in Hadoop clusters, NoSQL databases and Cloud Storage

•Flexible data storage platform with cheap storage, flexible schema support + compute

•Data lands in the data lake or reservoir in raw form, then minimally processed

•Data then accessed directly by “data scientists”, or processed further into DW
Meet the New Data Warehouse : The “Data Reservoir”
Data	Transfer Data	Access
Data	Factory
Data	Reservoir
Business	
Intelligence	Tools
Hadoop	Platform
File	Based	
Integration
Stream	
Based	
Integration
Data	streams
Discovery	&	Development	Labs
Safe	&	secure	Discovery	and	Development	
environment
Data	sets	and	
samples
Models	and	
programs
Marketing	/
Sales	Applications
Models
Machine
Learning
Segments
Operational	Data
Transactions
Customer
Master	ata
Unstructured	Data
Voice	+	Chat	
Transcripts
ETL	Based
Integration
Raw	
Customer	Data
Data	stored	in	
the	original	
format	(usually	
files)		such	as	
SS7,	ASN.1,	
JSON	etc.
Mapped	
Customer	Data
Data	sets	
produced	by	
mapping	and	
transforming	
raw	data
Hadoop is the new 

Data Warehouse
info@rittmanmead.com www.rittmanmead.com @rittmanmead
Hadoop : The Default Platform Today for Analytics
•Enterprise High-End RDBMSs such as Oracle can scale into the petabytes, using clustering

‣Sharded databases (e.g. Netezza) can scale further but with complexity / single workload trade-offs

•Hadoop was designed from outside for massive horizontal scalability - using cheap hardware

•Anticipates hardware failure and makes multiple copies of data as protection

•More nodes you add, more stable it becomes

•And at a fraction of the cost of traditional

RDBMS platforms
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Data from new-world applications is not like historic data

•Typically comes in non-tabular form

•JSON, log files, key/value pairs

•Users often want it speculatively

•Haven’t thought it through

•Schema can evolve

•Or maybe there isn’t one

•But the end-users want it now

•Not when you’re ready
35
But Why Hadoop? Reason #1 - Flexible Storage
Big	Data	Management	Platform
Discovery	&	Development	Labs

Safe	&	secure	Discovery	and	Development	environment
Data	sets	and	
samples
Models	and	
programs
Single	Customer	View
Enriched	

Customer	Profile
Correlating
Modeling
Machine

Learning
Scoring
Schema-on

Read	Analysis
info@rittmanmead.com www.rittmanmead.com @rittmanmead
But Why Hadoop? Reason #2 - Massive Scalability
•Enterprise High-End RDBMSs such as Oracle can scale

‣Clustering for single-instance DBs can scale to >PB

‣Exadata scales further by offloading queries to storage

‣Sharded databases (e.g. Netezza) can scale further

‣But cost (and complexity) become limiting factors

‣Typically $1m/node is not uncommon
info@rittmanmead.com www.rittmanmead.com @rittmanmead
But Why Hadoop? Reason #2 - Massive Scalability
info@rittmanmead.com www.rittmanmead.com @rittmanmead
But Why Hadoop? Reason #2 - Massive Scalability
•Hadoop’s main design goal was to enable virtually-limitless horizontal scalability

•Rather than a small number of large, powerful servers, it spreads processing over

large numbers of small, cheap, redundant servers

•Processes the data where it’s stored, avoiding I/O bottlenecks

•The more nodes you add, the more stable it becomes!

•At an affordable cost - this is key

•$50k/node vs. $1m/node
•And … the Hadoop platform is a better fit for

new types of processing and analysis
info@rittmanmead.com www.rittmanmead.com @rittmanmead
Big	Data	Platform	-	All	Running	Natively	Under	Hadoop
YARN	(Cluster	Resource	Management)
Batch

(MapReduce)
HDFS	(Cluster	Filesystem	holding	raw	data)
Interactive

(Impala,	Drill,

Tez,	Presto)
Streaming	+

In-Memory

(Spark,	Storm)
Graph	+	Search

(Solr,	Giraph)
Enriched	

Customer	Profile
Modeling
Scoring
But Why Hadoop? Reason #3 - Processing Frameworks
•Hadoop started by being synonymous with MapReduce, and Java coding

•But YARN (Yet another Resource Negotiator) broke this dependency

•Modern Hadoop platforms provide overall cluster resource management,

but support multiple processing frameworks

•General-purpose (e.g. MapReduce)

•Graph processing

•Machine Learning

•Real-Time Processing (Spark Streaming, Storm)

•Even the Hadoop resource management framework

can be swapped out

•Apache Mesos
info@rittmanmead.com www.rittmanmead.com @rittmanmead
Combine With DW for Old-World/New-World Solution
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Most high-end RDBMS vendors provide connectors to load data in/out of Hadoop platforms

‣Bulk extract

‣External tables

‣Query federation

•Use high-end RDBMSs

as specialist engines

•a.k.a. "Data Marts"
But … Analytic RDBMSs Are The New Data Mart
Discovery	&	Development	Labs

Safe	&	secure	Discovery	and	Development	environment


Data

Warehouse	
Curated	data	:	
Historical	view	
and	business	
aligned	access


Business	
Intelligence	
Tools
Big	Data	Management	Platform
Data	sets	and	
samples
Models	and	
programs
Big	Data	Platform	-	All	Running	Natively	Under	Hadoop
YARN	(Cluster	Resource	Management)
Batch

(MapReduce)
HDFS	(Cluster	Filesystem	holding	raw	data)
Interactive

(Impala,	Drill,

Tez,	Presto)
Streaming	+

In-Memory

(Spark,	Storm)
Graph	+	Search

(Solr,	Giraph)
Enriched	

Customer	Profile
Modeling
Scoring
BI Innovation is happening

around Hadoop
BI Innovation is happening

around Hadoop
hold on though…
isn’t Hadoop Slow?
too slow

for ad-hoc querying?
From lots of reports (with some data Analysis) 
to Massive Data Analysis (With some Reporting)
welcome to 2016
From lots of reports (with some data Analysis) 
to Massive Data Analysis (With some Reporting)
From lots of reports (with some data Analysis) 
to Massive Data Analysis (With some Reporting)
(Hadoop 2.0)
From lots of reports (with some data Analysis) 
to Massive Data Analysis (With some Reporting)
Hadoop is now fast
info@rittmanmead.com www.rittmanmead.com @rittmanmead 56
Hadoop 2.0 Processing Frameworks + Tools
info@rittmanmead.com www.rittmanmead.com @rittmanmead 57
•Cloudera’s answer to Hive query response time issues

•MPP SQL query engine running on Hadoop, bypasses MapReduce for
direct data access

•Mostly in-memory, but spills to disk if required

•Uses Hive metastore to access Hive table metadata

•Similar SQL dialect to Hive - not as rich though and no support for Hive
SerDes, storage handlers etc
Cloudera Impala - Fast, MPP-style Access to Hadoop Data
info@rittmanmead.com www.rittmanmead.com @rittmanmead 58
•Beginners usually store data in HDFS using text file formats (CSV) but these have limitations

•Apache AVRO often used for general-purpose processing

‣Splitability, schema evolution, in-built metadata, support for block compression

•Parquet now commonly used with Impala due to column-orientated storage

‣Mirrors work in RDBMS world around column-store

‣Only return (project) the columns you require across a wide table
Apache Parquet - Column-Orientated Storage for Analytics
info@rittmanmead.com www.rittmanmead.com @rittmanmead 59
•But Parquet (and HDFS) have significant limitation for real-time analytics applications

‣Append-only orientation, focus on column-store 

makes streaming ingestion harder

•Cloudera Kudu aims to combine 

best of HDFS + HBase

‣Real-time analytics-optimised 

‣Supports updates to data

‣Fast ingestion of data

‣Accessed using SQL-style tables

and get/put/update/delete API
Cloudera Kudu - Combining Best of HBase and Column-Store
info@rittmanmead.com www.rittmanmead.com @rittmanmead 60
•Kudu storage used with Impala - create tables using Kudu storage handler

•Can now UPDATE, DELETE and INSERT into Hadoop tables, not just SELECT and LOAD DATA
Example Impala DDL + DML Commands with Kudu
CREATE TABLE `my_first_table` (
`id` BIGINT,
`name` STRING
)
TBLPROPERTIES(
'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
'kudu.table_name' = 'my_first_table',
'kudu.master_addresses' = 'kudu-master.example.com:7051',
'kudu.key_columns' = 'id'
);
INSERT INTO my_first_table VALUES (99, "sarah");
INSERT IGNORE INTO my_first_table VALUES (99, "sarah");
UPDATE my_first_table SET name="bob" where id = 3;
DELETE FROM my_first_table WHERE id < 3;
DELETE c FROM my_second_table c, stock_symbols s WHERE c.name = s.symbol;
and it’s now in-memory
From lots of reports (with some data Analysis) 
to Massive Data Analysis (With some Reporting)
info@rittmanmead.com www.rittmanmead.com @rittmanmead 63
•Another DAG execution engine running on YARN

•More mature than TEZ, with richer API and more vendor support

•Uses concept of an RDD (Resilient Distributed Dataset)

‣RDDs like tables or Pig relations, but can be cached in-memory

‣Great for in-memory transformations, or iterative/cyclic processes

•Spark jobs comprise of a DAG of tasks operating on RDDs

•Access through Scala, Python or Java APIs

•Related projects include

‣Spark SQL

‣Spark Streaming
Apache Spark
info@rittmanmead.com www.rittmanmead.com @rittmanmead 64
•Spark SQL, and Data Frames, allow RDDs in Spark to be processed using SQL queries

•Bring in and federate additional data from JDBC sources

•Load, read and save data in Hive, Parquet and other structured tabular formats
Spark SQL - Adding SQL Processing to Apache Spark
val accessLogsFilteredDF = accessLogs
.filter( r => ! r.agent.matches(".*(spider|robot|bot|slurp).*"))
.filter( r => ! r.endpoint.matches(".*(wp-content|wp-admin).*")).toDF()
.registerTempTable("accessLogsFiltered")
val topTenPostsLast24Hour = sqlContext.sql("SELECT p.POST_TITLE, p.POST_AUTHOR, COUNT(*) 

as total 

FROM accessLogsFiltered a 

JOIN posts p ON a.endpoint = p.POST_SLUG 

GROUP BY p.POST_TITLE, p.POST_AUTHOR 

ORDER BY total DESC LIMIT 10 ")
// Persist top ten table for this window to HDFS as parquet file
topTenPostsLast24Hour.save("/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet"

, "parquet", SaveMode.Overwrite)
info@rittmanmead.com www.rittmanmead.com @rittmanmead 65
Accompanied by Innovations in Underlying Platform
Cluster Resource Management to

support multi-tenant distributed services
In-Memory Distributed Storage,

to accompany In-Memory Distributed Processing
Dataflow Pipelines 

are the new ETL
New ways to do BI
New ways to do BI
Hadoop is the new ETL Engine
info@rittmanmead.com www.rittmanmead.com @rittmanmead
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Proprietary ETL
engines die circa
2015 – folded into
big data
Oracle Open World 2015 21
Proprietary ETL is Dead. Apache-based ETL is What’s Next
Scripted
SQL
Stored
Procs
ODI for
Columnar
ODI for
In-Mem
ODI for
Exadata
ODI for
Hive
ODI for
Pig & Oozie
1990’s
Eon of Scripts and PL-SQL Era of SQL E-LT/Pushdown Big Data ETL in Batch Streaming ETL
Period of Proprietary Batch ETL Engines
Informatica
Ascential/IBM
Ab Initio
Acta/SAP
SyncSort
1994
Oracle Data Integrator
ODI for
Spark
ODI for
Spark Streaming
Warehouse
Builder
Machine Learning & Search for 

“Automagic” Schema Discovery
New ways to do BI
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•By definition there's lots of data in a big data system ... so how do you find the data you want?

•Google's own internal solution - GOODS ("Google Dataset Search")

•Uses crawler to discover new datasets

•ML classification routines to infer domain

•Data provenance and lineage

•Indexes and catalogs 26bn datasets

•Other users, vendors also have solutions

•Oracle Big Data Discovery

•Datameer

•Platfora

•Cloudera Navigator
Google GOODS - Catalog + Search At Google-Scale
A New Take on BI
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Came out if the data science movement, as a way to
"show workings"

•A set of reproducible steps that tell a story about the data

•as well as being a better command-line environment for
data analysis

•One example is Jupyter, evolution of iPython notebook

•supports pySpark, Pandas etc

•See also Apache Zepplin
Web-Based Data Analysis Notebooks
Meanwhile
in the real world …
https://www.youtube.com/watch?v=h1UmdvJDEYY
And Emerging Open-Source

BI Tools and Platforms
And Emerging Open-Source

BI Tools and Platforms
http://larrr.com/wp-content/uploads/2016/05/paper.pdf
From lots of reports (with some data Analysis) 
to Massive Data Analysis (With some Reporting)
And Emerging Open-Source

BI Tools and Platforms
From lots of reports (with some data Analysis) 
to Massive Data Analysis (With some Reporting)
To see an example:
See an example in action:
https://speakerdeck.com/markrittman/oracle-big-data-discovery-extending-into-machine-learning-a-quantified-self-case-study
http://www.rittmanmead.com

EVENT SPEAKER
DANISH BI MEETUP, SEP’ 2016
FROM LOTS OF REPORTS (WITH SOME DATA ANALYSIS) 

TO MASSIVE DATA ANALYSIS (WITH SOME REPORTING)
MARK RITTMAN, ORACLE ACE DIRECTOR

More Related Content

What's hot (20)

PDF
OTN EMEA TOUR 2016 - OBIEE12c New Features for End-Users, Developers and Sys...
Mark Rittman
 
PDF
The Future of Analytics, Data Integration and BI on Big Data Platforms
Mark Rittman
 
PPTX
Unlock the value in your big data reservoir using oracle big data discovery a...
Mark Rittman
 
PDF
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Mark Rittman
 
PDF
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Mark Rittman
 
PDF
How a Tweet Went Viral - BIWA Summit 2017
Rittman Analytics
 
PDF
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
 
PDF
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Mark Rittman
 
PDF
Big Data Computing Architecture
Gang Tao
 
PDF
Big Data Architecture and Design Patterns
John Yeung
 
PDF
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
PDF
Building a Data Lake on AWS
Gary Stafford
 
PDF
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
PDF
Lambda architecture for real time big data
Trieu Nguyen
 
PPTX
Big Data on azure
David Giard
 
PPTX
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Caserta
 
PDF
Big Data Architecture
Guido Schmutz
 
PDF
The Hidden Value of Hadoop Migration
Databricks
 
PPTX
Big data architectures and the data lake
James Serra
 
PPTX
SQL on Hadoop for the Oracle Professional
Michael Rainey
 
OTN EMEA TOUR 2016 - OBIEE12c New Features for End-Users, Developers and Sys...
Mark Rittman
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
Mark Rittman
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Mark Rittman
 
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Mark Rittman
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Mark Rittman
 
How a Tweet Went Viral - BIWA Summit 2017
Rittman Analytics
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
 
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Mark Rittman
 
Big Data Computing Architecture
Gang Tao
 
Big Data Architecture and Design Patterns
John Yeung
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
Building a Data Lake on AWS
Gary Stafford
 
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
Lambda architecture for real time big data
Trieu Nguyen
 
Big Data on azure
David Giard
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Caserta
 
Big Data Architecture
Guido Schmutz
 
The Hidden Value of Hadoop Migration
Databricks
 
Big data architectures and the data lake
James Serra
 
SQL on Hadoop for the Oracle Professional
Michael Rainey
 

Viewers also liked (20)

PDF
Data Lake Architektur: Von den Anforderungen zur Technologie
Jens Albrecht
 
PDF
Bitcoin & Blockchain for Friends
Sam Wouters
 
PDF
7 Things Banks should do with Blockchain
Sam Wouters
 
PPTX
Hadoop基盤上のETL構築実践例 ~多様なデータをどう扱う?~
Sotaro Kimura
 
PDF
Application of postgre sql to large social infrastructure
NTT DATA OSS Professional Services
 
PPTX
Bb Tour ANZ 17 - X-Ray Roll Up Reports
Blackboard APAC
 
PPTX
Bb Tour ANZ 2017 - Moodlerooms & X-Ray Learning Analytics Product Updates
Blackboard APAC
 
PDF
Cloud Native Hadoop #cwt2016
Cloudera Japan
 
PDF
Organising the Data Lake - Information Management in a Big Data World
DataWorks Summit/Hadoop Summit
 
PDF
Apache Hadoop 2.8.0 の新機能 (抜粋)
NTT DATA OSS Professional Services
 
PDF
ICTSC5 大人達の戦い LT資料
Ken SASAKI
 
PDF
ICTSC5 DMM.comラボの紹介+お給料の話
Ken SASAKI
 
PPTX
Kafkaを活用するためのストリーム処理の基本
Sotaro Kimura
 
PDF
ICTSC6 ちょっとだけ数学の話
Ken SASAKI
 
PDF
大規模データに対するデータサイエンスの進め方 #CWT2016
Cloudera Japan
 
PDF
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
hamaken
 
PDF
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
hamaken
 
PDF
AWS Lambda and Amazon API Gateway
Shinpei Ohtani
 
PDF
Amazon Aurora
Shinpei Ohtani
 
PDF
Application of postgre sql to large social infrastructure jp
NTT DATA OSS Professional Services
 
Data Lake Architektur: Von den Anforderungen zur Technologie
Jens Albrecht
 
Bitcoin & Blockchain for Friends
Sam Wouters
 
7 Things Banks should do with Blockchain
Sam Wouters
 
Hadoop基盤上のETL構築実践例 ~多様なデータをどう扱う?~
Sotaro Kimura
 
Application of postgre sql to large social infrastructure
NTT DATA OSS Professional Services
 
Bb Tour ANZ 17 - X-Ray Roll Up Reports
Blackboard APAC
 
Bb Tour ANZ 2017 - Moodlerooms & X-Ray Learning Analytics Product Updates
Blackboard APAC
 
Cloud Native Hadoop #cwt2016
Cloudera Japan
 
Organising the Data Lake - Information Management in a Big Data World
DataWorks Summit/Hadoop Summit
 
Apache Hadoop 2.8.0 の新機能 (抜粋)
NTT DATA OSS Professional Services
 
ICTSC5 大人達の戦い LT資料
Ken SASAKI
 
ICTSC5 DMM.comラボの紹介+お給料の話
Ken SASAKI
 
Kafkaを活用するためのストリーム処理の基本
Sotaro Kimura
 
ICTSC6 ちょっとだけ数学の話
Ken SASAKI
 
大規模データに対するデータサイエンスの進め方 #CWT2016
Cloudera Japan
 
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
hamaken
 
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
hamaken
 
AWS Lambda and Amazon API Gateway
Shinpei Ohtani
 
Amazon Aurora
Shinpei Ohtani
 
Application of postgre sql to large social infrastructure jp
NTT DATA OSS Professional Services
 
Ad

Similar to From lots of reports (with some data Analysis) 
to Massive Data Analysis (With some Reporting) (20)

PDF
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
Mark Rittman
 
PDF
ODI12c as your Big Data Integration Hub
Mark Rittman
 
PPTX
BI.pptx
RiadHasan25
 
PDF
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
DATAVERSITY
 
PDF
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Rittman Analytics
 
PPTX
Big Data Analytics for BI, BA and QA
Dmitry Tolpeko
 
PDF
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Mark Rittman
 
PDF
Mighty Guides- Data Disruption
Mighty Guides, Inc.
 
PDF
Big data and you
IBM
 
PPTX
Big data unit 2
RojaT4
 
PDF
OGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI Projects
Mark Rittman
 
PDF
6 enriching your data warehouse with big data and hadoop
Dr. Wilfred Lin (Ph.D.)
 
PPT
13500892 data-warehousing-and-data-mining
Ngaire Taylor
 
PDF
Leveraging Hadoop with OBIEE 11g and ODI 11g - UKOUG Tech'13
Mark Rittman
 
PDF
Extending BI with Big Data Analytics
Datameer
 
PDF
Foundation for Success: How Big Data Fits in an Information Architecture
Inside Analysis
 
PPTX
From Business Intelligence to Big Data - hack/reduce Dec 2014
Adam Ferrari
 
PDF
What is Big Data Discovery, and how it complements traditional business anal...
Mark Rittman
 
PPTX
Big data analyti data analytical life cycle
NAKKAPUNEETH1
 
PDF
OBIEE, Endeca, Hadoop and ORE Development (on Exalytics) (ODTUG 2013)
Mark Rittman
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
Mark Rittman
 
ODI12c as your Big Data Integration Hub
Mark Rittman
 
BI.pptx
RiadHasan25
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
DATAVERSITY
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Rittman Analytics
 
Big Data Analytics for BI, BA and QA
Dmitry Tolpeko
 
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Mark Rittman
 
Mighty Guides- Data Disruption
Mighty Guides, Inc.
 
Big data and you
IBM
 
Big data unit 2
RojaT4
 
OGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI Projects
Mark Rittman
 
6 enriching your data warehouse with big data and hadoop
Dr. Wilfred Lin (Ph.D.)
 
13500892 data-warehousing-and-data-mining
Ngaire Taylor
 
Leveraging Hadoop with OBIEE 11g and ODI 11g - UKOUG Tech'13
Mark Rittman
 
Extending BI with Big Data Analytics
Datameer
 
Foundation for Success: How Big Data Fits in an Information Architecture
Inside Analysis
 
From Business Intelligence to Big Data - hack/reduce Dec 2014
Adam Ferrari
 
What is Big Data Discovery, and how it complements traditional business anal...
Mark Rittman
 
Big data analyti data analytical life cycle
NAKKAPUNEETH1
 
OBIEE, Endeca, Hadoop and ORE Development (on Exalytics) (ODTUG 2013)
Mark Rittman
 
Ad

More from Mark Rittman (12)

PDF
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
 
PDF
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
Mark Rittman
 
PDF
Deploying Full BI Platforms to Oracle Cloud
Mark Rittman
 
PDF
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
Mark Rittman
 
PDF
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Mark Rittman
 
PDF
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
Mark Rittman
 
PDF
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015
Mark Rittman
 
PDF
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
Mark Rittman
 
PDF
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12c
Mark Rittman
 
PDF
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Mark Rittman
 
PDF
Part 4 - Hadoop Data Output and Reporting using OBIEE11g
Mark Rittman
 
PDF
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Mark Rittman
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
 
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
Mark Rittman
 
Deploying Full BI Platforms to Oracle Cloud
Mark Rittman
 
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
Mark Rittman
 
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Mark Rittman
 
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
Mark Rittman
 
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015
Mark Rittman
 
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
Mark Rittman
 
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12c
Mark Rittman
 
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Mark Rittman
 
Part 4 - Hadoop Data Output and Reporting using OBIEE11g
Mark Rittman
 
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Mark Rittman
 

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 

From lots of reports (with some data Analysis) 
to Massive Data Analysis (With some Reporting)

  • 1. EVENT SPEAKER DANISH BI MEETUP, SEP’ 2016 FROM LOTS OF REPORTS (WITH SOME DATA ANALYSIS) 
 TO MASSIVE DATA ANALYSIS (WITH SOME REPORTING) MARK RITTMAN, ORACLE ACE DIRECTOR
  • 2. [email protected] www.rittmanmead.com @rittmanmead 2 •Mark Rittman, Co-Founder of Rittman Mead ‣Oracle ACE Director, specialising in Oracle BI&DW ‣14 Years Experience with Oracle Technology ‣Regular columnist for Oracle Magazine •Author of two Oracle Press Oracle BI books ‣Oracle Business Intelligence Developers Guide ‣Oracle Exalytics Revealed ‣Writer for Rittman Mead Blog :
 http://www.rittmanmead.com/blog •Email : [email protected] •Twitter : @markrittman About the Speaker
  • 3. [email protected] www.rittmanmead.com @rittmanmead 3 •Started back in 1996 on a bank Oracle DW project •Our tools were Oracle 7.3.4, SQL*Plus, PL/SQL 
 and shell scripts •Went on to use Oracle Developer/2000 and Designer/2000 •Our initial users queried the DW using SQL*Plus •And later on, we rolled-out Discoverer/2000 to everyone else •And life was fun… 20 Years in Oracle BI and Data Warehousing
  • 4. [email protected] www.rittmanmead.com @rittmanmead 4 •Data warehouses provided a unified view of the business ‣Single place to store key data and metrics ‣Joined-up view of the business ‣Aggregates and conformed dimensions ‣ETL routines to load, cleanse and conform data •BI tools for simple, guided access to information ‣Tabular data access using SQL-generating tools ‣Drill paths, hierarchies, facts, attributes ‣Fast access to pre-computed aggregates ‣Packaged BI for fast-start ERP analytics Data Warehouses and Enterprise BI Tools Oracle MongoDB Oracle Sybase IBM DB/2 MS SQL MS SQL Server Core ERP Platform Retail Banking Call Center E-Commerce CRM 
 Business Intelligence Tools 
 Data Warehouse Access &
 Performance
 Layer ODS /
 Foundation
 Layer 4
  • 5. [email protected] www.rittmanmead.com @rittmanmead 5 •Examples were Crystal Reports, Oracle Reports, Cognos Impromptu, Business Objects •Report written against carefully-curated BI dataset, or directly connecting to ERP/CRM •Adding data from external sources, or other RDBMSs,
 was difficult and involved IT resources •Report-writing was a skilled job •High ongoing cost for maintenance and changes •Little scope for analysis, predictive modeling •Often user frustration and pace of delivery Reporting Back Then… 5
  • 6. [email protected] www.rittmanmead.com @rittmanmead 6 •For example Oracle OBIEE, SAP Business Objects, IBM Cognos •Full-featured, IT-orientated enterprise BI platforms •Metadata layers, integrated security, web delivery •Pre-build ERP metadata layers, dashboards + reports •Federated queries across multiple sources •Single version of the truth across the enterprise •Mobile, web dashboards, alerts, published reports •Integration with SOA and web services Then Came Enterprise BI Tools 6
  • 7. [email protected] www.rittmanmead.com @rittmanmead Traditional Three-Layer Relational Data Warehouses Staging Foundation /
 ODS Performance /
 Dimensional ETL ETL BI Tool (OBIEE)
 with metadata
 layer OLAP / In-Memory
 Tool with data load
 into own database Direct
 Read Data
 Load Traditional structured
 data sources Data
 Load Data
 Load Data
 Load Traditional Relational Data Warehouse •Three-layer architecture - staging, foundation and access/performance •All three layers stored in a relational database (Oracle) •ETL used to move data from layer-to-layer
  • 8. And All Was Good…
  • 9. (a big BI project)
  • 11. Lots of reports 
 (with some data analysis)
  • 15. and users got impatient…
  • 17. Reporting and Dashboards… became self-service 
 data discovery
  • 19. Cloud and SaaS have won
  • 22. [email protected] www.rittmanmead.com @rittmanmead The Gartner BI & Analytics Magic Quadrant 2016
  • 25. [email protected] www.rittmanmead.com @rittmanmead 29 Analytic Workflow Component Traditional BI Platform Modern BI Platform Data source Upfront dimensional modeling required (IT-built star schemas) Upfront modeling not required (flat files/ flat tables) Data ingestion and preparation IT-produced IT-enabled Content authoring Primarily IT staff, but also some power users Business users Analysis Predefined, ad hoc reporting, based on predefined model Free-form exploration Insight delivery Distribution and notifications via scheduled reports or portal Sharing and collaboration, storytelling, open APIs Gartner’s View of A “Modern BI Platform” in 2016
  • 26. 2007 - 2015 Died of ingratitude by business users Just when we got the infrastructure right Doesn’t anyone appreciate a single version of the truth? Don’t say we didn’t warn you No you can’t just export it to Excel Watch out OLAP you’re next
  • 28. [email protected] www.rittmanmead.com @rittmanmead 32 •Data now landed in Hadoop clusters, NoSQL databases and Cloud Storage •Flexible data storage platform with cheap storage, flexible schema support + compute •Data lands in the data lake or reservoir in raw form, then minimally processed •Data then accessed directly by “data scientists”, or processed further into DW Meet the New Data Warehouse : The “Data Reservoir” Data Transfer Data Access Data Factory Data Reservoir Business Intelligence Tools Hadoop Platform File Based Integration Stream Based Integration Data streams Discovery & Development Labs Safe & secure Discovery and Development environment Data sets and samples Models and programs Marketing / Sales Applications Models Machine Learning Segments Operational Data Transactions Customer Master ata Unstructured Data Voice + Chat Transcripts ETL Based Integration Raw Customer Data Data stored in the original format (usually files) such as SS7, ASN.1, JSON etc. Mapped Customer Data Data sets produced by mapping and transforming raw data
  • 29. Hadoop is the new 
 Data Warehouse
  • 30. [email protected] www.rittmanmead.com @rittmanmead Hadoop : The Default Platform Today for Analytics •Enterprise High-End RDBMSs such as Oracle can scale into the petabytes, using clustering ‣Sharded databases (e.g. Netezza) can scale further but with complexity / single workload trade-offs •Hadoop was designed from outside for massive horizontal scalability - using cheap hardware •Anticipates hardware failure and makes multiple copies of data as protection •More nodes you add, more stable it becomes •And at a fraction of the cost of traditional
 RDBMS platforms
  • 31. [email protected] www.rittmanmead.com @rittmanmead •Data from new-world applications is not like historic data •Typically comes in non-tabular form •JSON, log files, key/value pairs •Users often want it speculatively •Haven’t thought it through •Schema can evolve •Or maybe there isn’t one •But the end-users want it now •Not when you’re ready 35 But Why Hadoop? Reason #1 - Flexible Storage Big Data Management Platform Discovery & Development Labs
 Safe & secure Discovery and Development environment Data sets and samples Models and programs Single Customer View Enriched 
 Customer Profile Correlating Modeling Machine
 Learning Scoring Schema-on
 Read Analysis
  • 32. [email protected] www.rittmanmead.com @rittmanmead But Why Hadoop? Reason #2 - Massive Scalability •Enterprise High-End RDBMSs such as Oracle can scale ‣Clustering for single-instance DBs can scale to >PB ‣Exadata scales further by offloading queries to storage ‣Sharded databases (e.g. Netezza) can scale further ‣But cost (and complexity) become limiting factors ‣Typically $1m/node is not uncommon
  • 33. [email protected] www.rittmanmead.com @rittmanmead But Why Hadoop? Reason #2 - Massive Scalability
  • 34. [email protected] www.rittmanmead.com @rittmanmead But Why Hadoop? Reason #2 - Massive Scalability •Hadoop’s main design goal was to enable virtually-limitless horizontal scalability •Rather than a small number of large, powerful servers, it spreads processing over
 large numbers of small, cheap, redundant servers •Processes the data where it’s stored, avoiding I/O bottlenecks •The more nodes you add, the more stable it becomes! •At an affordable cost - this is key •$50k/node vs. $1m/node •And … the Hadoop platform is a better fit for
 new types of processing and analysis
  • 35. [email protected] www.rittmanmead.com @rittmanmead Big Data Platform - All Running Natively Under Hadoop YARN (Cluster Resource Management) Batch
 (MapReduce) HDFS (Cluster Filesystem holding raw data) Interactive
 (Impala, Drill,
 Tez, Presto) Streaming +
 In-Memory
 (Spark, Storm) Graph + Search
 (Solr, Giraph) Enriched 
 Customer Profile Modeling Scoring But Why Hadoop? Reason #3 - Processing Frameworks •Hadoop started by being synonymous with MapReduce, and Java coding •But YARN (Yet another Resource Negotiator) broke this dependency •Modern Hadoop platforms provide overall cluster resource management,
 but support multiple processing frameworks •General-purpose (e.g. MapReduce) •Graph processing •Machine Learning •Real-Time Processing (Spark Streaming, Storm) •Even the Hadoop resource management framework
 can be swapped out •Apache Mesos
  • 36. [email protected] www.rittmanmead.com @rittmanmead Combine With DW for Old-World/New-World Solution
  • 37. [email protected] www.rittmanmead.com @rittmanmead •Most high-end RDBMS vendors provide connectors to load data in/out of Hadoop platforms ‣Bulk extract ‣External tables ‣Query federation •Use high-end RDBMSs
 as specialist engines •a.k.a. "Data Marts" But … Analytic RDBMSs Are The New Data Mart Discovery & Development Labs
 Safe & secure Discovery and Development environment 
 Data
 Warehouse Curated data : Historical view and business aligned access 
 Business Intelligence Tools Big Data Management Platform Data sets and samples Models and programs Big Data Platform - All Running Natively Under Hadoop YARN (Cluster Resource Management) Batch
 (MapReduce) HDFS (Cluster Filesystem holding raw data) Interactive
 (Impala, Drill,
 Tez, Presto) Streaming +
 In-Memory
 (Spark, Storm) Graph + Search
 (Solr, Giraph) Enriched 
 Customer Profile Modeling Scoring
  • 38. BI Innovation is happening
 around Hadoop
  • 39. BI Innovation is happening
 around Hadoop
  • 50. [email protected] www.rittmanmead.com @rittmanmead 56 Hadoop 2.0 Processing Frameworks + Tools
  • 51. [email protected] www.rittmanmead.com @rittmanmead 57 •Cloudera’s answer to Hive query response time issues •MPP SQL query engine running on Hadoop, bypasses MapReduce for direct data access •Mostly in-memory, but spills to disk if required •Uses Hive metastore to access Hive table metadata •Similar SQL dialect to Hive - not as rich though and no support for Hive SerDes, storage handlers etc Cloudera Impala - Fast, MPP-style Access to Hadoop Data
  • 52. [email protected] www.rittmanmead.com @rittmanmead 58 •Beginners usually store data in HDFS using text file formats (CSV) but these have limitations •Apache AVRO often used for general-purpose processing ‣Splitability, schema evolution, in-built metadata, support for block compression •Parquet now commonly used with Impala due to column-orientated storage ‣Mirrors work in RDBMS world around column-store ‣Only return (project) the columns you require across a wide table Apache Parquet - Column-Orientated Storage for Analytics
  • 53. [email protected] www.rittmanmead.com @rittmanmead 59 •But Parquet (and HDFS) have significant limitation for real-time analytics applications ‣Append-only orientation, focus on column-store 
 makes streaming ingestion harder •Cloudera Kudu aims to combine 
 best of HDFS + HBase ‣Real-time analytics-optimised ‣Supports updates to data ‣Fast ingestion of data ‣Accessed using SQL-style tables
 and get/put/update/delete API Cloudera Kudu - Combining Best of HBase and Column-Store
  • 54. [email protected] www.rittmanmead.com @rittmanmead 60 •Kudu storage used with Impala - create tables using Kudu storage handler •Can now UPDATE, DELETE and INSERT into Hadoop tables, not just SELECT and LOAD DATA Example Impala DDL + DML Commands with Kudu CREATE TABLE `my_first_table` ( `id` BIGINT, `name` STRING ) TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'my_first_table', 'kudu.master_addresses' = 'kudu-master.example.com:7051', 'kudu.key_columns' = 'id' ); INSERT INTO my_first_table VALUES (99, "sarah"); INSERT IGNORE INTO my_first_table VALUES (99, "sarah"); UPDATE my_first_table SET name="bob" where id = 3; DELETE FROM my_first_table WHERE id < 3; DELETE c FROM my_second_table c, stock_symbols s WHERE c.name = s.symbol;
  • 55. and it’s now in-memory
  • 57. [email protected] www.rittmanmead.com @rittmanmead 63 •Another DAG execution engine running on YARN •More mature than TEZ, with richer API and more vendor support •Uses concept of an RDD (Resilient Distributed Dataset) ‣RDDs like tables or Pig relations, but can be cached in-memory ‣Great for in-memory transformations, or iterative/cyclic processes •Spark jobs comprise of a DAG of tasks operating on RDDs •Access through Scala, Python or Java APIs •Related projects include ‣Spark SQL ‣Spark Streaming Apache Spark
  • 58. [email protected] www.rittmanmead.com @rittmanmead 64 •Spark SQL, and Data Frames, allow RDDs in Spark to be processed using SQL queries •Bring in and federate additional data from JDBC sources •Load, read and save data in Hive, Parquet and other structured tabular formats Spark SQL - Adding SQL Processing to Apache Spark val accessLogsFilteredDF = accessLogs .filter( r => ! r.agent.matches(".*(spider|robot|bot|slurp).*")) .filter( r => ! r.endpoint.matches(".*(wp-content|wp-admin).*")).toDF() .registerTempTable("accessLogsFiltered") val topTenPostsLast24Hour = sqlContext.sql("SELECT p.POST_TITLE, p.POST_AUTHOR, COUNT(*) 
 as total 
 FROM accessLogsFiltered a 
 JOIN posts p ON a.endpoint = p.POST_SLUG 
 GROUP BY p.POST_TITLE, p.POST_AUTHOR 
 ORDER BY total DESC LIMIT 10 ") // Persist top ten table for this window to HDFS as parquet file topTenPostsLast24Hour.save("/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet"
 , "parquet", SaveMode.Overwrite)
  • 59. [email protected] www.rittmanmead.com @rittmanmead 65 Accompanied by Innovations in Underlying Platform Cluster Resource Management to
 support multi-tenant distributed services In-Memory Distributed Storage,
 to accompany In-Memory Distributed Processing
  • 61. New ways to do BI
  • 62. New ways to do BI
  • 63. Hadoop is the new ETL Engine
  • 64. [email protected] www.rittmanmead.com @rittmanmead Copyright © 2015, Oracle and/or its affiliates. All rights reserved. | Proprietary ETL engines die circa 2015 – folded into big data Oracle Open World 2015 21 Proprietary ETL is Dead. Apache-based ETL is What’s Next Scripted SQL Stored Procs ODI for Columnar ODI for In-Mem ODI for Exadata ODI for Hive ODI for Pig & Oozie 1990’s Eon of Scripts and PL-SQL Era of SQL E-LT/Pushdown Big Data ETL in Batch Streaming ETL Period of Proprietary Batch ETL Engines Informatica Ascential/IBM Ab Initio Acta/SAP SyncSort 1994 Oracle Data Integrator ODI for Spark ODI for Spark Streaming Warehouse Builder
  • 65. Machine Learning & Search for 
 “Automagic” Schema Discovery
  • 66. New ways to do BI
  • 67. [email protected] www.rittmanmead.com @rittmanmead •By definition there's lots of data in a big data system ... so how do you find the data you want? •Google's own internal solution - GOODS ("Google Dataset Search") •Uses crawler to discover new datasets •ML classification routines to infer domain •Data provenance and lineage •Indexes and catalogs 26bn datasets •Other users, vendors also have solutions •Oracle Big Data Discovery •Datameer •Platfora •Cloudera Navigator Google GOODS - Catalog + Search At Google-Scale
  • 68. A New Take on BI
  • 69. [email protected] www.rittmanmead.com @rittmanmead •Came out if the data science movement, as a way to "show workings" •A set of reproducible steps that tell a story about the data •as well as being a better command-line environment for data analysis •One example is Jupyter, evolution of iPython notebook •supports pySpark, Pandas etc •See also Apache Zepplin Web-Based Data Analysis Notebooks
  • 70. Meanwhile in the real world … https://www.youtube.com/watch?v=h1UmdvJDEYY
  • 71. And Emerging Open-Source
 BI Tools and Platforms
  • 72. And Emerging Open-Source
 BI Tools and Platforms http://larrr.com/wp-content/uploads/2016/05/paper.pdf
  • 74. And Emerging Open-Source
 BI Tools and Platforms
  • 76. To see an example:
  • 77. See an example in action: https://speakerdeck.com/markrittman/oracle-big-data-discovery-extending-into-machine-learning-a-quantified-self-case-study
  • 79. EVENT SPEAKER DANISH BI MEETUP, SEP’ 2016 FROM LOTS OF REPORTS (WITH SOME DATA ANALYSIS) 
 TO MASSIVE DATA ANALYSIS (WITH SOME REPORTING) MARK RITTMAN, ORACLE ACE DIRECTOR