SlideShare a Scribd company logo
April 10-12, Chicago, IL
Yahoo!, Big Data, and
Microsoft BI: Bigger and
Better Together
Dianne Cantwell and Denny Lee
April 10-12, Chicago, IL
Please silence
cell phones
3
Agenda
Yahoo! Business Case for Hadoop and BI
Big Data, Fast Queries
Big Data / BI Themes
Get the Hardware Balance Right
Partitioning, Partitioning, Partitioning
Keep it Simple
It is the order of things
4
Yahoo! manages a
powerful scalable
advertising exchange
that includes publishers
and advertisers
Yahoo! TAO Business Challenge
5
Advertisers want to get
the best bang for their
buck by reaching their
targeted audiences
effectively and efficiently
Yahoo! TAO Business Challenge
6
Yahoo! needs visibility into how consumers
are responding to ads along many
dimensions: web sites, creatives, time of
day, user segments (e.g. gender, age,
location) to make the exchange work as
efficiently and effectively as possible
Yahoo! TAO Business Challenge
7
Yahoo! TAO Technical Requirements
680,000,000Visitors to Yahoo! Branded sites:
Ad Impressions: 3,500,000,000(perday)
Refresh Frequency: Hourly
464,000,000,000(perqtr)
Rows Loaded:
Average Query Time: <10 seconds
8
Yahoo! TAO Platform Architecture
How did we load so much so quickly?
Data Archive & Staging
Oracle 11G RAC
File 1
File 2
File N
Partition 1
Partition 2
Partition N
Partition 1
Partition 2
Partition N
24TB
Cube
/qtr
1.2TB
/day
135GB/day
compressed
2PB
cluster
Data Aggregation & ETL
Hadoop
BI Server
SQL Server Analysis
Services 2008 R2
9
BI Query Servers
SQL Server Analysis
Services 2008 R2
24TB
Cube
/qtr
Adhoc Query/Visualization
Tableau Desktop 7
Optimization Application
Custom J2EE App
Yahoo! TAO Platform Architecture
Queries at the “speed of thought”
464B rows of
event level data
/qtr
• Dimensions: 42
• Attributes: 296
• Measures: 278
Avg Query Time:
2 secs
Avg Query Time:
5 secs
10
Yahoo! TAO Return on Investment
For campaigns
optimized using TAO,
advertisers spent
more with Yahoo! than
before
For campaigns
optimized using TAO,
more eCPMs
(revenue)!
11
Yahoo! TAO Return on Investment
Yahoo! TAO exposed customer segment
performance to campaign managers and
advertisers for the first time! No longer
“flying audience blind”
12
Yahoo! TAO Future Direction
Increase Segments by 3x
Increase data size and cartesian
No longer doing distinct count
Built frequency reports and sampling to deliver this due to the inherent complexity!
Current Challenge
Hadoop to SSAS cube (more later)
External access to cubes
More disk due to need for more IO
13
Big Data Analytics Challenges
Cube
F
14
Get the data out!
15
Extracting the data
File Generation
Hadoop jobs create many files that are exported / dumped to disk in tabular format
File Staging
Files are propped to a staging folder for relational dB access
Oracle External Tables
Generate external tables that point to the staged files
No need to import the data
Processing is slow
16
AS on Oracle Case
Oracle OLEDB
10K rows/sec
100K
rows/sec
SSIS Connector
20K rows/sec
Oracle Analysis Services
Oracle SQL Analysis Services
17
Passthrough Query to Linked Server
http://msdn.microsoft.com/en-us/library/jj710329.aspx
18
Partitioning,
Partitioning,
Partitioning
19
PartitionsPartitions
Yahoo Example – “Fast” Oracle Load
• Data is streamed in to Oracle to files
• To get max processing, 30 threads are fired because all T (temp) partitions are
processed concurrently
• Super fast data loads
• Problem is that it requires constant merging of partitions
Files are streamed in
as they become
available
10/10/10 T360772
10/10/10 T360773
…
10/10/10 T361645
10/10/10 T360772
Oracle 10g
10/10/10 T360773
10/10/10 T361645
…
10/10/10 T360772
10/10/10 T360773
10/10/10 T361645
…
SSAS
10/10/10
Merge
20
Partitions – Directly Merging
Partitions
10/10/10 00:00
Oracle 10g
10/10/10 01:00
10/10/10 23:00
…
• New model allows for set hourly partitions
• No more streaming data but with hourly partitions, cannot have as many threads for
fast data loads, unless…
• Process multiple cubes or measure groups in parallel
Partitions
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
SSAS
Segments
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
Activities
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
Uniques
21
It is the order of things
22
It is the order of things
“I am a Jem'Hadar. He is a Vorta.
It is the order of things."
"Do you really want to give up
your life for the 'order of things'?"
"It is not my life to give up, Captain
– and it never was.”
Rocks and Shoals,
Deep Space Nine
Written by Ronald D. Moore
23
Segments and the importance of sort order
Data File Sorted Not Sorted % Diff
fact.data 195,708,592 344,502,968 43.19%
agg.rigid.data 106,825,677 106,825,677 0.00%
dim1.dim2.fact.map 17,332,729 32,989,946 47.46%
dim1.dim3.fact.map 16,923,276 32,222,813 47.48%
dim1.dim4.fact.map 6,079,396 12,286,978 50.52%
dim5.dim6.fact.map 2,630,888 6,057,334 56.57%
dim1.dim7.fact.map 1,809,725 3,904,004 53.64%
dim8.dim9.fact.map 1,592,886 3,793,452 58.01%
dim1.dim10.fact.map 1,419,255 3,108,248 54.34%
dim8.dim11.fact.map 1,301,221 3,042,638 57.23%
dim1.dim12.fact.map 2,949,432 2,949,432 0.00%
dim1.dim13.fact.map 2,934,836 2,934,836 0.00%
dimA.dimA.fact.map 1,101,552 2,716,289 59.45%
dim8.dimB.fact.map 961,332 2,451,956 60.79%
dim1.dimC.fact.map 1,027,305 2,323,906 55.79%
dim8.dim8.fact.map 1,592,886 2,308,232 30.99%
dimA.dimD.fact.map 851,095 2,170,962 60.80%
Not Sorted
Sorted
24
Across the Eighth Dimension!
How do you associate dimensions with
Star Trek Into Darkness?
Cube
25
26
Back to cube dimensions
Running ProcessUpdate
Takes a long time to run because all of the fact partitions are re-indexed!
Minimize likelihood by building SCD-2 dimensions
Composite Key based on lowest level unique values to represent row
Sometimes identity can be just as effective though hashing requires mapping or lookuptables
Create SK to allow for SCD-2 dimensions
Key is that we keep the memory space of the SK small
Composite(Composite) or Hash(Composite) is good for dimensions loaded from fact BUT do
not expect Type-2 for fact-based dimensions
Important to call out restatement based on current data (high cost associated with keeping
versioned history of dimension tables)
27
Let’s aggregate it up
April 10-12, Chicago, IL
Thank you!
Diamond Sponsor

More Related Content

PPTX
Yahoo! TAO Case Study Excerpt
Denny Lee
 
PDF
2012.04.26 big insights streams im forum2
Wilfried Hoge
 
PPTX
Galaxy of bits
Michal Zylinski
 
PDF
Big Data simplified
Praveen Hanchinal
 
PDF
Big Data Real Time Applications
DataWorks Summit
 
PPTX
Tech4Africa - Opportunities around Big Data
Steve Watt
 
PDF
Bio bigdata
Mk Kim
 
PPT
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Treasure Data, Inc.
 
Yahoo! TAO Case Study Excerpt
Denny Lee
 
2012.04.26 big insights streams im forum2
Wilfried Hoge
 
Galaxy of bits
Michal Zylinski
 
Big Data simplified
Praveen Hanchinal
 
Big Data Real Time Applications
DataWorks Summit
 
Tech4Africa - Opportunities around Big Data
Steve Watt
 
Bio bigdata
Mk Kim
 
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Treasure Data, Inc.
 

Similar to Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together (20)

PPTX
Unit 1
vishal choudhary
 
PDF
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Think Big, a Teradata Company
 
PDF
Where Does Big Data Meet Big Database - QCon 2012
Ben Stopford
 
PPSX
Big data with Hadoop - Introduction
Tomy Rhymond
 
PPT
Big Data Ecosystem for Data-Driven Decision Making
Abzetdin Adamov
 
PDF
Rob peglar introduction_analytics _big data_hadoop
Ghassan Al-Yafie
 
PPTX
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Caserta
 
PPTX
Big Data & Data Science
BrijeshGoyani
 
PPTX
Check Point Big Data Forum m3
Alex Fok
 
PPTX
The key to unlocking the Value in the IoT? Managing the Data!
DataWorks Summit/Hadoop Summit
 
PPTX
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
StampedeCon
 
PDF
Dba to data scientist -Satyendra
pasalapudi123
 
PPTX
Big Data, NoSQL, NewSQL & The Future of Data Management
Tony Bain
 
PPTX
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)
Bogdan Bocse
 
PPTX
unit1 big data analysis description and defenition .pptx
abikishor767
 
PPTX
Big data analyti data analytical life cycle
NAKKAPUNEETH1
 
PPTX
Big data
Yazan Abu Al Failat
 
PDF
INF2190_W1_2016_public
Attila Barta
 
PPTX
lec1_Unit 1_rev.pptx_big data aanalytics
ashima967262
 
PDF
Big Data overview
alexisroos
 
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Think Big, a Teradata Company
 
Where Does Big Data Meet Big Database - QCon 2012
Ben Stopford
 
Big data with Hadoop - Introduction
Tomy Rhymond
 
Big Data Ecosystem for Data-Driven Decision Making
Abzetdin Adamov
 
Rob peglar introduction_analytics _big data_hadoop
Ghassan Al-Yafie
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Caserta
 
Big Data & Data Science
BrijeshGoyani
 
Check Point Big Data Forum m3
Alex Fok
 
The key to unlocking the Value in the IoT? Managing the Data!
DataWorks Summit/Hadoop Summit
 
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
StampedeCon
 
Dba to data scientist -Satyendra
pasalapudi123
 
Big Data, NoSQL, NewSQL & The Future of Data Management
Tony Bain
 
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)
Bogdan Bocse
 
unit1 big data analysis description and defenition .pptx
abikishor767
 
Big data analyti data analytical life cycle
NAKKAPUNEETH1
 
INF2190_W1_2016_public
Attila Barta
 
lec1_Unit 1_rev.pptx_big data aanalytics
ashima967262
 
Big Data overview
alexisroos
 
Ad

More from Denny Lee (20)

PDF
Azure Cosmos DB: Globally Distributed Multi-Model Database Service
Denny Lee
 
PPTX
Spark to DocumentDB connector
Denny Lee
 
PPTX
Introduction to Azure DocumentDB
Denny Lee
 
PPTX
SQL Server Integration Services Best Practices
Denny Lee
 
PPTX
SQL Server Reporting Services: IT Best Practices
Denny Lee
 
PPTX
Introduction to Microsoft's Big Data Platform and Hadoop Primer
Denny Lee
 
PPTX
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Denny Lee
 
PPTX
SQL Server Reporting Services Disaster Recovery webinar
Denny Lee
 
PPT
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Denny Lee
 
PPT
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Denny Lee
 
PPTX
SQLCAT - Data and Admin Security
Denny Lee
 
PPTX
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
Denny Lee
 
PPTX
SQLCAT: A Preview to PowerPivot Server Best Practices
Denny Lee
 
PPTX
Deploying and Managing PowerPivot for SharePoint
Denny Lee
 
PPTX
SQLCAT: Tier-1 BI in the World of Big Data
Denny Lee
 
PPTX
Big Data, Bigger Brains
Denny Lee
 
PDF
Jump Start into Apache Spark (Seattle Spark Meetup)
Denny Lee
 
PPTX
How Concur uses Big Data to get you to Tableau Conference On Time
Denny Lee
 
PPTX
SQL Server Reporting Services Disaster Recovery Webinar
Denny Lee
 
PPTX
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Denny Lee
 
Azure Cosmos DB: Globally Distributed Multi-Model Database Service
Denny Lee
 
Spark to DocumentDB connector
Denny Lee
 
Introduction to Azure DocumentDB
Denny Lee
 
SQL Server Integration Services Best Practices
Denny Lee
 
SQL Server Reporting Services: IT Best Practices
Denny Lee
 
Introduction to Microsoft's Big Data Platform and Hadoop Primer
Denny Lee
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Denny Lee
 
SQL Server Reporting Services Disaster Recovery webinar
Denny Lee
 
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Denny Lee
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Denny Lee
 
SQLCAT - Data and Admin Security
Denny Lee
 
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
Denny Lee
 
SQLCAT: A Preview to PowerPivot Server Best Practices
Denny Lee
 
Deploying and Managing PowerPivot for SharePoint
Denny Lee
 
SQLCAT: Tier-1 BI in the World of Big Data
Denny Lee
 
Big Data, Bigger Brains
Denny Lee
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Denny Lee
 
How Concur uses Big Data to get you to Tableau Conference On Time
Denny Lee
 
SQL Server Reporting Services Disaster Recovery Webinar
Denny Lee
 
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Denny Lee
 
Ad

Recently uploaded (20)

PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 

Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

  • 1. April 10-12, Chicago, IL Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together Dianne Cantwell and Denny Lee
  • 2. April 10-12, Chicago, IL Please silence cell phones
  • 3. 3 Agenda Yahoo! Business Case for Hadoop and BI Big Data, Fast Queries Big Data / BI Themes Get the Hardware Balance Right Partitioning, Partitioning, Partitioning Keep it Simple It is the order of things
  • 4. 4 Yahoo! manages a powerful scalable advertising exchange that includes publishers and advertisers Yahoo! TAO Business Challenge
  • 5. 5 Advertisers want to get the best bang for their buck by reaching their targeted audiences effectively and efficiently Yahoo! TAO Business Challenge
  • 6. 6 Yahoo! needs visibility into how consumers are responding to ads along many dimensions: web sites, creatives, time of day, user segments (e.g. gender, age, location) to make the exchange work as efficiently and effectively as possible Yahoo! TAO Business Challenge
  • 7. 7 Yahoo! TAO Technical Requirements 680,000,000Visitors to Yahoo! Branded sites: Ad Impressions: 3,500,000,000(perday) Refresh Frequency: Hourly 464,000,000,000(perqtr) Rows Loaded: Average Query Time: <10 seconds
  • 8. 8 Yahoo! TAO Platform Architecture How did we load so much so quickly? Data Archive & Staging Oracle 11G RAC File 1 File 2 File N Partition 1 Partition 2 Partition N Partition 1 Partition 2 Partition N 24TB Cube /qtr 1.2TB /day 135GB/day compressed 2PB cluster Data Aggregation & ETL Hadoop BI Server SQL Server Analysis Services 2008 R2
  • 9. 9 BI Query Servers SQL Server Analysis Services 2008 R2 24TB Cube /qtr Adhoc Query/Visualization Tableau Desktop 7 Optimization Application Custom J2EE App Yahoo! TAO Platform Architecture Queries at the “speed of thought” 464B rows of event level data /qtr • Dimensions: 42 • Attributes: 296 • Measures: 278 Avg Query Time: 2 secs Avg Query Time: 5 secs
  • 10. 10 Yahoo! TAO Return on Investment For campaigns optimized using TAO, advertisers spent more with Yahoo! than before For campaigns optimized using TAO, more eCPMs (revenue)!
  • 11. 11 Yahoo! TAO Return on Investment Yahoo! TAO exposed customer segment performance to campaign managers and advertisers for the first time! No longer “flying audience blind”
  • 12. 12 Yahoo! TAO Future Direction Increase Segments by 3x Increase data size and cartesian No longer doing distinct count Built frequency reports and sampling to deliver this due to the inherent complexity! Current Challenge Hadoop to SSAS cube (more later) External access to cubes More disk due to need for more IO
  • 13. 13 Big Data Analytics Challenges Cube F
  • 15. 15 Extracting the data File Generation Hadoop jobs create many files that are exported / dumped to disk in tabular format File Staging Files are propped to a staging folder for relational dB access Oracle External Tables Generate external tables that point to the staged files No need to import the data Processing is slow
  • 16. 16 AS on Oracle Case Oracle OLEDB 10K rows/sec 100K rows/sec SSIS Connector 20K rows/sec Oracle Analysis Services Oracle SQL Analysis Services
  • 17. 17 Passthrough Query to Linked Server http://msdn.microsoft.com/en-us/library/jj710329.aspx
  • 19. 19 PartitionsPartitions Yahoo Example – “Fast” Oracle Load • Data is streamed in to Oracle to files • To get max processing, 30 threads are fired because all T (temp) partitions are processed concurrently • Super fast data loads • Problem is that it requires constant merging of partitions Files are streamed in as they become available 10/10/10 T360772 10/10/10 T360773 … 10/10/10 T361645 10/10/10 T360772 Oracle 10g 10/10/10 T360773 10/10/10 T361645 … 10/10/10 T360772 10/10/10 T360773 10/10/10 T361645 … SSAS 10/10/10 Merge
  • 20. 20 Partitions – Directly Merging Partitions 10/10/10 00:00 Oracle 10g 10/10/10 01:00 10/10/10 23:00 … • New model allows for set hourly partitions • No more streaming data but with hourly partitions, cannot have as many threads for fast data loads, unless… • Process multiple cubes or measure groups in parallel Partitions 10/10/10 00:00 10/10/10 01:00 10/10/10 23:00 … SSAS Segments 10/10/10 00:00 10/10/10 01:00 10/10/10 23:00 … Activities 10/10/10 00:00 10/10/10 01:00 10/10/10 23:00 … Uniques
  • 21. 21 It is the order of things
  • 22. 22 It is the order of things “I am a Jem'Hadar. He is a Vorta. It is the order of things." "Do you really want to give up your life for the 'order of things'?" "It is not my life to give up, Captain – and it never was.” Rocks and Shoals, Deep Space Nine Written by Ronald D. Moore
  • 23. 23 Segments and the importance of sort order Data File Sorted Not Sorted % Diff fact.data 195,708,592 344,502,968 43.19% agg.rigid.data 106,825,677 106,825,677 0.00% dim1.dim2.fact.map 17,332,729 32,989,946 47.46% dim1.dim3.fact.map 16,923,276 32,222,813 47.48% dim1.dim4.fact.map 6,079,396 12,286,978 50.52% dim5.dim6.fact.map 2,630,888 6,057,334 56.57% dim1.dim7.fact.map 1,809,725 3,904,004 53.64% dim8.dim9.fact.map 1,592,886 3,793,452 58.01% dim1.dim10.fact.map 1,419,255 3,108,248 54.34% dim8.dim11.fact.map 1,301,221 3,042,638 57.23% dim1.dim12.fact.map 2,949,432 2,949,432 0.00% dim1.dim13.fact.map 2,934,836 2,934,836 0.00% dimA.dimA.fact.map 1,101,552 2,716,289 59.45% dim8.dimB.fact.map 961,332 2,451,956 60.79% dim1.dimC.fact.map 1,027,305 2,323,906 55.79% dim8.dim8.fact.map 1,592,886 2,308,232 30.99% dimA.dimD.fact.map 851,095 2,170,962 60.80% Not Sorted Sorted
  • 24. 24 Across the Eighth Dimension! How do you associate dimensions with Star Trek Into Darkness? Cube
  • 25. 25
  • 26. 26 Back to cube dimensions Running ProcessUpdate Takes a long time to run because all of the fact partitions are re-indexed! Minimize likelihood by building SCD-2 dimensions Composite Key based on lowest level unique values to represent row Sometimes identity can be just as effective though hashing requires mapping or lookuptables Create SK to allow for SCD-2 dimensions Key is that we keep the memory space of the SK small Composite(Composite) or Hash(Composite) is good for dimensions loaded from fact BUT do not expect Type-2 for fact-based dimensions Important to call out restatement based on current data (high cost associated with keeping versioned history of dimension tables)
  • 28. April 10-12, Chicago, IL Thank you! Diamond Sponsor

Editor's Notes

  • #5: Like the NYSE, the Yahoo! ad network behaves like an exchange for display advertising Advertisers are the buyers Publishers (web sites) are the sellers (Yahoo! is one of the publishers) Yahoo! needs to create the most efficient exchange as possible
  • #6: Performance display advertiser requires that we can: Identify the target audience for a campaign Monitor how they behave across a number of different dimensions
  • #7: Huge opportunity for optimization but difficult given the large number of discrete dimensions
  • #8: The number of ad performance factors (i.e. dimensions) and the number of ad impressions per day is huge Yahoo! branded sites attract 680 million unique visitors worldwide 3.5B performance display ad impressions served on Yahoo! exchange per day Large many to many relationships (consumers can be a member of more than one segment) Each consumer is a member of an average of 10 segments – explodes the data by 10x 161B rows per quarter for impression data 203B rows per quarter for segment data (compressed but # of rows processed is really 10x = 2 trillion) Given the number of permutations, query performance needs to be speed of thought or the system is useless Traditional ROLAP is too slow Hundred of dimensions, attributes and metrics create complexity Need integration with good visualization tools to find relevant trends and performance improvement opportunities Data needs to be fresh (from ad impression to query in less than 24 hours) or opportunities are lost Display ad campaigns have very short timeframes (< 2 weeks)
  • #9: Key design concepts are: Use standard, off the shelf parts Loosely coupled components (using a pull architecture) Centralize data aggregation on grid using Hadoop Leverage Oracle’s external table feature to make data available to SSAS with minimal latency One to one match of SASS partitions to Oracle partitions so not aggregation needed & partition pruning enabled (30+ trillion rows in Oracle tables) Maximize parallel loading (90+ threads loading in parallel) Separate cube building from cube querying Improvements in HW/Design 9h -> 2.5h: Change in HW: IBM x3560 M3 256GB RAM, 48 cores; EMC Clariion SAN 2.5h -> 1.25h: Use of Data Direct / Attunity drivers
  • #10: Cube is complex due to nature of the ad business Need to provide an “anything by anything” query environment to find the optimization opportunities If queries aren’t fast, we lose the value Need to update the cube continuously given that there’s limited time to optimize a display ad campaign (data needs to be updated 4x day at minimum) Used SASS aggregations extensively – cut down on Hadoop aggregations dramatically Only 8 fact tables loaded (4 areas, 1 detail, 1 aggregate) As opposed to an existing ROLAP application at Yahoo! that requires 3,600 facts (aggregate) tables
  • #11: Doubled the eCPM (revenue) by allowing our campaign managers to “tune” campaign targeting and creatives Drove increase in spend from advertisers since they got better performance by advertising through Yahoo!
  • #24: IMPORTANT: Sorting is require for both the source and the cube partition queries.
  • #28: Haven’t used UBO yet due to the 2005 issues Creates own spreadsheet (above) to hand-make aggregations Extremely difficult to make/explain aggs Analysis: once you split; how long is ProcessData v.s ProcessIndexes To determine if aggregation creation is the issue or not