Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

April 10-12, Chicago, IL
Yahoo!, Big Data, and
Microsoft BI: Bigger and
Better Together
Dianne Cantwell and Denny Lee

Please silence
cell phones

3
Agenda
Yahoo! Business Case for Hadoop and BI
Big Data, Fast Queries
Big Data / BI Themes
Get the Hardware Balance Right
Partitioning, Partitioning, Partitioning
Keep it Simple
It is the order of things

4
Yahoo! manages a
powerful scalable
advertising exchange
that includes publishers
and advertisers
Yahoo! TAO Business Challenge

5
Advertisers want to get
the best bang for their
buck by reaching their
targeted audiences
effectively and efficiently

6
Yahoo! needs visibility into how consumers
are responding to ads along many
dimensions: web sites, creatives, time of
day, user segments (e.g. gender, age,
location) to make the exchange work as
efficiently and effectively as possible

7
Yahoo! TAO Technical Requirements
680,000,000Visitors to Yahoo! Branded sites:
Ad Impressions: 3,500,000,000(perday)
Refresh Frequency: Hourly
464,000,000,000(perqtr)
Rows Loaded:
Average Query Time: <10 seconds

8
Yahoo! TAO Platform Architecture
How did we load so much so quickly?
Data Archive & Staging
Oracle 11G RAC
File 1
File 2
File N
Partition 1
Partition 2
Partition N
Partition 1
Partition 2
Partition N
24TB
Cube
/qtr
1.2TB
/day
135GB/day
compressed
2PB
cluster
Data Aggregation & ETL
Hadoop
BI Server
SQL Server Analysis
Services 2008 R2

9
BI Query Servers
SQL Server Analysis
Services 2008 R2
24TB
Cube
/qtr
Adhoc Query/Visualization
Tableau Desktop 7
Optimization Application
Custom J2EE App
Yahoo! TAO Platform Architecture
Queries at the “speed of thought”
464B rows of
event level data
/qtr
• Dimensions: 42
• Attributes: 296
• Measures: 278
Avg Query Time:
2 secs
Avg Query Time:
5 secs

10
Yahoo! TAO Return on Investment
For campaigns
optimized using TAO,
advertisers spent
more with Yahoo! than
before
For campaigns
optimized using TAO,
more eCPMs
(revenue)!

11
Yahoo! TAO Return on Investment
Yahoo! TAO exposed customer segment
performance to campaign managers and
advertisers for the first time! No longer
“flying audience blind”

12
Yahoo! TAO Future Direction
Increase Segments by 3x
Increase data size and cartesian
No longer doing distinct count
Built frequency reports and sampling to deliver this due to the inherent complexity!
Current Challenge
Hadoop to SSAS cube (more later)
External access to cubes
More disk due to need for more IO

13
Big Data Analytics Challenges
Cube
F

15
Extracting the data
File Generation
Hadoop jobs create many files that are exported / dumped to disk in tabular format
File Staging
Files are propped to a staging folder for relational dB access
Oracle External Tables
Generate external tables that point to the staged files
No need to import the data
Processing is slow

16
AS on Oracle Case
Oracle OLEDB
10K rows/sec
100K
rows/sec
SSIS Connector
20K rows/sec
Oracle Analysis Services
Oracle SQL Analysis Services

17
Passthrough Query to Linked Server
http://msdn.microsoft.com/en-us/library/jj710329.aspx

18
Partitioning,
Partitioning,
Partitioning

19
PartitionsPartitions
Yahoo Example – “Fast” Oracle Load
• Data is streamed in to Oracle to files
• To get max processing, 30 threads are fired because all T (temp) partitions are
processed concurrently
• Super fast data loads
• Problem is that it requires constant merging of partitions
Files are streamed in
as they become
available
10/10/10 T360772
10/10/10 T360773
…
10/10/10 T361645
10/10/10 T360772
Oracle 10g
10/10/10 T360773
10/10/10 T361645
…
10/10/10 T360772
10/10/10 T360773
10/10/10 T361645
…
SSAS
10/10/10
Merge

20
Partitions – Directly Merging
Partitions
10/10/10 00:00
Oracle 10g
10/10/10 01:00
10/10/10 23:00
…
• New model allows for set hourly partitions
• No more streaming data but with hourly partitions, cannot have as many threads for
fast data loads, unless…
• Process multiple cubes or measure groups in parallel
Partitions
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
SSAS
Segments
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
Activities
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
Uniques

22
It is the order of things
“I am a Jem'Hadar. He is a Vorta.
It is the order of things."
"Do you really want to give up
your life for the 'order of things'?"
"It is not my life to give up, Captain
– and it never was.”
Rocks and Shoals,
Deep Space Nine
Written by Ronald D. Moore

23
Segments and the importance of sort order
Data File Sorted Not Sorted % Diff
fact.data 195,708,592 344,502,968 43.19%
agg.rigid.data 106,825,677 106,825,677 0.00%
dim1.dim2.fact.map 17,332,729 32,989,946 47.46%
dim1.dim3.fact.map 16,923,276 32,222,813 47.48%
dim1.dim4.fact.map 6,079,396 12,286,978 50.52%
dim5.dim6.fact.map 2,630,888 6,057,334 56.57%
dim1.dim7.fact.map 1,809,725 3,904,004 53.64%
dim8.dim9.fact.map 1,592,886 3,793,452 58.01%
dim1.dim10.fact.map 1,419,255 3,108,248 54.34%
dim8.dim11.fact.map 1,301,221 3,042,638 57.23%
dim1.dim12.fact.map 2,949,432 2,949,432 0.00%
dim1.dim13.fact.map 2,934,836 2,934,836 0.00%
dimA.dimA.fact.map 1,101,552 2,716,289 59.45%
dim8.dimB.fact.map 961,332 2,451,956 60.79%
dim1.dimC.fact.map 1,027,305 2,323,906 55.79%
dim8.dim8.fact.map 1,592,886 2,308,232 30.99%
dimA.dimD.fact.map 851,095 2,170,962 60.80%
Not Sorted
Sorted

24
Across the Eighth Dimension!
How do you associate dimensions with
Star Trek Into Darkness?
Cube

26
Back to cube dimensions
Running ProcessUpdate
Takes a long time to run because all of the fact partitions are re-indexed!
Minimize likelihood by building SCD-2 dimensions
Composite Key based on lowest level unique values to represent row
Sometimes identity can be just as effective though hashing requires mapping or lookuptables
Create SK to allow for SCD-2 dimensions
Key is that we keep the memory space of the SK small
Composite(Composite) or Hash(Composite) is good for dimensions loaded from fact BUT do
not expect Type-2 for fact-based dimensions
Important to call out restatement based on current data (high cost associated with keeping
versioned history of dimension tables)

Thank you!
Diamond Sponsor

Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

More Related Content

Similar to Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together (20)

More from Denny Lee (20)

Recently uploaded (20)

Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

Editor's Notes