SlideShare a Scribd company logo
Scaling self service on Hadoop
Sander Kieft
@skieft
Photo credits: niznoz - https://www.flickr.com/photos/niznoz/3116766661/
About me
@skieft
 Manager Core Services at Sanoma
 Responsible for all common services, including the Big Data platform
 Work:
– Centralized services
– Data platform
– Search
 Like:
– Work
– Water(sports)
– Whiskey
– Tinkering: Arduino, Raspberry PI, soldering stuff
27April20153
Sanoma, Publishing and Learning company
2+100
2 Finnish newspapers
Over 100 magazines
27April2015 Presentationname4
5
TV channels in Finland
and The Netherlands
200+
Websites
100
Mobile applications on
various mobile platforms
Past
History
< 2008 2009 2010 2011 2012 2013 2014 2015
Self service
Photo credits: misternaxal - http://www.flickr.com/photos/misternaxal/2888791930/
Self service
Self service levels
Personal Departmental Corporate
Full self service Support with publishing dashboards
and data loading
Full service and support on
dashboard
Information is created by end users with
little or no oversight. Users are
empowered to integrate different data
sources and make their own
calculations.
Information has been created by end
users and is worth sharing, but has not
been validated.
Information that has gone through a
rigorous validation process can be
disseminated as official data.
27April20159
Information Workers Information Consumers
Full Agility Centralized Development
Excel Static reportsFocus
Self service position – 2010 starting point
Source Extraction
Trans-
formation
Modeling Load
Report /
Dashboar
d
Insight
27April201510
Data team Analysts
History
< 2008 2009 2010 2011 2012 2013 2014 2015
4/27/2015 © SanomaMedia11
Glue: ETL
 EXTRACT
 TRANSFORM
 LOAD
27April201512
27April2015 Presentationname13
27April2015 Presentationname14
Scaling self service on Hadoop
Scaling self service on Hadoop
#fail
 Hadoop and Qlikview proofed to be really valuable tools
 Traditional ETL tools don’t scale for Big Data sources
 Big Data projects are not BI projects
 Doing full end-to-end integrations and dashboard development doesn’t scale
 Qlikview was not good enough as the front-end to the cluster
 Hadoop requires developers not BI consultants
Learnings
History
< 2008 2009 2010 2011 2012 2013 2014 2015
Scaling self service on Hadoop
Russell Jurney
Agile Data
27.4.2015 © SanomaMedia Presentationname/Author21
Self service position – Agile Data
Source Extraction
Trans-
formation
Modeling Load
Report /
Dashboar
d
Insight
27April201522
New glue
Photo credits: Sheng Hunglin - http://www.flickr.com/photos/shenghunglin/304959443/
ETL Tool features
 Processing
 Scheduling
 Data quality
 Data lineage
 Versioning
 Annotating
27April201524
27April2015 Presentationname25
Photo credits: atomicbartbeans - http://www.flickr.com/photos/atomicbartbeans/71575328/
27.4.2015 © SanomaMedia Presentationname/Author26
27.4.2015 © SanomaMedia Presentationname/Author27
Processing - Jython
 No JVM startup overhead for Hadoop API usage
 Relatively concise syntax (Python)
 Mix Python standard library with any Java libs
27April2015 Presentationname28
Scheduling - Jenkins
 Flexible scheduling with dependencies
 Saves output
 E-mails on errors
 Scales to multiple nodes
 REST API
 Status monitor
 Integrates with version control
27April2015 Presentationname29
ETL Tool features
 Processing – Bash & Jython
 Scheduling – Jenkins
 Data quality
 Data lineage
 Versioning – Mecurial (hg)
 Annotating – Commenting the code 
27April2015 Presentationname30
Processes
Photo credits: Paul McGreevy - http://www.flickr.com/photos/48379763@N03/6271909867/
Independent jobs
Source (external)
HDFS upload + move in place
Staging (HDFS)
MapReduce + HDFS move
Hive-staging (HDFS)
Hive map external table + SELECT INTO
Hive
27April2015 Presentationname32
Typical data flow - Extract
Jenkins QlikViewHadoop
HDFS
Hadoop
M/R or Hive
1. Job code from
hg (mercurial)
2. Get the data from
S3, FTP or API
3. Store the raw source
data on HDFS
Typical data flow - Transform
Jenkins QlikViewHadoop
HDFS
Hadoop
M/R or Hive
1. Job code from
hg (mercurial)
2. Execute M/R or
Hive Query
3. Transform the data to the
intermediate structure
Typical data flow - Load
Jenkins QlikViewHadoop
HDFS
Hadoop
M/R or Hive
1. Job code from
hg (mercurial)
2. Execute Hive
Query
3. Load data to QlikView
Out of order jobs
 At any point, you don’t really know what ‘made it’ to Hive
 Will happen anyway, because some days the data delivery is going to be three
hours late
 Or you get half in the morning and the other half later in the day
 It really depends on what you do with the data
 This is where metrics + fixable data store help...
27April2015 Presentationname36
Fixable data store
 Using Hive partitions
 Jobs that move data from staging create partitions
 When new data / insight about the data arrives, drop the partition and re-insert
 Be careful to reset any metrics in this case
 Basically: instead of trying to make everything transactional, repair afterwards
 Use metrics to determine whether data is fit for purpose
27April2015 Presentationname37
Photo credits: DL76 - http://www.flickr.com/photos/dl76/3643359247/
Enabling self service
27.4.2015 © SanomaMedia39
??
Netherlands
Finland
Staging Raw Data
(Native format)
HDFS
Normalized Data
structure
HIVE
Reusable
stem data
• Weather
• Int. event
calanders
• Int. AdWord
Prices
• Exchange
rates
Source
systems
FI
NL
?
Extract Transform
Data
Scientists
Business
Analists
Enabling self service
27.4.2015 © SanomaMedia40
??
Netherlands
Finland
Staging Raw Data
(Native format)
HDFS
Normalized Data
structure
HIVE
Reusable
stem data
• Weather
• Int. event
calanders
• Int. AdWord
Prices
• Exchange
rates
Source
systems
FI
NL
?
Extract Transform
Data
Scientists
Business
Analists
HUE & Hive
QlikView
R Studio Server
Jenkins
Architecture
High Level Architecture
46
sources
Jenkins & Jython
ETL
Back Office
Analytics
Meta Data
AdServing
(incl. mobile, video,cpc,yield,
etc.)
Market
Content
QlikView
Dashboard/
Reporting
Hive
&
HUE
Subscription
Hadoop
Scheduled
loads
AdHoc
Queries
Scheduled
exports
R Studio
Advanced
Analytics
Current State - Real time
 Extending own collecting infrastructure
 Using this to drive recommendations, real time dashboarding, user segementation
and targeting in real time
 Using Kafka and Storm
27April2015 Presentationname47
High Level Architecture
48
Jenkins & Jython
ETL
Meta Data
QlikView
Dashboard/
Reporting
Hive
&
HUE
(Real time)
Collecting
CLOE
Hadoop
Scheduled
loads
AdHoc
Queries
Scheduled
exports
Recommendations /
Learn to Rank
R Studio
Advanced
Analytics
Storm
Stream
processing
Recommendatio
ns & online
learningFlume / Kafka
Back Office
Analytics
AdServing
(incl. mobile, video,cpc,yield,
etc.)
Market
Content
sources
Subscription
DC1 DC2
Sanoma Media The Netherlands Infra
Production VMs
Managed HostingColocation
Big Data platform
Dev., test and
acceptance VMs
Production VMs
• NO SLA (yet)
• Limited support BD9-5
• No Full system backups
• Managed by us with systems department
help
• 99,98% availability
• 24/7 support
• Multi DC
• Full system backups
• High performance EMC SAN storage
• Managed by dedicated systems
department
DC1 DC2
Sanoma Media The Netherlands Infra
Production VMs
Managed HostingColocation
Big Data platform
Dev., test and
acceptance VMs
Production VMs
• NO SLA (yet)
• Limited support BD9-5
• No Full system backups
• Managed by us with systems department
help
• 99,98% availability
• 24/7 support
• Multi DC
• Full system backups
• High performance EMC SAN storage
• Managed by dedicated systems
department
Processing
Storage
Batch
Collecting
Real time
Recommendations
Queue
data for 3
days
Run on
stale data
for 3 days
Present
Self service position – Present
Source Extraction
Trans-
formation
Modeling Load
Report /
Dashboar
d
Insight
27April2015 Presentationname52
Present
< 2008 2009 2010 2011 2012 2013 2014 2015
YARN
Moving to YARN
(CDH 4.3 -> 5.1)
27April2015 "Blastingfrankfurt"byHeptagon-Ownwork.LicensedunderCCBY-SA3.0viaWikimediaCommons-http://commons.wikimedia.org/wiki/File:Blasting_frankfurt.jpg#/media/File:Blasting_frankfurt.jpg54
 Functional testing wasn’t enough
 Takes time to tune the parameters
 Defaults are NOT good enough
 Cluster dead locks
 Grouped nodes with similar Hardware requirements into Node groups
CDH 4 to 5 specific
 Hive CLI -> Beeline
 Hive Locking
 Combining various data sources (web analytics + advertising + user profiles)
 A/B testing + deeper analyses
 Ad spend ROI optimization & attribution
 Ad auction price optimization
 Ad retargeting
 Item and user based recommendations
 Search optimizations
– Online Learn to Rank
– Search suggestions
Current state – Use case
27 April2015 Presentationname55
Current state - Usage
 Main use case for reporting and analytics, increasing data science workloads
 Sanoma standard data platform, used in all Sanoma countries
 > 250 daily dashboard users
 40 daily users: analysts & developers
 43 source systems, with 125 different sources
 400 tables in hive
27April2015 Presentationname56
Current state – Tech and team
 Platform Team:
– 1 product owner
– 3 developers
 Close collaborators:
– ~10 data scientists
– 1 Qlikview application manager
– ½ (system) architect
 Platform:
– 50-60 nodes
– > 600TB storage
– ~3000 jobs/day
 Typical nodes
– 1-2 CPU 4-12 cores
– 2 system disks (RAID 1)
– 4 data disks (2TB, 3TB or 4TB)
– 24-32GB RAM
27April2015 Presentationname57
Challenges
Photo credits: http://kdrunlimited1.blogspot.hu/2013/01/late-edition-squirrel-appreciation-day.html
1. Security
2. Increase SLA’s
3. Integration with other BI environments
4. Improve Self service
5. Data Quality
6. Code reuse
Challenges
27 April2015 Presentationname59
Future
 Better integration batch and real time (unified code base)
 Improve the availability and SLA’s of the platform
 Optimizing Job scheduling (Resource Pools)
 Automated query/job optimization tips for analysts
 Fix Jenkins, is single point of failure
 Moving some NLP (Natural Language Processing) and Image recognition
workload to hadoop
27April2015 Presentationname61
What’s next
Thank you!
Questions?
@skieft
sander.kieft@sanoma.com
Scaling self service on Hadoop

More Related Content

What's hot (20)

PPTX
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
 
PPTX
Hadoop Hadoop & Spark meetup - Altiscale
Mark Kerzner
 
PPTX
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
 
PDF
Intro to Spark & Zeppelin - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
PPTX
Accelerating query processing with materialized views in Apache Hive
DataWorks Summit
 
PPTX
Real Time Machine Learning Visualization with Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
 
PPTX
Insights into Real World Data Management Challenges
DataWorks Summit
 
PDF
Visualizing Big Data in Realtime
DataWorks Summit
 
PDF
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Hortonworks
 
PPTX
Modernise your EDW - Data Lake
DataWorks Summit/Hadoop Summit
 
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Data Con LA
 
PPTX
Hadoop crashcourse v3
Hortonworks
 
PPTX
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
DataWorks Summit/Hadoop Summit
 
PDF
Witsml data processing with kafka and spark streaming
Mark Kerzner
 
PPTX
Real-Time Robot Predictive Maintenance in Action
DataWorks Summit
 
PPTX
Cloudy with a chance of Hadoop - real world considerations
DataWorks Summit
 
PPTX
IOT, Streaming Analytics and Machine Learning
DataWorks Summit/Hadoop Summit
 
PPTX
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit
 
PPTX
Boost Performance with Scala – Learn From Those Who’ve Done It!
Cécile Poyet
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
 
Hadoop Hadoop & Spark meetup - Altiscale
Mark Kerzner
 
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
 
Intro to Spark & Zeppelin - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
Accelerating query processing with materialized views in Apache Hive
DataWorks Summit
 
Real Time Machine Learning Visualization with Spark
DataWorks Summit/Hadoop Summit
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
 
Insights into Real World Data Management Challenges
DataWorks Summit
 
Visualizing Big Data in Realtime
DataWorks Summit
 
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Hortonworks
 
Modernise your EDW - Data Lake
DataWorks Summit/Hadoop Summit
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Data Con LA
 
Hadoop crashcourse v3
Hortonworks
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
DataWorks Summit/Hadoop Summit
 
Witsml data processing with kafka and spark streaming
Mark Kerzner
 
Real-Time Robot Predictive Maintenance in Action
DataWorks Summit
 
Cloudy with a chance of Hadoop - real world considerations
DataWorks Summit
 
IOT, Streaming Analytics and Machine Learning
DataWorks Summit/Hadoop Summit
 
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Cécile Poyet
 

Viewers also liked (6)

PPTX
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
Denodo
 
PPTX
Self-Service Provisioning and Hadoop Management with Apache Ambari
DataWorks Summit
 
PDF
Business Intelligence on Hadoop Benchmark
atscaleinc
 
PPTX
Comparison of MPP Data Warehouse Platforms
David Portnoy
 
PPTX
Hybrid Data Warehouse Hadoop Implementations
David Portnoy
 
PDF
Netezza vs Teradata vs Exadata
Asis Mohanty
 
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
Denodo
 
Self-Service Provisioning and Hadoop Management with Apache Ambari
DataWorks Summit
 
Business Intelligence on Hadoop Benchmark
atscaleinc
 
Comparison of MPP Data Warehouse Platforms
David Portnoy
 
Hybrid Data Warehouse Hadoop Implementations
David Portnoy
 
Netezza vs Teradata vs Exadata
Asis Mohanty
 
Ad

Similar to Scaling self service on Hadoop (20)

PPTX
Hadoop Summit - Sanoma self service on hadoop
Sander Kieft
 
PDF
IoT Crash Course Hadoop Summit SJ
Daniel Madrigal
 
PDF
Solving Big Data Problems using Hortonworks
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Powers Modern Enterprise Data Architectures
DataWorks Summit
 
PPTX
Deutsche Telekom on Big Data
DataWorks Summit
 
PDF
Hadoop 2.0: YARN to Further Optimize Data Processing
Hortonworks
 
PDF
Modern data warehouse
Stephen Alex
 
PDF
Modern data warehouse
Stephen Alex
 
PPTX
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
DataKitchen
 
PPTX
Atlanta hadoop users group july 2013
Christopher Curtin
 
PDF
Apache Hadoop on the Open Cloud
Hortonworks
 
PDF
Managing data analytics in a hybrid cloud
Karan Singh
 
PPTX
Big Data Expo 2015 - Pentaho The Future of Analytics
BigDataExpo
 
PDF
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Hortonworks
 
PDF
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Hortonworks
 
PDF
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Hortonworks
 
PDF
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Global Business Events
 
PPT
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Jonathan Seidman
 
PPT
Gartner peer forum sept 2011 orbitz
Raghu Kashyap
 
PPTX
The modern analytics architecture
Joseph D'Antoni
 
Hadoop Summit - Sanoma self service on hadoop
Sander Kieft
 
IoT Crash Course Hadoop Summit SJ
Daniel Madrigal
 
Solving Big Data Problems using Hortonworks
DataWorks Summit/Hadoop Summit
 
Hadoop Powers Modern Enterprise Data Architectures
DataWorks Summit
 
Deutsche Telekom on Big Data
DataWorks Summit
 
Hadoop 2.0: YARN to Further Optimize Data Processing
Hortonworks
 
Modern data warehouse
Stephen Alex
 
Modern data warehouse
Stephen Alex
 
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
DataKitchen
 
Atlanta hadoop users group july 2013
Christopher Curtin
 
Apache Hadoop on the Open Cloud
Hortonworks
 
Managing data analytics in a hybrid cloud
Karan Singh
 
Big Data Expo 2015 - Pentaho The Future of Analytics
BigDataExpo
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Hortonworks
 
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Hortonworks
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Hortonworks
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Global Business Events
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Jonathan Seidman
 
Gartner peer forum sept 2011 orbitz
Raghu Kashyap
 
The modern analytics architecture
Joseph D'Antoni
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Digital Circuits, important subject in CS
contactparinay1
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 

Scaling self service on Hadoop

  • 1. Scaling self service on Hadoop Sander Kieft @skieft
  • 2. Photo credits: niznoz - https://www.flickr.com/photos/niznoz/3116766661/
  • 3. About me @skieft  Manager Core Services at Sanoma  Responsible for all common services, including the Big Data platform  Work: – Centralized services – Data platform – Search  Like: – Work – Water(sports) – Whiskey – Tinkering: Arduino, Raspberry PI, soldering stuff 27April20153
  • 4. Sanoma, Publishing and Learning company 2+100 2 Finnish newspapers Over 100 magazines 27April2015 Presentationname4 5 TV channels in Finland and The Netherlands 200+ Websites 100 Mobile applications on various mobile platforms
  • 6. History < 2008 2009 2010 2011 2012 2013 2014 2015
  • 7. Self service Photo credits: misternaxal - http://www.flickr.com/photos/misternaxal/2888791930/
  • 9. Self service levels Personal Departmental Corporate Full self service Support with publishing dashboards and data loading Full service and support on dashboard Information is created by end users with little or no oversight. Users are empowered to integrate different data sources and make their own calculations. Information has been created by end users and is worth sharing, but has not been validated. Information that has gone through a rigorous validation process can be disseminated as official data. 27April20159 Information Workers Information Consumers Full Agility Centralized Development Excel Static reportsFocus
  • 10. Self service position – 2010 starting point Source Extraction Trans- formation Modeling Load Report / Dashboar d Insight 27April201510 Data team Analysts
  • 11. History < 2008 2009 2010 2011 2012 2013 2014 2015 4/27/2015 © SanomaMedia11
  • 12. Glue: ETL  EXTRACT  TRANSFORM  LOAD 27April201512
  • 17. #fail
  • 18.  Hadoop and Qlikview proofed to be really valuable tools  Traditional ETL tools don’t scale for Big Data sources  Big Data projects are not BI projects  Doing full end-to-end integrations and dashboard development doesn’t scale  Qlikview was not good enough as the front-end to the cluster  Hadoop requires developers not BI consultants Learnings
  • 19. History < 2008 2009 2010 2011 2012 2013 2014 2015
  • 21. Russell Jurney Agile Data 27.4.2015 © SanomaMedia Presentationname/Author21
  • 22. Self service position – Agile Data Source Extraction Trans- formation Modeling Load Report / Dashboar d Insight 27April201522
  • 23. New glue Photo credits: Sheng Hunglin - http://www.flickr.com/photos/shenghunglin/304959443/
  • 24. ETL Tool features  Processing  Scheduling  Data quality  Data lineage  Versioning  Annotating 27April201524
  • 25. 27April2015 Presentationname25 Photo credits: atomicbartbeans - http://www.flickr.com/photos/atomicbartbeans/71575328/
  • 26. 27.4.2015 © SanomaMedia Presentationname/Author26
  • 27. 27.4.2015 © SanomaMedia Presentationname/Author27
  • 28. Processing - Jython  No JVM startup overhead for Hadoop API usage  Relatively concise syntax (Python)  Mix Python standard library with any Java libs 27April2015 Presentationname28
  • 29. Scheduling - Jenkins  Flexible scheduling with dependencies  Saves output  E-mails on errors  Scales to multiple nodes  REST API  Status monitor  Integrates with version control 27April2015 Presentationname29
  • 30. ETL Tool features  Processing – Bash & Jython  Scheduling – Jenkins  Data quality  Data lineage  Versioning – Mecurial (hg)  Annotating – Commenting the code  27April2015 Presentationname30
  • 31. Processes Photo credits: Paul McGreevy - http://www.flickr.com/photos/48379763@N03/6271909867/
  • 32. Independent jobs Source (external) HDFS upload + move in place Staging (HDFS) MapReduce + HDFS move Hive-staging (HDFS) Hive map external table + SELECT INTO Hive 27April2015 Presentationname32
  • 33. Typical data flow - Extract Jenkins QlikViewHadoop HDFS Hadoop M/R or Hive 1. Job code from hg (mercurial) 2. Get the data from S3, FTP or API 3. Store the raw source data on HDFS
  • 34. Typical data flow - Transform Jenkins QlikViewHadoop HDFS Hadoop M/R or Hive 1. Job code from hg (mercurial) 2. Execute M/R or Hive Query 3. Transform the data to the intermediate structure
  • 35. Typical data flow - Load Jenkins QlikViewHadoop HDFS Hadoop M/R or Hive 1. Job code from hg (mercurial) 2. Execute Hive Query 3. Load data to QlikView
  • 36. Out of order jobs  At any point, you don’t really know what ‘made it’ to Hive  Will happen anyway, because some days the data delivery is going to be three hours late  Or you get half in the morning and the other half later in the day  It really depends on what you do with the data  This is where metrics + fixable data store help... 27April2015 Presentationname36
  • 37. Fixable data store  Using Hive partitions  Jobs that move data from staging create partitions  When new data / insight about the data arrives, drop the partition and re-insert  Be careful to reset any metrics in this case  Basically: instead of trying to make everything transactional, repair afterwards  Use metrics to determine whether data is fit for purpose 27April2015 Presentationname37
  • 38. Photo credits: DL76 - http://www.flickr.com/photos/dl76/3643359247/
  • 39. Enabling self service 27.4.2015 © SanomaMedia39 ?? Netherlands Finland Staging Raw Data (Native format) HDFS Normalized Data structure HIVE Reusable stem data • Weather • Int. event calanders • Int. AdWord Prices • Exchange rates Source systems FI NL ? Extract Transform Data Scientists Business Analists
  • 40. Enabling self service 27.4.2015 © SanomaMedia40 ?? Netherlands Finland Staging Raw Data (Native format) HDFS Normalized Data structure HIVE Reusable stem data • Weather • Int. event calanders • Int. AdWord Prices • Exchange rates Source systems FI NL ? Extract Transform Data Scientists Business Analists
  • 46. High Level Architecture 46 sources Jenkins & Jython ETL Back Office Analytics Meta Data AdServing (incl. mobile, video,cpc,yield, etc.) Market Content QlikView Dashboard/ Reporting Hive & HUE Subscription Hadoop Scheduled loads AdHoc Queries Scheduled exports R Studio Advanced Analytics
  • 47. Current State - Real time  Extending own collecting infrastructure  Using this to drive recommendations, real time dashboarding, user segementation and targeting in real time  Using Kafka and Storm 27April2015 Presentationname47
  • 48. High Level Architecture 48 Jenkins & Jython ETL Meta Data QlikView Dashboard/ Reporting Hive & HUE (Real time) Collecting CLOE Hadoop Scheduled loads AdHoc Queries Scheduled exports Recommendations / Learn to Rank R Studio Advanced Analytics Storm Stream processing Recommendatio ns & online learningFlume / Kafka Back Office Analytics AdServing (incl. mobile, video,cpc,yield, etc.) Market Content sources Subscription
  • 49. DC1 DC2 Sanoma Media The Netherlands Infra Production VMs Managed HostingColocation Big Data platform Dev., test and acceptance VMs Production VMs • NO SLA (yet) • Limited support BD9-5 • No Full system backups • Managed by us with systems department help • 99,98% availability • 24/7 support • Multi DC • Full system backups • High performance EMC SAN storage • Managed by dedicated systems department
  • 50. DC1 DC2 Sanoma Media The Netherlands Infra Production VMs Managed HostingColocation Big Data platform Dev., test and acceptance VMs Production VMs • NO SLA (yet) • Limited support BD9-5 • No Full system backups • Managed by us with systems department help • 99,98% availability • 24/7 support • Multi DC • Full system backups • High performance EMC SAN storage • Managed by dedicated systems department Processing Storage Batch Collecting Real time Recommendations Queue data for 3 days Run on stale data for 3 days
  • 52. Self service position – Present Source Extraction Trans- formation Modeling Load Report / Dashboar d Insight 27April2015 Presentationname52
  • 53. Present < 2008 2009 2010 2011 2012 2013 2014 2015 YARN
  • 54. Moving to YARN (CDH 4.3 -> 5.1) 27April2015 "Blastingfrankfurt"byHeptagon-Ownwork.LicensedunderCCBY-SA3.0viaWikimediaCommons-http://commons.wikimedia.org/wiki/File:Blasting_frankfurt.jpg#/media/File:Blasting_frankfurt.jpg54  Functional testing wasn’t enough  Takes time to tune the parameters  Defaults are NOT good enough  Cluster dead locks  Grouped nodes with similar Hardware requirements into Node groups CDH 4 to 5 specific  Hive CLI -> Beeline  Hive Locking
  • 55.  Combining various data sources (web analytics + advertising + user profiles)  A/B testing + deeper analyses  Ad spend ROI optimization & attribution  Ad auction price optimization  Ad retargeting  Item and user based recommendations  Search optimizations – Online Learn to Rank – Search suggestions Current state – Use case 27 April2015 Presentationname55
  • 56. Current state - Usage  Main use case for reporting and analytics, increasing data science workloads  Sanoma standard data platform, used in all Sanoma countries  > 250 daily dashboard users  40 daily users: analysts & developers  43 source systems, with 125 different sources  400 tables in hive 27April2015 Presentationname56
  • 57. Current state – Tech and team  Platform Team: – 1 product owner – 3 developers  Close collaborators: – ~10 data scientists – 1 Qlikview application manager – ½ (system) architect  Platform: – 50-60 nodes – > 600TB storage – ~3000 jobs/day  Typical nodes – 1-2 CPU 4-12 cores – 2 system disks (RAID 1) – 4 data disks (2TB, 3TB or 4TB) – 24-32GB RAM 27April2015 Presentationname57
  • 59. 1. Security 2. Increase SLA’s 3. Integration with other BI environments 4. Improve Self service 5. Data Quality 6. Code reuse Challenges 27 April2015 Presentationname59
  • 61.  Better integration batch and real time (unified code base)  Improve the availability and SLA’s of the platform  Optimizing Job scheduling (Resource Pools)  Automated query/job optimization tips for analysts  Fix Jenkins, is single point of failure  Moving some NLP (Natural Language Processing) and Image recognition workload to hadoop 27April2015 Presentationname61 What’s next

Editor's Notes

  • #3: Poultery
  • #7: Mainly webanalytics Forester 1 dev + 1 bi
  • #10: Business Intelligence Self Service Sweet Spot
  • #26: ETL Pentaho SAS DI Studio Informatica Oozie Other
  • #55: - cluster deadlocks can appear in moments of high utilization: low reducer slowstart threshold can result in a situation of many jobs with all reducers already allocated, taking up the slots for the remaining mappers which still need to finish ; apparently the Fair scheduler is not smart enough to detect this => all jobs are hanging (deadlock)           solution: increase reducer slowstart threshold (mapreduce.job.reduce.slowstart.completedmaps) to 0.99