Successes, Challenges and
Pitfalls Migrating a SAAS
Business to Hadoop
Shaun Klopfenstein, CTO
Eric Kienle, Chief Architect
The Vision
Requirements
Page 4
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Business Requirements
• Near real-time activity processing
• 1 billion activities per customer per day
• Improve cost efficiency of operations while scaling up
• Global enterprise grade security and governance
Page 5
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Architecture Requirements
• Maximize utilization of hardware
• Multitenancy support with fairness
• Encryption, Authorization & Authentication
• Applications must scale horizontally
Technology Bake Off
Page 7
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Bake Off
• Technology Selection
• Storm/Spark Streaming
• HBase/Cassandra
• Built POC with each permutation + Kafka
• Load tested with one day of web traffic
Page 8
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
The Winner Is… Our First Challenge
• We hoped to find a clear winner… we didn’t exactly
• Truth is all the POCs worked at the scale we tested
• It’s possible if we had scaled up the test, we would
have found more differences
Page 9
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
How We Chose
• Community
• Features
• Team Skillset
• History
• The winners: HBase/Kafka/Spark streaming
Architecture & Design
Page 11
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Marketo Lambda Architecture
CRM Sync
Partner APIs
Other Marketing
Activities
Spark Streaming
Consumers
Campaign Triggers
Solr Indexing
Solr
Spark Streaming Indexer
Ingestion Processor
Scala/Tomcat
HBase
HDFS
Kafka Event Stream
Web Activity
RTP Activity
Mobile Activity
Marketo UI
Campaign Detail
Lead Detail
Other Clients
CRM Sync
Revenue Cycle Analylitcs
APIs
Email Report Loader
Web Activity Processor
• Enhanced Lambda Architecture
• Inbound activities written to Ingestion Processor
• Hbase and then Kafka
• High volume (e.g. web) activities
• First written to Kafka, then enriched
• Spark Streaming applications consume events from Kafka
• Solr Indexing
• Email Reports
• Campaign Processing
• HBase is used for simple historical queries, and is system of record
High Level Architecture
Build It
Implementation
Page 14
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Building Expertise
• We had a few people with Hadoop and
Spark experience
• We decided to grow knowledge in house
• Focus on training - HortonWorks boot camp
for operations
• In house courses and tech talks for engineering/QE
Page 15
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Building Expertise - Successes
• Critical to kick start the project
• Built excitement
• Created foundation for the design process
Page 16
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Building Expertise – Context Challenge
Challenge
• Training packed a lot of information into a short period
• Teams that didn’t leverage the training right away lost context
Recommendation
• Create environments for hands on experience early
• Hands on experience across all teams right after training
Page 17
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Building Expertise – Experience Challenge
Challenge
• Hadoop technology is like playing a piano… knowing how to read
music doesn’t mean you can play
• Many ways to design, configure, manage - Only a few right ways
and the reasons can be subtle
Recommendation
• Find your experts!
• Partner and hire
Page 18
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Building Our First Cluster
• Initial sizing and capacity planning of first
Hadoop Clusters
• Perform load tests to get initial capacity plan
• Decided that disk I/O and storage would be the leading indicator
• Went with industry best practice on hardware and network
configuration
Page 19
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Building Our First Cluster- Success
• Leading indicator ended up being compute
• But cluster sizing ended up being close enough to start
• Clusters can always be expanded…
So don’t get too hung up
Page 20
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Building Our First Cluster – Zookeeper & VM
Challenge
• We started with Zookeeper virtualized
• Didn’t perform properly (we think because of disk IO)
• Caused random outages
Recommendation
• We ended up migrating zookeeper to physical boxes
• Don’t use VMs for zookeeper!
Page 21
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Security
• All data at rest must be encrypted
• Applications sharing Hadoop must be isolated
from each other
• Applications must have hard quotas for both
compute and disk resources
Page 22
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Security - Success
• Enabled Kerberos security for Hadoop cluster
• Kerberos allowed us to leveraged HDFS
native encryption
• Used encrypted disks for Kafka servers
• Created separate secure Yarn queues to
isolate applications
• Each application uses a separate Kerberos principal
Page 23
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Security – Kerberos Challenge
Challenge
• Kerberos can’t be added to a Hadoop cluster without prolonged
downtime and patches
• Needed weeks of developer time to accommodate security changes
• Added several months to the overall rollout schedule
Recommendation
• Allow extra time for Kerberos
• Educate your team beforehand, find an expert to guide you
• Be prepared for different levels of Kerberos support across the
Hadoop ecosystem
Page 24
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Security – Kafka and Spark Challenge
Challenge
• Kafka doesn’t support data encryption (and won’t)
• HDP version we had didn’t fully support Kerberos Kafka and Spark
clients properly
Recommendation
• Move Kafka and Spark out of Ambari
• Only encrypt Kafka data if you absolutely must, as it adds complexity
Test It
Page 26
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Validation
• Changing the engines on a plane while in flight is hard
• Required all components implemented “Passive mode”
• The new code ran in the background and continuously compared results
with the legacy system
• Automated functional tests kicked off from Jenkins
• Performance testing at AWS
Page 27
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Validation - Success
• Passive mode is one of the best moves we made!
• Allowed for testing of components with real world
data and load
• Found countless performance and logic issues with
minimal operational impact
Page 28
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Validation – Passive Mode “Minimal Impact”
Challenge
• By design passive mode wrote to both Legacy and Hadoop systems
• We impacted performance during an outage of our cluster
Recommendation
• Use asynchronous writes or tight timeouts in passive mode
• Monitoring for the Hadoop cluster should be in place before
passive testing
Deploying It
Page 30
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Migration and Management
• We are here!
• Migrate over 6,000 subscriptions with no service interruption
or data loss
• Track and monitor migration and provide management tools
for the new platform
• Achieve the end goal of removing the safety net
Page 31
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Migration and Management - Successes
• Created a new management console called Sirius
• Close architectural coordination of all teams during
migration
• If problems arose, we had a quick, automated, fallback
path to the legacy system
• Daily cross-functional standup meetings to track the
rollout
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Challenge
• Oozie workflows can be challenging to build and debug
• Capacity planning and resource management in the shared Hadoop
cluster is very complex
Recommendation
• Only use Oozie workflows for automating complex or long running
processes, or use a different orchestration platform
• Constantly reevaluate your capacity plan based on current deployment
Migration and Management Challenges
Running It
Page 35
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Monitoring
• Needed to monitor hundreds of new Hadoop and other
infrastructure servers
• Our custom Spark Streaming applications required all
new metrics and monitors
• Capacity planning requires trend analysis of both the
infrastructure and our applications
• Don’t overwhelm our already busy Cloud Platform Team
Page 36
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Monitoring - Successes
• Built a custom monitoring infrastructure using
OpenTSDB and Grafana
• Added business SLA metrics to our Sirius console to
provide real-time alerts
• Added comprehensive Hadoop monitors into our
pre-existing production monitoring system
Page 37
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Monitoring - Challenges
Challenges
• Adding hundreds of servers and a dozen new applications
makes for a huge monitoring task
• Nagios is a very general purpose system and isn’t designed
to monitor Hadoop out of the box
Recommendations
• Make sure that you have monitors and trend analysis in
place and tested before migration
• Be prepared to constantly refine and improve the your
monitors and alerts
Page 38
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Patching and Upgrading
• We have a zero-downtime requirement for applications
• Patching and upgrading of either the infrastructure or our own
applications is problematic
• Keeping up with the community requires frequent patching
• Eventually hundreds of Spark Streaming jobs will need to be
constantly processing data with no interruption
Page 39
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Patching and Upgrading - Successes
• Use Sirius console to manage Spark Streaming jobs
• Marketo’s Kafka consumer allows streaming jobs to pick up
where they left off after a restart
• Integrated existing Jenkins infrastructure with the Sirius
console to provide painless automated patching/upgrades
Page 40
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Infrastructure Patching and Upgrading - Challenges
Challenges
• Patches/upgrades managed with Ambari – not perfect!
• We almost never get through an upgrade without one or more Hadoop
components having downtime (so far)
Recommendations
• Test all infrastructure patches and upgrades in a loaded non-production
environment
• Check out the start and stop scripts from the component specific open
source communities, rather than rely on Ambari
We’re Hiring!
Http://Marketo.Jobs
Q & A

More Related Content

PPTX
Integrating Apache Spark and NiFi for Data Lakes
PPTX
Treat your enterprise data lake indigestion: Enterprise ready security and go...
PPTX
Provisioning Big Data Platform using Cloudbreak & Ambari
PPTX
Apache deep learning 101
PDF
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
PPTX
Hadoop & Cloud Storage: Object Store Integration in Production
PDF
Getting involved with Open Source at the ASF
PPTX
IoT with Apache MXNet and Apache NiFi and MiniFi
Integrating Apache Spark and NiFi for Data Lakes
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Provisioning Big Data Platform using Cloudbreak & Ambari
Apache deep learning 101
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Hadoop & Cloud Storage: Object Store Integration in Production
Getting involved with Open Source at the ASF
IoT with Apache MXNet and Apache NiFi and MiniFi

What's hot (20)

PPTX
Securing Hadoop in an Enterprise Context
PPTX
Log Analytics Optimization
PPTX
How to Use Apache Zeppelin with HWX HDB
PDF
Hortonworks tech workshop in-memory processing with spark
PPT
Running Zeppelin in Enterprise
PPTX
Boost Performance with Scala – Learn From Those Who’ve Done It!
PDF
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
PPTX
Innovation in the Enterprise Rent-A-Car Data Warehouse
PPTX
Real time fraud detection at 1+M scale on hadoop stack
PPTX
Preventative Maintenance of Robots in Automotive Industry
PPT
Enabling a hardware accelerated deep learning data science experience for Apa...
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PDF
Splunk-hortonworks-risk-management-oct-2014
PDF
Multitenancy At Bloomberg - HBase and Oozie
PPTX
Insights into Real-world Data Management Challenges
PPTX
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
PDF
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
PPTX
Apache Hadoop YARN: state of the union
PDF
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
PDF
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Securing Hadoop in an Enterprise Context
Log Analytics Optimization
How to Use Apache Zeppelin with HWX HDB
Hortonworks tech workshop in-memory processing with spark
Running Zeppelin in Enterprise
Boost Performance with Scala – Learn From Those Who’ve Done It!
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
Innovation in the Enterprise Rent-A-Car Data Warehouse
Real time fraud detection at 1+M scale on hadoop stack
Preventative Maintenance of Robots in Automotive Industry
Enabling a hardware accelerated deep learning data science experience for Apa...
LLAP: Sub-Second Analytical Queries in Hive
Splunk-hortonworks-risk-management-oct-2014
Multitenancy At Bloomberg - HBase and Oozie
Insights into Real-world Data Management Challenges
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Apache Hadoop YARN: state of the union
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Ad

Viewers also liked (20)

PDF
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
PDF
Spark meetup TCHUG
PPTX
Big Data Platform Industrialization
PPTX
Migrating Clinical Data in Various Formats to a Clinical Data Management System
PDF
Cloudera Impala 1.0
PPTX
Building a modern Application with DataFrames
PPTX
Launching your advanced analytics program for success in a mature industry
PPTX
Building a modern Application with DataFrames
PDF
Jump Start into Apache® Spark™ and Databricks
PDF
Introduction to Hadoop
PDF
Scalable And Incremental Data Profiling With Spark
PDF
Hadoop application architectures - using Customer 360 as an example
PPTX
A better business case for big data with Hadoop
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
PPTX
Apache spark 소개 및 실습
PPTX
Hadoop Summit Tokyo Apache NiFi Crash Course
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PPTX
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
Spark meetup TCHUG
Big Data Platform Industrialization
Migrating Clinical Data in Various Formats to a Clinical Data Management System
Cloudera Impala 1.0
Building a modern Application with DataFrames
Launching your advanced analytics program for success in a mature industry
Building a modern Application with DataFrames
Jump Start into Apache® Spark™ and Databricks
Introduction to Hadoop
Scalable And Incremental Data Profiling With Spark
Hadoop application architectures - using Customer 360 as an example
A better business case for big data with Hadoop
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
Apache spark 소개 및 실습
Hadoop Summit Tokyo Apache NiFi Crash Course
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
File Format Benchmark - Avro, JSON, ORC & Parquet
Ad

Similar to Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop (20)

PDF
Open Source Applied - Real World Use Cases
PPTX
Open source applied - Real world use cases (Presented at Open Source 101)
PDF
C1 keynote creating_your_enterprise_cloud_strategy
PPTX
Does Big Data Spell Big Costs- Impetus Webinar
PDF
OOW-5185-Hybrid Cloud
PDF
C4 optimizing your_application_infrastructure
PPTX
Top 5 benefits of docker
PPTX
ThatConference 2016 - Highly Available Node.js
PPTX
Monitor OpenStack Environments from the bottom up and front to back
PPTX
Intel Cloud Foundry and OpenStack
PPTX
Superfast Business - Moving to the Cloud
PPTX
TEC118 – How Do You Manage the Configuration of Your Environments from Metal ...
PPTX
DevOps, what should you decide, when, why & how - Vinita Rathi
PPTX
Twelve-Factor application pattern with Spring Framework
PDF
Mark Interrante OpenStack Design Summit
PDF
Enterprise CI as-a-Service using Jenkins
PPTX
SAPUI5/OpenUI5 - Continuous Integration
PPTX
Open source applied: Real-world uses
PPTX
Agile DevOps Transformation At HUD (AgileDC 2017)
PPTX
Migrating DataPower to IBM's API Connect Using Custom Policies//DataPower Wee...
Open Source Applied - Real World Use Cases
Open source applied - Real world use cases (Presented at Open Source 101)
C1 keynote creating_your_enterprise_cloud_strategy
Does Big Data Spell Big Costs- Impetus Webinar
OOW-5185-Hybrid Cloud
C4 optimizing your_application_infrastructure
Top 5 benefits of docker
ThatConference 2016 - Highly Available Node.js
Monitor OpenStack Environments from the bottom up and front to back
Intel Cloud Foundry and OpenStack
Superfast Business - Moving to the Cloud
TEC118 – How Do You Manage the Configuration of Your Environments from Metal ...
DevOps, what should you decide, when, why & how - Vinita Rathi
Twelve-Factor application pattern with Spring Framework
Mark Interrante OpenStack Design Summit
Enterprise CI as-a-Service using Jenkins
SAPUI5/OpenUI5 - Continuous Integration
Open source applied: Real-world uses
Agile DevOps Transformation At HUD (AgileDC 2017)
Migrating DataPower to IBM's API Connect Using Custom Policies//DataPower Wee...

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PDF
Five Habits of High-Impact Board Members
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Chapter 5: Probability Theory and Statistics
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
August Patch Tuesday
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
STKI Israel Market Study 2025 version august
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPT
Geologic Time for studying geology for geologist
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
Five Habits of High-Impact Board Members
Tartificialntelligence_presentation.pptx
Chapter 5: Probability Theory and Statistics
WOOl fibre morphology and structure.pdf for textiles
DP Operators-handbook-extract for the Mautical Institute
Getting started with AI Agents and Multi-Agent Systems
sustainability-14-14877-v2.pddhzftheheeeee
August Patch Tuesday
A contest of sentiment analysis: k-nearest neighbor versus neural network
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Univ-Connecticut-ChatGPT-Presentaion.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Assigned Numbers - 2025 - Bluetooth® Document
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
STKI Israel Market Study 2025 version august
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Geologic Time for studying geology for geologist
Taming the Chaos: How to Turn Unstructured Data into Decisions

Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop

  • 1. Successes, Challenges and Pitfalls Migrating a SAAS Business to Hadoop Shaun Klopfenstein, CTO Eric Kienle, Chief Architect
  • 4. Page 4 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Business Requirements • Near real-time activity processing • 1 billion activities per customer per day • Improve cost efficiency of operations while scaling up • Global enterprise grade security and governance
  • 5. Page 5 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Architecture Requirements • Maximize utilization of hardware • Multitenancy support with fairness • Encryption, Authorization & Authentication • Applications must scale horizontally
  • 7. Page 7 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Bake Off • Technology Selection • Storm/Spark Streaming • HBase/Cassandra • Built POC with each permutation + Kafka • Load tested with one day of web traffic
  • 8. Page 8 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 The Winner Is… Our First Challenge • We hoped to find a clear winner… we didn’t exactly • Truth is all the POCs worked at the scale we tested • It’s possible if we had scaled up the test, we would have found more differences
  • 9. Page 9 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 How We Chose • Community • Features • Team Skillset • History • The winners: HBase/Kafka/Spark streaming
  • 11. Page 11 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Marketo Lambda Architecture CRM Sync Partner APIs Other Marketing Activities Spark Streaming Consumers Campaign Triggers Solr Indexing Solr Spark Streaming Indexer Ingestion Processor Scala/Tomcat HBase HDFS Kafka Event Stream Web Activity RTP Activity Mobile Activity Marketo UI Campaign Detail Lead Detail Other Clients CRM Sync Revenue Cycle Analylitcs APIs Email Report Loader Web Activity Processor
  • 12. • Enhanced Lambda Architecture • Inbound activities written to Ingestion Processor • Hbase and then Kafka • High volume (e.g. web) activities • First written to Kafka, then enriched • Spark Streaming applications consume events from Kafka • Solr Indexing • Email Reports • Campaign Processing • HBase is used for simple historical queries, and is system of record High Level Architecture
  • 14. Page 14 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Building Expertise • We had a few people with Hadoop and Spark experience • We decided to grow knowledge in house • Focus on training - HortonWorks boot camp for operations • In house courses and tech talks for engineering/QE
  • 15. Page 15 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Building Expertise - Successes • Critical to kick start the project • Built excitement • Created foundation for the design process
  • 16. Page 16 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Building Expertise – Context Challenge Challenge • Training packed a lot of information into a short period • Teams that didn’t leverage the training right away lost context Recommendation • Create environments for hands on experience early • Hands on experience across all teams right after training
  • 17. Page 17 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Building Expertise – Experience Challenge Challenge • Hadoop technology is like playing a piano… knowing how to read music doesn’t mean you can play • Many ways to design, configure, manage - Only a few right ways and the reasons can be subtle Recommendation • Find your experts! • Partner and hire
  • 18. Page 18 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Building Our First Cluster • Initial sizing and capacity planning of first Hadoop Clusters • Perform load tests to get initial capacity plan • Decided that disk I/O and storage would be the leading indicator • Went with industry best practice on hardware and network configuration
  • 19. Page 19 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Building Our First Cluster- Success • Leading indicator ended up being compute • But cluster sizing ended up being close enough to start • Clusters can always be expanded… So don’t get too hung up
  • 20. Page 20 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Building Our First Cluster – Zookeeper & VM Challenge • We started with Zookeeper virtualized • Didn’t perform properly (we think because of disk IO) • Caused random outages Recommendation • We ended up migrating zookeeper to physical boxes • Don’t use VMs for zookeeper!
  • 21. Page 21 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Security • All data at rest must be encrypted • Applications sharing Hadoop must be isolated from each other • Applications must have hard quotas for both compute and disk resources
  • 22. Page 22 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Security - Success • Enabled Kerberos security for Hadoop cluster • Kerberos allowed us to leveraged HDFS native encryption • Used encrypted disks for Kafka servers • Created separate secure Yarn queues to isolate applications • Each application uses a separate Kerberos principal
  • 23. Page 23 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Security – Kerberos Challenge Challenge • Kerberos can’t be added to a Hadoop cluster without prolonged downtime and patches • Needed weeks of developer time to accommodate security changes • Added several months to the overall rollout schedule Recommendation • Allow extra time for Kerberos • Educate your team beforehand, find an expert to guide you • Be prepared for different levels of Kerberos support across the Hadoop ecosystem
  • 24. Page 24 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Security – Kafka and Spark Challenge Challenge • Kafka doesn’t support data encryption (and won’t) • HDP version we had didn’t fully support Kerberos Kafka and Spark clients properly Recommendation • Move Kafka and Spark out of Ambari • Only encrypt Kafka data if you absolutely must, as it adds complexity
  • 26. Page 26 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Validation • Changing the engines on a plane while in flight is hard • Required all components implemented “Passive mode” • The new code ran in the background and continuously compared results with the legacy system • Automated functional tests kicked off from Jenkins • Performance testing at AWS
  • 27. Page 27 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Validation - Success • Passive mode is one of the best moves we made! • Allowed for testing of components with real world data and load • Found countless performance and logic issues with minimal operational impact
  • 28. Page 28 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Validation – Passive Mode “Minimal Impact” Challenge • By design passive mode wrote to both Legacy and Hadoop systems • We impacted performance during an outage of our cluster Recommendation • Use asynchronous writes or tight timeouts in passive mode • Monitoring for the Hadoop cluster should be in place before passive testing
  • 30. Page 30 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Migration and Management • We are here! • Migrate over 6,000 subscriptions with no service interruption or data loss • Track and monitor migration and provide management tools for the new platform • Achieve the end goal of removing the safety net
  • 31. Page 31 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Migration and Management - Successes • Created a new management console called Sirius • Close architectural coordination of all teams during migration • If problems arose, we had a quick, automated, fallback path to the legacy system • Daily cross-functional standup meetings to track the rollout
  • 33. Challenge • Oozie workflows can be challenging to build and debug • Capacity planning and resource management in the shared Hadoop cluster is very complex Recommendation • Only use Oozie workflows for automating complex or long running processes, or use a different orchestration platform • Constantly reevaluate your capacity plan based on current deployment Migration and Management Challenges
  • 35. Page 35 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Monitoring • Needed to monitor hundreds of new Hadoop and other infrastructure servers • Our custom Spark Streaming applications required all new metrics and monitors • Capacity planning requires trend analysis of both the infrastructure and our applications • Don’t overwhelm our already busy Cloud Platform Team
  • 36. Page 36 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Monitoring - Successes • Built a custom monitoring infrastructure using OpenTSDB and Grafana • Added business SLA metrics to our Sirius console to provide real-time alerts • Added comprehensive Hadoop monitors into our pre-existing production monitoring system
  • 37. Page 37 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Monitoring - Challenges Challenges • Adding hundreds of servers and a dozen new applications makes for a huge monitoring task • Nagios is a very general purpose system and isn’t designed to monitor Hadoop out of the box Recommendations • Make sure that you have monitors and trend analysis in place and tested before migration • Be prepared to constantly refine and improve the your monitors and alerts
  • 38. Page 38 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Patching and Upgrading • We have a zero-downtime requirement for applications • Patching and upgrading of either the infrastructure or our own applications is problematic • Keeping up with the community requires frequent patching • Eventually hundreds of Spark Streaming jobs will need to be constantly processing data with no interruption
  • 39. Page 39 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Patching and Upgrading - Successes • Use Sirius console to manage Spark Streaming jobs • Marketo’s Kafka consumer allows streaming jobs to pick up where they left off after a restart • Integrated existing Jenkins infrastructure with the Sirius console to provide painless automated patching/upgrades
  • 40. Page 40 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Infrastructure Patching and Upgrading - Challenges Challenges • Patches/upgrades managed with Ambari – not perfect! • We almost never get through an upgrade without one or more Hadoop components having downtime (so far) Recommendations • Test all infrastructure patches and upgrades in a loaded non-production environment • Check out the start and stop scripts from the component specific open source communities, rather than rely on Ambari

Editor's Notes

  • #2: 18 months ago our team kicked off an ambitious project which we have since named orion.   A group of us came to Hadoop summit to learn as much as we could.  That experience is the inspiration for this talk We wanted to share is about what we have learned over the last 18 months.  What worked well, and what we would do differently
  • #3: Although the talk isn’t about the project…  we have a few slides up front to set the context around what we are working on If you have been near technology at all in the last couple of years you know that the world has become very connected.   The number of connected devices blows my mind.  It’s not just phones anymore…   Amazon dash buttons, coffee makers, propane tanks, garage doors.  These devices are sending 10’s of billions of activities and user interactions every day... Orion is our platfor Our marketing platform ingests the user interactions process them into relevant marketing touchpoints Its enables marketers to create marketing campaigns around these activities to build relationships with their customers Become the fabric for marketers Its been a great experience building this
  • #5: Here are a few of the requirements Near real time processing At least a 1 billion activities per customer per day. customer demands from increasing devices caused us to evaluate next get queueing and streaming... reduction in infrastructure COGS primarily from expensive enterprise class filers... reduction in people COGS by gained efficiency from reducing tech stack from using too many similar technologies ... Multitenant… of course Secure Customer isolation and improved resource management
  • #6: Arch requirement driven from biz requirement Improve utilization over the existing system Lots of customers in same infra, without starving Encryption from day 1 for safe data storage Aim for horz scalability Radically reduce processing latency Eliminate backlogs Brownout protection
  • #7: Bakeoff to decide which platform to use Build POCs to pick the best tech stack Researched various technologies hadoop/non
  • #8: Decided to take day worth of web traffic and build POC storm/spark as our event processing platforms Hbase/Cassandra for storage And Kafka as the event queue
  • #9: All combos worked, no clear winner The amount of load generated was not enough to
  • #10: Community - Spark had much more active community than Storm Features - Spark solved batch processing, something Storm couldn't do Team Experience - HBase to leverage existing Hadoop expertise History – Our team had poor experiences scaling up our existing Cassandra cluster
  • #11: A few words about the architecture Main goal is to inject, process and store marketing events
  • #12: High level diagram of our event processor Enhanced Lambda Architecture Inbound activities written to Ingestion Processor Hbase and then Kafka High volume (e.g. web) activities First written to Kafka, then enriched Spark Streaming applications consume events from Kafka Solr Indexing Email Reports Campaign Processing HBase is used for simple historical queries, and is system of record
  • #13: Reiterates my points on the last slide. I included in case you wanted to look at the slides later
  • #14: Next thing we are going to talk about are some key points during the implementation phase of the project Lots of learnings around training Getting first cluster running And security
  • #15: One of the 1st things was to build expertise Grow knowledge in house Tech talks lead by architecture team on new infrastructure online courses (Coursera) Scala and Hadoop onsite training for Scala (which is the peferred language for sparck steam) Hortonworks Bootcamp to train operators
  • #16: Training helped us kickstart project – by getting people in the right mindset Helped people feel included in the project/process got people thinking about new technologies Created a nice foundation for design process
  • #17: Early training was great Groups who didn’t leverage knowledge immediately lost context Would set up hadoop environments early to let people get hands on experience right away Hands on experience should have spanned all teams For example, developers were developing in Spark standalone mode, made a rough transition into Yarn cluster mode
  • #18: Hadoop ecosystem is quite complex Design possibilities are large, only a few right ways Difference between right and wrong can be very subtle Best way to navigate is find experts – hire if possible, or get expertise from partner like Hortonworks You need experts!
  • #19: Took a scientiic approach – took POC and did some load tests in AWS Leading indicator was disk 1/O Asked HW and HP for recommendations for hardware Our next task was how to figure out how to build our first cluster, which is quite daunting Built scale model in AWS Talked to HP and Hortonworks to get best practices around server builds
  • #20: Leading indicator not disk -- its compute We can add either disk only or compute only nodes to scale Do the initial exercise –but don’t get too hung up on the cluster composition You will end up resizing and tuning as you scale up anyway We may add compute only nodes, don’t get too stuck on initial sizing You can always scale up later Don’t overscale from day one
  • #21: ZK not in the path of direct user queries ZK in VMs did not work well We think it was disk I/O Moved to physical boxes and life was much better Zookeeper, for those of you who are new to Hadoop, is the cluster coordination service
  • #22: Talk about why you talk about capacity with securty? From beginning infra needs to meet enterprise sec requirements All applications are isolated Restrict applications resource usage (disk IO, etc)
  • #23: Hadoop has support for kerboros (some parts better then others) HDFS Native disk encryption Encrypted disks for Kafka because of lack of native support Isolated yarn queues
  • #24: Kerboros really really hard Allow extra time for kerboros Training first Find someone who has done it before (much easier then on your own) Kerboros has different support, not so great in kafka We still some bugs we are trying to work out
  • #25: Kafka doesn’t support data encryption (and won’t because of performance) Disk encryption ended up not being a critical performance blocker Ended up rolling back kerborization for spark and Move Kafka and Spark out of Ambari and manage if you don’t need the features More control over versions Take patches faster Only loosely integrated for now
  • #26: Next phase was when we were ready to validate our newly built event ingestion system
  • #27: Wanted to validate that the new system performed as a functional superset of the old one Doing this on a running system is extremely difficult We decided early on to require all components to implement run in a silent mode – Allows us to test on real data for correctness with real data – in the wild We had automeated CI tests in Jenkins perf testing at AWS
  • #28: Passive Mode one of the best moves we made – found countless bugs and config issues Real world load testing Super valuable – worth the cost of implemenation
  • #29: By design writes to both the legacy and new system – Caused performance issue due to slow writes Cluster didn’t really go all the way down – because we overloaded ZK We recommend do Passive mode Use short timeouts or write async Make sure you have monitors in place even for passive mode
  • #30: After finished proving the service in passive mode for beta customers Massive undertaking
  • #31: Ready to migrate 6000 subs without any service interruption and no downtown Maybe say customers instead of subscriptions Non-trivial! Marketo has a 24/7/365 commitment Migrate customers a few subs at a time Create management and migrations tools Delete data out of relational database
  • #32: In order to manage the migration created sirius Human factor – about 10 teams and 30 sub components Whole team involved closely with the migration Automers fallback to legacy system if problem arose Daily standup to track rollout
  • #33: This is a pic of our management console All test data Example
  • #34: One big challenge - built on top of oozie Oozie is powerful but very complex Capacity planning was more complex then we thought Ended up ramping up customers -> capacity plan -> ramp up Only use oozie if you have to Important to capacity plan in the wild -> one team ended up needing 10% their original estimate
  • #35: We have had several learnings already running this new infrastructure Its challenging running dozens of applications to keep track of across 100s of servers
  • #36: First, needed to add monitors for all the new servers ~350 Created a bunch of spark streaming applications all needing metrics to be reported and monitors Metrics used for capacity planning and ensuring they are meeting the biz metrics for the project Didn’t want to overwhelm CTP
  • #37: Built a new monitoring and metrics system using OPENTSDB and grafana Allow us to do trend analysis Sirius console monitors the biz level metrics In addition added comprehensive set of monitors to our existing system (Nagios) Hadoop requires a lot of monitoring Built a custom monitoring infrastructure using OpenTSDB and Grafana Allows us to do trend analysis on Hadoop and other infrastructure Instrumented all of our new applications to report metrics Added comprehensive Hadoop monitors into our pre-existing production monitoring system (Nagios) to alert our operators of infrastructure issues
  • #38: Big challenge to create all the monitors to make sure we knew the health of the systems Constantly tuning monitors to make sure we aren’t over or under alerting Creating “Goldilocks” alerts for the operators, not too noisy, not too quiet
  • #39: Big challenge with spark streaming and yarn is that there isn’t any built in facility for patching and upgrading with zero downtime Really true across all hadoop components Eventually will have 100s of spark streaming jobs running, and need to do it without interruption
  • #40: Decided early on that we would build our on tooling for managing patches and upgrades Allows us to deploy a new set of spark streaming application without intteruptions Kafka consumers are coded to allow jobs to pick up where they left off Integrated with CI system Sirius uses the Oozie workflow engine to manage orchestration during patches/upgrades with minimal downtime
  • #41: One big challenge is that Ambari doesn’t always stop and start infrastructure in a way that doesn’t cause service interruption Have been close, but not successful Test under load! It makes a huge difference. You will hit timeout, etc that upset abari Check out the communities graceful restart scripts. They seem to be further along Hortonworks has been very good about learning from our issues and improving the upgrade process