Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop

Successes, Challenges and
Pitfalls Migrating a SAAS
Business to Hadoop
Shaun Klopfenstein, CTO
Eric Kienle, Chief Architect

Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Business Requirements
• Near real-time activity processing
• 1 billion activities per customer per day
• Improve cost efficiency of operations while scaling up
• Global enterprise grade security and governance

Architecture Requirements
• Maximize utilization of hardware
• Multitenancy support with fairness
• Encryption, Authorization & Authentication
• Applications must scale horizontally

Bake Off
• Technology Selection
• Storm/Spark Streaming
• HBase/Cassandra
• Built POC with each permutation + Kafka
• Load tested with one day of web traffic

The Winner Is… Our First Challenge
• We hoped to find a clear winner… we didn’t exactly
• Truth is all the POCs worked at the scale we tested
• It’s possible if we had scaled up the test, we would
have found more differences

How We Chose
• Community
• Features
• Team Skillset
• History
• The winners: HBase/Kafka/Spark streaming

Marketo Lambda Architecture
CRM Sync
Partner APIs
Other Marketing
Activities
Spark Streaming
Consumers
Campaign Triggers
Solr Indexing
Solr
Spark Streaming Indexer
Ingestion Processor
Scala/Tomcat
HBase
HDFS
Kafka Event Stream
Web Activity
RTP Activity
Mobile Activity
Marketo UI
Campaign Detail
Lead Detail
Other Clients
CRM Sync
Revenue Cycle Analylitcs
APIs
Email Report Loader
Web Activity Processor

• Enhanced Lambda Architecture
• Inbound activities written to Ingestion Processor
• Hbase and then Kafka
• High volume (e.g. web) activities
• First written to Kafka, then enriched
• Spark Streaming applications consume events from Kafka
• Solr Indexing
• Email Reports
• Campaign Processing
• HBase is used for simple historical queries, and is system of record
High Level Architecture

Building Expertise
• We had a few people with Hadoop and
Spark experience
• We decided to grow knowledge in house
• Focus on training - HortonWorks boot camp
for operations
• In house courses and tech talks for engineering/QE

Building Expertise - Successes
• Critical to kick start the project
• Built excitement
• Created foundation for the design process

Building Expertise – Context Challenge
Challenge
• Training packed a lot of information into a short period
• Teams that didn’t leverage the training right away lost context
Recommendation
• Create environments for hands on experience early
• Hands on experience across all teams right after training

Building Expertise – Experience Challenge
Challenge
• Hadoop technology is like playing a piano… knowing how to read
music doesn’t mean you can play
• Many ways to design, configure, manage - Only a few right ways
and the reasons can be subtle
Recommendation
• Find your experts!
• Partner and hire

Building Our First Cluster
• Initial sizing and capacity planning of first
Hadoop Clusters
• Perform load tests to get initial capacity plan
• Decided that disk I/O and storage would be the leading indicator
• Went with industry best practice on hardware and network
configuration

Building Our First Cluster- Success
• Leading indicator ended up being compute
• But cluster sizing ended up being close enough to start
• Clusters can always be expanded…
So don’t get too hung up

Building Our First Cluster – Zookeeper & VM
Challenge
• We started with Zookeeper virtualized
• Didn’t perform properly (we think because of disk IO)
• Caused random outages
Recommendation
• We ended up migrating zookeeper to physical boxes
• Don’t use VMs for zookeeper!

Security
• All data at rest must be encrypted
• Applications sharing Hadoop must be isolated
from each other
• Applications must have hard quotas for both
compute and disk resources

Security - Success
• Enabled Kerberos security for Hadoop cluster
• Kerberos allowed us to leveraged HDFS
native encryption
• Used encrypted disks for Kafka servers
• Created separate secure Yarn queues to
isolate applications
• Each application uses a separate Kerberos principal

Security – Kerberos Challenge
Challenge
• Kerberos can’t be added to a Hadoop cluster without prolonged
downtime and patches
• Needed weeks of developer time to accommodate security changes
• Added several months to the overall rollout schedule
Recommendation
• Allow extra time for Kerberos
• Educate your team beforehand, find an expert to guide you
• Be prepared for different levels of Kerberos support across the
Hadoop ecosystem

Security – Kafka and Spark Challenge
Challenge
• Kafka doesn’t support data encryption (and won’t)
• HDP version we had didn’t fully support Kerberos Kafka and Spark
clients properly
Recommendation
• Move Kafka and Spark out of Ambari
• Only encrypt Kafka data if you absolutely must, as it adds complexity

Validation
• Changing the engines on a plane while in flight is hard
• Required all components implemented “Passive mode”
• The new code ran in the background and continuously compared results
with the legacy system
• Automated functional tests kicked off from Jenkins
• Performance testing at AWS

Validation - Success
• Passive mode is one of the best moves we made!
• Allowed for testing of components with real world
data and load
• Found countless performance and logic issues with
minimal operational impact

Validation – Passive Mode “Minimal Impact”
Challenge
• By design passive mode wrote to both Legacy and Hadoop systems
• We impacted performance during an outage of our cluster
Recommendation
• Use asynchronous writes or tight timeouts in passive mode
• Monitoring for the Hadoop cluster should be in place before
passive testing

Migration and Management
• We are here!
• Migrate over 6,000 subscriptions with no service interruption
or data loss
• Track and monitor migration and provide management tools
for the new platform
• Achieve the end goal of removing the safety net

Migration and Management - Successes
• Created a new management console called Sirius
• Close architectural coordination of all teams during
migration
• If problems arose, we had a quick, automated, fallback
path to the legacy system
• Daily cross-functional standup meetings to track the
rollout

Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop

Challenge
• Oozie workflows can be challenging to build and debug
• Capacity planning and resource management in the shared Hadoop
cluster is very complex
Recommendation
• Only use Oozie workflows for automating complex or long running
processes, or use a different orchestration platform
• Constantly reevaluate your capacity plan based on current deployment
Migration and Management Challenges

Monitoring
• Needed to monitor hundreds of new Hadoop and other
infrastructure servers
• Our custom Spark Streaming applications required all
new metrics and monitors
• Capacity planning requires trend analysis of both the
infrastructure and our applications
• Don’t overwhelm our already busy Cloud Platform Team

Monitoring - Successes
• Built a custom monitoring infrastructure using
OpenTSDB and Grafana
• Added business SLA metrics to our Sirius console to
provide real-time alerts
• Added comprehensive Hadoop monitors into our
pre-existing production monitoring system

Monitoring - Challenges
Challenges
• Adding hundreds of servers and a dozen new applications
makes for a huge monitoring task
• Nagios is a very general purpose system and isn’t designed
to monitor Hadoop out of the box
Recommendations
• Make sure that you have monitors and trend analysis in
place and tested before migration
• Be prepared to constantly refine and improve the your
monitors and alerts

Patching and Upgrading
• We have a zero-downtime requirement for applications
• Patching and upgrading of either the infrastructure or our own
applications is problematic
• Keeping up with the community requires frequent patching
• Eventually hundreds of Spark Streaming jobs will need to be
constantly processing data with no interruption

Patching and Upgrading - Successes
• Use Sirius console to manage Spark Streaming jobs
• Marketo’s Kafka consumer allows streaming jobs to pick up
where they left off after a restart
• Integrated existing Jenkins infrastructure with the Sirius
console to provide painless automated patching/upgrades

Infrastructure Patching and Upgrading - Challenges
Challenges
• Patches/upgrades managed with Ambari – not perfect!
• We almost never get through an upgrade without one or more Hadoop
components having downtime (so far)
Recommendations
• Test all infrastructure patches and upgrades in a loaded non-production
environment
• Check out the start and stop scripts from the component specific open
source communities, rather than rely on Ambari

We’re Hiring!
Http://Marketo.Jobs
Q & A

Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop (20)

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded (20)

Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop

Editor's Notes