SlideShare a Scribd company logo
Janos Matyas / CTO / SequenceIQ Inc.
GOAL / MOTIVATION
TECHNOLOGY STACK
PROBLEM RESOLUTION / HOW IT WORKS
RESULTS / ACHIEVEMENTS
OVERVIEW
GOAL / MOTIVATION
 Ease Hadoop provisioning – everywhere
 Automate and unify the process
 Arbitrary cluster size
 Same process through a cluster lifecycle (Dev, QA, UAT, Prod)
 (Auto) scaling Hadoop
 QoS
OUR APPROACH
 Use Docker
 Build cloud-specific ‘Dockerized’ images
 Provision the cluster
 Use Ambari
DOCKER
 Lightweight, portable
 Build once, run anywhere
 VM – without the overhead of a VM
 Isolated containers
 Automated and scripted
DOCKER – CONTAINERS vs. VMs
 Containers are isolated, but share OS and,
where appropriate, bins/libraries
APACHE AMBARI – ARCHITECTURE
 Easy Hadoop cluster provisioning
 Management and monitoring
 Key features – blueprints
 REST API
APACHE AMBARI – CREATE CLUSTER
 Define a blueprint (POST /api/v1/blueprints)
 Create cluster (POST /api/v1/clusters/mycluster)
HADOOP PROVISIONG ISSUES
 Each cloud provider has a proprietary API
 Create images for each provider
 Network configuration
 Service discovery
 Resize, failover, member join support
OUR APPROACH – DETAILS
 Build your Docker image
 Install or pre-install Hadoop services with Ambari
 Install Serf and dnsmasq
 Build your cloud image
 Use Ansible to create an image
 Provision the cluster
BUILD DOCKER IMAGES
 Create the Dockerfile
 Have Docker.io to build the image
 Optionally pre-install services
 Use Ambari
 Push image to Docker.io
 Licensing questions
BUILD CLOUD IMAGES
 Use a Docker ready base image
 Use Ansible to provision the image template
 Pull the Docker images
 Apply custom infrastructure
 Use cloud provider specific playbooks
 AWS EC2
 Azure
ANSIBLE
 Configuration as data
 Simplest way to automate IT
 Secure and agentless
 Goal oriented
 One playbook – multiple modules
 We use it to “burn” cloud images/templates
PROVISIONING – ISSUES
 FQDN
 /etc/hosts is read-only in Docker
 Everybody needs to know everybody
 DNS
 Single point of failure
 Dynamic cluster – nodes joining, leaving, failing
 Routing
 Cloud – ability to inter-host container routing
 Collision free private IP range for Docker bridge
PROVISIONING – SOLUTION
 FQDN
 Use –h and –dns Docker params
 DNS
 dnsmasq is running on each Docker container
 Serf member-xxx events trigger dnsmasq reconfiguration
 Routing
 Docker bridge configuration – follows a convention
SERF
 Gossip based membership
 Service discovery
 Decentralized
 Lightweight, fault tolerant
 Highly available
 DevOps friendly
 Keep an eye on Consul, Open vSwitch, pipework
SERF – DECENTRALIZED SERVICE DISCOVERY
 Gossip instead of heartbeat
 LAN, WAN profiles
 Provides membership information
 Event handlers: member_join, member_leave, member_failed, member-
update, member-reap, user
 Query
SERF – GOSSIPING
SERF – MEMBERSHIP, EVENT HANDLERS
DNSMASQ
 Network infrastructure for small networks
 Lightweight DNS, DHCP server
 Comes with most Linux distributions
AWS EC2 – HADOOP CLUSTER
 Use EC2 REST API to provision instances (from Dockerized image)
 Start Docker containers
 One Ambari server
 N-1 Ambari agents connecting to server
 Connect ambari-shell to
 Define blueprint
 Provision the cluster
AWS EC2 – NETWORK SECURITY
 Create a VPC
 Configure subnets
 Routing tables
 Security gateway
 Set ACL
 Configure VPN
AWS EC2 - CLOUDFORMATION
 Manually set up VPC is too complicated
 Use CloudFormation
 Manage the stack together
 Template-based
 Environments under version control
 Customizable at runtime
 No extra charge
"VpcId" : {
"Type" : "String",
"Description" : "VpcId of your existing Virtual Private Cloud (VPC)"
},
"SubnetId" : {
"Type" : "String",
"Description" : "SubnetId of an existing subnet (for the primary
network) in your Virtual Private Cloud (VPC)"
},
"SecondaryIPAddressCount" : {
"Type" : "Number",
"Default" : "1",
"MinValue" : "1",
"MaxValue" : "5",
"Description" : "Number of secondary IP addresses to assign to the
network interface (1-5)",
"ConstraintDescription": "must be a number from 1 to 5."
},
"SSHLocation" : {
"Description" : "The IP address range that can be used to SSH to the
EC2 instances",
"Type": "String",
"MinLength": "9",
"MaxLength": "18",
"Default": "0.0.0.0/0",
"AllowedPattern": "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})/
(d{1,2})",
"ConstraintDescription": "must be a valid IP CIDR range of the form
x.x.x.x/x."
}
},
CLOUDBREAK
Cloudbreak is a powerful left surf that
breaks over a coral reef, a mile off
southwest the island of Tavarua, Fiji.
Cloudbreak is a cloud-agnostic
Hadoop as a Service API. Abstracts
the provisioning and ease
management and monitoring of on-
demand clusters.
Provisioning Hadoop has never been easier
CLOUDBREAK
 Benefits
 Elastic
 Scalable
 Blueprints
 Flexible
 Main REST resources
 /template – specify a cluster infrastructure
 /stack – creates a cloud infrastructure built from a template
 /blueprint – describes a Hadoop cluster
 /cluster – creates a Hadoop cluster
RESULTS AND ACHIEVEMENTS
 Hadoop as a Service API
 Available for EC2 and Azure cloud
 OpenStack, bare metal is coming soon
 Open source under Apache 2 licence
 Same goals as Apache Ambari Launchpad project
 What's next?
HADOOP SERVICES - AS A SERVICE
 Leverage YARN
 Slider (Hoya) providers
 HBase, Accumulo
 SequenceIQ providers - Flume, Tomcat
 YARN -1964
 QoS for YARN – heuristic scheduler
 Platform as a Service API
BANZAI PIPELINE
Banzai Pipeline is a surf reef break located
in Hawaii, off Ehukai Beach Park in
Pupukea on O'ahu's North Shore.
Banzai Pipeline is a RESTful
application development
platform for building on-
demand data and job pipelines
running on Hadoop YARN.
Banzai Pipeline is a big data API for the REST
THANK YOU
 Get the code: https://github.com/sequenceiq
 Read about: http://blog.sequenceiq.com
 Facebook: http://facebook.com/sequenceiq
 Twitter: http://twitter.com/sequenceiq
 LinkedIn: http://linkedin.com/sequenceiq
 Contact: janos.matyas@sequenceiq.com
FEEL FREE TO CONTRIBUTE

More Related Content

What's hot (20)

PPT
February 2016 HUG: Running Spark Clusters in Containers with Docker
Yahoo Developer Network
 
ODP
Guaranteeing Storage Performance by Mike Tutkowski
buildacloud
 
PDF
OpenStack Best Practices and Considerations - terasky tech day
Arthur Berezin
 
PDF
Cloud stack for_beginners
Radhika Puthiyetath
 
PDF
Avishay Traeger & Shimshon Zimmerman, Stratoscale - Deploying OpenStack Cinde...
Cloud Native Day Tel Aviv
 
PDF
Ceph with CloudStack
ShapeBlue
 
PDF
Wido den hollander cloud stack and ceph
ShapeBlue
 
PPTX
Ansible + Hadoop
Michael Young
 
PDF
Open Datacentre
Des Drury
 
PDF
Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...
Cloud Native Day Tel Aviv
 
PDF
OpenStack Summit Vancouver: Lessons learned on upgrades
Frédéric Lepied
 
PPTX
Hypervisor Selection in Apache CloudStack 4.4
Tim Mackey
 
PDF
Ceph and Apache CloudStack
ke4qqq
 
PPTX
On Docker and its use for LHC at CERN
Sebastien Goasguen
 
PPTX
OpenStack Cinder
Deepti Ramakrishna
 
PDF
Cassandra and Docker Lessons Learned
DataStax Academy
 
PPTX
How bigtop leveraged docker for build automation and one click hadoop provis...
Evans Ye
 
PDF
CloudStack Best Practice in PPTV
gavin_lee
 
PPTX
Cloud stack overview
howie YU
 
PPTX
vBACD - Deploying Infrastructure-as-a-Service with CloudStack - 2/28
CloudStack - Open Source Cloud Computing Project
 
February 2016 HUG: Running Spark Clusters in Containers with Docker
Yahoo Developer Network
 
Guaranteeing Storage Performance by Mike Tutkowski
buildacloud
 
OpenStack Best Practices and Considerations - terasky tech day
Arthur Berezin
 
Cloud stack for_beginners
Radhika Puthiyetath
 
Avishay Traeger & Shimshon Zimmerman, Stratoscale - Deploying OpenStack Cinde...
Cloud Native Day Tel Aviv
 
Ceph with CloudStack
ShapeBlue
 
Wido den hollander cloud stack and ceph
ShapeBlue
 
Ansible + Hadoop
Michael Young
 
Open Datacentre
Des Drury
 
Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...
Cloud Native Day Tel Aviv
 
OpenStack Summit Vancouver: Lessons learned on upgrades
Frédéric Lepied
 
Hypervisor Selection in Apache CloudStack 4.4
Tim Mackey
 
Ceph and Apache CloudStack
ke4qqq
 
On Docker and its use for LHC at CERN
Sebastien Goasguen
 
OpenStack Cinder
Deepti Ramakrishna
 
Cassandra and Docker Lessons Learned
DataStax Academy
 
How bigtop leveraged docker for build automation and one click hadoop provis...
Evans Ye
 
CloudStack Best Practice in PPTV
gavin_lee
 
Cloud stack overview
howie YU
 
vBACD - Deploying Infrastructure-as-a-Service with CloudStack - 2/28
CloudStack - Open Source Cloud Computing Project
 

Viewers also liked (9)

PPTX
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
DataWorks Summit/Hadoop Summit
 
PDF
Apache Hadoop YARN - Enabling Next Generation Data Applications
Hortonworks
 
PPTX
Managing Docker Containers In A Cluster - Introducing Kubernetes
Marc Sluiter
 
PPTX
Hadoop on Docker
Rakesh Saha
 
PDF
Docker Swarm Cluster
Fernando Ike
 
PPTX
Simplified Cluster Operation & Troubleshooting
DataWorks Summit/Hadoop Summit
 
PDF
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks
 
PPTX
Configuring Your First Hadoop Cluster On EC2
benjaminwootton
 
PDF
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Jeffrey Breen
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
DataWorks Summit/Hadoop Summit
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Hortonworks
 
Managing Docker Containers In A Cluster - Introducing Kubernetes
Marc Sluiter
 
Hadoop on Docker
Rakesh Saha
 
Docker Swarm Cluster
Fernando Ike
 
Simplified Cluster Operation & Troubleshooting
DataWorks Summit/Hadoop Summit
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks
 
Configuring Your First Hadoop Cluster On EC2
benjaminwootton
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Jeffrey Breen
 
Ad

Similar to Docker based Hadoop provisioning - Hadoop Summit 2014 (20)

PPTX
Docker based Hadoop provisioning - anywhere
Janos Matyas
 
PPTX
One Click Hadoop Clusters - Anywhere (Using Docker)
DataWorks Summit
 
PPTX
DEVNET-1141 Dynamic Dockerized Hadoop Provisioning
Cisco DevNet
 
PPTX
Docker based Hadoop Deployment
Rakesh Saha
 
PDF
Hadoop Everywhere & Cloudbreak
Sean Roberts
 
PDF
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks
 
PPTX
Containerization
Suryadeep Chatterjee
 
PDF
Docker dev ops for cd meetup 12-14
Simon Storm
 
PPTX
AWS DevDay Cologne - Automating building blocks choices you will face with co...
Cobus Bernard
 
PDF
The Docker "Gauntlet" - Introduction, Ecosystem, Deployment, Orchestration
Erica Windisch
 
PPTX
AWS DevDay Vienna - Automating building blocks choices you will face with con...
Cobus Bernard
 
PDF
Build High-Performance, Scalable, Distributed Applications with Stacks of Co...
Yandex
 
PDF
week8_watermark.pdfhowcanitbe minimum 40 i
sec22ci043
 
PDF
Docker Online Meetup #3: Docker in Production
Docker, Inc.
 
PPTX
Micro services vs hadoop
Gergely Devenyi
 
PDF
Week 8 lecture material
Ankit Gupta
 
PDF
Containers, Docker, and Microservices: the Terrific Trio
Jérôme Petazzoni
 
PDF
Dockerizing OpenStack for High Availability
Daniel Krook
 
PDF
Microservices Architecture with AWS @ AnyMind Group
Giang Tran
 
PDF
AnyMind Group Tech Talk - Microservices architecture with AWS
Nhân Nguyễn
 
Docker based Hadoop provisioning - anywhere
Janos Matyas
 
One Click Hadoop Clusters - Anywhere (Using Docker)
DataWorks Summit
 
DEVNET-1141 Dynamic Dockerized Hadoop Provisioning
Cisco DevNet
 
Docker based Hadoop Deployment
Rakesh Saha
 
Hadoop Everywhere & Cloudbreak
Sean Roberts
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks
 
Containerization
Suryadeep Chatterjee
 
Docker dev ops for cd meetup 12-14
Simon Storm
 
AWS DevDay Cologne - Automating building blocks choices you will face with co...
Cobus Bernard
 
The Docker "Gauntlet" - Introduction, Ecosystem, Deployment, Orchestration
Erica Windisch
 
AWS DevDay Vienna - Automating building blocks choices you will face with con...
Cobus Bernard
 
Build High-Performance, Scalable, Distributed Applications with Stacks of Co...
Yandex
 
week8_watermark.pdfhowcanitbe minimum 40 i
sec22ci043
 
Docker Online Meetup #3: Docker in Production
Docker, Inc.
 
Micro services vs hadoop
Gergely Devenyi
 
Week 8 lecture material
Ankit Gupta
 
Containers, Docker, and Microservices: the Terrific Trio
Jérôme Petazzoni
 
Dockerizing OpenStack for High Availability
Daniel Krook
 
Microservices Architecture with AWS @ AnyMind Group
Giang Tran
 
AnyMind Group Tech Talk - Microservices architecture with AWS
Nhân Nguyễn
 
Ad

Recently uploaded (20)

PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 

Docker based Hadoop provisioning - Hadoop Summit 2014

  • 1. Janos Matyas / CTO / SequenceIQ Inc.
  • 2. GOAL / MOTIVATION TECHNOLOGY STACK PROBLEM RESOLUTION / HOW IT WORKS RESULTS / ACHIEVEMENTS OVERVIEW
  • 3. GOAL / MOTIVATION  Ease Hadoop provisioning – everywhere  Automate and unify the process  Arbitrary cluster size  Same process through a cluster lifecycle (Dev, QA, UAT, Prod)  (Auto) scaling Hadoop  QoS
  • 4. OUR APPROACH  Use Docker  Build cloud-specific ‘Dockerized’ images  Provision the cluster  Use Ambari
  • 5. DOCKER  Lightweight, portable  Build once, run anywhere  VM – without the overhead of a VM  Isolated containers  Automated and scripted
  • 6. DOCKER – CONTAINERS vs. VMs  Containers are isolated, but share OS and, where appropriate, bins/libraries
  • 7. APACHE AMBARI – ARCHITECTURE  Easy Hadoop cluster provisioning  Management and monitoring  Key features – blueprints  REST API
  • 8. APACHE AMBARI – CREATE CLUSTER  Define a blueprint (POST /api/v1/blueprints)  Create cluster (POST /api/v1/clusters/mycluster)
  • 9. HADOOP PROVISIONG ISSUES  Each cloud provider has a proprietary API  Create images for each provider  Network configuration  Service discovery  Resize, failover, member join support
  • 10. OUR APPROACH – DETAILS  Build your Docker image  Install or pre-install Hadoop services with Ambari  Install Serf and dnsmasq  Build your cloud image  Use Ansible to create an image  Provision the cluster
  • 11. BUILD DOCKER IMAGES  Create the Dockerfile  Have Docker.io to build the image  Optionally pre-install services  Use Ambari  Push image to Docker.io  Licensing questions
  • 12. BUILD CLOUD IMAGES  Use a Docker ready base image  Use Ansible to provision the image template  Pull the Docker images  Apply custom infrastructure  Use cloud provider specific playbooks  AWS EC2  Azure
  • 13. ANSIBLE  Configuration as data  Simplest way to automate IT  Secure and agentless  Goal oriented  One playbook – multiple modules  We use it to “burn” cloud images/templates
  • 14. PROVISIONING – ISSUES  FQDN  /etc/hosts is read-only in Docker  Everybody needs to know everybody  DNS  Single point of failure  Dynamic cluster – nodes joining, leaving, failing  Routing  Cloud – ability to inter-host container routing  Collision free private IP range for Docker bridge
  • 15. PROVISIONING – SOLUTION  FQDN  Use –h and –dns Docker params  DNS  dnsmasq is running on each Docker container  Serf member-xxx events trigger dnsmasq reconfiguration  Routing  Docker bridge configuration – follows a convention
  • 16. SERF  Gossip based membership  Service discovery  Decentralized  Lightweight, fault tolerant  Highly available  DevOps friendly  Keep an eye on Consul, Open vSwitch, pipework
  • 17. SERF – DECENTRALIZED SERVICE DISCOVERY  Gossip instead of heartbeat  LAN, WAN profiles  Provides membership information  Event handlers: member_join, member_leave, member_failed, member- update, member-reap, user  Query
  • 19. SERF – MEMBERSHIP, EVENT HANDLERS
  • 20. DNSMASQ  Network infrastructure for small networks  Lightweight DNS, DHCP server  Comes with most Linux distributions
  • 21. AWS EC2 – HADOOP CLUSTER  Use EC2 REST API to provision instances (from Dockerized image)  Start Docker containers  One Ambari server  N-1 Ambari agents connecting to server  Connect ambari-shell to  Define blueprint  Provision the cluster
  • 22. AWS EC2 – NETWORK SECURITY  Create a VPC  Configure subnets  Routing tables  Security gateway  Set ACL  Configure VPN
  • 23. AWS EC2 - CLOUDFORMATION  Manually set up VPC is too complicated  Use CloudFormation  Manage the stack together  Template-based  Environments under version control  Customizable at runtime  No extra charge "VpcId" : { "Type" : "String", "Description" : "VpcId of your existing Virtual Private Cloud (VPC)" }, "SubnetId" : { "Type" : "String", "Description" : "SubnetId of an existing subnet (for the primary network) in your Virtual Private Cloud (VPC)" }, "SecondaryIPAddressCount" : { "Type" : "Number", "Default" : "1", "MinValue" : "1", "MaxValue" : "5", "Description" : "Number of secondary IP addresses to assign to the network interface (1-5)", "ConstraintDescription": "must be a number from 1 to 5." }, "SSHLocation" : { "Description" : "The IP address range that can be used to SSH to the EC2 instances", "Type": "String", "MinLength": "9", "MaxLength": "18", "Default": "0.0.0.0/0", "AllowedPattern": "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})/ (d{1,2})", "ConstraintDescription": "must be a valid IP CIDR range of the form x.x.x.x/x." } },
  • 24. CLOUDBREAK Cloudbreak is a powerful left surf that breaks over a coral reef, a mile off southwest the island of Tavarua, Fiji. Cloudbreak is a cloud-agnostic Hadoop as a Service API. Abstracts the provisioning and ease management and monitoring of on- demand clusters. Provisioning Hadoop has never been easier
  • 25. CLOUDBREAK  Benefits  Elastic  Scalable  Blueprints  Flexible  Main REST resources  /template – specify a cluster infrastructure  /stack – creates a cloud infrastructure built from a template  /blueprint – describes a Hadoop cluster  /cluster – creates a Hadoop cluster
  • 26. RESULTS AND ACHIEVEMENTS  Hadoop as a Service API  Available for EC2 and Azure cloud  OpenStack, bare metal is coming soon  Open source under Apache 2 licence  Same goals as Apache Ambari Launchpad project  What's next?
  • 27. HADOOP SERVICES - AS A SERVICE  Leverage YARN  Slider (Hoya) providers  HBase, Accumulo  SequenceIQ providers - Flume, Tomcat  YARN -1964  QoS for YARN – heuristic scheduler  Platform as a Service API
  • 28. BANZAI PIPELINE Banzai Pipeline is a surf reef break located in Hawaii, off Ehukai Beach Park in Pupukea on O'ahu's North Shore. Banzai Pipeline is a RESTful application development platform for building on- demand data and job pipelines running on Hadoop YARN. Banzai Pipeline is a big data API for the REST
  • 29. THANK YOU  Get the code: https://github.com/sequenceiq  Read about: http://blog.sequenceiq.com  Facebook: http://facebook.com/sequenceiq  Twitter: http://twitter.com/sequenceiq  LinkedIn: http://linkedin.com/sequenceiq  Contact: [email protected] FEEL FREE TO CONTRIBUTE

Editor's Notes

  • #2: Thanks for coming – today will talk about Docker based Hadoop provisioning. Quick introduction of who we are - Young startup, from Budapest, Hungary. Janos Matyas – CTO, open source contributor, Hadoop YARN evangelist.
  • #4: Why we have started this at all – there are so many options. We repeated the same steps over and over – and scripted. Still, we felt that there is something missing. See bullet points
  • #5: Been through many different approaches. Bare metal, cloud VM, so on – ended up using Docker. Tested many provisioning frameworks – Ambari is the one.
  • #6: Quick question - How many of you have used Docker before. Docker is a container based virtualization framework. Unlike traditional virtualization Docker is fast, lightweight and easy to use. Docker allows you to create containers holding all the dependencies for an application. Each container is kept isolated from any other, and nothing gets shared.
  • #7: I can run 5-6 containers – less overhead than 1 virtualbox. No SOCKS proxy, etc.
  • #8: The ‘provisioning’ framework. No need to enter details, there were pretty good sessions about Ambari. Blueprints 1.5.1 tech preview, 1.6 fully supported. Blueprint = stack definition + component layout. REST API – we have created, open sourced Ambari client + shell (come and join the Ambari Meetup today at 3:30)
  • #10: Now, the issues. Do it again and again – for each cloud provider. Create the image – but how do you know what’s the requirement, building an image each and every time? Network – this is a big issue. EC2 has API, Azure his own. Open Stack has a network as a service component – Neutrom. SDN – Software define network!!! Everything is dynamic – how do you do service discovery? Extra features – fully dynamic Hadoop cluster.
  • #11: Will expand on these shortly. Sounds too easy – lets get into details.
  • #12: A Docker image is described by a Dockerfile – like a Vagrant file for virtualbox for example. You want trusted build – use Docker.io Faster provisioning – a 100+ node Hadoop cluster in less than 5 minutes? Come and join the Ambari meetup. Licensing –Ganglia or Nagios (BSD and GPL). Hortonworks Hadoop – Apache 2 Bigtop is coming…
  • #13: Amazon Linux – Redhat based – recently is Docker ready. OpenStack stack Nova hypervisor supports Docker. Apply the network and other infrastructure relates stuff. Remember the licensing – use our Ansible script to build your cloud image. Or modify.
  • #14: IT automation war - Ansible vc Chef, Puppet. Ansible configurations are simple data descriptions of your infrastructure (both human-readable and machine-parsable). Needs only SSH.
  • #15: Dev – env : use default Docker bridge (easy) All talks to each other DNS – heavy management overweight
  • #16: -h for hostname, --dns to specify the DNS service to use Convention: AMI launch index
  • #17: Serf is a decentralized solution for cluster membership, failure detection and orchestration. Serf, Zookeeper, etcd, doozerd. All three have server nodes that require a quorum of nodes to operate – strong consistency. Serf - eventual consistency Most important thing is that gossip based – will expand shortly. Decentralized – all nodes are equal.
  • #18: Fire and forget Waits for anwer – limited response collection. Custom event handlers Tags – e.g. Ambari server, hostgroups, etc
  • #20: Load increases – how to cluster knows that there is a new member.
  • #21: Running on each Docker container – updated by SERF events.
  • #22: Amazon supports Docker natively. Start N number of nodes. Pass our userdata script .at startup. Start the containers – they will know about each other using Serf. Shell or REST API or Ambari UI.
  • #23: You need security – strongly recommended use your VPC instead of default VPC. Use different availability zones for maximum uptime.
  • #24: Who did VPS knows – can be scripted. It is harder to decommision / change / delete than add components. Use CloudFormation.
  • #25: This is a very easy but still error prone process – though it helps a let. We build an API on top, and automated the whole process. We are not a Service Provider – this is an API.
  • #26: Elastic – arbitrary number of nodes. Scalable – follow your workload change. Blueprints – supports different cluster blueprints Flexible – Use your favorite cloud, bring your own Hadoop – one common API
  • #27: One API – any size, anywhere. Why we needed Cloudbreak – this is not the end of the story.
  • #28: We wanted to have a Platform as a Service API. We are YARN evangelists – wanted to run everything on YARN. Community driven. Heuristic scheduler.
  • #29: A fully dynamic big data pipeline. Build your pipeline, run dynamically / on demand. All pre-coded, zero coding, only configuration. Data pipeline – run services on demand, short or long term. Start when needed, stoped when is idle. Apply ETL on demand. Job pipeline – all major ML are supported (Mahout, Mllib), and 44 other MR jobs (correlations, joins, summarizations, filtering, sort, sharding, shuffle) Streaming pipeline – Spark based Custom SDK – abstracts the complexity behind MR and Spark.
  • #30: Subscribe to the Beta test. Contribute. We did contributions on several Apache and other open source projects. Babilon at SequenceIQ; Java and Scala is the default. Groovy is used very often. Than Go – Docker + Serf – we had to learn Go to fix things. Ansible for IT. Strongly suggest to use Docker – we use it everywhere. CI/CD, cloud. For a demo come and join the Ambari meetup. Thanks for coming. Q&A. Join me after or follow us through one of the social medias listed.