Kubernetes @ Squarespace (SRE Portland Meetup October 2017)

Kubernetes @ Squarespace
Microservices on Kubernetes in a Datacenter
Kevin Lynch
klynch@squarespace.com

Agenda
01 The problem with static infrastructure
02 Kubernetes fundamentals
03 Kubernetes networking in a datacenter
04 Adapting microservices to Kubernetes
05 Managing Kubernetes clusters

Microservices Journey: A Story of Growth
2013: small (< 50 engineers)
build product & grow customer base
whatever works
2014: medium (< 100 engineers)
we have a lot of customers now!
whatever works doesn't work anymore
2016: large (100+ engineers)
architect for scalability and reliability
organizational structures
2017: XL (200+ engineers)

Challenges with a Monolith
● Reliability
● Performance
● Engineering agility/speed, cross-team coupling
● More time spent fire fighting rather than building new functionality
What were the increasingly difficult challenges with a
monolith?

Challenges with a Monolith
● Minimize failure domains
● Developers are more confident in their changes
● Squarespace can move faster
Solution: Microservices!

Operational Challenges
● Engineering org grows…
● More features...
● More services…
● More infrastructure to spin up…
● Ops becomes a blocker...
Stuck in a loop

Traditional Provisioning Process
● Pick ESX with available resources
● Pick IP
● Register host to Cobbler
● Register DNS entry
● Create new VM on ESX
● PXE boot VM and install OS and base configuration
● Install system dependencies (LDAP, NTP, CollectD, Sensu…)
● Install app dependencies (Java, FluentD/Filebeat, Consul, Mongo-
S…)
● Install the app
● App registers with discovery system and begins receiving traffic

Containerization & Kubernetes Orchestration
● Difficult to find resources
● Slow to provision and scale
● Discovery is a must
● Metrics system must support short lived metrics
● Alerts are usually per instance
Static infrastructure and microservices do not mix!

Kubernetes Provisioning Process
● kubectl apply -f app.yaml

Kubernetes Fundamentals
● ApiVersion & Kind
○ type of object
● Metadata
○ Names, annotations, labels
● Spec & Status
○ What you want to happen...
○ … versus reality
apiVersion: v1
kind: Pod
metadata:
name: nginx
namespace: default
annotations:
squarespace.net/build: nginx-42
labels:
app: frontend
...
spec:
containers:
- name: nginx
image: nginx:latest
...
status:
hostIP: 10.122.1.201
podIP: 10.123.185.9
phase: Running
qosClass: BestEffort
startTime: 2017-07-31T02:08:25Z
...

Common Objects: Pods
● Basic deployable workload
● Group of 1+ containers
● Define resource requirements
● Defines storage volumes
○ Ephemeral storage
○ Shared storage (NFS, CephFS)
○ Block storage (RBD)
○ Secrets
○ ConfigMaps
○ more...
spec:
containers:
- name: location
image: .../location:master-269
ports: ...
resources:
limits:
cpu: 2
memory: 4Gi
requests:
cpu: 2
memory: 4Gi
volumeMounts:
- name: config
mountPath: /service/config
- name: log-dir
mountPath: /data/logs
volumes:
- name: config
configMap:
name: location-config
- name: log-dir
emptyDir: {}

Common Objects: Deployments
● Declarative
● Defines a type of pod to run
● Defines desired #
● Supports basic operations
○ Can be rolled back quickly!
○ Can be scaled up/down
● Meant to be stateless apps!
kind: Deployment
spec:
replicas: 3
selector:
matchLabels:
service: location
strategy:
rollingUpdate:
maxSurge: 100%
maxUnavailable: 0
type: RollingUpdate
template:
... pod info here ...

Common Objects: Services
● Make pods addressable internally
and externally
○ IP
○ DNS
○ External Load Balancer
apiVersion: v1
kind: Service
metadata:
name: location
namespace: core-services
spec:
type: ClusterIP
clusterIP: 10.123.79.211
selector:
service: location
ports:
- name: traffic
port: 8080
- name: admin
port: 8081

Kubernetes Networking
● Kubernetes CNI (Container Network Interface) is pluggable
● Different plugins for different network topologies
○ Flannel
○ Calico
○ Weave
○ Kubenet
○ VXLan

Calico Networking
● No network overlay required!
○ No nasty MTU issues
○ No performance impact
● Communicates directly with existing L3 network
● BGP Peering with Top of Rack switch

Calico Networking
● Engineers can think of Pod IPs as normal hosts
(they’re not)
○ Ping works
○ Consul works normally
○ Browser communication works
○ Shell sorta works (kubectl exec -it pod sh)

Spine and Leaf Layer 3 Clos Topology
● All work is performed at the leaf/ToR switch
● Each leaf switch is separate Layer 3 domain
● Each leaf is a separate BGP domain (ASN)
● No Spanning Tree Protocol issues seen in L2 networks (convergence
time, loops)
Leaf Leaf Leaf Leaf
Spine Spine

● Simple to understand
● Easy to scale
● Predictable and consistent latency (hops = 2)
● Allows for Anycast IPs
Leaf Leaf Leaf Leaf
Spine Spine

Calico Networking
● Each worker announces it’s pod IP ranges
○ Aggregated to /26
● Each master announces an External Anycast IP
○ Used for component communication
● Each ingress tier announces the Service IP range
ip addr add 10.123.0.0/17 dev lo
etcdctl set
/calico/bgp/v1/global/custom_filters/v4/services
'if ( net = 10.123.0.0/17 ) then { accept; }'

Leaf Leaf Leaf Leaf
Spine Spine
Host Host Host Host

Leaf Leaf Leaf Leaf
Spine Spine
Host Host Host Host
Host Host Host Host

Leaf Layer 3
Leaf Layer 2
Spine
Host
● Not quite that easy…
● Switches started issuing ICMP
Redirects
● Eventually crashed...
● Causing routes to be dropped
Host
Pod
Pod

Leaf Layer 3
Leaf Layer 2
Spine
Host
● Switches issued ICMP Redirects
○ Allows for more efficient routes
○ Each host is a router!
● Eventually routes flapped to Calico,
dropping connections
○ A redirect was issued on every
packet
Host
Pod
Pod

Leaf Layer 3
Leaf Layer 2
Spine
Host
● Route Reflectors pass full routing table
to Calico
○ Host traffic on the same switch is
no longer routed
○ No ICMP Redirects!
Host
Pod
Pod
RR
Calico
Calico

How do we run Java in a
container?

Microservice Pod Definition
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2
memory: 4Gi
Microservice Pod
Java Microservice
fluentd consul

Quality of Service Classes
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2
memory: 4Gi
● BestEffort
○ No resource constraints
○ First to be killed under pressure
● Guaranteed
○ Requests == Limits
○ Last to kill under pressure
○ Easier to reason about resources
● Burstable
○ Take advantage of unused resources!
○ Can be tricky with some languages

resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2
memory: 4Gi
● Kubernetes assumes no other processes are
consuming significant resources
● Completely Fair Scheduler (CFS)
○ Schedules a task based on CPU Shares
○ Throttles a task once it hits CPU Quota
● OOM Killed when memory limit exceeded

resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2
memory: 4Gi
● Shares = CPU Request * 1024
● Total Kubernetes Shares = # Cores * 1024
● Quota = CPU Limit * 100ms
● Period = 100ms

Java in a Container
● JVM is able to detect # of cores via sysconf(_SC_NPROCESSORS_ONLN)
● Many libraries rely on Runtime.getRuntime.availableProcessors()
○ Jetty
○ ForkJoinPool
○ GC Threads
○ That mystery dependency...

Java in a Container
● Provide a base container that calculates the container’s resources!
● Detect # of “cores” assigned
○ /sys/fs/cgroup/cpu/cpu.cfs_quota_us divided by
/sys/fs/cgroup/cpu/cpu.cfs_period_us
● Automatically tune the JVM:
○ -XX:ParallelGCThreads=${core_limit}
○ -XX:ConcGCThreads=${core_limit}
○ -Djava.util.concurrent.ForkJoinPool.common.parallelism=${core_limit}

Java in a Container
● Use Linux preloading to override availableProcessors()
#include <stdlib.h>
#include <unistd.h>
int JVM_ActiveProcessorCount(void) {
char* val = getenv("CONTAINER_CORE_LIMIT");
return val != NULL ? atoi(val) : sysconf(_SC_NPROCESSORS_ONLN);
}
https://engineering.squarespace.com/blog/2017/understanding-linux-container-scheduling

How do we monitor our
applications?

● Graphite does not scale well with ephemeral instances
● Easy to have combinatoric explosion of metrics
Traditional Monitoring & Alerting
● Application and system alerts are tightly coupled
● Difficult to create alerts on SLAs
● Difficult to route alerts

Host System
Application
Push
metrics
Check
application
health
Check service
Check system health

● Efficient for ephemeral instances
● Stores tagged data
● Easy to have many smaller instances (per team or complex system)
● Prometheus Operator runs everything in Kubernetes!
Kubernetes Monitoring & Alerting
● Alerts are defined with the application code!
● Easy to define SLA alerts
● Routing is still difficult

Kubernetes Monitoring & Alerting

How do we keep everything
organized?

● Namespaces
○ Isolates groups of objects
■ Developer
■ Team
■ System or Service
○ Good for permission boundaries
○ Good for network boundaries
● Most objects are namespaced
apiVersion: v1
kind: Namespace
metadata:
name: core-services
annotations:
squarespace.net/contact: |
team@squarespace.com
spec:
finalizers:
- kubernetes
status:
phase: Active

● Need to keep certain objects up to date in each namespace
● Need to keep objects synchronized across different datacenters
○ RBAC Policies
○ Prometheus instance per team
○ Keys to access Ceph
○ External Service Endpoints
○ Consul configurations and keys
● kubectl gets too complicated to manage these…
● Everything gets out of sync very quickly

● Kubernetes CustomResourceDefinitions allow us to define types
○ Deploy a service in Kubernetes to manage Kubernetes!
Namespace
Operator
API Server
SRE Team
Detect new team
definition
Keep namespace
in sync
Core Services
Team
SRE
Namespace
Core Services
Namespace

How do we handle
dependencies?

Dependency Management
Microservice Pod
Java Microservice
fluentd consul
● Deployments are committed alongside
the service code
● Deployments also define their own
dependencies...
● How do you update Consul across 1
service? 5 services? 100 services?

Deployment
● Kubernetes 1.7 introduces Custom Initializers
○ Register Sidecar Initializer for Deployments
○ Deploy service
○ Inject sidecar containers
○ …
○ Profit!
Pod Template
Java Microservice
fluentd consul
Sidecar
Injector
FluentD
Sidecar
Consul
Sidecar

apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: location
namespace: core-services
annotations:
sidecar.injector.squarespace.net/consul: "true"
apiVersion: injector.squarespace.net/v1alpha1
kind: Sidecar
metadata:
name: consul
spec:
annotation: sidecar.injector.squarespace.net/consul
containers:
- name: consul
image: consul:0.8.5
...

Conclusion
● Kubernetes and containerization is hard…
○ Don’t give up!
● Services first!
○ Monitor the service, not the instances
● The Kubernetes API model is powerful!
○ Declare what you want and write code to manage that state

QUESTIONS?
Thank you!
squarespace.com/careers
Kevin Lynch
klynch@squarespace.com

Kubernetes @ Squarespace (SRE Portland Meetup October 2017)

More Related Content

What's hot (20)

Similar to Kubernetes @ Squarespace (SRE Portland Meetup October 2017) (20)

Recently uploaded (20)

Kubernetes @ Squarespace (SRE Portland Meetup October 2017)

Editor's Notes