SlideShare a Scribd company logo
Microservices: What’s Missing…
Adrian Cockcroft @adrianco
Technology Fellow - Battery Ventures
March 2016
What does @adrianco do?
@adrianco
Technology Due
Diligence on Deals
Presentations at
Conferences
Presentations at
Companies
Technical
Advice for Portfolio
Companies
Program
Committee for
Conferences
Networking with
Interesting PeopleTinkering with
Technologies
Maintain
Relationship with
Cloud Vendors
http://www.slideshare.net/adriancockcroft
What’s Missing?
Trying out new content today
Discussion/feedback
O’Reilly Software Architecture Conference
New York April 13th - for the real thing…
@adrianco
Discussion Points
Failure injection testing
Versioning, Routing, Protocols
Timeouts and retries
Denormalized data models
Monitoring, Tracing
Simplicity through symmetry
See www.battery.com for a list of portfolio investments
@adrianco
Failure Injection Testing
Netflix Chaos Monkey and Simian Army
http://techblog.netflix.com/2011/07/netflix-simian-army.html
http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html
http://techblog.netflix.com/2016/01/automated-failure-testing.html
See www.battery.com for a list of portfolio investments
! Chaos Monkey - enforcing stateless business logic
! Chaos Gorilla - enforcing zone isolation/replication
! Chaos Kong - enforcing region isolation/replication
! Security Monkey - watching for insecure configuration settings
! Latency Monkey & FIT - inject errors to enforce robust dependencies
! See over 100 NetflixOSS projects at netflix.github.com
! Get “Technical Indigestion” reading techblog.netflix.com
Trust with Verification
@adrianco
Benefits of version aware routing
Immediately and safely introduce a new version
Canary test in production
Pin clients to a version so they can’t get disrupted
Change client or dependencies but not both at once
Eventually remove old versions
Incremental or infrequent “break the build” garbage collection
See www.battery.com for a list of portfolio investments
@adrianco
Versioning, Routing
Version numbering: Interface.Feature.Bugfix
V1.2.3 to V1.2.4 - Canary test then remove old version
V1.2.x to V1.3.x - Canary test then remove or keep both
Route V1.3.x clients to new version to get new feature
Remove V1.2.x only after V1.3.x is found to work for V1.2.x clients
V1.x.x to V2.x.x - Route clients to specific versions
Remove old server version when all old clients are gone
See www.battery.com for a list of portfolio investments
@adrianco
Timeouts and Retries
Connection timeout vs. request timeout confusion
Usually setup incorrectly, global defaults
Systems collapse with “retry storms”
Timeouts too long, too many retries
Services doing work that can never be used
See www.battery.com for a list of portfolio investments
@adrianco
Connections and Requests
TCP makes a connection, HTTP makes a request
HTTP hopefully reuses connections for several requests
Both have different timeout and retry needs!
TCP timeout is purely a property of one network latency hop
HTTP timeout depends on the service and its dependencies
See www.battery.com for a list of portfolio investments
@adrianco
Timeouts and Retries
Edge
Service
Good
Service
Good
Service
Bad config: Every service defaults to 5 second timeout, two retries
Edge
Service not
responding
Overloaded
service not
responding
Failed
Service
If anything breaks, everything upstream stops responding
Retries add unproductive work
@adrianco
Timeouts and Retries
Bad config: Every service defaults to 5 second timeout, two retries
Edge
service
responds
slowly
Overloaded
service
Partially
failed
service
First request from Edge timed out so it ignores the successful
response and keeps retrying. Middle service load increases as
it’s doing work that isn’t being consumed
@adrianco
Timeout and Retry Fixes
Cascading timeout budget
Static settings that decrease from the edge
or dynamic budget passed with request
How often do retries actually succeed?
Don’t ask the same instance the same thing
Only retry on a different connection
See www.battery.com for a list of portfolio investments
@adrianco
Timeouts and Retries
Edge
Service
Good
Service
Budgeted timeout, one retry
Failed
Service
5s
1s
1s
Fast fail
response
after 2s
Upstream timeout must always be longer than
total downstream timeout * retries delay
No unproductive work while fast failing
@adrianco
Timeouts and Retries
Edge
Service
Good
Service
Budgeted timeout, failover retry
Failed
Service
5s
1s
1s
For replicated services with multiple instances
never retry against a failed instance
No extra retries or unproductive work
Good
Service
Success
response
delayed 1s
@adrianco
Denormalized Data Models
“The Network is Reliable” http://dl.acm.org/citation.cfm?id=2655736
Distributed systems are inconsistent by nature
Clients are inconsistent with servers
Most caches are inconsistent
Versions are inconsistent
Get over it and
Deal with it
See www.battery.com for a list of portfolio investments
@adrianco
Denormalized Data Models
Any non-trivial organization has many databases
Cross references exist, inconsistencies exist
Microservices work best with individual simple stores
Scale, operate, mutate, fail them independently
NoSQL allows flexible schema/object versions
See www.battery.com for a list of portfolio investments
@adrianco
Denormalized Data Models
Build custom cross-datasource check/repair processes
Ensure all cross references are up to date
Immutability Changes Everything
http://highscalability.com/blog/2015/1/26/paper-immutability-changes-everything-by-pat-helland.html
Memories, Guesses and Apologies
https://blogs.msdn.microsoft.com/pathelland/2007/05/15/memories-guesses-and-apologies/
See www.battery.com for a list of portfolio investments
Monitoring
Microservices
A Possible Hierarchy
Continents
Regions
Zones
Services
Versions
Containers
Instances
How Many?
3 to 5
2-4 per Continent
1-5 per Region
100’s per Zone
Many per Service
1000’s per Version
10,000’s
It’s much more challenging
than just a large number of
machines
Flow
Some tools can show
the request flow
across a few services
Interesting
architectures have a
lot of microservices!
Flow visualization is
a big challenge.
See http://www.slideshare.net/LappleApple/gilt-from-monolith-ruby-app-to-micro-service-scala-service-architecture
Simulated Microservices
Model and visualize microservices
Simulate interesting architectures
Generate large scale configurations
Eventually stress test real tools
See github.com/adrianco/spigo
Simulate Protocol Interactions in Go
Visualize with D3
ELB Load Balancer
Zuul API Proxy
Karyon
Business
Logic
Staash
Data
Access
Layer
Priam Cassandra
Datastore
Three
Availability
Zones
Spigo Nanoservice Structure
func Start(listener chan gotocol.Message) {
...
for {
select {
case msg := <-listener:
flow.Instrument(msg, name, hist)
switch msg.Imposition {
case gotocol.Hello: // get named by parent
...
case gotocol.NameDrop: // someone new to talk to
...
case gotocol.Put: // upstream request handler
...
outmsg := gotocol.Message{gotocol.Replicate, listener, time.Now(),
msg.Ctx.NewParent(), msg.Intention}
flow.AnnotateSend(outmsg, name)
outmsg.GoSend(replicas)
}
case <-eurekaTicker.C: // poll the service registry
...
}
}
}
Skeleton code for replicating a Put message
Instrument incoming requests
Instrument outgoing requests
update trace context
Flow Trace Recording
riak2
us-east-1
zoneC
riak9
us-west-2
zoneA
Put s896
Replicate
riak3
us-east-1
zoneA
riak8
us-west-2
zoneC
riak4
us-east-1
zoneB
riak10
us-west-2
zoneB
us-east-1.zoneC.riak2 t98p895s896 Put
us-east-1.zoneA.riak3 t98p896s908 Replicate
us-east-1.zoneB.riak4 t98p896s909 Replicate
us-west-2.zoneA.riak9 t98p896s910 Replicate
us-west-2.zoneB.riak10 t98p910s912 Replicate
us-west-2.zoneC.riak8 t98p910s913 Replicate
staash
us-east-1
zoneC
s910 s908s913
s909s912
Replicate Put
Open Zipkin
A common format for trace annotations
A Java tool for visualizing traces
Standardization effort to fold in other formats
Driven by Adrian Cole (currently at Pivotal)
Extended to load Spigo generated trace files
Zipkin Trace Dependencies
Zipkin Trace Dependencies
Trace for one Spigo Flow
Definition of an architecture
{
"arch": "lamp",
"description":"Simple LAMP stack",
"version": "arch-0.0",
"victim": "webserver",
"services": [
{ "name": "rds-mysql", "package": "store", "count": 2, "regions": 1, "dependencies": [] },
{ "name": "memcache", "package": "store", "count": 1, "regions": 1, "dependencies": [] },
{ "name": "webserver", "package": "monolith", "count": 18, "regions": 1, "dependencies": ["memcache", "rds-mysql"] },
{ "name": "webserver-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["webserver"] },
{ "name": "www", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["webserver-elb"] }
]
}
Header includes
chaos monkey victim
New tier
name
Tier
package
0 = non
Regional
Node
count
List of tier
dependencies
Running Spigo
$ ./spigo -a lamp -j -d 2
2016/01/26 23:04:05 Loading architecture from json_arch/lamp_arch.json
2016/01/26 23:04:05 lamp.edda: starting
2016/01/26 23:04:05 Architecture: lamp Simple LAMP stack
2016/01/26 23:04:05 architecture: scaling to 100%
2016/01/26 23:04:05 lamp.us-east-1.zoneB.eureka01....eureka.eureka: starting
2016/01/26 23:04:05 lamp.us-east-1.zoneA.eureka00....eureka.eureka: starting
2016/01/26 23:04:05 lamp.us-east-1.zoneC.eureka02....eureka.eureka: starting
2016/01/26 23:04:05 Starting: {rds-mysql store 1 2 []}
2016/01/26 23:04:05 Starting: {memcache store 1 1 []}
2016/01/26 23:04:05 Starting: {webserver monolith 1 18 [memcache rds-mysql]}
2016/01/26 23:04:05 Starting: {webserver-elb elb 1 0 [webserver]}
2016/01/26 23:04:05 Starting: {www denominator 0 0 [webserver-elb]}
2016/01/26 23:04:05 lamp.*.*.www00....www.denominator activity rate 10ms
2016/01/26 23:04:06 chaosmonkey delete: lamp.us-east-1.zoneC.webserver02....webserver.monolith
2016/01/26 23:04:07 asgard: Shutdown
2016/01/26 23:04:07 lamp.us-east-1.zoneB.eureka01....eureka.eureka: closing
2016/01/26 23:04:07 lamp.us-east-1.zoneA.eureka00....eureka.eureka: closing
2016/01/26 23:04:07 lamp.us-east-1.zoneC.eureka02....eureka.eureka: closing
2016/01/26 23:04:07 spigo: complete
2016/01/26 23:04:07 lamp.edda: closing
-a architecture lamp
-j graph json/lamp.json
-d run for 2 seconds
Riak IoT Architecture
{
"arch": "riak",
"description":"Riak IoT ingestion example for the RICON 2015 presentation",
"version": "arch-0.0",
"victim": "",
"services": [
{ "name": "riakTS", "package": "riak", "count": 6, "regions": 1, "dependencies": ["riakTS", "eureka"]},
{ "name": "ingester", "package": "staash", "count": 6, "regions": 1, "dependencies": ["riakTS"]},
{ "name": "ingestMQ", "package": "karyon", "count": 3, "regions": 1, "dependencies": ["ingester"]},
{ "name": "riakKV", "package": "riak", "count": 3, "regions": 1, "dependencies": ["riakKV"]},
{ "name": "enricher", "package": "staash", "count": 6, "regions": 1, "dependencies": ["riakKV", "ingestMQ"]},
{ "name": "enrichMQ", "package": "karyon", "count": 3, "regions": 1, "dependencies": ["enricher"]},
{ "name": "analytics", "package": "karyon", "count": 6, "regions": 1, "dependencies": ["ingester"]},
{ "name": "analytics-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["analytics"]},
{ "name": "analytics-api", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["analytics-elb"]},
{ "name": "normalization", "package": "karyon", "count": 6, "regions": 1, "dependencies": ["enrichMQ"]},
{ "name": "iot-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["normalization"]},
{ "name": "iot-api", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["iot-elb"]},
{ "name": "stream", "package": "karyon", "count": 6, "regions": 1, "dependencies": ["ingestMQ"]},
{ "name": "stream-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["stream"]},
{ "name": "stream-api", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["stream-elb"]}
]
}
New tier
name
Tier
package
Node
count
List of tier
dependencies
0 = non
Regional
Single Region Riak IoT
Single Region Riak IoT
IoT Ingestion Endpoint
Stream Endpoint
Analytics Endpoint
Single Region Riak IoT
IoT Ingestion Endpoint
Stream Endpoint
Analytics Endpoint
Load Balancer
Load Balancer
Load Balancer
Single Region Riak IoT
IoT Ingestion Endpoint
Stream Endpoint
Analytics Endpoint
Load Balancer
Normalization Services
Load Balancer
Load Balancer
Stream Service
Analytics Service
Single Region Riak IoT
IoT Ingestion Endpoint
Stream Endpoint
Analytics Endpoint
Load Balancer
Normalization Services
Enrich Message Queue
Riak KV
Enricher Services
Load Balancer
Load Balancer
Stream Service
Analytics Service
Single Region Riak IoT
IoT Ingestion Endpoint
Stream Endpoint
Analytics Endpoint
Load Balancer
Normalization Services
Enrich Message Queue
Riak KV
Enricher Services
Ingest Message Queue
Load Balancer
Load Balancer
Stream Service
Analytics Service
Single Region Riak IoT
IoT Ingestion Endpoint
Stream Endpoint
Analytics Endpoint
Load Balancer
Normalization Services
Enrich Message Queue
Riak KV
Enricher Services
Ingest Message Queue
Load Balancer
Load Balancer
Stream Service Riak TS
Analytics Service
Ingester Service
Two Region Riak IoT
IoT Ingestion Endpoint
Stream Endpoint
Analytics Endpoint
East Region Ingestion
West Region Ingestion
Multi Region TS Analytics
Two Region Riak IoT
IoT Ingestion Endpoint
Stream Endpoint
Analytics Endpoint
East Region Ingestion
West Region Ingestion
Multi Region TS Analytics
What’s the response
time of the stream
endpoint?
Response Times
What’s the response time of a simple service?
memcached
rds-msql
rds-msqlwebservers
elb
www
What’s the response time of an even simpler storage
backed web service?
memcached
mysql
disk volume
web
service
load
generator
Measuring
Response Time With
Histograms
Changes made to codahale/hdrhistogram
Changes made to go-kit/kit/metrics
Implementation in adrianco/spigo/collect
What to measure?
Client Server
GetRequest
GetResponse
Client
Time
Client Send CS
Server Receive SR
Server Send SS
Client Receive CR
Server
Time
What to measure?
Client Server
GetRequest
GetResponse
Client
Time
Client Send CS
Server Receive SR
Server Send SS
Client Receive CR
Response
CR-CS
Service
SS-SR
Network
SR-CS
Network
CR-SS
Net Round Trip
(SR-CS) + (CR-SS)
(CR-CS) - (SS-SR)
Server
Time
Go-Kit Histogram Collection
const (
maxHistObservable = 1000000
sampleCount = 500
)
func NewHist(name string) metrics.Histogram {
var h metrics.Histogram
if name != "" && archaius.Conf.Collect {
h = expvar.NewHistogram(name, 1000, maxHistObservable, 1, []int{50, 99}...)
if sampleMap == nil {
sampleMap = make(map[metrics.Histogram][]int64)
}
sampleMap[h] = make([]int64, 0, sampleCount)
return h
}
return nil
}
func Measure(h metrics.Histogram, d time.Duration) {
if h != nil && archaius.Conf.Collect {
if d > maxHistObservable {
h.Observe(int64(maxHistObservable))
} else {
h.Observe(int64(d))
}
s := sampleMap[h]
if s != nil && len(s) < sampleCount {
sampleMap[h] = append(s, int64(d))
}
}
}
Nanoseconds!
Median and 99%ile
Slice for first 500
values as samples for
export to Guesstimate
Spigo Histogram Results
name: storage.*.*.load00....load.denominator_resp
count: 1978
gauges: map[50:126975 99:278527]
From, To, Count, Prob, Bar
28672, 29695, 1, 0.0005, :
31744, 32767, 1, 0.0005, :
34816, 36863, 2, 0.0010, :#
36864, 38911, 8, 0.0040, |######
38912, 40959, 13, 0.0066, |##########
40960, 43007, 18, 0.0091, |##############
43008, 45055, 12, 0.0061, |#########
45056, 47103, 26, 0.0131, |####################
47104, 49151, 24, 0.0121, |##################
49152, 51199, 33, 0.0167, |#########################
51200, 53247, 29, 0.0147, |######################
53248, 55295, 35, 0.0177, |###########################
55296, 57343, 39, 0.0197, |##############################
57344, 59391, 35, 0.0177, |###########################
59392, 61439, 43, 0.0217, |#################################
61440, 63487, 31, 0.0157, |########################
63488, 65535, 39, 0.0197, |##############################
65536, 69631, 74, 0.0374, |#########################################################
69632, 73727, 65, 0.0329, |##################################################
73728, 77823, 57, 0.0288, |############################################
77824, 81919, 37, 0.0187, |############################
81920, 86015, 37, 0.0187, |############################
86016, 90111, 30, 0.0152, |#######################
90112, 94207, 39, 0.0197, |##############################
94208, 98303, 28, 0.0142, |#####################
98304, 102399, 30, 0.0152, |#######################
102400, 106495, 31, 0.0157, |########################
106496, 110591, 20, 0.0101, |###############
110592, 114687, 26, 0.0131, |####################
114688, 118783, 44, 0.0222, |##################################
118784, 122879, 41, 0.0207, |###############################
122880, 126975, 54, 0.0273, |##########################################
126976, 131071, 51, 0.0258, |#######################################
131072, 139263, 114, 0.0576, |########################################################################################
139264, 147455, 123, 0.0622, |###############################################################################################
147456, 155647, 127, 0.0642, |###################################################################################################
155648, 163839, 102, 0.0516, |###############################################################################
163840, 172031, 90, 0.0455, |######################################################################
172032, 180223, 65, 0.0329, |##################################################
180224, 188415, 43, 0.0217, |#################################
188416, 196607, 60, 0.0303, |##############################################
196608, 204799, 54, 0.0273, |##########################################
204800, 212991, 29, 0.0147, |######################
212992, 221183, 21, 0.0106, |################
221184, 229375, 25, 0.0126, |###################
229376, 237567, 18, 0.0091, |##############
237568, 245759, 15, 0.0076, |###########
245760, 253951, 9, 0.0046, |#######
253952, 262143, 8, 0.0040, |######
262144, 278527, 10, 0.0051, |#######
278528, 294911, 6, 0.0030, |####
294912, 311295, 2, 0.0010, |#
327680, 344063, 2, 0.0010, :#
344064, 360447, 1, 0.0005, |
376832, 393215, 1, 0.0005, :
name: storage.*.*.load00....load.denominator_resp
count: 1978
gauges: map[50:126975 99:278527]
From, To, Count, Prob, Bar
28672, 29695, 1, 0.0005, :
31744, 32767, 1, 0.0005, :
34816, 36863, 2, 0.0010, :#
36864, 38911, 8, 0.0040, |######
38912, 40959, 13, 0.0066, |##########
Normalized probability
Response time distribution
measured in nanoseconds
using High Dynamic
Range Histogram
:# Zero counts skipped
|# Contiguous buckets
Total count, median and
99th percentile values
See http://www.getguesstimate.com/models/1307
memcached hit %
memcached response mysql response
service cpu time
memcached hit mode
mysql cache hit mode
mysql disk access mode
Hit rates: memcached 40% mysql 70%
Hit rates: memcached 60% mysql 70%
Hit rates: memcached 20% mysql 90%
Golang Guesstimate Interface
https://github.com/adrianco/goguesstimate
{
"space": {
"name": "gotest",
"description": "Testing",
"is_private": "true",
"graph": {
"metrics": [
{"id": "AB", "readableId": "AB", "name": "memcached", "location": {"row": 2, "column":4}},
{"id": "AC", "readableId": "AC", "name": "memcached percent", "location": {"row": 2, "column":
3}},
{"id": "AD", "readableId": "AD", "name": "staash cpu", "location": {"row": 3, "column":3}},
{"id": "AE", "readableId": "AE", "name": "staash", "location": {"row": 3, "column":2}}
],
"guesstimates": [
{"metric": "AB", "input": null, "guesstimateType": "DATA", "data":
[119958,6066,13914,9595,6773,5867,2347,1333,9900,9404,13518,9021,7915,3733,10244,5461,12243,7931,9044,11706,
5706,22861,9022,48661,15158,28995,16885,9564,17915,6610,7080,7065,12992,35431,11910,11465,14455,25790,8339,9
991]},
{"metric": "AC", "input": "40", "guesstimateType": "POINT"},
{"metric": "AD", "input": "[1000,4000]", "guesstimateType": "LOGNORMAL"},
{"metric": "AE", "input": "=100+((randomInt(0,100)>AC)?AB:AD)", "guesstimateType": "FUNCTION"}
]
}
}
}
See http://www.getguesstimate.com
@adrianco
Simplicity through symmetry
Symmetry
Invariants
Stable assertions
No special cases
See www.battery.com for a list of portfolio investments
@adrianco
“We see the world as increasingly more complex and chaotic
because we use inadequate concepts to explain it. When we
understand something, we no longer see it as chaotic or complex.”
Jamshid Gharajedaghi - 2011
Systems Thinking: Managing Chaos and Complexity: A Platform for Designing Business Architecture
Q&A
Adrian Cockcroft @adrianco
http://slideshare.com/adriancockcroft
Technology Fellow - Battery Ventures
See www.battery.com for a list of portfolio investments

More Related Content

What's hot (20)

PDF
Software Architecture Conference - Monitoring Microservices - A Challenge
Adrian Cockcroft
 
PDF
Openstack Silicon Valley - Vendor Lock In
Adrian Cockcroft
 
PDF
Dockercon 2015 - Faster Cheaper Safer
Adrian Cockcroft
 
PDF
Evolution of Microservices - Craft Conference
Adrian Cockcroft
 
PDF
Cloud Native Cost Optimization UCC
Adrian Cockcroft
 
PDF
Microservices the Good Bad and the Ugly
Adrian Cockcroft
 
PDF
Microxchg Analyzing Response Time Distributions for Microservices
Adrian Cockcroft
 
PDF
Devops: Who Does What? - Devops Enterprise Summit 2016
cornelia davis
 
PDF
Cloud Native: Designing Change-tolerant Software
cornelia davis
 
PDF
Cloud-native Data: Every Microservice Needs a Cache
cornelia davis
 
PDF
Velocity NY 2016 - Devops: Who Does What?
cornelia davis
 
PDF
Transform Digital Business with DevOps
Daniel Oh
 
PDF
Cloud Native: Designing Change-tolerant Software
cornelia davis
 
PDF
Atlassian Connect on Serverless Platforms: Low Cost Add-Ons
Atlassian
 
PDF
[muCon2017]DevSecOps: How to Continuously Integrate Security into DevOps
Daniel Oh
 
PDF
Delivering with Microservices - How to Iterate Towards Sophistication
Thoughtworks
 
PPTX
A Guide to Event-Driven SRE-inspired DevOps
Andreas Grabner
 
PPTX
From Continuous Integration to DevOps
IBM UrbanCode Products
 
PPTX
Micro Service Architecture
Eduards Sizovs
 
PDF
DevOps Digital Transformation: A real life use case enabled by Alien4Cloud
Cloudify Community
 
Software Architecture Conference - Monitoring Microservices - A Challenge
Adrian Cockcroft
 
Openstack Silicon Valley - Vendor Lock In
Adrian Cockcroft
 
Dockercon 2015 - Faster Cheaper Safer
Adrian Cockcroft
 
Evolution of Microservices - Craft Conference
Adrian Cockcroft
 
Cloud Native Cost Optimization UCC
Adrian Cockcroft
 
Microservices the Good Bad and the Ugly
Adrian Cockcroft
 
Microxchg Analyzing Response Time Distributions for Microservices
Adrian Cockcroft
 
Devops: Who Does What? - Devops Enterprise Summit 2016
cornelia davis
 
Cloud Native: Designing Change-tolerant Software
cornelia davis
 
Cloud-native Data: Every Microservice Needs a Cache
cornelia davis
 
Velocity NY 2016 - Devops: Who Does What?
cornelia davis
 
Transform Digital Business with DevOps
Daniel Oh
 
Cloud Native: Designing Change-tolerant Software
cornelia davis
 
Atlassian Connect on Serverless Platforms: Low Cost Add-Ons
Atlassian
 
[muCon2017]DevSecOps: How to Continuously Integrate Security into DevOps
Daniel Oh
 
Delivering with Microservices - How to Iterate Towards Sophistication
Thoughtworks
 
A Guide to Event-Driven SRE-inspired DevOps
Andreas Grabner
 
From Continuous Integration to DevOps
IBM UrbanCode Products
 
Micro Service Architecture
Eduards Sizovs
 
DevOps Digital Transformation: A real life use case enabled by Alien4Cloud
Cloudify Community
 

Viewers also liked (16)

PDF
Microservices: What's Missing - O'Reilly Software Architecture New York
Adrian Cockcroft
 
PDF
In Search of Segmentation
Adrian Cockcroft
 
PDF
Microservices Workshop All Topics Deck 2016
Adrian Cockcroft
 
PDF
Microservices Workshop - Craft Conference
Adrian Cockcroft
 
PDF
Gophercon 2016 Communicating Sequential Goroutines
Adrian Cockcroft
 
PDF
Innovation and Architecture
Adrian Cockcroft
 
PDF
Gluecon Monitoring Microservices and Containers: A Challenge
Adrian Cockcroft
 
PDF
Continuous Delivery by Example
Rafael Portela
 
PDF
Netflix in the Cloud at SV Forum
Adrian Cockcroft
 
PDF
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Adrian Cockcroft
 
PDF
Automated Deployment with Capistrano
Sumit Chhetri
 
PPTX
Dystopia as a Service
Adrian Cockcroft
 
PDF
Multi-provider Vagrant and Chef: AWS, VMware, and more
Chef Software, Inc.
 
PDF
Netflix on Cloud - combined slides for Dev and Ops
Adrian Cockcroft
 
ODP
It Works On My Machine: Vagrant for Software Development
Carlos Perez
 
PDF
Vagrant For DevOps
Lalatendu Mohanty
 
Microservices: What's Missing - O'Reilly Software Architecture New York
Adrian Cockcroft
 
In Search of Segmentation
Adrian Cockcroft
 
Microservices Workshop All Topics Deck 2016
Adrian Cockcroft
 
Microservices Workshop - Craft Conference
Adrian Cockcroft
 
Gophercon 2016 Communicating Sequential Goroutines
Adrian Cockcroft
 
Innovation and Architecture
Adrian Cockcroft
 
Gluecon Monitoring Microservices and Containers: A Challenge
Adrian Cockcroft
 
Continuous Delivery by Example
Rafael Portela
 
Netflix in the Cloud at SV Forum
Adrian Cockcroft
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Adrian Cockcroft
 
Automated Deployment with Capistrano
Sumit Chhetri
 
Dystopia as a Service
Adrian Cockcroft
 
Multi-provider Vagrant and Chef: AWS, VMware, and more
Chef Software, Inc.
 
Netflix on Cloud - combined slides for Dev and Ops
Adrian Cockcroft
 
It Works On My Machine: Vagrant for Software Development
Carlos Perez
 
Vagrant For DevOps
Lalatendu Mohanty
 
Ad

Similar to What's Missing? Microservices Meetup at Cisco (20)

PDF
Microservices: State of the Union
C4Media
 
PDF
Surviving microservices
Francesco Degrassi
 
PPTX
Jeffrey Richter
CodeFest
 
PPTX
Service Mesh CTO Forum (Draft 3)
Rick Hightower
 
PDF
The Future of Cloud Innovation, featuring Adrian Cockcroft
Dun & Bradstreet Cloud Innovation Center
 
PPTX
Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"
Fwdays
 
PDF
Microservices: The Best Practices
Pavel Mička
 
PPTX
Tef con2016 (1)
ggarber
 
PDF
Service Mesh Talk for CTO Forum
Rick Hightower
 
PPTX
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
Javier García Magna
 
PDF
Resilience Planning & How the Empire Strikes Back
C4Media
 
PDF
Microservices and Data Design
AWS Germany
 
PDF
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
DataStax Academy
 
PDF
Microservices Antipatterns
C4Media
 
PDF
#ATAGTR2020 Presentation - Microservices – Explored
Agile Testing Alliance
 
PDF
Software Architecture Anti-Patterns
Eduards Sizovs
 
PDF
2019 03-13-implementing microservices by ddd
Kim Kao
 
PPTX
Architectures for High Availability - QConSF
Adrian Cockcroft
 
PPTX
Microservices pros and cons
Andrew Siemer
 
PPTX
Intro to Microservices - SimtTLiX Tech Talk
Leo Marzo
 
Microservices: State of the Union
C4Media
 
Surviving microservices
Francesco Degrassi
 
Jeffrey Richter
CodeFest
 
Service Mesh CTO Forum (Draft 3)
Rick Hightower
 
The Future of Cloud Innovation, featuring Adrian Cockcroft
Dun & Bradstreet Cloud Innovation Center
 
Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"
Fwdays
 
Microservices: The Best Practices
Pavel Mička
 
Tef con2016 (1)
ggarber
 
Service Mesh Talk for CTO Forum
Rick Hightower
 
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
Javier García Magna
 
Resilience Planning & How the Empire Strikes Back
C4Media
 
Microservices and Data Design
AWS Germany
 
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
DataStax Academy
 
Microservices Antipatterns
C4Media
 
#ATAGTR2020 Presentation - Microservices – Explored
Agile Testing Alliance
 
Software Architecture Anti-Patterns
Eduards Sizovs
 
2019 03-13-implementing microservices by ddd
Kim Kao
 
Architectures for High Availability - QConSF
Adrian Cockcroft
 
Microservices pros and cons
Andrew Siemer
 
Intro to Microservices - SimtTLiX Tech Talk
Leo Marzo
 
Ad

More from Adrian Cockcroft (8)

PDF
Dockercon State of the Art in Microservices
Adrian Cockcroft
 
PDF
Goto Berlin - Migrating to Microservices (Fast Delivery)
Adrian Cockcroft
 
PDF
Cloud Native Cost Optimization
Adrian Cockcroft
 
PDF
QCon New York - Migrating to Cloud Native with Microservices
Adrian Cockcroft
 
PPTX
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Adrian Cockcroft
 
PPTX
Disrupting the Storage Industry talk at SNIA Data Storage Innovation Conference
Adrian Cockcroft
 
PDF
Hack Kid Con - Learn to be a Data Scientist for $1
Adrian Cockcroft
 
PPTX
Epidemic Failures
Adrian Cockcroft
 
Dockercon State of the Art in Microservices
Adrian Cockcroft
 
Goto Berlin - Migrating to Microservices (Fast Delivery)
Adrian Cockcroft
 
Cloud Native Cost Optimization
Adrian Cockcroft
 
QCon New York - Migrating to Cloud Native with Microservices
Adrian Cockcroft
 
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Adrian Cockcroft
 
Disrupting the Storage Industry talk at SNIA Data Storage Innovation Conference
Adrian Cockcroft
 
Hack Kid Con - Learn to be a Data Scientist for $1
Adrian Cockcroft
 
Epidemic Failures
Adrian Cockcroft
 

Recently uploaded (20)

PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
PPTX
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
Import Data Form Excel to Tally Services
Tally xperts
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 

What's Missing? Microservices Meetup at Cisco

  • 1. Microservices: What’s Missing… Adrian Cockcroft @adrianco Technology Fellow - Battery Ventures March 2016
  • 2. What does @adrianco do? @adrianco Technology Due Diligence on Deals Presentations at Conferences Presentations at Companies Technical Advice for Portfolio Companies Program Committee for Conferences Networking with Interesting PeopleTinkering with Technologies Maintain Relationship with Cloud Vendors http://www.slideshare.net/adriancockcroft
  • 3. What’s Missing? Trying out new content today Discussion/feedback O’Reilly Software Architecture Conference New York April 13th - for the real thing…
  • 4. @adrianco Discussion Points Failure injection testing Versioning, Routing, Protocols Timeouts and retries Denormalized data models Monitoring, Tracing Simplicity through symmetry See www.battery.com for a list of portfolio investments
  • 5. @adrianco Failure Injection Testing Netflix Chaos Monkey and Simian Army http://techblog.netflix.com/2011/07/netflix-simian-army.html http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html http://techblog.netflix.com/2016/01/automated-failure-testing.html See www.battery.com for a list of portfolio investments
  • 6. ! Chaos Monkey - enforcing stateless business logic ! Chaos Gorilla - enforcing zone isolation/replication ! Chaos Kong - enforcing region isolation/replication ! Security Monkey - watching for insecure configuration settings ! Latency Monkey & FIT - inject errors to enforce robust dependencies ! See over 100 NetflixOSS projects at netflix.github.com ! Get “Technical Indigestion” reading techblog.netflix.com Trust with Verification
  • 7. @adrianco Benefits of version aware routing Immediately and safely introduce a new version Canary test in production Pin clients to a version so they can’t get disrupted Change client or dependencies but not both at once Eventually remove old versions Incremental or infrequent “break the build” garbage collection See www.battery.com for a list of portfolio investments
  • 8. @adrianco Versioning, Routing Version numbering: Interface.Feature.Bugfix V1.2.3 to V1.2.4 - Canary test then remove old version V1.2.x to V1.3.x - Canary test then remove or keep both Route V1.3.x clients to new version to get new feature Remove V1.2.x only after V1.3.x is found to work for V1.2.x clients V1.x.x to V2.x.x - Route clients to specific versions Remove old server version when all old clients are gone See www.battery.com for a list of portfolio investments
  • 9. @adrianco Timeouts and Retries Connection timeout vs. request timeout confusion Usually setup incorrectly, global defaults Systems collapse with “retry storms” Timeouts too long, too many retries Services doing work that can never be used See www.battery.com for a list of portfolio investments
  • 10. @adrianco Connections and Requests TCP makes a connection, HTTP makes a request HTTP hopefully reuses connections for several requests Both have different timeout and retry needs! TCP timeout is purely a property of one network latency hop HTTP timeout depends on the service and its dependencies See www.battery.com for a list of portfolio investments
  • 11. @adrianco Timeouts and Retries Edge Service Good Service Good Service Bad config: Every service defaults to 5 second timeout, two retries Edge Service not responding Overloaded service not responding Failed Service If anything breaks, everything upstream stops responding Retries add unproductive work
  • 12. @adrianco Timeouts and Retries Bad config: Every service defaults to 5 second timeout, two retries Edge service responds slowly Overloaded service Partially failed service First request from Edge timed out so it ignores the successful response and keeps retrying. Middle service load increases as it’s doing work that isn’t being consumed
  • 13. @adrianco Timeout and Retry Fixes Cascading timeout budget Static settings that decrease from the edge or dynamic budget passed with request How often do retries actually succeed? Don’t ask the same instance the same thing Only retry on a different connection See www.battery.com for a list of portfolio investments
  • 14. @adrianco Timeouts and Retries Edge Service Good Service Budgeted timeout, one retry Failed Service 5s 1s 1s Fast fail response after 2s Upstream timeout must always be longer than total downstream timeout * retries delay No unproductive work while fast failing
  • 15. @adrianco Timeouts and Retries Edge Service Good Service Budgeted timeout, failover retry Failed Service 5s 1s 1s For replicated services with multiple instances never retry against a failed instance No extra retries or unproductive work Good Service Success response delayed 1s
  • 16. @adrianco Denormalized Data Models “The Network is Reliable” http://dl.acm.org/citation.cfm?id=2655736 Distributed systems are inconsistent by nature Clients are inconsistent with servers Most caches are inconsistent Versions are inconsistent Get over it and Deal with it See www.battery.com for a list of portfolio investments
  • 17. @adrianco Denormalized Data Models Any non-trivial organization has many databases Cross references exist, inconsistencies exist Microservices work best with individual simple stores Scale, operate, mutate, fail them independently NoSQL allows flexible schema/object versions See www.battery.com for a list of portfolio investments
  • 18. @adrianco Denormalized Data Models Build custom cross-datasource check/repair processes Ensure all cross references are up to date Immutability Changes Everything http://highscalability.com/blog/2015/1/26/paper-immutability-changes-everything-by-pat-helland.html Memories, Guesses and Apologies https://blogs.msdn.microsoft.com/pathelland/2007/05/15/memories-guesses-and-apologies/ See www.battery.com for a list of portfolio investments
  • 20. A Possible Hierarchy Continents Regions Zones Services Versions Containers Instances How Many? 3 to 5 2-4 per Continent 1-5 per Region 100’s per Zone Many per Service 1000’s per Version 10,000’s It’s much more challenging than just a large number of machines
  • 21. Flow
  • 22. Some tools can show the request flow across a few services
  • 23. Interesting architectures have a lot of microservices! Flow visualization is a big challenge. See http://www.slideshare.net/LappleApple/gilt-from-monolith-ruby-app-to-micro-service-scala-service-architecture
  • 24. Simulated Microservices Model and visualize microservices Simulate interesting architectures Generate large scale configurations Eventually stress test real tools See github.com/adrianco/spigo Simulate Protocol Interactions in Go Visualize with D3 ELB Load Balancer Zuul API Proxy Karyon Business Logic Staash Data Access Layer Priam Cassandra Datastore Three Availability Zones
  • 25. Spigo Nanoservice Structure func Start(listener chan gotocol.Message) { ... for { select { case msg := <-listener: flow.Instrument(msg, name, hist) switch msg.Imposition { case gotocol.Hello: // get named by parent ... case gotocol.NameDrop: // someone new to talk to ... case gotocol.Put: // upstream request handler ... outmsg := gotocol.Message{gotocol.Replicate, listener, time.Now(), msg.Ctx.NewParent(), msg.Intention} flow.AnnotateSend(outmsg, name) outmsg.GoSend(replicas) } case <-eurekaTicker.C: // poll the service registry ... } } } Skeleton code for replicating a Put message Instrument incoming requests Instrument outgoing requests update trace context
  • 26. Flow Trace Recording riak2 us-east-1 zoneC riak9 us-west-2 zoneA Put s896 Replicate riak3 us-east-1 zoneA riak8 us-west-2 zoneC riak4 us-east-1 zoneB riak10 us-west-2 zoneB us-east-1.zoneC.riak2 t98p895s896 Put us-east-1.zoneA.riak3 t98p896s908 Replicate us-east-1.zoneB.riak4 t98p896s909 Replicate us-west-2.zoneA.riak9 t98p896s910 Replicate us-west-2.zoneB.riak10 t98p910s912 Replicate us-west-2.zoneC.riak8 t98p910s913 Replicate staash us-east-1 zoneC s910 s908s913 s909s912 Replicate Put
  • 27. Open Zipkin A common format for trace annotations A Java tool for visualizing traces Standardization effort to fold in other formats Driven by Adrian Cole (currently at Pivotal) Extended to load Spigo generated trace files
  • 30. Trace for one Spigo Flow
  • 31. Definition of an architecture { "arch": "lamp", "description":"Simple LAMP stack", "version": "arch-0.0", "victim": "webserver", "services": [ { "name": "rds-mysql", "package": "store", "count": 2, "regions": 1, "dependencies": [] }, { "name": "memcache", "package": "store", "count": 1, "regions": 1, "dependencies": [] }, { "name": "webserver", "package": "monolith", "count": 18, "regions": 1, "dependencies": ["memcache", "rds-mysql"] }, { "name": "webserver-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["webserver"] }, { "name": "www", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["webserver-elb"] } ] } Header includes chaos monkey victim New tier name Tier package 0 = non Regional Node count List of tier dependencies
  • 32. Running Spigo $ ./spigo -a lamp -j -d 2 2016/01/26 23:04:05 Loading architecture from json_arch/lamp_arch.json 2016/01/26 23:04:05 lamp.edda: starting 2016/01/26 23:04:05 Architecture: lamp Simple LAMP stack 2016/01/26 23:04:05 architecture: scaling to 100% 2016/01/26 23:04:05 lamp.us-east-1.zoneB.eureka01....eureka.eureka: starting 2016/01/26 23:04:05 lamp.us-east-1.zoneA.eureka00....eureka.eureka: starting 2016/01/26 23:04:05 lamp.us-east-1.zoneC.eureka02....eureka.eureka: starting 2016/01/26 23:04:05 Starting: {rds-mysql store 1 2 []} 2016/01/26 23:04:05 Starting: {memcache store 1 1 []} 2016/01/26 23:04:05 Starting: {webserver monolith 1 18 [memcache rds-mysql]} 2016/01/26 23:04:05 Starting: {webserver-elb elb 1 0 [webserver]} 2016/01/26 23:04:05 Starting: {www denominator 0 0 [webserver-elb]} 2016/01/26 23:04:05 lamp.*.*.www00....www.denominator activity rate 10ms 2016/01/26 23:04:06 chaosmonkey delete: lamp.us-east-1.zoneC.webserver02....webserver.monolith 2016/01/26 23:04:07 asgard: Shutdown 2016/01/26 23:04:07 lamp.us-east-1.zoneB.eureka01....eureka.eureka: closing 2016/01/26 23:04:07 lamp.us-east-1.zoneA.eureka00....eureka.eureka: closing 2016/01/26 23:04:07 lamp.us-east-1.zoneC.eureka02....eureka.eureka: closing 2016/01/26 23:04:07 spigo: complete 2016/01/26 23:04:07 lamp.edda: closing -a architecture lamp -j graph json/lamp.json -d run for 2 seconds
  • 33. Riak IoT Architecture { "arch": "riak", "description":"Riak IoT ingestion example for the RICON 2015 presentation", "version": "arch-0.0", "victim": "", "services": [ { "name": "riakTS", "package": "riak", "count": 6, "regions": 1, "dependencies": ["riakTS", "eureka"]}, { "name": "ingester", "package": "staash", "count": 6, "regions": 1, "dependencies": ["riakTS"]}, { "name": "ingestMQ", "package": "karyon", "count": 3, "regions": 1, "dependencies": ["ingester"]}, { "name": "riakKV", "package": "riak", "count": 3, "regions": 1, "dependencies": ["riakKV"]}, { "name": "enricher", "package": "staash", "count": 6, "regions": 1, "dependencies": ["riakKV", "ingestMQ"]}, { "name": "enrichMQ", "package": "karyon", "count": 3, "regions": 1, "dependencies": ["enricher"]}, { "name": "analytics", "package": "karyon", "count": 6, "regions": 1, "dependencies": ["ingester"]}, { "name": "analytics-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["analytics"]}, { "name": "analytics-api", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["analytics-elb"]}, { "name": "normalization", "package": "karyon", "count": 6, "regions": 1, "dependencies": ["enrichMQ"]}, { "name": "iot-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["normalization"]}, { "name": "iot-api", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["iot-elb"]}, { "name": "stream", "package": "karyon", "count": 6, "regions": 1, "dependencies": ["ingestMQ"]}, { "name": "stream-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["stream"]}, { "name": "stream-api", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["stream-elb"]} ] } New tier name Tier package Node count List of tier dependencies 0 = non Regional
  • 35. Single Region Riak IoT IoT Ingestion Endpoint Stream Endpoint Analytics Endpoint
  • 36. Single Region Riak IoT IoT Ingestion Endpoint Stream Endpoint Analytics Endpoint Load Balancer Load Balancer Load Balancer
  • 37. Single Region Riak IoT IoT Ingestion Endpoint Stream Endpoint Analytics Endpoint Load Balancer Normalization Services Load Balancer Load Balancer Stream Service Analytics Service
  • 38. Single Region Riak IoT IoT Ingestion Endpoint Stream Endpoint Analytics Endpoint Load Balancer Normalization Services Enrich Message Queue Riak KV Enricher Services Load Balancer Load Balancer Stream Service Analytics Service
  • 39. Single Region Riak IoT IoT Ingestion Endpoint Stream Endpoint Analytics Endpoint Load Balancer Normalization Services Enrich Message Queue Riak KV Enricher Services Ingest Message Queue Load Balancer Load Balancer Stream Service Analytics Service
  • 40. Single Region Riak IoT IoT Ingestion Endpoint Stream Endpoint Analytics Endpoint Load Balancer Normalization Services Enrich Message Queue Riak KV Enricher Services Ingest Message Queue Load Balancer Load Balancer Stream Service Riak TS Analytics Service Ingester Service
  • 41. Two Region Riak IoT IoT Ingestion Endpoint Stream Endpoint Analytics Endpoint East Region Ingestion West Region Ingestion Multi Region TS Analytics
  • 42. Two Region Riak IoT IoT Ingestion Endpoint Stream Endpoint Analytics Endpoint East Region Ingestion West Region Ingestion Multi Region TS Analytics What’s the response time of the stream endpoint?
  • 44. What’s the response time of a simple service? memcached rds-msql rds-msqlwebservers elb www
  • 45. What’s the response time of an even simpler storage backed web service? memcached mysql disk volume web service load generator
  • 47. Changes made to codahale/hdrhistogram Changes made to go-kit/kit/metrics Implementation in adrianco/spigo/collect
  • 48. What to measure? Client Server GetRequest GetResponse Client Time Client Send CS Server Receive SR Server Send SS Client Receive CR Server Time
  • 49. What to measure? Client Server GetRequest GetResponse Client Time Client Send CS Server Receive SR Server Send SS Client Receive CR Response CR-CS Service SS-SR Network SR-CS Network CR-SS Net Round Trip (SR-CS) + (CR-SS) (CR-CS) - (SS-SR) Server Time
  • 50. Go-Kit Histogram Collection const ( maxHistObservable = 1000000 sampleCount = 500 ) func NewHist(name string) metrics.Histogram { var h metrics.Histogram if name != "" && archaius.Conf.Collect { h = expvar.NewHistogram(name, 1000, maxHistObservable, 1, []int{50, 99}...) if sampleMap == nil { sampleMap = make(map[metrics.Histogram][]int64) } sampleMap[h] = make([]int64, 0, sampleCount) return h } return nil } func Measure(h metrics.Histogram, d time.Duration) { if h != nil && archaius.Conf.Collect { if d > maxHistObservable { h.Observe(int64(maxHistObservable)) } else { h.Observe(int64(d)) } s := sampleMap[h] if s != nil && len(s) < sampleCount { sampleMap[h] = append(s, int64(d)) } } } Nanoseconds! Median and 99%ile Slice for first 500 values as samples for export to Guesstimate
  • 51. Spigo Histogram Results name: storage.*.*.load00....load.denominator_resp count: 1978 gauges: map[50:126975 99:278527] From, To, Count, Prob, Bar 28672, 29695, 1, 0.0005, : 31744, 32767, 1, 0.0005, : 34816, 36863, 2, 0.0010, :# 36864, 38911, 8, 0.0040, |###### 38912, 40959, 13, 0.0066, |########## 40960, 43007, 18, 0.0091, |############## 43008, 45055, 12, 0.0061, |######### 45056, 47103, 26, 0.0131, |#################### 47104, 49151, 24, 0.0121, |################## 49152, 51199, 33, 0.0167, |######################### 51200, 53247, 29, 0.0147, |###################### 53248, 55295, 35, 0.0177, |########################### 55296, 57343, 39, 0.0197, |############################## 57344, 59391, 35, 0.0177, |########################### 59392, 61439, 43, 0.0217, |################################# 61440, 63487, 31, 0.0157, |######################## 63488, 65535, 39, 0.0197, |############################## 65536, 69631, 74, 0.0374, |######################################################### 69632, 73727, 65, 0.0329, |################################################## 73728, 77823, 57, 0.0288, |############################################ 77824, 81919, 37, 0.0187, |############################ 81920, 86015, 37, 0.0187, |############################ 86016, 90111, 30, 0.0152, |####################### 90112, 94207, 39, 0.0197, |############################## 94208, 98303, 28, 0.0142, |##################### 98304, 102399, 30, 0.0152, |####################### 102400, 106495, 31, 0.0157, |######################## 106496, 110591, 20, 0.0101, |############### 110592, 114687, 26, 0.0131, |#################### 114688, 118783, 44, 0.0222, |################################## 118784, 122879, 41, 0.0207, |############################### 122880, 126975, 54, 0.0273, |########################################## 126976, 131071, 51, 0.0258, |####################################### 131072, 139263, 114, 0.0576, |######################################################################################## 139264, 147455, 123, 0.0622, |############################################################################################### 147456, 155647, 127, 0.0642, |################################################################################################### 155648, 163839, 102, 0.0516, |############################################################################### 163840, 172031, 90, 0.0455, |###################################################################### 172032, 180223, 65, 0.0329, |################################################## 180224, 188415, 43, 0.0217, |################################# 188416, 196607, 60, 0.0303, |############################################## 196608, 204799, 54, 0.0273, |########################################## 204800, 212991, 29, 0.0147, |###################### 212992, 221183, 21, 0.0106, |################ 221184, 229375, 25, 0.0126, |################### 229376, 237567, 18, 0.0091, |############## 237568, 245759, 15, 0.0076, |########### 245760, 253951, 9, 0.0046, |####### 253952, 262143, 8, 0.0040, |###### 262144, 278527, 10, 0.0051, |####### 278528, 294911, 6, 0.0030, |#### 294912, 311295, 2, 0.0010, |# 327680, 344063, 2, 0.0010, :# 344064, 360447, 1, 0.0005, | 376832, 393215, 1, 0.0005, : name: storage.*.*.load00....load.denominator_resp count: 1978 gauges: map[50:126975 99:278527] From, To, Count, Prob, Bar 28672, 29695, 1, 0.0005, : 31744, 32767, 1, 0.0005, : 34816, 36863, 2, 0.0010, :# 36864, 38911, 8, 0.0040, |###### 38912, 40959, 13, 0.0066, |########## Normalized probability Response time distribution measured in nanoseconds using High Dynamic Range Histogram :# Zero counts skipped |# Contiguous buckets Total count, median and 99th percentile values
  • 53. memcached hit % memcached response mysql response service cpu time memcached hit mode mysql cache hit mode mysql disk access mode Hit rates: memcached 40% mysql 70%
  • 54. Hit rates: memcached 60% mysql 70%
  • 55. Hit rates: memcached 20% mysql 90%
  • 56. Golang Guesstimate Interface https://github.com/adrianco/goguesstimate { "space": { "name": "gotest", "description": "Testing", "is_private": "true", "graph": { "metrics": [ {"id": "AB", "readableId": "AB", "name": "memcached", "location": {"row": 2, "column":4}}, {"id": "AC", "readableId": "AC", "name": "memcached percent", "location": {"row": 2, "column": 3}}, {"id": "AD", "readableId": "AD", "name": "staash cpu", "location": {"row": 3, "column":3}}, {"id": "AE", "readableId": "AE", "name": "staash", "location": {"row": 3, "column":2}} ], "guesstimates": [ {"metric": "AB", "input": null, "guesstimateType": "DATA", "data": [119958,6066,13914,9595,6773,5867,2347,1333,9900,9404,13518,9021,7915,3733,10244,5461,12243,7931,9044,11706, 5706,22861,9022,48661,15158,28995,16885,9564,17915,6610,7080,7065,12992,35431,11910,11465,14455,25790,8339,9 991]}, {"metric": "AC", "input": "40", "guesstimateType": "POINT"}, {"metric": "AD", "input": "[1000,4000]", "guesstimateType": "LOGNORMAL"}, {"metric": "AE", "input": "=100+((randomInt(0,100)>AC)?AB:AD)", "guesstimateType": "FUNCTION"} ] } } }
  • 58. @adrianco Simplicity through symmetry Symmetry Invariants Stable assertions No special cases See www.battery.com for a list of portfolio investments
  • 59. @adrianco “We see the world as increasingly more complex and chaotic because we use inadequate concepts to explain it. When we understand something, we no longer see it as chaotic or complex.” Jamshid Gharajedaghi - 2011 Systems Thinking: Managing Chaos and Complexity: A Platform for Designing Business Architecture
  • 60. Q&A Adrian Cockcroft @adrianco http://slideshare.com/adriancockcroft Technology Fellow - Battery Ventures See www.battery.com for a list of portfolio investments