Avoiding Log Data Overload in a CI/CD System: Streaming 190 Billion Events and Batch Processing 40 TB/Hour with Joshua Robinson

1 © 2018 PURE STORAGE INC. PURE PROPRIETARY
How to Avoid Drowning in Logs
Joshua Robinson
Founding Engineer, FlashBlade
Streaming 180 Billion Events/Day and
Batching 150 TB/Hour

Log Analytics Pipeline in Numbers
ü2M events / second
ü5 seconds SLA
ü0.5 - 1 PB of data / day

Continuous Integration &
Continuous Deployment
Source Build
Functional
Test
Stress
Test
Deploy

< 5
1 Test
coordinator
(Jenkins)
< 10
< 10
CI/CD works!
100s
tests / day
< 5
failures
Email
developer

700
failures
x
15 min
70,000+
tests / day
20 Triage Engineers
2x in the next 12 months
1500+
VMs
250+
FBs
20+
Jenkins
700+
clients
100+
Engineers
Scale Problems

Log Analysis Dream
1. Automate triaging of failures
2. Extract performance metrics
3. Save our logs for future use
4. Do all of this in a scalable system
5. Real-time results!

Log Analysis
Volume
Value

Log Analysis v1
Volume
Value
Save
Alert / Take action

Log Analysis v2
Volume
Value
Save
ETL / Add Structure
Alert / Take action

Log Analysis v3
Volume
Value
Save
Aggregate / Search
ETL / Add Structure
Alert / Take action

Log Analysis v10
Volume
Value
Save
Aggregate / Search
ETL / Add Structure
Alert / Take action

Log Analysis Pipeline
Augment &
Centralize
LogSources
Index
Aggregate
Transform
Logic
Timeseries
DB
AlertStore
Visualize

Augment &
Centralize
LogSources
Aggregate
Transform
Logic
Timeseries
DB
AlertStore
Visualize
Index

Augment &
Centralize
LogSources
Streaming
Buffer
Filter
Store
Aggregate
Transform
Logic
Timeseries
DB
Alert
Visualize
Index

Augment &
Centralize
LogSources
Streaming
Buffer
Filter
Store Re-Filter
Aggregate
Transform
Logic
Timeseries
DB
Alert
Visualize
Index

Augment &
Centralize
LogSources
Streaming
Buffer
Filter
Store
Aggregate
Transform
Logic
Timeseries
DB
Alert
Visualize
Index
Re-Filter

rsyslog
LogSources
Streaming
Buffer
Filter
Store Re-Filter
Aggregate
Transform
Logic
Timeseries
DB
Alert
Visualize
Index

rsyslog
LogSources
Streaming
Buffer
Filter
Re-Filter
Aggregate
Transform
Logic
Timeseries
DB
Alert
Visualize
Index

rsyslog
LogSources
Filter
Re-Filter
Timeseries
DB
Alert
Aggregate
Transform
Logic
Visualize
Index

rsyslog
LogSources
Timeseries
DB
Alert
Aggregate
Transform
Logic
Visualize
Index

rsyslog
LogSources
Timeseries
DB
Alert
Aggregate
Transform
Logic
Visualize

rsyslog
LogSources
Timeseries
DB
Alert
Visualize

rsyslog
LogSources

Indexing
Use filesystem directory structure to encode metadata
• Raw data: <host>/<year>/<month>/<day>/<flat files>
• Producer: Rsyslog
• Consumer: Spark batch (re-filter or custom lookbacks)
• Indexed data: <pattern>/<year>/<month>/<day>/<hour>/<host>/<flat files>
• Producer: Spark streaming (filter)
• Consumer: Python services (e.g. ETL, alert, searchability)

Querying
Find and load data
• FlashBlade NFS protocol. < 1ms latency
• Listing
• “ls -alR” is still SLOW
• NFS client in kernel sequentially discovers filesystem structure.
• Solution: Skip the kernel. Use libnfs to create our own parallelized discovery. 1000x faster for 1M
files
• Reading
• Buffering: Create input pipeline to optimize for throughput and hide latency away

Full Pipeline
2,500+
VMs
300+
FBs
20+
Jenkins
1,000+
clients
72T
12
12
12
12
12
12
12
12
12
12
72T 12
12
12
12
12
12
12
12
12
12
12
12
12
12
120,000+
tests / day
24T
rsyslog
16
16
16
16
16
16
800G 12
12
12
12
12
12
ü Duplicate bug
ü Infrastructure failure
ü Performance regression

Full Pipeline
2,500+
VMs
350+
FBs
20+
Jenkins
1,000+
clients
72T
12
12
12
12
12
12
12
12
12
12
72T 12
12
12
12
12
12
12
12
12
12
12
12
12
12
120,000+
tests / day
24T
rsyslog
16
16
16
16
16
16
800G
12
12
12
12
12
12
ü Duplicate bug
ü Performance regression200T
12
12
12
12
12
12
90G

Full Pipeline
2,500+
VMs
350+
FBs
20+
Jenkins
1,000+
clients
72T
12
12
12
12
12
12
12
12
12
12
72T 12
12
12
12
12
12
12
12
12
12
12
12
12
12
120,000+
tests / day
24T
rsyslog
16
16
16
16
16
16
800G
12
12
12
12
12
12
ü Duplicate bug
ü Performance regression200T
12
12
12
12
12
12
90G
50G
12
12
12
12189T ü Low level details
ü Easy to read graphs

Takeaways
ü Index only what you need, store the rest
(in a storage layer that scales in throughput and to billions of files/objects)
ü Optimize for throughput and not latency
ü Disaggregation of compute and storage for
scalability of subsystems

QUESTIONS?

Avoiding Log Data Overload in a CI/CD System: Streaming 190 Billion Events and Batch Processing 40 TB/Hour with Joshua Robinson

More Related Content

What's hot (20)

Similar to Avoiding Log Data Overload in a CI/CD System: Streaming 190 Billion Events and Batch Processing 40 TB/Hour with Joshua Robinson (20)

More from Databricks (20)

Recently uploaded (20)

Avoiding Log Data Overload in a CI/CD System: Streaming 190 Billion Events and Batch Processing 40 TB/Hour with Joshua Robinson