SlideShare a Scribd company logo
Greg Dostatni
Team Lead, Application Hosting
Splunk at the University of
Alberta
Copyright © 2015 Splunk Inc.
2
• At U of A since 2007
• Responsible for 10-person
team managing applications
and databases university-wide
• Splunk user since 2013
• I’ve eaten BBQ chicken
intestines on a stick. Yummy.
• splunk> take the sh out of IT
3
The University of Alberta
• Public research university based in
Edmonton and founded in 1908
• 39,000+ students and 18,000
employees
• 5 campuses and 18 faculties
• One of the top 100 universities
worldwide
4
IT at the University of Alberta
Central IT group for authentication,
wireless and core services
Independent IT groups for most
faculties and departments
University-wide initiative to
consolidate more of IT
Need to standardize IT operations
and tame diverse technology stacks
4
5
Application Hosting Objectives
• Centralize more of IT
• Build and manage shared
environments
• Develop custom services as
needed
• Roll out/upgrade applications
• Investigate performance
problems
IT
Libraries
LMS
Public website
+ CMS
Ticketing
Billing
systems
Research group
serversOther applications
and databases
6
Challenges after Restructuring IT
• More interdependencies
among teams
• Massive volume of data,
housed in silos
• “Running blind” – no
understanding of the data
• Time-consuming to gather
data for incidents
7
Splunk Timeline
• Funding to
rebuild Splunk
environment
• New hardware,
clustering with
dedicated
storage
• 400 data sources
• 133 sourcetypes
April 2015
• Management
notification of
syslog data loss
• Incidents
escalated
• Splunk in
production?
Sept. 2014
• Data loss
concerns from
restarting
Splunk
• Management
relying on
Splunk reports
• Splunk not in
production
March 2014
• Pilot deployed
• Splunk as syslog
target
• Log aggregation
test; no need
for backup
Sept. 2013
8
Splunk at the University of Alberta
Infrastructure
Applications
(mail, authentication)
Networking
and Security
(switches, IPS)
Application
Hosting
(apps, databases)
9
Example: Troubleshooting Authentication Systems
Before
• 12GB/day, 20 machines
• No aggregation
• Reactive issue response
based on user feedback
• Manual investigations
• Delay in getting data
After
• Centralized data
• ½ hour to troubleshoot
• Proactive alerts for issues
• Easy access to
infrastructure data
• Real-time reporting
10
Example: Performance Monitoring
Track and correlate request response times to gauge user satisfaction
11
Example: First Responders App
Dashboards for initial incident review
12
Example: Proactive Alerts
Trigger alerts on both the count and percentage of messages
13
Example: Executive Dashboards
14
Splunk Deployment Takeaways
Successes
• Visibility cutting through team
boundaries
• More advanced initial incident
investigation
• Openness - signed standard IT
agreement for access to Splunk
data
• Management loves reports
• Defusing situations with rapid
access to facts
Challenges
• Accepting syslog data directly
• Log standardization
• Figuring out what to look at in the
logs to understand “good” system
behavior
15
Aha! Moments
Transactions
• End-to-end monitoring
of 4M+ email messages
per day (greylisting 
spam filtering 
Google)
• Used transactions to
combine logs across
systems into single,
message-centric log
• Ability to easily search
for anomalies
Generic Alerts
• Created alert to catch
errors across systems in
real time
• Used existing alert and
removed host
specification to create
the generic alert
• Catches errors that were
not in Splunk at the
moment the alert was
created
10-second Query
• 10-second window =
~35,000 events
• Statistics to rank likely
events triggering issues
• New Splunk window to
analyze unusual messages
• Ability to examine small
slice of time in detail
while running statistics
over longer period of time
16
“Splunk allows us to erase these lines and
any analyst can see all the data from
anywhere and investigate a problem from
end to end.”
Thank you

More Related Content

PPTX
SplunkLive! Austin Customer Presentation - Dell
Splunk
 
PPTX
SplunkLive! Austin Customer Presentation - Baylor
Splunk
 
PDF
SplunkLive! Austin Customer Presentation - Xerox
Splunk
 
PPTX
Customer Presentation
Splunk
 
PPTX
Customer Presentation
Splunk
 
PPTX
SchoolsFirst Credit Union Customer Presentation
Splunk
 
PPTX
AdvancedMD Customer Presentation
Splunk
 
PPTX
How to Design, Build and Map IT and Business Services in Splunk
Splunk
 
SplunkLive! Austin Customer Presentation - Dell
Splunk
 
SplunkLive! Austin Customer Presentation - Baylor
Splunk
 
SplunkLive! Austin Customer Presentation - Xerox
Splunk
 
Customer Presentation
Splunk
 
Customer Presentation
Splunk
 
SchoolsFirst Credit Union Customer Presentation
Splunk
 
AdvancedMD Customer Presentation
Splunk
 
How to Design, Build and Map IT and Business Services in Splunk
Splunk
 

What's hot (20)

PPTX
Machines are Talking. Are You Listening?
Splunk
 
PPTX
Getting Started with Splunk Enterprise
Splunk
 
PPTX
Travis Perkins: Building a 'Lean SOC' over 'Legacy SOC'
Splunk
 
PDF
Peter Zaitsev - Practical MySQL Performance Optimization
Caroline_Rose
 
PPTX
Group Health Cooperative Customer Presentation
Splunk
 
PPTX
AdvancedMD Customer Presentation
Splunk
 
PPTX
Splunk for IT Operations
Splunk
 
PPTX
Cisco UCS and Splunk Workshop
Robb Boyd
 
PPTX
SplunkLive! Customer Presentation – athenahealth
Splunk
 
PPTX
Splunk for IT Operations
Splunk
 
PDF
Building on quicksand microservices indicthreads
IndicThreads
 
PPTX
Splunk for IT Operations
Splunk
 
PPTX
12 Ways to Use PLCs & SQL Databases Together
Inductive Automation
 
PPTX
Taking Splunk to the Next Level - Management Breakout Session
Splunk
 
PPTX
Splunk for IT Operations
Splunk
 
PPTX
SplunkLive! Warsaw 2016 - Cisco
Splunk
 
PPTX
Leveraging Operational Data in the Cloud
Inductive Automation
 
PPTX
Splunk for vmware virtualization customer presentation
Greg Hanchin
 
PPTX
Improve the Impact of DevOps
Splunk
 
PDF
SplunkLive! London - Splunk App for Stream & MINT Breakout
Splunk
 
Machines are Talking. Are You Listening?
Splunk
 
Getting Started with Splunk Enterprise
Splunk
 
Travis Perkins: Building a 'Lean SOC' over 'Legacy SOC'
Splunk
 
Peter Zaitsev - Practical MySQL Performance Optimization
Caroline_Rose
 
Group Health Cooperative Customer Presentation
Splunk
 
AdvancedMD Customer Presentation
Splunk
 
Splunk for IT Operations
Splunk
 
Cisco UCS and Splunk Workshop
Robb Boyd
 
SplunkLive! Customer Presentation – athenahealth
Splunk
 
Splunk for IT Operations
Splunk
 
Building on quicksand microservices indicthreads
IndicThreads
 
Splunk for IT Operations
Splunk
 
12 Ways to Use PLCs & SQL Databases Together
Inductive Automation
 
Taking Splunk to the Next Level - Management Breakout Session
Splunk
 
Splunk for IT Operations
Splunk
 
SplunkLive! Warsaw 2016 - Cisco
Splunk
 
Leveraging Operational Data in the Cloud
Inductive Automation
 
Splunk for vmware virtualization customer presentation
Greg Hanchin
 
Improve the Impact of DevOps
Splunk
 
SplunkLive! London - Splunk App for Stream & MINT Breakout
Splunk
 
Ad

Viewers also liked (6)

PDF
SplunkLive! Customer Presentation - Cequint
Splunk
 
PPTX
Splunk live! Customer Presentation – Prelert
Splunk
 
PDF
Herbalife Customer Presentation
Splunk
 
PPTX
AWS on Splunk, Splunk on AWS
Splunk
 
PDF
Molina Healthcare Customer Presentation
Splunk
 
PDF
Experian Customer Presentation
Splunk
 
SplunkLive! Customer Presentation - Cequint
Splunk
 
Splunk live! Customer Presentation – Prelert
Splunk
 
Herbalife Customer Presentation
Splunk
 
AWS on Splunk, Splunk on AWS
Splunk
 
Molina Healthcare Customer Presentation
Splunk
 
Experian Customer Presentation
Splunk
 
Ad

Similar to Splunk live university of alberta 2015 (20)

PPTX
Splunk at Weill Cornell Medical College
Splunk
 
PPTX
SplunkLive! Customer Presentation – athenahealth
Stephanie Bies
 
PPTX
SplunkLive! Customer Presentation – athenahealth
Stephanie Bies
 
PPTX
Customer Presentation - KCP&L
Splunk
 
PPTX
Gov & Education Day 2015 - Mark Mendelson, UCLA
Splunk
 
PDF
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Jonathan Singer
 
PPTX
Apresentação splunk completa
Sidnir Vieira
 
PPTX
Customer Presentation - Financial Services Organization
Splunk
 
PPTX
Building Splunk Apps, Development Paths with Splunk & User Behaviour Analytics
Harry McLaren
 
PPTX
SplunkLive! Minneapolis April 2013 - Moneygram
Splunk
 
PDF
Splunk for application_management
Greg Hanchin
 
PDF
SplunkLive! Stockholm 2015 - Statnett
Splunk
 
PPTX
SplunkLive! Paris 2018: Delivering New Visibility And Analytics For IT Operat...
Splunk
 
PPTX
Splunk User Group Edinburgh - November Event
Harry McLaren
 
PPTX
Getting Started with Splunk Enterprise
Splunk
 
PPTX
Splunk at Sabre
Splunk
 
PPTX
NHS Choices: Managing complex infrastructure to deliver critical online services
Splunk
 
PPTX
Customer Presentation
Splunk
 
PPTX
Elevate your Splunk Deployment by Better Understanding your Value Breakfast S...
Splunk
 
PPTX
Splunk User Group Edinburgh - September Event
Harry McLaren
 
Splunk at Weill Cornell Medical College
Splunk
 
SplunkLive! Customer Presentation – athenahealth
Stephanie Bies
 
SplunkLive! Customer Presentation – athenahealth
Stephanie Bies
 
Customer Presentation - KCP&L
Splunk
 
Gov & Education Day 2015 - Mark Mendelson, UCLA
Splunk
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Jonathan Singer
 
Apresentação splunk completa
Sidnir Vieira
 
Customer Presentation - Financial Services Organization
Splunk
 
Building Splunk Apps, Development Paths with Splunk & User Behaviour Analytics
Harry McLaren
 
SplunkLive! Minneapolis April 2013 - Moneygram
Splunk
 
Splunk for application_management
Greg Hanchin
 
SplunkLive! Stockholm 2015 - Statnett
Splunk
 
SplunkLive! Paris 2018: Delivering New Visibility And Analytics For IT Operat...
Splunk
 
Splunk User Group Edinburgh - November Event
Harry McLaren
 
Getting Started with Splunk Enterprise
Splunk
 
Splunk at Sabre
Splunk
 
NHS Choices: Managing complex infrastructure to deliver critical online services
Splunk
 
Customer Presentation
Splunk
 
Elevate your Splunk Deployment by Better Understanding your Value Breakfast S...
Splunk
 
Splunk User Group Edinburgh - September Event
Harry McLaren
 

Recently uploaded (20)

PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PDF
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 

Splunk live university of alberta 2015

  • 1. Greg Dostatni Team Lead, Application Hosting Splunk at the University of Alberta Copyright © 2015 Splunk Inc.
  • 2. 2 • At U of A since 2007 • Responsible for 10-person team managing applications and databases university-wide • Splunk user since 2013 • I’ve eaten BBQ chicken intestines on a stick. Yummy. • splunk> take the sh out of IT
  • 3. 3 The University of Alberta • Public research university based in Edmonton and founded in 1908 • 39,000+ students and 18,000 employees • 5 campuses and 18 faculties • One of the top 100 universities worldwide
  • 4. 4 IT at the University of Alberta Central IT group for authentication, wireless and core services Independent IT groups for most faculties and departments University-wide initiative to consolidate more of IT Need to standardize IT operations and tame diverse technology stacks 4
  • 5. 5 Application Hosting Objectives • Centralize more of IT • Build and manage shared environments • Develop custom services as needed • Roll out/upgrade applications • Investigate performance problems IT Libraries LMS Public website + CMS Ticketing Billing systems Research group serversOther applications and databases
  • 6. 6 Challenges after Restructuring IT • More interdependencies among teams • Massive volume of data, housed in silos • “Running blind” – no understanding of the data • Time-consuming to gather data for incidents
  • 7. 7 Splunk Timeline • Funding to rebuild Splunk environment • New hardware, clustering with dedicated storage • 400 data sources • 133 sourcetypes April 2015 • Management notification of syslog data loss • Incidents escalated • Splunk in production? Sept. 2014 • Data loss concerns from restarting Splunk • Management relying on Splunk reports • Splunk not in production March 2014 • Pilot deployed • Splunk as syslog target • Log aggregation test; no need for backup Sept. 2013
  • 8. 8 Splunk at the University of Alberta Infrastructure Applications (mail, authentication) Networking and Security (switches, IPS) Application Hosting (apps, databases)
  • 9. 9 Example: Troubleshooting Authentication Systems Before • 12GB/day, 20 machines • No aggregation • Reactive issue response based on user feedback • Manual investigations • Delay in getting data After • Centralized data • ½ hour to troubleshoot • Proactive alerts for issues • Easy access to infrastructure data • Real-time reporting
  • 10. 10 Example: Performance Monitoring Track and correlate request response times to gauge user satisfaction
  • 11. 11 Example: First Responders App Dashboards for initial incident review
  • 12. 12 Example: Proactive Alerts Trigger alerts on both the count and percentage of messages
  • 14. 14 Splunk Deployment Takeaways Successes • Visibility cutting through team boundaries • More advanced initial incident investigation • Openness - signed standard IT agreement for access to Splunk data • Management loves reports • Defusing situations with rapid access to facts Challenges • Accepting syslog data directly • Log standardization • Figuring out what to look at in the logs to understand “good” system behavior
  • 15. 15 Aha! Moments Transactions • End-to-end monitoring of 4M+ email messages per day (greylisting  spam filtering  Google) • Used transactions to combine logs across systems into single, message-centric log • Ability to easily search for anomalies Generic Alerts • Created alert to catch errors across systems in real time • Used existing alert and removed host specification to create the generic alert • Catches errors that were not in Splunk at the moment the alert was created 10-second Query • 10-second window = ~35,000 events • Statistics to rank likely events triggering issues • New Splunk window to analyze unusual messages • Ability to examine small slice of time in detail while running statistics over longer period of time
  • 16. 16 “Splunk allows us to erase these lines and any analyst can see all the data from anywhere and investigate a problem from end to end.”

Editor's Notes

  • #2: Good morning everyone and welcome to SplunkLive Calgary Thanks so much for having me at your SplunkLive today
  • #3: Graduated from CS at University of Alberta in 2002, worked for various research projects and did some contract development for a few years. Joined university again in 2007.
  • #4: University of Alberta – friendly neighbor about 300 KM that away (for main campus) 18 faculties includes some big ones like Science, Medicine and Engineering The University ranking changes every year as well as different rankings get published. This seemed like a safe enough statement to make without going into half a page of small print.
  • #5: Our starting point about 4 years ago. Over the last 4 years or so we’ve been consolidating about 300+ individual groups IT into central department. Originally this was envisioned as a 10 year plan, so we still have a ways to go. At this point we are supporting a lot of different institutional needs, lots of different technologies and life is very exciting.
  • #6: Application Hosting. We manage the applications and databases for a lot of clients across campus. There are some big applications that are managed by others (LMS, Peoplesoft), but we make up for it with the number of different applications we support. I’ve stopped being amazed at the number of needs an institution of this size has. We have an application for printing out labels to put on file folders, tracking project time, billing and invoicing, databases supporting libraries, ticketing systems, wikis, departmental pages. Typically there is a piece of software behind a lot of business processes and all of that needs to be monitored, patched, upgraded. As our consolidation effort continues we will be using Splunk to look into how an application is used in order to determine how it could be consolidated with other applications of similar function. There is an amazing amount of information about usage patterns, what gets accessed and how often and who does the accessing.
  • #7: We’ve re-organized ourselves along functional lines. OS Support, Networking, Application Hosting, etc. What that means is that some investigations spanning multiple teams become very time consuming and expensive (two people looking at logs). Some of that is unavoidable, and even desirable, but for a large number of errors we’re just missing that one piece of crucial information that “solves” the problem. That could be a log line from the VM host indicating physical hardware problem, log from authentication system detailing why connection was rejected, etc. etc. Splunk allows us access to that information. Here is where I need to bring up a big warning flag. Having access to the logs does not mean you can understand the logs. There are some errors where the team running a system is required to correctly interpret the logs, but in general having more eyes is a good thing. Some of the expertise can be developed over time, some more through developing dashboards and applications within Splunk.
  • #8: I get a kick of these timelines, so I wanted to add our own. I’ve seen a few at a Splunk conference and they typically go something similar. An organization gets Splunk in limited capacity, something happens (systems get hacked, phishing attack, etc) and next year they are running 10x the license. We’ve had a bit of a different experience, where the Splunk importance snuck up on us. In September 2013 we’ve deployed Splunk as a pilot. Some of the conversations at that point were along the lines of “backups are not important as all the systems keep their own logs” Splunk is just a view into the logs, there is little to no new information contained. Let’s just send logs to see what we can make out of it. By March we were becoming concerned every time we needed to restart Splunk, since that meant data loss. Our installation was configured as a syslog target for networking devices and IMS logs so if Splunk was not there to receive the events, they went nowhere. By September (1 year later) we needed to notify management of Splunk outages because of the data loss. In April / May this year we are rebuilding the environment in clustered configuration with dedicated syslog servers. I don’t know if I can think of any one moment where Splunk suddenly became critical production environment, but it definitely is one now.
  • #9: What follows is a gross generalization. I obviously understand the challenges within my team the best, so take this with a grain of salt. Although other groups do use and send logs to Splunk, the three main groups are Infrastructure Applications, Networking and Application Hosting. It’s interesting that we all have different needs and different use cases. Infrastructure has a small number of highly critical applications that they need to understand end to end (mail, central authentication, etc). Networking has a few data sources that are fairly similar (switches, firewalls, etc). Security looks at a few different data sources like VPN and IPS as well as authentication logs from a number of systems. Application Hosting currently has a relatively few number of sources, but a lot of them are different and unique. There is probably 5 different ways apache is configured to log access requests, we run every major database type and version. Postgresql, MySQL, MSSQL, Oracle. In addition we support applications themselves or interact with software vendors on behalf of clients. On my last monthly report there were 374 different relational databases in environments supported by my team.
  • #10: An example from our Infrastructure Applications team. Our authentication system (approximately 180,000 accounts) generates around 12 GB of logs / day. Logs were stored on each individual node of a cluster in a text file. Trying to find logs related to a specific login id required signing on onto the 20 of so systems and using “grep” to identify individual log lines. That was not a quick task and required generating IO load on the servers. Doing anything more advanced than that was nearly impossible. After. We have summary indexes and reporting indexes on this data to quickly answer specific questions we know will be asked. We can correlate with data from other systems and alert in real-time for specific events. Users are no longer our main issue detection method.
  • #11: This is something we’ve used with great effect in my team for a few performance investigations so far. We break down the web traffic by percentiles and return size. It allows us to pinpoint problems (in some cases) as well as provide an instant report to the client how their application is really performing. This query is both complex enough to be useful, yet simple enough that I can explain it to non technical clients.
  • #12: This is a new initiative coming out of my team. It is called a First Responders App (even though it is currently a dashboard, I know it will end up an app eventually). First Responders is meant as the place to go at a start of an incident. It’s meant to put a lot of infrastructure information at analyst’s fingertips and it spans information from all of the operational teams. It allows an analyst to verify backup status, check logs, check tickets / change calendar, check monitoring system, who has logged in into the system last, etc. As we’re still rolling it out we do not have all the information we would perhaps wish, but the reception so far has been very positive.
  • #13: Also part of First Responders we have a holistic look at the logs. Not only do we look at the logs from the system, we also look at logs about the system. If our IPS detects activity against the host, that will appear in the window at top left. Same if out authentication suddenly starts throwing a lot of errors or messages about the host. Sparklines are great for very quickly identifying patterns and seeing if something is unusual. Lastly we have a query I’ve been playing around for a some months. This query tries to generate a statistical baseline of events from a system, and then compare last full hour against that baseline to highlight issues. In this screenshot I relaxed my alerting thresholds, that’s why the z scores are so low, it does illustrate what the output looks like. I’m working a next generation of this type of query that will log all deviations from the entire environment every hour. Wish me luck.
  • #14: This was probably our management’s first introduction to splunk, monthly reports for critical systems. We are still tweaking what goes on the reports, and will continue to do so. This particular dashboard shows our Adobe Connect, including a histogram of meeting sizes, which did include classes in the 240 – 245 participant range as well as an AppDex score. Eventually I’d like to standardize all application monthly reports and have them send automatically to each client / department.
  • #15: Things that worked and things that did not work so well. The main things that were successes: You can really defuse a situation by being able to rapidly provide facts. If you’re able to provide a list of users who accessed a specific file in the first 15 minutes of a breach investigation, that really brings down the stress level of everyone involved and situation rapidly de-escalates. Similarly for performance investigations. Everyone in the system can see all logs, we try to have the system as open as possible to all IT. Challenges: There is so many different way of configuring logging. That makes getting consistent reports a challenge. Knowing what is “normal”. Having the ability to rapidly generate graphs of values and have data that goes back a few months (at least) is highly beneficial. Syslog. Where possible use universal forwarders, where not possible have a syslog collector.
  • #16: Some of the “AHA moments” Using transactions to create message centric logs. It was nothing short of magical, especially when compared to the good old “grep” command across multiple systems. Generic alerts. Ability to create alerts that work for systems that are not in the system yet. The ability to look at the entire environment as a single event stream is incredibly powerful. An extension of the previous. While investigating a hiccup of some sort I performed a time constrained query across the entire environment. I wanted to see whether the error was limited to this system, or whether it appeared anywhere else. Using two windows I was able to run simple queries on specific messages and determine “normal event” or “not”. It was an amazing, and humbling, look at our environment. So many things happened within that 10 second window.