SlideShare a Scribd company logo
Operational Verification
David Schmitt (aka @dev_el_ops),
Tech Lead - Content
April 15, 2020
Good afternoon, folks! The Content team at Puppet is currently working on a
new module to provide more confidence in your infrastructure's health. While I
hope I don't need to convince you that having confidence in our deployments
is necessary, I hope I can show you today that it is possible to improve on the
situation we currently have.
1
First a few words about words. Here's how I learned them in University an
eternity ago. Verification is process oriented - are we doing the things right
and does each step match the requirements, while Validation is outcome
oriented: are we solving the actual problem? In refreshing my memory on
these distinctions, I found a post by a CS prof who summarizes it as
"Verification will help to determine whether the software is of high quality, but
it will not ensure that the system is useful."
https://www.easterbrook.ca/steve/2010/11/the-difference-between-verificatio
n-and-validation/
2 - @dev_el_ops
Verification: Are we building the system right?
Validation: Are we building the right system?
Terms I
This graphic is from the same blog post and shows various techniques that
we can apply to ensure that any solution is within the specification and
making progress towards solving the customer's problem. Of note for puppet
are unit tests - making sure that code meets specific low-level expectations
and acceptance tests that are responsible for proving fit-for-purposeness with
regards to full systems. Keep that thought in your mind!
While a test will never be able to judge the ultimate purpose of your service, I
want to show that there is a clear progression in the testing scope that we can
follow to ensure that our implementation provides value. For example a
system running apache is more useful as a webserver than a system that is
not running apache.
While you'll likely already know this ...
----
graphic from blog used with permission
https://twitter.com/dev_el_ops/status/1381940089438281728
https://www.easterbrook.ca/steve/2010/11/the-difference-between-verificatio
n-and-validation/
3 - @dev_el_ops
Terms II
From https://www.easterbrook.ca/steve/2010/11/the-difference-between-verification-and-validation/
I still want to spend a minute on Idempotency. Such actions can be applied
any number of times, but won't change the state of the system on
subsequent applications. For puppet that's convenient because we can apply
the same catalog over and over again, and won't change the target system if
it is already in the desired state.
Much of puppet's ecosystem today relies on a catalog's idempotency for
verification: we rely on this for impact analysis to make sense; we catch apply
errors in testing, so we can deploy with confidence; if there are unexpected
changes, we look for the security breach; if a puppet run doesn't change
anything, the system is healthy. But is it really?
4 - @dev_el_ops
Idempotent: Action can be applied multiple times without
changing the state of the system (beyond the first).
Terms III
Let's go through a short example. This here configures apache to serve static
content from a directory. What common issues would have a puppet run fail
on this code? One of the ugly ones is the apache service failing to start
because of a fatal configuration error. For example, if port 80 is already in use.
Thanks to recent-ish improvements to service management on linux, this has
become very easy to detect and puppet is already giving you an error for this
at the time the broken configuration is deployed. That nicely lines up with the
expectations I quoted earlier: we run this in a test, we inspect the error, we fix
the code.
Let's make the situation a bit more complicated... I mean ... realistic.
5 - @dev_el_ops
class { '::apache': }
apache::vhost { 'basic.example.com':
port => 80,
docroot => '/var/www/basic',
}
Example I
This example configures the same virtual host, but instead of serving up static
files it proxies requests through to a backend service.
Searching for the error is not the point of the slide, so I've already highlighted
where everything started to go wrong.
These two services will never talk to each other and puppet will happily keep
it that way, with no errors reported from applying the catalog.
Or maybe the SSL certificate has expired.
Or the docker image's configuration doesn't look at the PORT variable and
defaults to something else.
Or there is a firewall configuration that blocks access to port 80 or port 4000.
I'm sure each one of you will have their own example of that one time,
something went too wrong. Even before puppet we have developed
monitoring tools to live with this and have a better understanding of our
system's current state. For example, nagios first release was in 2002.
6 - @dev_el_ops
class { '::apache': }
apache::vhost { 'basic.example.com':
port => 80,
proxy_dest => 'localhost:4000',
}
docker::run { 'backend_service:latest':
env => [ 'PORT=3000' ],
}
Example II
testing, puppet idempotency and monitoring are facets of a bigger effort to
verify our systems, and what I'm about to show you is another step in that
direction.
What if a puppet run could tell you more about your services health beyond
that a process successfully started?
----
https://puppet.com/blog/hitchhikers-guide-to-testing-infrastructure-as-and-c
ode/
What if we could add a check resource into the catalog that - right then and
there - checks that the configured webservice is alive and kicking and returns
the proper status from its health check endpoint?
This check resource will make a HTTP call to the specified URL, and report a
failure if the request doesn't return 200 or the body of the response is not the
specified JSON. This will run directly on the managed node everytime puppet
runs. This won't make puppet a monitoring solution, but it will provide another
system health data point closely integrated into management workflows. It is
one step further in the direction of the testing/monitoring convergence that
Mina and I talked about previously.
For the sake of brevity the example on the slide glosses over some details.
The resource dependencies need to be hooked up correctly so that the check
happens only after it is possible to succeed. The service might take a few
moments to start up, so the check should be configured with a retry loop and
a timeout.
This is also only the start of understanding all the ways this can be useful. For
example what happens when this gets included in bolt plans for deployment
steering? Is this useful in your CD4PE blue/green deployment pipeline to
catch issues earlier?
7 - @dev_el_ops
class { '::apache': }
apache::vhost { 'basic.example.com':
port => 80,
proxy_dest => 'localhost:4000',
}
docker::run { 'backend_service:latest':
env => [ 'PORT=3000' ],
}
Example II - with added check
check_http { 'http://basic.example.com/health':
http_status => 200,
http_content => '{"status":"ok"}',
}
To figure these things out, we've published a very basic prototype of this in
the puppetlabs/opv (oscar papa victor) repo. And we are looking for early
feedback on how this fits into your workflow and what other checks you'd like
to see (of course, PRs are especially welcome).
We've already identified more work that's necessary before OPV is ready for
general consumption. One big one for example is how nobody wants to see a
change notification every time the check succeeds. We're currently working
on a new feature in the resource API to make that easily possible. Please have
a look at the tickets on the repo to see all the details of what's currently
planned.
Where do we go from here?
8 - @dev_el_ops
class { '::apache': }
apache::vhost { 'basic.example.com':
port => 80,
proxy_dest => 'localhost:4000',
}
docker::run { 'backend_service:latest':
env => [ 'PORT=3000' ],
}
check_http { 'http://basic.example.com/health':
http_status => 200,
http_content => '{"status":"ok"}',
}
Example II - with added check
https://github.com/puppetlabs/opv
Clearly there's more work in front of us to make this fully usable.
Here's the initial list of checks we're looking at: http as just shown, https with
additional verification of SSL certs, powershell and command to run arbitrary
shell checks, apt_update to check for outdated mirrors and packages,
certificate for all x509 validation fun and service for system services. At this
point in time the list is very provisional and - again - please do post any and
all feedback in the OPV repo's discussions if you're looking for something
specific. We don't have infinite budget and I'd much rather see us implement
something that folks are actually asking for.
The reporting fix is expected to land next week and will allow us to write
check resources that do not show up in a report when they're successful. The
only caveat here is that it's a resource api change and will only be available
from the next puppet 6 and 7 releases.
For retrying we're looking at the usual exponential backoff and overall timeout
bits, so that shouldn't be too controversial.
The next one is interesting. Of course we want to integrate this into existing
modules. This will give us first hand experience of how much effort it is and
give the OPV module some real-world exposure. It also means that we'll
10 - @dev_el_ops
Next Steps
● checks: http, https, powershell, command, task,
apt_update, certificate, service
● fix reporting
● implement exponential retry params
● integrate into existing modules
● expose checks for plans
● gather feedback
elevate it to supported status for our commercial customers.
While the resources as they exist can be used as part of apply blocks in bolt
plans, we also want to make them directly available as tasks and functions.
And finally - did I say this already? - please do throw any and all feedback
that you have at us. I'll be especially grateful for folks who can spare 30 or 60
minutes for an in-depth interview where we can double check some of our
basic assumptions.
When I showed the idea around internally, I usually got asked one of two
questions: "when can I get this?" and "how does this interact with
acceptance testing?" To the first one, I can only say we are working on it as I
speak here. With regards to the second one, I'm glad you asked!
Let's dive into how this operational verification works out in testing: I've
already shown all the big bits on this slide earlier: configure a VHost, add a
file, check that the file is available on the web server.
This is now ALSO an acceptance test case: if the file can be downloaded
from the webserver, the configuration is acceptable. And this is a more valid
and more in-depth checking test implementation than most I've seen in the
wild, including our own supported modules.
Plugging this into litmus is quite straightforward using idempotent_apply and
will yield test results with minimal faff around this. Since the test checks are
already built into the catalog, a successful application now really means that
the service has been configured correctly. WIthout having to hand-code
additional serverspec checks in ruby or deal with rspec scoping rules.
On the next slide, I'm gonna be even more loose and fast with the details ...
10 - @dev_el_ops
apache::vhost { 'basic.example.com':
port => 80,
docroot => '/var/www/basic',
}
file { '/var/www/basic/test.txt':
content => 'Hello, World!',
}
check_http { 'http://basic.example.com/test.txt':
http_status => 200,
http_content => 'Hello, World!',
}
Future of Acceptance Testing
… to spark your imagination while still fitting everything on a single page.
To quickly summarize what is happening here: to deploy this fictional app, the
plan first configures the database, checks that everything went well, then
configures the web server, double checking that the database is accessible
from this machine before touching anything. At the bottom the plan confirms
that the application is reachable from the node bolt is running on.
Surely production systems will have additional complexities, like running
database migrations, pre- and post- configuration steps to quiesce the
database or the app, draining, disconnecting and reestablishing loadbalancer
configurations, managing this process across more than two nodes, etc etc.
By virtue of having critical checks directly where the configuration happens,
they would not get lost, don't have delays and can provide immediate
in-place feedback if something goes wrong without losing context at every
stage of development.
And that's all I have for you today. I hope I've inspired you to have a new
perspective on testing and monitoring and you go check out the OPV
repository and participate in where this journey takes us.
11 - @dev_el_ops
plan my_app::deploy (TargetSpec $db_server, TargetSpec $app_server, String $app_url)
{
$db_results = apply($db_server) {
class { 'my_app::db': app_server => $app_server.name, }
}
opv::check_apply($db_results)
$app_results = apply($app_server) {
check_db { $db_server.name: }
-> class { 'my_app::app':
db_server => $db_server.name,
public_url => $app_url ,
}
}
opv::check_apply($app_results)
opv::check_http($app_url)
}
Future of Deployment Testing
I think we still have a few minutes for Q&A, meanwhile I'll leave the links for
the things I talked about up here. I'll also post the slides to slack later.
12 - @dev_el_ops
Links and Resources
● OPV module: https://github.com/puppetlabs/opv (add feedback to Discussions)
● Monitoring == Testing:
https://puppet.com/blog/hitchhikers-guide-to-testing-infrastructure-as-and-code
● Verification and Validation:
https://www.easterbrook.ca/steve/2010/11/the-difference-between-verification-a
nd-validation

More Related Content

What's hot (20)

PDF
Ceylon From Here to Infinity: The Big Picture and What's Coming
Virtual JBoss User Group
 
PDF
Mete Atamel
CodeFest
 
KEY
Drupal Deployment
q0rban
 
PDF
CI : the first_step: Auto Testing with CircleCI - (MOSG)
Soshi Nemoto
 
PPTX
Job DSL Plugin for Jenkins
Niels Bech Nielsen
 
PDF
Docker: automation for the rest of us
Jérôme Petazzoni
 
PPTX
Jenkins Job DSL plugin
Nikita Bugrovsky
 
PDF
StackStorm DevOps Automation Webinar
StackStorm
 
PDF
Test Failed, Then...
Toru Furukawa
 
PDF
CI/CD Using Ansible and Jenkins for Infrastructure
Faisal Shaikh
 
PDF
Trying Continuous Delivery - pyconjp 2012
Toru Furukawa
 
PPT
Building and Deployment of Drupal sites with Features and Context
Svilen Sabev
 
PDF
Cutting the Kubernetes Monorepo in pieces – never learnt more about git
Stefan Schimanski
 
PDF
Apache Lucene for Java EE Developers
Virtual JBoss User Group
 
PDF
Евгений Жарков "React Native: Hurdle Race"
Fwdays
 
PPTX
Ansible top 10 - 2018
Viresh Doshi
 
PDF
Monitoring Akka with Kamon 1.0
Steffen Gebert
 
PPTX
Getting started with Octopus Deploy
Karoline Klever
 
PDF
Docker Best Practices Workshop
Ahmed AbouZaid
 
PDF
Rundeck's History and Future
dev2ops
 
Ceylon From Here to Infinity: The Big Picture and What's Coming
Virtual JBoss User Group
 
Mete Atamel
CodeFest
 
Drupal Deployment
q0rban
 
CI : the first_step: Auto Testing with CircleCI - (MOSG)
Soshi Nemoto
 
Job DSL Plugin for Jenkins
Niels Bech Nielsen
 
Docker: automation for the rest of us
Jérôme Petazzoni
 
Jenkins Job DSL plugin
Nikita Bugrovsky
 
StackStorm DevOps Automation Webinar
StackStorm
 
Test Failed, Then...
Toru Furukawa
 
CI/CD Using Ansible and Jenkins for Infrastructure
Faisal Shaikh
 
Trying Continuous Delivery - pyconjp 2012
Toru Furukawa
 
Building and Deployment of Drupal sites with Features and Context
Svilen Sabev
 
Cutting the Kubernetes Monorepo in pieces – never learnt more about git
Stefan Schimanski
 
Apache Lucene for Java EE Developers
Virtual JBoss User Group
 
Евгений Жарков "React Native: Hurdle Race"
Fwdays
 
Ansible top 10 - 2018
Viresh Doshi
 
Monitoring Akka with Kamon 1.0
Steffen Gebert
 
Getting started with Octopus Deploy
Karoline Klever
 
Docker Best Practices Workshop
Ahmed AbouZaid
 
Rundeck's History and Future
dev2ops
 

Similar to 2021 04-15 operational verification (with notes) (20)

DOC
Gowtham_resume
Gowtham Venkateshan
 
PDF
SELJE_Database_Unit_Testing.pdf
Eric Selje
 
PDF
PVS-Studio Is Now in Chocolatey: Checking Chocolatey under Azure DevOps
Andrey Karpov
 
PDF
How to Implement Token Authentication Using the Django REST Framework
Katy Slemon
 
PDF
Automation of web attacks from advisories to create real world exploits
Munir Njiru
 
PPTX
JavaOne 2015 Devops and the Darkside CON6447
Steve Poole
 
PPTX
Asynchronous Apex Salesforce World Tour Paris 2015
Samuel De Rycke
 
PDF
Albert Witteveen - With Cloud Computing Who Needs Performance Testing
TEST Huddle
 
PDF
Care and feeding notes
Perrin Harkins
 
PPTX
Sherlock Homepage - A detective story about running large web services - WebN...
Maarten Balliauw
 
PDF
FINAL_40058464
Gavin Palmer
 
PDF
Continuous Deployment Pipeline for Systems at Cascadia IT Conference - 2017-0...
garrett honeycutt
 
PPTX
ELEVATE Paris
Peter Chittum
 
PPTX
From Duke of DevOps to Queen of Chaos - Api days 2018
Christophe Rochefolle
 
PPTX
WinOps Conf 2016 - Michael Greene - Release Pipelines
WinOps Conf
 
PDF
PVS-Studio in the Clouds: Travis CI
Andrey Karpov
 
DOCX
Introduction to the .NET Access Control Service
butest
 
DOCX
Introduction to the .NET Access Control Service
butest
 
PPTX
Stéphane Nicoll and Madhura Bhave at SpringOne Platform 2017
VMware Tanzu
 
PDF
Unit testing for WordPress
Harshad Mane
 
Gowtham_resume
Gowtham Venkateshan
 
SELJE_Database_Unit_Testing.pdf
Eric Selje
 
PVS-Studio Is Now in Chocolatey: Checking Chocolatey under Azure DevOps
Andrey Karpov
 
How to Implement Token Authentication Using the Django REST Framework
Katy Slemon
 
Automation of web attacks from advisories to create real world exploits
Munir Njiru
 
JavaOne 2015 Devops and the Darkside CON6447
Steve Poole
 
Asynchronous Apex Salesforce World Tour Paris 2015
Samuel De Rycke
 
Albert Witteveen - With Cloud Computing Who Needs Performance Testing
TEST Huddle
 
Care and feeding notes
Perrin Harkins
 
Sherlock Homepage - A detective story about running large web services - WebN...
Maarten Balliauw
 
FINAL_40058464
Gavin Palmer
 
Continuous Deployment Pipeline for Systems at Cascadia IT Conference - 2017-0...
garrett honeycutt
 
ELEVATE Paris
Peter Chittum
 
From Duke of DevOps to Queen of Chaos - Api days 2018
Christophe Rochefolle
 
WinOps Conf 2016 - Michael Greene - Release Pipelines
WinOps Conf
 
PVS-Studio in the Clouds: Travis CI
Andrey Karpov
 
Introduction to the .NET Access Control Service
butest
 
Introduction to the .NET Access Control Service
butest
 
Stéphane Nicoll and Madhura Bhave at SpringOne Platform 2017
VMware Tanzu
 
Unit testing for WordPress
Harshad Mane
 
Ad

More from Puppet (20)

PPTX
Puppet Community Day: Planning the Future Together
Puppet
 
PPTX
The Evolution of Puppet: Key Changes and Modernization Tips
Puppet
 
PPTX
Can You Help Me Upgrade to Puppet 8? Tips, Tools & Best Practices for Your Up...
Puppet
 
PPTX
Bolt Dynamic Inventory: Making Puppet Easier
Puppet
 
PPTX
Customizing Reporting with the Puppet Report Processor
Puppet
 
PPTX
Puppet at ConfigMgmtCamp 2025 Sponsor Deck
Puppet
 
PPTX
The State of Puppet in 2025: A Presentation from Developer Relations Lead Dav...
Puppet
 
PPTX
Let Red be Red and Green be Green: The Automated Workflow Restarter in GitHub...
Puppet
 
PDF
Puppet camp2021 testing modules and controlrepo
Puppet
 
PPTX
Puppet camp vscode
Puppet
 
PDF
Modules of the twenties
Puppet
 
PDF
Applying Roles and Profiles method to compliance code
Puppet
 
PPTX
KGI compliance as-code approach
Puppet
 
PDF
Enforce compliance policy with model-driven automation
Puppet
 
PDF
Keynote: Puppet camp compliance
Puppet
 
PPTX
Automating it management with Puppet + ServiceNow
Puppet
 
PPTX
Puppet: The best way to harden Windows
Puppet
 
PPTX
Simplified Patch Management with Puppet - Oct. 2020
Puppet
 
PPTX
Accelerating azure adoption with puppet
Puppet
 
PDF
Puppet catalog Diff; Raphael Pinson
Puppet
 
Puppet Community Day: Planning the Future Together
Puppet
 
The Evolution of Puppet: Key Changes and Modernization Tips
Puppet
 
Can You Help Me Upgrade to Puppet 8? Tips, Tools & Best Practices for Your Up...
Puppet
 
Bolt Dynamic Inventory: Making Puppet Easier
Puppet
 
Customizing Reporting with the Puppet Report Processor
Puppet
 
Puppet at ConfigMgmtCamp 2025 Sponsor Deck
Puppet
 
The State of Puppet in 2025: A Presentation from Developer Relations Lead Dav...
Puppet
 
Let Red be Red and Green be Green: The Automated Workflow Restarter in GitHub...
Puppet
 
Puppet camp2021 testing modules and controlrepo
Puppet
 
Puppet camp vscode
Puppet
 
Modules of the twenties
Puppet
 
Applying Roles and Profiles method to compliance code
Puppet
 
KGI compliance as-code approach
Puppet
 
Enforce compliance policy with model-driven automation
Puppet
 
Keynote: Puppet camp compliance
Puppet
 
Automating it management with Puppet + ServiceNow
Puppet
 
Puppet: The best way to harden Windows
Puppet
 
Simplified Patch Management with Puppet - Oct. 2020
Puppet
 
Accelerating azure adoption with puppet
Puppet
 
Puppet catalog Diff; Raphael Pinson
Puppet
 
Ad

Recently uploaded (20)

PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
July Patch Tuesday
Ivanti
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 

2021 04-15 operational verification (with notes)

  • 1. Operational Verification David Schmitt (aka @dev_el_ops), Tech Lead - Content April 15, 2020 Good afternoon, folks! The Content team at Puppet is currently working on a new module to provide more confidence in your infrastructure's health. While I hope I don't need to convince you that having confidence in our deployments is necessary, I hope I can show you today that it is possible to improve on the situation we currently have. 1
  • 2. First a few words about words. Here's how I learned them in University an eternity ago. Verification is process oriented - are we doing the things right and does each step match the requirements, while Validation is outcome oriented: are we solving the actual problem? In refreshing my memory on these distinctions, I found a post by a CS prof who summarizes it as "Verification will help to determine whether the software is of high quality, but it will not ensure that the system is useful." https://www.easterbrook.ca/steve/2010/11/the-difference-between-verificatio n-and-validation/ 2 - @dev_el_ops Verification: Are we building the system right? Validation: Are we building the right system? Terms I
  • 3. This graphic is from the same blog post and shows various techniques that we can apply to ensure that any solution is within the specification and making progress towards solving the customer's problem. Of note for puppet are unit tests - making sure that code meets specific low-level expectations and acceptance tests that are responsible for proving fit-for-purposeness with regards to full systems. Keep that thought in your mind! While a test will never be able to judge the ultimate purpose of your service, I want to show that there is a clear progression in the testing scope that we can follow to ensure that our implementation provides value. For example a system running apache is more useful as a webserver than a system that is not running apache. While you'll likely already know this ... ---- graphic from blog used with permission https://twitter.com/dev_el_ops/status/1381940089438281728 https://www.easterbrook.ca/steve/2010/11/the-difference-between-verificatio n-and-validation/ 3 - @dev_el_ops Terms II From https://www.easterbrook.ca/steve/2010/11/the-difference-between-verification-and-validation/
  • 4. I still want to spend a minute on Idempotency. Such actions can be applied any number of times, but won't change the state of the system on subsequent applications. For puppet that's convenient because we can apply the same catalog over and over again, and won't change the target system if it is already in the desired state. Much of puppet's ecosystem today relies on a catalog's idempotency for verification: we rely on this for impact analysis to make sense; we catch apply errors in testing, so we can deploy with confidence; if there are unexpected changes, we look for the security breach; if a puppet run doesn't change anything, the system is healthy. But is it really? 4 - @dev_el_ops Idempotent: Action can be applied multiple times without changing the state of the system (beyond the first). Terms III
  • 5. Let's go through a short example. This here configures apache to serve static content from a directory. What common issues would have a puppet run fail on this code? One of the ugly ones is the apache service failing to start because of a fatal configuration error. For example, if port 80 is already in use. Thanks to recent-ish improvements to service management on linux, this has become very easy to detect and puppet is already giving you an error for this at the time the broken configuration is deployed. That nicely lines up with the expectations I quoted earlier: we run this in a test, we inspect the error, we fix the code. Let's make the situation a bit more complicated... I mean ... realistic. 5 - @dev_el_ops class { '::apache': } apache::vhost { 'basic.example.com': port => 80, docroot => '/var/www/basic', } Example I
  • 6. This example configures the same virtual host, but instead of serving up static files it proxies requests through to a backend service. Searching for the error is not the point of the slide, so I've already highlighted where everything started to go wrong. These two services will never talk to each other and puppet will happily keep it that way, with no errors reported from applying the catalog. Or maybe the SSL certificate has expired. Or the docker image's configuration doesn't look at the PORT variable and defaults to something else. Or there is a firewall configuration that blocks access to port 80 or port 4000. I'm sure each one of you will have their own example of that one time, something went too wrong. Even before puppet we have developed monitoring tools to live with this and have a better understanding of our system's current state. For example, nagios first release was in 2002. 6 - @dev_el_ops class { '::apache': } apache::vhost { 'basic.example.com': port => 80, proxy_dest => 'localhost:4000', } docker::run { 'backend_service:latest': env => [ 'PORT=3000' ], } Example II
  • 7. testing, puppet idempotency and monitoring are facets of a bigger effort to verify our systems, and what I'm about to show you is another step in that direction. What if a puppet run could tell you more about your services health beyond that a process successfully started? ---- https://puppet.com/blog/hitchhikers-guide-to-testing-infrastructure-as-and-c ode/
  • 8. What if we could add a check resource into the catalog that - right then and there - checks that the configured webservice is alive and kicking and returns the proper status from its health check endpoint? This check resource will make a HTTP call to the specified URL, and report a failure if the request doesn't return 200 or the body of the response is not the specified JSON. This will run directly on the managed node everytime puppet runs. This won't make puppet a monitoring solution, but it will provide another system health data point closely integrated into management workflows. It is one step further in the direction of the testing/monitoring convergence that Mina and I talked about previously. For the sake of brevity the example on the slide glosses over some details. The resource dependencies need to be hooked up correctly so that the check happens only after it is possible to succeed. The service might take a few moments to start up, so the check should be configured with a retry loop and a timeout. This is also only the start of understanding all the ways this can be useful. For example what happens when this gets included in bolt plans for deployment steering? Is this useful in your CD4PE blue/green deployment pipeline to catch issues earlier? 7 - @dev_el_ops class { '::apache': } apache::vhost { 'basic.example.com': port => 80, proxy_dest => 'localhost:4000', } docker::run { 'backend_service:latest': env => [ 'PORT=3000' ], } Example II - with added check check_http { 'http://basic.example.com/health': http_status => 200, http_content => '{"status":"ok"}', }
  • 9. To figure these things out, we've published a very basic prototype of this in the puppetlabs/opv (oscar papa victor) repo. And we are looking for early feedback on how this fits into your workflow and what other checks you'd like to see (of course, PRs are especially welcome). We've already identified more work that's necessary before OPV is ready for general consumption. One big one for example is how nobody wants to see a change notification every time the check succeeds. We're currently working on a new feature in the resource API to make that easily possible. Please have a look at the tickets on the repo to see all the details of what's currently planned. Where do we go from here? 8 - @dev_el_ops class { '::apache': } apache::vhost { 'basic.example.com': port => 80, proxy_dest => 'localhost:4000', } docker::run { 'backend_service:latest': env => [ 'PORT=3000' ], } check_http { 'http://basic.example.com/health': http_status => 200, http_content => '{"status":"ok"}', } Example II - with added check https://github.com/puppetlabs/opv
  • 10. Clearly there's more work in front of us to make this fully usable. Here's the initial list of checks we're looking at: http as just shown, https with additional verification of SSL certs, powershell and command to run arbitrary shell checks, apt_update to check for outdated mirrors and packages, certificate for all x509 validation fun and service for system services. At this point in time the list is very provisional and - again - please do post any and all feedback in the OPV repo's discussions if you're looking for something specific. We don't have infinite budget and I'd much rather see us implement something that folks are actually asking for. The reporting fix is expected to land next week and will allow us to write check resources that do not show up in a report when they're successful. The only caveat here is that it's a resource api change and will only be available from the next puppet 6 and 7 releases. For retrying we're looking at the usual exponential backoff and overall timeout bits, so that shouldn't be too controversial. The next one is interesting. Of course we want to integrate this into existing modules. This will give us first hand experience of how much effort it is and give the OPV module some real-world exposure. It also means that we'll 10 - @dev_el_ops Next Steps ● checks: http, https, powershell, command, task, apt_update, certificate, service ● fix reporting ● implement exponential retry params ● integrate into existing modules ● expose checks for plans ● gather feedback
  • 11. elevate it to supported status for our commercial customers. While the resources as they exist can be used as part of apply blocks in bolt plans, we also want to make them directly available as tasks and functions. And finally - did I say this already? - please do throw any and all feedback that you have at us. I'll be especially grateful for folks who can spare 30 or 60 minutes for an in-depth interview where we can double check some of our basic assumptions. When I showed the idea around internally, I usually got asked one of two questions: "when can I get this?" and "how does this interact with acceptance testing?" To the first one, I can only say we are working on it as I speak here. With regards to the second one, I'm glad you asked!
  • 12. Let's dive into how this operational verification works out in testing: I've already shown all the big bits on this slide earlier: configure a VHost, add a file, check that the file is available on the web server. This is now ALSO an acceptance test case: if the file can be downloaded from the webserver, the configuration is acceptable. And this is a more valid and more in-depth checking test implementation than most I've seen in the wild, including our own supported modules. Plugging this into litmus is quite straightforward using idempotent_apply and will yield test results with minimal faff around this. Since the test checks are already built into the catalog, a successful application now really means that the service has been configured correctly. WIthout having to hand-code additional serverspec checks in ruby or deal with rspec scoping rules. On the next slide, I'm gonna be even more loose and fast with the details ... 10 - @dev_el_ops apache::vhost { 'basic.example.com': port => 80, docroot => '/var/www/basic', } file { '/var/www/basic/test.txt': content => 'Hello, World!', } check_http { 'http://basic.example.com/test.txt': http_status => 200, http_content => 'Hello, World!', } Future of Acceptance Testing
  • 13. … to spark your imagination while still fitting everything on a single page. To quickly summarize what is happening here: to deploy this fictional app, the plan first configures the database, checks that everything went well, then configures the web server, double checking that the database is accessible from this machine before touching anything. At the bottom the plan confirms that the application is reachable from the node bolt is running on. Surely production systems will have additional complexities, like running database migrations, pre- and post- configuration steps to quiesce the database or the app, draining, disconnecting and reestablishing loadbalancer configurations, managing this process across more than two nodes, etc etc. By virtue of having critical checks directly where the configuration happens, they would not get lost, don't have delays and can provide immediate in-place feedback if something goes wrong without losing context at every stage of development. And that's all I have for you today. I hope I've inspired you to have a new perspective on testing and monitoring and you go check out the OPV repository and participate in where this journey takes us. 11 - @dev_el_ops plan my_app::deploy (TargetSpec $db_server, TargetSpec $app_server, String $app_url) { $db_results = apply($db_server) { class { 'my_app::db': app_server => $app_server.name, } } opv::check_apply($db_results) $app_results = apply($app_server) { check_db { $db_server.name: } -> class { 'my_app::app': db_server => $db_server.name, public_url => $app_url , } } opv::check_apply($app_results) opv::check_http($app_url) } Future of Deployment Testing
  • 14. I think we still have a few minutes for Q&A, meanwhile I'll leave the links for the things I talked about up here. I'll also post the slides to slack later. 12 - @dev_el_ops Links and Resources ● OPV module: https://github.com/puppetlabs/opv (add feedback to Discussions) ● Monitoring == Testing: https://puppet.com/blog/hitchhikers-guide-to-testing-infrastructure-as-and-code ● Verification and Validation: https://www.easterbrook.ca/steve/2010/11/the-difference-between-verification-a nd-validation