2021 04-15 operational verification (with notes)

Operational Verification
David Schmitt (aka @dev_el_ops),
Tech Lead - Content
April 15, 2020
Good afternoon, folks! The Content team at Puppet is currently working on a
new module to provide more confidence in your infrastructure's health. While I
hope I don't need to convince you that having confidence in our deployments
is necessary, I hope I can show you today that it is possible to improve on the
situation we currently have.
1

First a few words about words. Here's how I learned them in University an
eternity ago. Verification is process oriented - are we doing the things right
and does each step match the requirements, while Validation is outcome
oriented: are we solving the actual problem? In refreshing my memory on
these distinctions, I found a post by a CS prof who summarizes it as
"Verification will help to determine whether the software is of high quality, but
it will not ensure that the system is useful."
https://www.easterbrook.ca/steve/2010/11/the-difference-between-verificatio
n-and-validation/
2 - @dev_el_ops
Verification: Are we building the system right?
Validation: Are we building the right system?
Terms I

This graphic is from the same blog post and shows various techniques that
we can apply to ensure that any solution is within the specification and
making progress towards solving the customer's problem. Of note for puppet
are unit tests - making sure that code meets specific low-level expectations
and acceptance tests that are responsible for proving fit-for-purposeness with
regards to full systems. Keep that thought in your mind!
While a test will never be able to judge the ultimate purpose of your service, I
want to show that there is a clear progression in the testing scope that we can
follow to ensure that our implementation provides value. For example a
system running apache is more useful as a webserver than a system that is
not running apache.
While you'll likely already know this ...
----
graphic from blog used with permission
https://twitter.com/dev_el_ops/status/1381940089438281728
https://www.easterbrook.ca/steve/2010/11/the-difference-between-verificatio
n-and-validation/
3 - @dev_el_ops
Terms II
From https://www.easterbrook.ca/steve/2010/11/the-difference-between-verification-and-validation/

I still want to spend a minute on Idempotency. Such actions can be applied
any number of times, but won't change the state of the system on
subsequent applications. For puppet that's convenient because we can apply
the same catalog over and over again, and won't change the target system if
it is already in the desired state.
Much of puppet's ecosystem today relies on a catalog's idempotency for
verification: we rely on this for impact analysis to make sense; we catch apply
errors in testing, so we can deploy with confidence; if there are unexpected
changes, we look for the security breach; if a puppet run doesn't change
anything, the system is healthy. But is it really?
4 - @dev_el_ops
Idempotent: Action can be applied multiple times without
changing the state of the system (beyond the first).
Terms III

Let's go through a short example. This here configures apache to serve static
content from a directory. What common issues would have a puppet run fail
on this code? One of the ugly ones is the apache service failing to start
because of a fatal configuration error. For example, if port 80 is already in use.
Thanks to recent-ish improvements to service management on linux, this has
become very easy to detect and puppet is already giving you an error for this
at the time the broken configuration is deployed. That nicely lines up with the
expectations I quoted earlier: we run this in a test, we inspect the error, we fix
the code.
Let's make the situation a bit more complicated... I mean ... realistic.
5 - @dev_el_ops
class { '::apache': }
apache::vhost { 'basic.example.com':
port => 80,
docroot => '/var/www/basic',
}
Example I

This example configures the same virtual host, but instead of serving up static
files it proxies requests through to a backend service.
Searching for the error is not the point of the slide, so I've already highlighted
where everything started to go wrong.
These two services will never talk to each other and puppet will happily keep
it that way, with no errors reported from applying the catalog.
Or maybe the SSL certificate has expired.
Or the docker image's configuration doesn't look at the PORT variable and
defaults to something else.
Or there is a firewall configuration that blocks access to port 80 or port 4000.
I'm sure each one of you will have their own example of that one time,
something went too wrong. Even before puppet we have developed
monitoring tools to live with this and have a better understanding of our
system's current state. For example, nagios first release was in 2002.
6 - @dev_el_ops
port => 80,
proxy_dest => 'localhost:4000',
}
docker::run { 'backend_service:latest':
env => [ 'PORT=3000' ],
}
Example II

testing, puppet idempotency and monitoring are facets of a bigger eﬀort to
verify our systems, and what I'm about to show you is another step in that
direction.
What if a puppet run could tell you more about your services health beyond
that a process successfully started?
----
https://puppet.com/blog/hitchhikers-guide-to-testing-infrastructure-as-and-c
ode/

What if we could add a check resource into the catalog that - right then and
there - checks that the configured webservice is alive and kicking and returns
the proper status from its health check endpoint?
This check resource will make a HTTP call to the specified URL, and report a
failure if the request doesn't return 200 or the body of the response is not the
specified JSON. This will run directly on the managed node everytime puppet
runs. This won't make puppet a monitoring solution, but it will provide another
system health data point closely integrated into management workflows. It is
one step further in the direction of the testing/monitoring convergence that
Mina and I talked about previously.
For the sake of brevity the example on the slide glosses over some details.
The resource dependencies need to be hooked up correctly so that the check
happens only after it is possible to succeed. The service might take a few
moments to start up, so the check should be configured with a retry loop and
a timeout.
This is also only the start of understanding all the ways this can be useful. For
example what happens when this gets included in bolt plans for deployment
steering? Is this useful in your CD4PE blue/green deployment pipeline to
catch issues earlier?
7 - @dev_el_ops
port => 80,
}
env => [ 'PORT=3000' ],
}
Example II - with added check
check_http { 'http://basic.example.com/health':
http_status => 200,
http_content => '{"status":"ok"}',
}

To figure these things out, we've published a very basic prototype of this in
the puppetlabs/opv (oscar papa victor) repo. And we are looking for early
feedback on how this fits into your workflow and what other checks you'd like
to see (of course, PRs are especially welcome).
We've already identified more work that's necessary before OPV is ready for
general consumption. One big one for example is how nobody wants to see a
change notification every time the check succeeds. We're currently working
on a new feature in the resource API to make that easily possible. Please have
a look at the tickets on the repo to see all the details of what's currently
planned.
Where do we go from here?
8 - @dev_el_ops
port => 80,
}
env => [ 'PORT=3000' ],
}
check_http { 'http://basic.example.com/health':
http_status => 200,
http_content => '{"status":"ok"}',
}
Example II - with added check
https://github.com/puppetlabs/opv

Clearly there's more work in front of us to make this fully usable.
Here's the initial list of checks we're looking at: http as just shown, https with
additional verification of SSL certs, powershell and command to run arbitrary
shell checks, apt_update to check for outdated mirrors and packages,
certificate for all x509 validation fun and service for system services. At this
point in time the list is very provisional and - again - please do post any and
all feedback in the OPV repo's discussions if you're looking for something
specific. We don't have infinite budget and I'd much rather see us implement
something that folks are actually asking for.
The reporting fix is expected to land next week and will allow us to write
check resources that do not show up in a report when they're successful. The
only caveat here is that it's a resource api change and will only be available
from the next puppet 6 and 7 releases.
For retrying we're looking at the usual exponential backoff and overall timeout
bits, so that shouldn't be too controversial.
The next one is interesting. Of course we want to integrate this into existing
modules. This will give us first hand experience of how much effort it is and
give the OPV module some real-world exposure. It also means that we'll
10 - @dev_el_ops
Next Steps
● checks: http, https, powershell, command, task,
apt_update, certificate, service
● fix reporting
● implement exponential retry params
● integrate into existing modules
● expose checks for plans
● gather feedback

elevate it to supported status for our commercial customers.
While the resources as they exist can be used as part of apply blocks in bolt
plans, we also want to make them directly available as tasks and functions.
And ﬁnally - did I say this already? - please do throw any and all feedback
that you have at us. I'll be especially grateful for folks who can spare 30 or 60
minutes for an in-depth interview where we can double check some of our
basic assumptions.
When I showed the idea around internally, I usually got asked one of two
questions: "when can I get this?" and "how does this interact with
acceptance testing?" To the ﬁrst one, I can only say we are working on it as I
speak here. With regards to the second one, I'm glad you asked!

Let's dive into how this operational verification works out in testing: I've
already shown all the big bits on this slide earlier: configure a VHost, add a
file, check that the file is available on the web server.
This is now ALSO an acceptance test case: if the file can be downloaded
from the webserver, the configuration is acceptable. And this is a more valid
and more in-depth checking test implementation than most I've seen in the
wild, including our own supported modules.
Plugging this into litmus is quite straightforward using idempotent_apply and
will yield test results with minimal faff around this. Since the test checks are
already built into the catalog, a successful application now really means that
the service has been configured correctly. WIthout having to hand-code
additional serverspec checks in ruby or deal with rspec scoping rules.
On the next slide, I'm gonna be even more loose and fast with the details ...
10 - @dev_el_ops
port => 80,
docroot => '/var/www/basic',
}
file { '/var/www/basic/test.txt':
content => 'Hello, World!',
}
check_http { 'http://basic.example.com/test.txt':
http_status => 200,
http_content => 'Hello, World!',
}
Future of Acceptance Testing

… to spark your imagination while still fitting everything on a single page.
To quickly summarize what is happening here: to deploy this fictional app, the
plan first configures the database, checks that everything went well, then
configures the web server, double checking that the database is accessible
from this machine before touching anything. At the bottom the plan confirms
that the application is reachable from the node bolt is running on.
Surely production systems will have additional complexities, like running
database migrations, pre- and post- configuration steps to quiesce the
database or the app, draining, disconnecting and reestablishing loadbalancer
configurations, managing this process across more than two nodes, etc etc.
By virtue of having critical checks directly where the configuration happens,
they would not get lost, don't have delays and can provide immediate
in-place feedback if something goes wrong without losing context at every
stage of development.
And that's all I have for you today. I hope I've inspired you to have a new
perspective on testing and monitoring and you go check out the OPV
repository and participate in where this journey takes us.
11 - @dev_el_ops
plan my_app::deploy (TargetSpec $db_server, TargetSpec $app_server, String $app_url)
{
$db_results = apply($db_server) {
class { 'my_app::db': app_server => $app_server.name, }
}
opv::check_apply($db_results)
$app_results = apply($app_server) {
check_db { $db_server.name: }
-> class { 'my_app::app':
db_server => $db_server.name,
public_url => $app_url ,
}
}
opv::check_apply($app_results)
opv::check_http($app_url)
}
Future of Deployment Testing

I think we still have a few minutes for Q&A, meanwhile I'll leave the links for
the things I talked about up here. I'll also post the slides to slack later.
12 - @dev_el_ops
Links and Resources
● OPV module: https://github.com/puppetlabs/opv (add feedback to Discussions)
● Monitoring == Testing:
https://puppet.com/blog/hitchhikers-guide-to-testing-infrastructure-as-and-code
● Verification and Validation:
https://www.easterbrook.ca/steve/2010/11/the-difference-between-verification-a
nd-validation

2021 04-15 operational verification (with notes)

More Related Content

What's hot

Similar to 2021 04-15 operational verification (with notes)

More from Puppet

Recently uploaded

2021 04-15 operational verification (with notes)