How to measure everything: Instrumenting your infrastructure

At Kruxwe are addicted to metrics. Metrics not only give us the insight to diagnose an anomaly, but to make good decisions on things like capacity planning, performance tuning, cost optimizations, and beyond. The benefits of well-instrumented infrastructure are felt far beyond the operations team; with well-defined metrics, developers, product managers, sales and support staff are able to make more sound decisions within their field of expertise.

For collecting and storing metrics, we really like using Statsd, Collectd and Graphite. We use collectd to gather system information like disk usage, cpu usage, and service state. Statsd is used by collectd to relay its metrics to graphite, and any applications we write stream data directly to statsd as well. I cover this topic in-depth in a previous article.

statsd collectd graphite

The real power of this system comes from automatically instrumenting all your moving parts, so everything emits stats as a part of normal operations. In an ideal world, your metrics are so encompassing that drawing a graph of your KPIs tells you everything you need to know about a service, and digging into logs is only needed in the most extreme circumstance.

Instrumenting your infrastructure

In a previous article, I introduced libvmod-statsd, which lets you instrument Varnish to emit realtime stats that come from your web requests. Since then, I’ve also written an accompanying module for Apache, called mod_statsd.

Between those two modules, all of our web requests are now automatically measured, irrespective of whether the application behind it sends its own measurements. In fact, you may remember this graph that we build for every service we operate, which is composed from metrics coming directly from Varnish or Apache:

service graphte

As you can see from the graph, we track the amount of good responses (HTTP 2xx/3xx) and bad responses (HTTP 4xx/5xx) to make sure our users are getting the content they ask for. On the right hand side of the graph, we track the upper 95th & 99th percentile response times to ensure requests come back in a timely fashion. Lastly, we add a bit of bookkeeping; we keep track of a safe QPS threshold (the magenta horizontal line) to ensure there is enough service capacity for all incoming requests, and we draw vertical lines whenever a deploy happens or Map/Reduce starts, as those two events are leading indicators of behavior change in services.

But not every system you are running will be a web server; there’s process managers, cron jobs, configuration management and perhaps even cloud infrastructure to name just a few. In the rest of this article, I’d like to highlight a few techniques, and some Open Source Software we wrote, that you can use to instrument those types of systems as well.

Process management

Odds are you’re writing (or using) quite a few daemons as part of your infrastructure, and that those daemons will need to be managed somehow. On Linux, you have a host of choices like SysV, Upstart, Systemddaemontools, Supervisor, etc. Depending on your flavor of Linux, one of these is likely to be the default for system services already.

For our own applications & services, we settled on Supervisor; we found it to be very reliable and easy to configure in a single file. It also provides good logging support, an admin interface, and monitoring hooks. There’s even a great puppet module for it.

The one bit we found lacking is direct integration with our metrics system to tell us what happened to a service and when. For that purpose I wrote Sulphite, which is a Supervisor Event Listener; Sulphite emits a stat to Graphite every time a transition in service state happens (for example, from running to exited). This lets us track restarts, crashes, unexpected downtime and more on as part of our service dashboard.

You can get Sulphite by installing it from PyPi like this:

$ pip install sulphite

And then configure it as an Event Listener in Supervisor like this:

[eventlistener:sulphite]
command=sulphite --graphite-server=graphite.example.com --graphite-prefix=events.supervisor --graphite-suffix=`hostname -s`
events=PROCESS_STATE
numprocs=1

Configuration management

Similarly, you’re probably using some sort of configuration management system like Chef or Puppet, especially if you have more than a handful of nodes. In our case, we use Puppet for all of our servers, and as part of its operations, it produces a lot of valuable information in report form. By default, these are stored as log files on the client & server, but using the custom reports functionality, you can send this information on to Graphite as well, letting you correlate changes in service behavior to changes made by Puppet.

We open sourced the code we use for that and you can configure Puppet to use it by following the installation instructions and creating a simple graphite.yaml file that looks something like this:

$ cat /etc/puppet/graphite.yaml
---
:host: 'graphite.example.com'
:port: 2023
:prefix: 'puppet.metrics'

And then updating your Puppet configuration file like this:

[master]
pluginsync = true
report     = true
reports    = store,graphite

[agent]
pluginsync = true
report     = true

Programming language support

One of the best things you can do to instrument your infrastructure is to provide a base library that includes statistics, logging & monitoring support for the language or language(s) your company uses. We’re big users of Python and Java, so we created libraries for each of those. Below, I’ll show you the Python library, which we’ve open sourced.

Our main design goal was to make it easy for our developers to do the right thing; by the very nature of using the base library, you’d get convenient access to patterns and code you wouldn’t have to write again, but it would also provide the operators with all the knobs & insights needed to run it in production.

The library comes with two main entry points you can inherit from for apps you might want to write. There’s krux.cli.Application for command line tools, and krux.tornado.Application for building dynamic services. You can install it by running:

$ pip install krux-stdlib

Here’s what a basic app might look like built on top of krux-stdlib:

class App(krux.cli.Application):
 def __init__(self):
   ### Call to the superclass to bootstrap.
   super(Application, self).__init__(name = 'sample-app') 
 
 def run(self):
   stats = self.stats
   log   = self.logger

   with stats.timer('run'):
     log.info('running...')
     ...

The ‘name’ argument above uniquely identifies the app across your business environment; it’s used as the prefix for stats, it’s used as the identifier in log files, as well as its name in the usage message. Without adding any additional code (and therefor work), here’s what is immediately available to operators:

$ sample-app -h
[…]

logging:
 --log-level {info,debug,critical,warning,error}
    Verbosity of logging. (default: warning)
stats:
 --stats Enable sending statistics to statsd. (default: False)
 --stats-host STATS_HOST
    Statsd host to send statistics to. (default: localhost)
 --stats-port STATS_PORT
    Statsd port to send statistics to. (default: 8125)
 --stats-environment STATS_ENVIRONMENT
    Statsd environment. (default: dev)

Now any app developed can be deployed to production, with stats & logging enabled, like this:

$ sample-app.h --stats --log-level warning

Both krux.cli and krux.tornado will capture metrics and log lines as part of their normal operation, so even if developers aren’t adding additional information, you’ll still get a good baseline of metrics just from using this as a base class.

AWS costs

We run most of our infrastructure inside AWS, and we pride ourselves on our cost management in such an on-demand environment; we optimize every bit of our infrastructure to eliminate waste and ensure we get the biggest bang for our buck.

Part of how we do this is to track Amazon costs as they happen in realtime, and cross-correlate them to the services we run. Since Amazon exposes your ongoing expenses as CloudWatch metrics, you can programmatically access them and add them to your graphite service graphs.

Start by installing the Cloudwatch CLI tools and then, for every Amazon service you care about, simply run:

$ mon-get-stats EstimatedCharges 
   --namespace "AWS/Billing" 
   --statistics Sum 
   --dimensions "ServiceName=${service}" 
   --start-time $date

You can then send those numbers to Statsd using your favorite mechanism.

Further reading

The tips and techniques above are more detailed examples from a presentation I’ve given about measuring a million metrics per second with minimal developer overhead. If you find the above tips and techniques interesting, there are more available in the slideshow below:

 

 

 

 

 

Realtime stats from Varnish

At my company we are big fans of Varnish. We use it to front a number of API services that peak at about 40,000 requests/second on our largest cluster currently, and it works very well.

As part of our monitoring and instrumentation, we’d also like to see how long these requests take, and what the error rates are. We use Statsd and Graphite to capture and display those stats, and I released a blog post before on how to do this inside Apache, or anything else really.

There are ways to calculate this by writing an app that uses loglines that varnishncsa can generate (in fact, that’s what we used to do), but it’d be much nicer if this would just work from Varnish directly and it would definitely reduce the resource use and complexity of the system.

Although there isn’t a way to do this natively in VCL, the Varnish configuration language, it’s possible to get at this data and send it off using Varnish Modules, or Vmods.

So I wrote two Vmods to do exactly this; one for capturing timing information from Varnish and one for sending those timers off to Statsd. Both Vmods have been released to the Varnish community.

We have been using both of them in production for 6 months now, serving over 3.5 billion requests a day with it across our clusters, and has cost less than a 1% CPU increase in our setup. The previous setup that was tailing a varnishncsa generated log would cost about 15% combined across the log writer & reader.

We enable request timing for every request that comes in using a snippet of VCL that we include at the top of every Varnish config we write. Our goal is to send the following stats to Statsd:

 incr:  $env.$serv.$cluster.$backend.$hit_miss.$resp_code.$host
 timer: $env.$serv.$cluster.$backend.$hit_miss.$resp_code.$host $ms

Let me walk you through how we do this (and for your convenience, here’s a gist with the entire snippet):

First, we initialize the modules in vcl_init:

#########################################
### Initialize timing/stats modules
#########################################

### https://github.com/jib/libvmod-statsd
import statsd;
### https://github.com/jib/libvmod-timers
import timers;

### If you do not end with a return() statement, all instances of 
### vcl_* will be concatenated.
sub vcl_init {
 ### Prefix all stats with the ENVIRONMENT, the TYPE OF SERVICE
 ### and the CLUSTER NAME. Then suffix with the MACHINE NAME.
 ### Optional, but distinguishes per host/cluster/env/service type.
 statsd.prefix( "prod.httpd.apiservices." );
 statsd.suffix( ".apiservices001" );

 ### Connect to statsd
 statsd.server( "localhost", "8125" );

 ### Set the timing units - seconds, milli, micro or nano.
 timers.unit( "milliseconds" );
}

If you run multiple backends behind your varnish server, you’ll want to include the following snippet. It records which backend served the request, which isn’t in scope in vcl_deliver. If you only run a single backend, this isn’t important to you.

######################
### Determine backend
######################

### Backend will have been determined in vcl_recv, but it's
### not available in vcl_deliver. So communicate via header.
### All (non pipe) requests will go through vcl_hit or vcl_miss,
### where we have req.backend in scope. Catch a pass() as well,
### so anything that's not cachable by design gets it's backend
### assigned too.

sub vcl_miss {
 set req.http.X-Stats-Backend = req.backend;
}

sub vcl_hit {
 set req.http.X-Stats-Backend = req.backend;
}

sub vcl_pass {
 set req.http.X-Stats-Backend = req.backend;
}

Next, we determine what the status of the request was; successful or failure.

######################
### Determine status
######################

### obj/beresp.status may be changed elsewhere in the
### VCL to show a different response code to the end
### user - capture it now so we are reporting on the
### actually backend response, not what the user sees
sub vcl_fetch {
 set req.http.X-Stats-Status = beresp.status;
}

sub vcl_error {
 ### An error may have occurred, and we've been sent to vcl_error
 ### capture the response code that the backend sent if it wasn't
 ### already
 if( !req.http.X-Stats-Status ) {
  set req.http.X-Stats-Status = obj.status;
 }
}

Now, we are ready to look up the hit/miss in vcl_deliver and send the stats off to Statsd:

######################
### Send stats
######################

sub vcl_deliver {

 ### Hit or Miss?
 if( obj.hits == 0 ) {
  set req.http.X-Stats-HitMiss = "miss";
 } else {
  set req.http.X-Stats-HitMiss = "hit";
 }

 ### So, not set in vcl_fetch or vcl_error? Use the response code 
 ### as will be sent to the client then.
 if( !req.http.X-Stats-Status ) {
  set req.http.X-Stats-Status = resp.status;
 }

 ### Which backend was used?

 ### You set one explicitly
 if( req.http.X-Request-Backend ) {
  set req.http.X-Stats-Backend = req.http.X-Request-Backend;

 ### We discovered it in vcl_hit/miss/pass
 } elsif ( req.http.X-Stats-Backend ) {
  # No op;

 ### We didn't discover it. Probably means you hit an error, or
 ### you synthesized a response. Set it to 'internal'
 } else {
  set req.http.X-Stats-Backend = "internal";
 }

 ### Key to use for statsd. Something like: config.hit.200
 set req.http.X-Stats-Key =
  req.http.X-Stats-Backend + "." +
  req.http.X-Stats-HitMiss + "." +
  req.http.X-Stats-Status;

 ### Increment the amount of requests here, and how long this 
 ### particular backend took.
 statsd.incr(   req.http.X-Stats-Key );
 statsd.timing( req.http.X-Stats-Key, timers.req_response_time() );
}

And once you’ve included this snippet in your VCL, you’ll be able to generate graphs like the one below, directly from your graphite installation, optionally broken down by response code, backend and hit/miss:

HTTP stats directly from Varnish

You can get the code for the Statsd integration and Varnish timers directly from the Varnish  Vmods page and you can download the VCL snippet above as a gist.