Realtime stats from Varnish

At my company we are big fans of Varnish. We use it to front a number of API services that peak at about 40,000 requests/second on our largest cluster currently, and it works very well.

As part of our monitoring and instrumentation, we’d also like to see how long these requests take, and what the error rates are. We use Statsd and Graphite to capture and display those stats, and I released a blog post before on how to do this inside Apache, or anything else really.

There are ways to calculate this by writing an app that uses loglines that varnishncsa can generate (in fact, that’s what we used to do), but it’d be much nicer if this would just work from Varnish directly and it would definitely reduce the resource use and complexity of the system.

Although there isn’t a way to do this natively in VCL, the Varnish configuration language, it’s possible to get at this data and send it off using Varnish Modules, or Vmods.

So I wrote two Vmods to do exactly this; one for capturing timing information from Varnish and one for sending those timers off to Statsd. Both Vmods have been released to the Varnish community.

We have been using both of them in production for 6 months now, serving over 3.5 billion requests a day with it across our clusters, and has cost less than a 1% CPU increase in our setup. The previous setup that was tailing a varnishncsa generated log would cost about 15% combined across the log writer & reader.

We enable request timing for every request that comes in using a snippet of VCL that we include at the top of every Varnish config we write. Our goal is to send the following stats to Statsd:

 incr:  $env.$serv.$cluster.$backend.$hit_miss.$resp_code.$host
 timer: $env.$serv.$cluster.$backend.$hit_miss.$resp_code.$host $ms

Let me walk you through how we do this (and for your convenience, here’s a gist with the entire snippet):

First, we initialize the modules in vcl_init:

#########################################
### Initialize timing/stats modules
#########################################

### https://github.com/jib/libvmod-statsd
import statsd;
### https://github.com/jib/libvmod-timers
import timers;

### If you do not end with a return() statement, all instances of 
### vcl_* will be concatenated.
sub vcl_init {
 ### Prefix all stats with the ENVIRONMENT, the TYPE OF SERVICE
 ### and the CLUSTER NAME. Then suffix with the MACHINE NAME.
 ### Optional, but distinguishes per host/cluster/env/service type.
 statsd.prefix( "prod.httpd.apiservices." );
 statsd.suffix( ".apiservices001" );

 ### Connect to statsd
 statsd.server( "localhost", "8125" );

 ### Set the timing units - seconds, milli, micro or nano.
 timers.unit( "milliseconds" );
}

If you run multiple backends behind your varnish server, you’ll want to include the following snippet. It records which backend served the request, which isn’t in scope in vcl_deliver. If you only run a single backend, this isn’t important to you.

######################
### Determine backend
######################

### Backend will have been determined in vcl_recv, but it's
### not available in vcl_deliver. So communicate via header.
### All (non pipe) requests will go through vcl_hit or vcl_miss,
### where we have req.backend in scope. Catch a pass() as well,
### so anything that's not cachable by design gets it's backend
### assigned too.

sub vcl_miss {
 set req.http.X-Stats-Backend = req.backend;
}

sub vcl_hit {
 set req.http.X-Stats-Backend = req.backend;
}

sub vcl_pass {
 set req.http.X-Stats-Backend = req.backend;
}

Next, we determine what the status of the request was; successful or failure.

######################
### Determine status
######################

### obj/beresp.status may be changed elsewhere in the
### VCL to show a different response code to the end
### user - capture it now so we are reporting on the
### actually backend response, not what the user sees
sub vcl_fetch {
 set req.http.X-Stats-Status = beresp.status;
}

sub vcl_error {
 ### An error may have occurred, and we've been sent to vcl_error
 ### capture the response code that the backend sent if it wasn't
 ### already
 if( !req.http.X-Stats-Status ) {
  set req.http.X-Stats-Status = obj.status;
 }
}

Now, we are ready to look up the hit/miss in vcl_deliver and send the stats off to Statsd:

######################
### Send stats
######################

sub vcl_deliver {

 ### Hit or Miss?
 if( obj.hits == 0 ) {
  set req.http.X-Stats-HitMiss = "miss";
 } else {
  set req.http.X-Stats-HitMiss = "hit";
 }

 ### So, not set in vcl_fetch or vcl_error? Use the response code 
 ### as will be sent to the client then.
 if( !req.http.X-Stats-Status ) {
  set req.http.X-Stats-Status = resp.status;
 }

 ### Which backend was used?

 ### You set one explicitly
 if( req.http.X-Request-Backend ) {
  set req.http.X-Stats-Backend = req.http.X-Request-Backend;

 ### We discovered it in vcl_hit/miss/pass
 } elsif ( req.http.X-Stats-Backend ) {
  # No op;

 ### We didn't discover it. Probably means you hit an error, or
 ### you synthesized a response. Set it to 'internal'
 } else {
  set req.http.X-Stats-Backend = "internal";
 }

 ### Key to use for statsd. Something like: config.hit.200
 set req.http.X-Stats-Key =
  req.http.X-Stats-Backend + "." +
  req.http.X-Stats-HitMiss + "." +
  req.http.X-Stats-Status;

 ### Increment the amount of requests here, and how long this 
 ### particular backend took.
 statsd.incr(   req.http.X-Stats-Key );
 statsd.timing( req.http.X-Stats-Key, timers.req_response_time() );
}

And once you’ve included this snippet in your VCL, you’ll be able to generate graphs like the one below, directly from your graphite installation, optionally broken down by response code, backend and hit/miss:

HTTP stats directly from Varnish

You can get the code for the Statsd integration and Varnish timers directly from the Varnish  Vmods page and you can download the VCL snippet above as a gist.

Advertisements

Measure all the things!

So, we all know that Monitoring & Graphing sucks. The only thing that sucks more is to retrofit your existing app with monitoring & graphing. The good news is, that there’s a few straightforward things you can do that don’t involve rearchitecting the entire stack or other similarly painful activities. And that’s the topic of an ignite talk I just gave at Devopsdays 2012.

First things first; you are going to need 3 tools to make your life better:

  1. Collectd – to collect system metrics like RAM, CPU & plugins for Apache, etc.
  2. Statsd – to collect stats from your app, or any other random place.
  3. Graphite – to store, query & graph your data.

Graphite Setup

There are lots of great docs on setting up graphite, so I’ll just highlight our choices in the setup: We use 1 graphite server per data center we are in and that scales to about 300k writes/second/node on an AWS c1.xlarge. You’ll want 1 core per carbon process you run, and being able to raid the ephemeral drives will greatly improve your performance. We use c1.mediums in our smaller data centers, as they collect far less data/second and aren’t asked to produce graphs.

Using namespaces

As an aside, Graphite doesn’t actually care about the keys you use for your metrics, but it can do clever things using a dot as a delimiter and wildcards for searching/aggregating, so pick wisely. Our naming scheme is:

<environment>.<application>.<<your key>>.<hostname>

This allows for easy roll ups to the cluster level, avoids stomping on other metrics and neatly keeps the dev/stage/canary noise out of the production data.

We also do automated rollups per cluster using the carbon aggregator using this rule, which saves CPU when displaying these numbers on graphs every 1-5 minutes.

<prefix>.<env>.<key>.total_sum (10) = sum <prefix>.(?!timers)<env>.<<key>>.<node>
<prefix>.<env>.<key>.total_avg (10) = avg <prefix>.(?!timers)<env>.<<key>>.<node>

Getting data into graphite

If you’re using Apache, you can easily get basic statistics about your system into Graphite. Just add a piped custom log to your Vhost configuration and you’re good to go. For example, if you were running a simple analytics beacon, this would define a log format that would capture response times for your beacon and send them to Statsd, which will then be immediately viewable in Graphite:

LogFormat “http.beacon:%D|ms” stats

CustomLog “|nc -u localhost 8125” stats

If you’re using Varnish, you can write out a log file with performance data using varnishncsa, and then use a trivial tailing script to pipe the data onto Statsd. For example, to measure the performance of a dynamic service, you could do something like this:

$ varnisncsa -a -w stats.log -F “%U %{Varnish:time_firstbyte}x %{Varnish:hitmiss}x %{X-Backend} ”

$ varnishtail stats.log

And if you’re using a framework for your dynamic application, like Tornado or Django, you can hook into their request/response cycle to get the data you need. For example, you can wrap the request handler with a “with stats.timer(path) …” statement and send stats on response code/time per path straight to Statsd from the response handler.

Transforming your data

Graphite supports lots of functions (mathematical and otherwise) to transform your data. There are a few I’d like to highlight as they really give you a lot of bang for your buck

Use events to graph all of your deploys, CDN reconfigurations, outages, code reviews or any other event that is important to your organization:

You can compare week over week (or day over day, etc) data with time shift; graph some data, set your graph period to a week, and then ask for the same data a week ago, two weeks and three weeks ago.

Did your page views go up? How about convergence?

Your latency, pageviews, signups, queues or whatever data you track probably changes by time of day, or day of the week. So instead of using absolute numbers, start using the standard deviation. Calculate the mean value and get the average standard deviation of the mean. For example, “our servers respond in 100 milliseconds, with a standard deviation of 50ms”

Then, figure out how far your other servers are from that mean; you may easily find servers that are 2-4 standard deviations away. Your monitoring would have caught a 500ms upper response time, but would have not let you do this preventative care.

Taking this one step further, look into Holt-Winters forecasting to use your past data to model historical upper and lower bounds for your current data. Graphite even has built in functions for it.

Further reading

This hopefully gets you a taste of what’s possible, but luckily there are a lot of people using these tools and sharing their insights: