Realtime stats from Varnish

At my company we are big fans of Varnish. We use it to front a number of API services that peak at about 40,000 requests/second on our largest cluster currently, and it works very well.

As part of our monitoring and instrumentation, we’d also like to see how long these requests take, and what the error rates are. We use Statsd and Graphite to capture and display those stats, and I released a blog post before on how to do this inside Apache, or anything else really.

There are ways to calculate this by writing an app that uses loglines that varnishncsa can generate (in fact, that’s what we used to do), but it’d be much nicer if this would just work from Varnish directly and it would definitely reduce the resource use and complexity of the system.

Although there isn’t a way to do this natively in VCL, the Varnish configuration language, it’s possible to get at this data and send it off using Varnish Modules, or Vmods.

So I wrote two Vmods to do exactly this; one for capturing timing information from Varnish and one for sending those timers off to Statsd. Both Vmods have been released to the Varnish community.

We have been using both of them in production for 6 months now, serving over 3.5 billion requests a day with it across our clusters, and has cost less than a 1% CPU increase in our setup. The previous setup that was tailing a varnishncsa generated log would cost about 15% combined across the log writer & reader.

We enable request timing for every request that comes in using a snippet of VCL that we include at the top of every Varnish config we write. Our goal is to send the following stats to Statsd:

 incr:  $env.$serv.$cluster.$backend.$hit_miss.$resp_code.$host
 timer: $env.$serv.$cluster.$backend.$hit_miss.$resp_code.$host $ms

Let me walk you through how we do this (and for your convenience, here’s a gist with the entire snippet):

First, we initialize the modules in vcl_init:

#########################################
### Initialize timing/stats modules
#########################################

### https://github.com/jib/libvmod-statsd
import statsd;
### https://github.com/jib/libvmod-timers
import timers;

### If you do not end with a return() statement, all instances of 
### vcl_* will be concatenated.
sub vcl_init {
 ### Prefix all stats with the ENVIRONMENT, the TYPE OF SERVICE
 ### and the CLUSTER NAME. Then suffix with the MACHINE NAME.
 ### Optional, but distinguishes per host/cluster/env/service type.
 statsd.prefix( "prod.httpd.apiservices." );
 statsd.suffix( ".apiservices001" );

 ### Connect to statsd
 statsd.server( "localhost", "8125" );

 ### Set the timing units - seconds, milli, micro or nano.
 timers.unit( "milliseconds" );
}

If you run multiple backends behind your varnish server, you’ll want to include the following snippet. It records which backend served the request, which isn’t in scope in vcl_deliver. If you only run a single backend, this isn’t important to you.

######################
### Determine backend
######################

### Backend will have been determined in vcl_recv, but it's
### not available in vcl_deliver. So communicate via header.
### All (non pipe) requests will go through vcl_hit or vcl_miss,
### where we have req.backend in scope. Catch a pass() as well,
### so anything that's not cachable by design gets it's backend
### assigned too.

sub vcl_miss {
 set req.http.X-Stats-Backend = req.backend;
}

sub vcl_hit {
 set req.http.X-Stats-Backend = req.backend;
}

sub vcl_pass {
 set req.http.X-Stats-Backend = req.backend;
}

Next, we determine what the status of the request was; successful or failure.

######################
### Determine status
######################

### obj/beresp.status may be changed elsewhere in the
### VCL to show a different response code to the end
### user - capture it now so we are reporting on the
### actually backend response, not what the user sees
sub vcl_fetch {
 set req.http.X-Stats-Status = beresp.status;
}

sub vcl_error {
 ### An error may have occurred, and we've been sent to vcl_error
 ### capture the response code that the backend sent if it wasn't
 ### already
 if( !req.http.X-Stats-Status ) {
  set req.http.X-Stats-Status = obj.status;
 }
}

Now, we are ready to look up the hit/miss in vcl_deliver and send the stats off to Statsd:

######################
### Send stats
######################

sub vcl_deliver {

 ### Hit or Miss?
 if( obj.hits == 0 ) {
  set req.http.X-Stats-HitMiss = "miss";
 } else {
  set req.http.X-Stats-HitMiss = "hit";
 }

 ### So, not set in vcl_fetch or vcl_error? Use the response code 
 ### as will be sent to the client then.
 if( !req.http.X-Stats-Status ) {
  set req.http.X-Stats-Status = resp.status;
 }

 ### Which backend was used?

 ### You set one explicitly
 if( req.http.X-Request-Backend ) {
  set req.http.X-Stats-Backend = req.http.X-Request-Backend;

 ### We discovered it in vcl_hit/miss/pass
 } elsif ( req.http.X-Stats-Backend ) {
  # No op;

 ### We didn't discover it. Probably means you hit an error, or
 ### you synthesized a response. Set it to 'internal'
 } else {
  set req.http.X-Stats-Backend = "internal";
 }

 ### Key to use for statsd. Something like: config.hit.200
 set req.http.X-Stats-Key =
  req.http.X-Stats-Backend + "." +
  req.http.X-Stats-HitMiss + "." +
  req.http.X-Stats-Status;

 ### Increment the amount of requests here, and how long this 
 ### particular backend took.
 statsd.incr(   req.http.X-Stats-Key );
 statsd.timing( req.http.X-Stats-Key, timers.req_response_time() );
}

And once you’ve included this snippet in your VCL, you’ll be able to generate graphs like the one below, directly from your graphite installation, optionally broken down by response code, backend and hit/miss:

HTTP stats directly from Varnish

You can get the code for the Statsd integration and Varnish timers directly from the Varnish  Vmods page and you can download the VCL snippet above as a gist.

Advertisements

5 thoughts on “Realtime stats from Varnish

  1. Pingback: Collecting performance data with varnish and statsd – Liip | WarWebDev

  2. Hi Jos, thank you for creating the modules and for sharing.
    Could you share you thoughts on the meaning of handling time? What could be practical reason to measure it. As I understand it it is duration between when session was opened and when particular request processing was started. So if you have sess_timeout set to, say, 60s. handle_time will vary from 0s to 60s and I’m not sure what it means for me.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s