Realtime stats from Varnish

At my company we are big fans of Varnish. We use it to front a number of API services that peak at about 40,000 requests/second on our largest cluster currently, and it works very well.

As part of our monitoring and instrumentation, we’d also like to see how long these requests take, and what the error rates are. We use Statsd and Graphite to capture and display those stats, and I released a blog post before on how to do this inside Apache, or anything else really.

There are ways to calculate this by writing an app that uses loglines that varnishncsa can generate (in fact, that’s what we used to do), but it’d be much nicer if this would just work from Varnish directly and it would definitely reduce the resource use and complexity of the system.

Although there isn’t a way to do this natively in VCL, the Varnish configuration language, it’s possible to get at this data and send it off using Varnish Modules, or Vmods.

So I wrote two Vmods to do exactly this; one for capturing timing information from Varnish and one for sending those timers off to Statsd. Both Vmods have been released to the Varnish community.

We have been using both of them in production for 6 months now, serving over 3.5 billion requests a day with it across our clusters, and has cost less than a 1% CPU increase in our setup. The previous setup that was tailing a varnishncsa generated log would cost about 15% combined across the log writer & reader.

We enable request timing for every request that comes in using a snippet of VCL that we include at the top of every Varnish config we write. Our goal is to send the following stats to Statsd:

 incr:  $env.$serv.$cluster.$backend.$hit_miss.$resp_code.$host
 timer: $env.$serv.$cluster.$backend.$hit_miss.$resp_code.$host $ms

Let me walk you through how we do this (and for your convenience, here’s a gist with the entire snippet):

First, we initialize the modules in vcl_init:

#########################################
### Initialize timing/stats modules
#########################################

### https://github.com/jib/libvmod-statsd
import statsd;
### https://github.com/jib/libvmod-timers
import timers;

### If you do not end with a return() statement, all instances of 
### vcl_* will be concatenated.
sub vcl_init {
 ### Prefix all stats with the ENVIRONMENT, the TYPE OF SERVICE
 ### and the CLUSTER NAME. Then suffix with the MACHINE NAME.
 ### Optional, but distinguishes per host/cluster/env/service type.
 statsd.prefix( "prod.httpd.apiservices." );
 statsd.suffix( ".apiservices001" );

 ### Connect to statsd
 statsd.server( "localhost", "8125" );

 ### Set the timing units - seconds, milli, micro or nano.
 timers.unit( "milliseconds" );
}

If you run multiple backends behind your varnish server, you’ll want to include the following snippet. It records which backend served the request, which isn’t in scope in vcl_deliver. If you only run a single backend, this isn’t important to you.

######################
### Determine backend
######################

### Backend will have been determined in vcl_recv, but it's
### not available in vcl_deliver. So communicate via header.
### All (non pipe) requests will go through vcl_hit or vcl_miss,
### where we have req.backend in scope. Catch a pass() as well,
### so anything that's not cachable by design gets it's backend
### assigned too.

sub vcl_miss {
 set req.http.X-Stats-Backend = req.backend;
}

sub vcl_hit {
 set req.http.X-Stats-Backend = req.backend;
}

sub vcl_pass {
 set req.http.X-Stats-Backend = req.backend;
}

Next, we determine what the status of the request was; successful or failure.

######################
### Determine status
######################

### obj/beresp.status may be changed elsewhere in the
### VCL to show a different response code to the end
### user - capture it now so we are reporting on the
### actually backend response, not what the user sees
sub vcl_fetch {
 set req.http.X-Stats-Status = beresp.status;
}

sub vcl_error {
 ### An error may have occurred, and we've been sent to vcl_error
 ### capture the response code that the backend sent if it wasn't
 ### already
 if( !req.http.X-Stats-Status ) {
  set req.http.X-Stats-Status = obj.status;
 }
}

Now, we are ready to look up the hit/miss in vcl_deliver and send the stats off to Statsd:

######################
### Send stats
######################

sub vcl_deliver {

 ### Hit or Miss?
 if( obj.hits == 0 ) {
  set req.http.X-Stats-HitMiss = "miss";
 } else {
  set req.http.X-Stats-HitMiss = "hit";
 }

 ### So, not set in vcl_fetch or vcl_error? Use the response code 
 ### as will be sent to the client then.
 if( !req.http.X-Stats-Status ) {
  set req.http.X-Stats-Status = resp.status;
 }

 ### Which backend was used?

 ### You set one explicitly
 if( req.http.X-Request-Backend ) {
  set req.http.X-Stats-Backend = req.http.X-Request-Backend;

 ### We discovered it in vcl_hit/miss/pass
 } elsif ( req.http.X-Stats-Backend ) {
  # No op;

 ### We didn't discover it. Probably means you hit an error, or
 ### you synthesized a response. Set it to 'internal'
 } else {
  set req.http.X-Stats-Backend = "internal";
 }

 ### Key to use for statsd. Something like: config.hit.200
 set req.http.X-Stats-Key =
  req.http.X-Stats-Backend + "." +
  req.http.X-Stats-HitMiss + "." +
  req.http.X-Stats-Status;

 ### Increment the amount of requests here, and how long this 
 ### particular backend took.
 statsd.incr(   req.http.X-Stats-Key );
 statsd.timing( req.http.X-Stats-Key, timers.req_response_time() );
}

And once you’ve included this snippet in your VCL, you’ll be able to generate graphs like the one below, directly from your graphite installation, optionally broken down by response code, backend and hit/miss:

HTTP stats directly from Varnish

You can get the code for the Statsd integration and Varnish timers directly from the Varnish  Vmods page and you can download the VCL snippet above as a gist.

Advertisements

Be “Do Not Track” compliant in 30 microseconds or less.

Fork me on GitHub

Last week I blogged about the state of Do Not Track on the top internet properties, advertisers and widget providers. If you haven’t read it yet, **spoiler alert** the results aren’t encouraging.

In my experience, many of the top web operators are in fact concerned with your privacy, so it might be hard to understand why even they aren’t honoring your “Do Not Track” settings. I’d venture that part of that is definitely awareness, but also because, at scale, implementing “Do Not Track” compliant solutions isn’t a trivial matter. The latter becomes obvious when you see people in the Web & Ad industry talking about the importance of DNT, with only a handful being able to actually provide a working implementation.

Do Not Track – the basics

Let me illustrate the difference between DNT and non-DNT requests by using Krux (my employer) services as an example.

Under normal circumstances, a basic request/response looks something like this, where a user generates some sort of analytics data, and is given a cookie on his first visit. The data is then logged somewhere and processed:

> GET http://beacon.krxd.net/pageview.gif?xxxxxxxxx
< HTTP/1.1 204 No Content
< Set-Cookie: uid=YYYYYY; expires=zz.zz.zz

Now, for a DNT enabled request, the exchange looks a bit different; the user still generates some sort of analytics data, but in the response cookie the value is now set to ‘DNT’ (we set the value to ‘DNT’ because you can’t read the value of the DNT header in JavaScript in all browsers yet) and the expiry is set to a fixed date in the future, so it’s impossible to distinguish one user from another based on any of the properties:

> GET http://beacon.krxd.net/pageview.gif?xxxxxxxxx
> DNT: 1
< HTTP/1.1 204 No Content
< Set-Cookie: uid=DNT; expires=Fri, 01-Jan-38 00:00:00 GMT

Implementing Do Not Track compliance

At Krux, we provide audience analytics for top web properties, with a very strong commitment to user privacy. As part of that, we take honoring “Do Not Track” for our properties, as well as our publishers’ properties, very serious.

Our analytics servers process billions of data points per day (or many thousands per second), and each of these requests should be handled quickly; any meaningful slowdown would mean a deteriorated user experience and provisioning many more machines to handle the traffic.

The industry standard for setting user cookies is basically Apache + mod_usertrack and in our production environment will get response times in the 300-500 microsecond range. This gives us a good performance baseline to go off. Unfortunately, mod_usertrack isn’t DNT compliant (it will set a cookie regardless) and can’t be configured to behave differently, so I had to look for a different solution.

Writing the beacon as a service is a simple programming task, and the obvious first choice is to try a high throughput event driven system like Tornado or Node (both are technologies that are already in our stack). I encountered 3 issues with this approach that made this type of solution not viable however:

  • Tornado & node both respond in a 3-5 millisecond window, and although that’s quite fast individually, it’s an order of magnitude slower than doing it inside Apache
  • Response times degrade rapidly at higher concurrency rates, which are a very common pattern in our setup
  • These processes are single thread, meaning they need to be behind a load balancer or proxy to take advantage of multiple cores, further increasing response times

Next, I tried using Varnish 2.1. It handled the concurrency fine, and was responding in the 2 millisecond range. It also has the benefit of being able to be exposed directly on port 80 to the world, rather than being load balanced. The problem I ran into is that Varnish does not allow access to all the HTTP Request headers for logging purposesVarnish 3.0 does have support for all headers, but can’t read cookie values directly and we’ve experienced some stability problems in other tests.

With none of these solutions being satisfactory, nor coming close to the desired response times, the only other option left was to write a custom Apache module to handle DNT compliance myself. And being not much of a C programmer (my first language is Perl), this was a fun challenge. It also gave me a chance to write mod_usertrack in the way it should have been behaving all along.

Introducing mod_cookietrack

So here it is, mod_cookietrack, a drop in replacement for mod_usertrack, that addresses many outstanding issues with mod_usertrack (details below), including Do Not Track compliance.

And most importantly, it performs quite well. Below is a graph that shows performance of an Apache Benchmark (ab) test using 100,000 requests, 50 concurrent to a standard Apache server. The blue line shows mod_usertrack in action, while the red line shows mod_cookietrack:

The graph shows that for all the extra features below, including DNT compliance, it only takes an additional 5-10 microseconds per request

We have been using mod_cookietrack in production for quite some time now, and it is serving billions of requests per day. To give you some numbers from a real world traffic pattern, we find that in our production environment, with more extensive logging, much higher concurrency and all other constraints that come with it, our mean response time has only gone up from 306 microseconds to 335 microseconds.

So what else can mod_cookietrack do? Here’s a list of the improvements you get over mod_usertrack for free with your DNT compliance:

  • Rolling Expires: Rather than losing the user’s token after the cookie expires, mod_cookietrack updates the Expires value on every visit
  • Set cookie on incoming request: mod_cookietrack also sets the cookie on the incoming request, so you can correlate the user to their first visit, completely transparent to your application.
  • Support for X-Forwarded-For: mod_cookietrack understands your services may be behind a load balancer, and will honor XFF or any other header you tell it to.
  • Support for non-2XX response codes: mod_cookietrack will also set cookies for users when you redirect them, like you might with a URL shortener.
  • Support for setting an incoming & outgoing header: mod_cookietrack understands you might want to log or hash on the user id, and can set a header to be used by your application.
  • External UID generation library: mod_cookietrack lets you specify your own library for generating the UID, in case you want something fancier than ‘$ip.$timestamp’.
  • Completely configurable DNT support: Do Not Track compliant out of the box, mod_cookietrack lets you configure every aspect of DNT.

The code is freely available, Open Source, well documented and comes with extensive tests, so I encourage you to try it outcontribute features and report issues.

For now, mod_cookietrack only supports Apache, but with that we’re covering two thirds of the server market share as measured by Netcraft. Of course, If you’d like to contribute versions for say, Nginx or Varnish, I’d welcome your work!

Now that being DNT compliant will cost you no more than 30 microseconds per request, all you good eggs have the tools you need to be good internet citizens and respect the privacy of your users; the next steps are up to you!

Note: these tests were done on a c1.medium in AWS – your mileage may vary.

 

The state of “Do Not Track”

Over the last few weeks, “Do Not Track” has been getting a lot of attention; Mozilla introduced DNT into FireFox as of January 2011 and has a very good FAQ up on the subject. Last week, the official announcement came from the big browser vendors, including Microsoft and Google, that they’d start incorporating DNT as a browser feature as well, which coincided nicely with the White House announcing a privacy bill of rights.

Do Not Track IndicatorIt’s great to see online data privacy finally being taken seriously, especially after the various shades of gray we have been seeing lately. Some of which are just plain scary.

But sending a “Do Not Track” header in your browser is one thing, having the server on the other side, and perhaps even more importantly, their (advertising) partners honor the request is quite another. And unfortunately, the the current state of affairs isn’t great; taken from Mozillas FAQ as mentioned above:

Companies are starting to support Do Not Track, but you may not notice any changes initially. We are actively working with companies that have started to implement Do Not Track, with others who have committed to doing so soon.

Let’s take a quick look at the current cookie setting practices from the top 500 websites, as counted by alexa. I ran a quick scan against http://www.domain.com, once with and once without a DNT header. Of those 500 sites, 482 gave useful replies; some of the most used domains are CDNs or don’t have top level content, so they are excluded.

From the chart below, you can see that most sites set 1-2 cookies, and that most of those cookies are somehow related to user or session specific data.

I’d have added a third line showing you the delta in cookies set when the DNT header was set, but the sad truth is that only 3 websites changed their cookie behavior based on the DNT header: Kudos to 9gag.com for not setting any cookies and blackhatworld.com & movie2k.com for at least dropping one of their user specific cookies. The outlier with a whopping 18 cookies, 10 of which are personally identifiable, is walmart.com.

Online Graphing

Now, setting a user/session cookie is not necessarily a bad thing; for one thing, you can not read the DNT header from JavaScript, so if you’d want to be DNT compliant in JS, you’d have to set a DNT cookie (although not part of the public standard, some newer browsers are starting to support inspecting the DNT setting from the DOM). The industry standard is now to set a cookie matching the string “DNT” or “OPTOUT”. Again, unfortunately, non of the top 500 websites actually do this when the DNT header is set.

The other viable option is to send back the same cookie, but set the expiry time in the past so that it’s removed by the browser. Although this would be silly on a first request (it would be better not to set a cookie at all in that case), and is not as useful in a JavaScript environment, it’d still be making an effort towards DNT compliance. From the top 500, only forbes.com is using this technique currently.

As it stands, only 4 out of 482 measured top 500 sites are actively responding to the DNT header being sent.

The FTC has been calling for a “Do Not Track” implementation and according to Mozilla, now 7% of Desktop Firefox users and 18% of Mobile Firefox users already have DNT enabled. With such a clear call from regulators and end users, why are so few sites actually following up with a solid implementation? And what does that mean for the advertising and widget partners they use, whose whole model is based around being able to use your personal data?

Again the answer is not very encouraging. The Wall Street Journal did a great investigation into this with their “What They Know” series and have found that even websites that you trust and use every day have literally hundreds of trackers ushered in when you visit them:

(full disclosure: I work for Krux, whose findings were featured in the WSJ “What They Know” series and we published a whitepaper on the subject)

If you browse through the above charts, it becomes obvious that your personal data is flying across the web and you have very little control of who takes it, how they use it and who they might be selling it on to.

The folks at PrivacyScore even built an index to show you how much your data is at risk when visiting any particular website. Some of the scores, like the one for Target are quite scary, and is illustrated by this story about how Target found out a girl was pregnant before her dad even did.

Bottom line, the worst offenders tend to be in the online merchant, advertising networks or widget providers space (especially those of the ‘free’ variety – because nothing is every really ‘free’) that play fast and loose with your personal data in order to optimize their own revenue. To illustrate the point, here’s a choice quote from the above article:

“AddThis has its own plans to sell user data, but it’s not looking to publishers as the main buyers. It will sell analytics on user data directly to ad agencies and brands themselves and will get a bigger cut by doing so.”

So, why is it hard for the good eggs to do the right thing, even though it’s making them look like bad eggs in the process? Part of it is awareness I’m sure, but another part of it is simply the challenge of implementing a good “Do Not Track” solution. Implementing DNT at scale is actually not that easy, and we spent a fair amount of time at Krux to get it right.

To further the cause of Data Privacy, we’re open sourcing our solution and it will be the topic of my next blogpost, in the hope that all the good eggs will at least be able to Do The Right thing easily, and making it easier for the rest of us to call the bad eggs on their behavior.

P.S, if you want to see where your personal data is going when you visit a webpage, we released a FireFox browser plugin called Krux Inspector, which you can install directly from our website. It shows you exactly who is bringing in which advertisers and partners on the webpage you’re viewing, and what personal data they’re skimming as well as the beacons they’re dropping.