How to measure everything: Instrumenting your infrastructure

At Kruxwe are addicted to metrics. Metrics not only give us the insight to diagnose an anomaly, but to make good decisions on things like capacity planning, performance tuning, cost optimizations, and beyond. The benefits of well-instrumented infrastructure are felt far beyond the operations team; with well-defined metrics, developers, product managers, sales and support staff are able to make more sound decisions within their field of expertise.

For collecting and storing metrics, we really like using Statsd, Collectd and Graphite. We use collectd to gather system information like disk usage, cpu usage, and service state. Statsd is used by collectd to relay its metrics to graphite, and any applications we write stream data directly to statsd as well. I cover this topic in-depth in a previous article.

statsd collectd graphite

The real power of this system comes from automatically instrumenting all your moving parts, so everything emits stats as a part of normal operations. In an ideal world, your metrics are so encompassing that drawing a graph of your KPIs tells you everything you need to know about a service, and digging into logs is only needed in the most extreme circumstance.

Instrumenting your infrastructure

In a previous article, I introduced libvmod-statsd, which lets you instrument Varnish to emit realtime stats that come from your web requests. Since then, I’ve also written an accompanying module for Apache, called mod_statsd.

Between those two modules, all of our web requests are now automatically measured, irrespective of whether the application behind it sends its own measurements. In fact, you may remember this graph that we build for every service we operate, which is composed from metrics coming directly from Varnish or Apache:

service graphte

As you can see from the graph, we track the amount of good responses (HTTP 2xx/3xx) and bad responses (HTTP 4xx/5xx) to make sure our users are getting the content they ask for. On the right hand side of the graph, we track the upper 95th & 99th percentile response times to ensure requests come back in a timely fashion. Lastly, we add a bit of bookkeeping; we keep track of a safe QPS threshold (the magenta horizontal line) to ensure there is enough service capacity for all incoming requests, and we draw vertical lines whenever a deploy happens or Map/Reduce starts, as those two events are leading indicators of behavior change in services.

But not every system you are running will be a web server; there’s process managers, cron jobs, configuration management and perhaps even cloud infrastructure to name just a few. In the rest of this article, I’d like to highlight a few techniques, and some Open Source Software we wrote, that you can use to instrument those types of systems as well.

Process management

Odds are you’re writing (or using) quite a few daemons as part of your infrastructure, and that those daemons will need to be managed somehow. On Linux, you have a host of choices like SysV, Upstart, Systemddaemontools, Supervisor, etc. Depending on your flavor of Linux, one of these is likely to be the default for system services already.

For our own applications & services, we settled on Supervisor; we found it to be very reliable and easy to configure in a single file. It also provides good logging support, an admin interface, and monitoring hooks. There’s even a great puppet module for it.

The one bit we found lacking is direct integration with our metrics system to tell us what happened to a service and when. For that purpose I wrote Sulphite, which is a Supervisor Event Listener; Sulphite emits a stat to Graphite every time a transition in service state happens (for example, from running to exited). This lets us track restarts, crashes, unexpected downtime and more on as part of our service dashboard.

You can get Sulphite by installing it from PyPi like this:

$ pip install sulphite

And then configure it as an Event Listener in Supervisor like this:

[eventlistener:sulphite]
command=sulphite --graphite-server=graphite.example.com --graphite-prefix=events.supervisor --graphite-suffix=`hostname -s`
events=PROCESS_STATE
numprocs=1

Configuration management

Similarly, you’re probably using some sort of configuration management system like Chef or Puppet, especially if you have more than a handful of nodes. In our case, we use Puppet for all of our servers, and as part of its operations, it produces a lot of valuable information in report form. By default, these are stored as log files on the client & server, but using the custom reports functionality, you can send this information on to Graphite as well, letting you correlate changes in service behavior to changes made by Puppet.

We open sourced the code we use for that and you can configure Puppet to use it by following the installation instructions and creating a simple graphite.yaml file that looks something like this:

$ cat /etc/puppet/graphite.yaml
---
:host: 'graphite.example.com'
:port: 2023
:prefix: 'puppet.metrics'

And then updating your Puppet configuration file like this:

[master]
pluginsync = true
report     = true
reports    = store,graphite

[agent]
pluginsync = true
report     = true

Programming language support

One of the best things you can do to instrument your infrastructure is to provide a base library that includes statistics, logging & monitoring support for the language or language(s) your company uses. We’re big users of Python and Java, so we created libraries for each of those. Below, I’ll show you the Python library, which we’ve open sourced.

Our main design goal was to make it easy for our developers to do the right thing; by the very nature of using the base library, you’d get convenient access to patterns and code you wouldn’t have to write again, but it would also provide the operators with all the knobs & insights needed to run it in production.

The library comes with two main entry points you can inherit from for apps you might want to write. There’s krux.cli.Application for command line tools, and krux.tornado.Application for building dynamic services. You can install it by running:

$ pip install krux-stdlib

Here’s what a basic app might look like built on top of krux-stdlib:

class App(krux.cli.Application):
 def __init__(self):
   ### Call to the superclass to bootstrap.
   super(Application, self).__init__(name = 'sample-app') 
 
 def run(self):
   stats = self.stats
   log   = self.logger

   with stats.timer('run'):
     log.info('running...')
     ...

The ‘name’ argument above uniquely identifies the app across your business environment; it’s used as the prefix for stats, it’s used as the identifier in log files, as well as its name in the usage message. Without adding any additional code (and therefor work), here’s what is immediately available to operators:

$ sample-app -h
[…]

logging:
 --log-level {info,debug,critical,warning,error}
    Verbosity of logging. (default: warning)
stats:
 --stats Enable sending statistics to statsd. (default: False)
 --stats-host STATS_HOST
    Statsd host to send statistics to. (default: localhost)
 --stats-port STATS_PORT
    Statsd port to send statistics to. (default: 8125)
 --stats-environment STATS_ENVIRONMENT
    Statsd environment. (default: dev)

Now any app developed can be deployed to production, with stats & logging enabled, like this:

$ sample-app.h --stats --log-level warning

Both krux.cli and krux.tornado will capture metrics and log lines as part of their normal operation, so even if developers aren’t adding additional information, you’ll still get a good baseline of metrics just from using this as a base class.

AWS costs

We run most of our infrastructure inside AWS, and we pride ourselves on our cost management in such an on-demand environment; we optimize every bit of our infrastructure to eliminate waste and ensure we get the biggest bang for our buck.

Part of how we do this is to track Amazon costs as they happen in realtime, and cross-correlate them to the services we run. Since Amazon exposes your ongoing expenses as CloudWatch metrics, you can programmatically access them and add them to your graphite service graphs.

Start by installing the Cloudwatch CLI tools and then, for every Amazon service you care about, simply run:

$ mon-get-stats EstimatedCharges 
   --namespace "AWS/Billing" 
   --statistics Sum 
   --dimensions "ServiceName=${service}" 
   --start-time $date

You can then send those numbers to Statsd using your favorite mechanism.

Further reading

The tips and techniques above are more detailed examples from a presentation I’ve given about measuring a million metrics per second with minimal developer overhead. If you find the above tips and techniques interesting, there are more available in the slideshow below:

 

 

 

 

 

Realtime stats from Varnish

At my company we are big fans of Varnish. We use it to front a number of API services that peak at about 40,000 requests/second on our largest cluster currently, and it works very well.

As part of our monitoring and instrumentation, we’d also like to see how long these requests take, and what the error rates are. We use Statsd and Graphite to capture and display those stats, and I released a blog post before on how to do this inside Apache, or anything else really.

There are ways to calculate this by writing an app that uses loglines that varnishncsa can generate (in fact, that’s what we used to do), but it’d be much nicer if this would just work from Varnish directly and it would definitely reduce the resource use and complexity of the system.

Although there isn’t a way to do this natively in VCL, the Varnish configuration language, it’s possible to get at this data and send it off using Varnish Modules, or Vmods.

So I wrote two Vmods to do exactly this; one for capturing timing information from Varnish and one for sending those timers off to Statsd. Both Vmods have been released to the Varnish community.

We have been using both of them in production for 6 months now, serving over 3.5 billion requests a day with it across our clusters, and has cost less than a 1% CPU increase in our setup. The previous setup that was tailing a varnishncsa generated log would cost about 15% combined across the log writer & reader.

We enable request timing for every request that comes in using a snippet of VCL that we include at the top of every Varnish config we write. Our goal is to send the following stats to Statsd:

 incr:  $env.$serv.$cluster.$backend.$hit_miss.$resp_code.$host
 timer: $env.$serv.$cluster.$backend.$hit_miss.$resp_code.$host $ms

Let me walk you through how we do this (and for your convenience, here’s a gist with the entire snippet):

First, we initialize the modules in vcl_init:

#########################################
### Initialize timing/stats modules
#########################################

### https://github.com/jib/libvmod-statsd
import statsd;
### https://github.com/jib/libvmod-timers
import timers;

### If you do not end with a return() statement, all instances of 
### vcl_* will be concatenated.
sub vcl_init {
 ### Prefix all stats with the ENVIRONMENT, the TYPE OF SERVICE
 ### and the CLUSTER NAME. Then suffix with the MACHINE NAME.
 ### Optional, but distinguishes per host/cluster/env/service type.
 statsd.prefix( "prod.httpd.apiservices." );
 statsd.suffix( ".apiservices001" );

 ### Connect to statsd
 statsd.server( "localhost", "8125" );

 ### Set the timing units - seconds, milli, micro or nano.
 timers.unit( "milliseconds" );
}

If you run multiple backends behind your varnish server, you’ll want to include the following snippet. It records which backend served the request, which isn’t in scope in vcl_deliver. If you only run a single backend, this isn’t important to you.

######################
### Determine backend
######################

### Backend will have been determined in vcl_recv, but it's
### not available in vcl_deliver. So communicate via header.
### All (non pipe) requests will go through vcl_hit or vcl_miss,
### where we have req.backend in scope. Catch a pass() as well,
### so anything that's not cachable by design gets it's backend
### assigned too.

sub vcl_miss {
 set req.http.X-Stats-Backend = req.backend;
}

sub vcl_hit {
 set req.http.X-Stats-Backend = req.backend;
}

sub vcl_pass {
 set req.http.X-Stats-Backend = req.backend;
}

Next, we determine what the status of the request was; successful or failure.

######################
### Determine status
######################

### obj/beresp.status may be changed elsewhere in the
### VCL to show a different response code to the end
### user - capture it now so we are reporting on the
### actually backend response, not what the user sees
sub vcl_fetch {
 set req.http.X-Stats-Status = beresp.status;
}

sub vcl_error {
 ### An error may have occurred, and we've been sent to vcl_error
 ### capture the response code that the backend sent if it wasn't
 ### already
 if( !req.http.X-Stats-Status ) {
  set req.http.X-Stats-Status = obj.status;
 }
}

Now, we are ready to look up the hit/miss in vcl_deliver and send the stats off to Statsd:

######################
### Send stats
######################

sub vcl_deliver {

 ### Hit or Miss?
 if( obj.hits == 0 ) {
  set req.http.X-Stats-HitMiss = "miss";
 } else {
  set req.http.X-Stats-HitMiss = "hit";
 }

 ### So, not set in vcl_fetch or vcl_error? Use the response code 
 ### as will be sent to the client then.
 if( !req.http.X-Stats-Status ) {
  set req.http.X-Stats-Status = resp.status;
 }

 ### Which backend was used?

 ### You set one explicitly
 if( req.http.X-Request-Backend ) {
  set req.http.X-Stats-Backend = req.http.X-Request-Backend;

 ### We discovered it in vcl_hit/miss/pass
 } elsif ( req.http.X-Stats-Backend ) {
  # No op;

 ### We didn't discover it. Probably means you hit an error, or
 ### you synthesized a response. Set it to 'internal'
 } else {
  set req.http.X-Stats-Backend = "internal";
 }

 ### Key to use for statsd. Something like: config.hit.200
 set req.http.X-Stats-Key =
  req.http.X-Stats-Backend + "." +
  req.http.X-Stats-HitMiss + "." +
  req.http.X-Stats-Status;

 ### Increment the amount of requests here, and how long this 
 ### particular backend took.
 statsd.incr(   req.http.X-Stats-Key );
 statsd.timing( req.http.X-Stats-Key, timers.req_response_time() );
}

And once you’ve included this snippet in your VCL, you’ll be able to generate graphs like the one below, directly from your graphite installation, optionally broken down by response code, backend and hit/miss:

HTTP stats directly from Varnish

You can get the code for the Statsd integration and Varnish timers directly from the Varnish  Vmods page and you can download the VCL snippet above as a gist.

Personalized user experiences & user data privacy

As our lives become increasingly intertwined with the World Wide Web, the battle between personalized user experiences and user data privacy is more acute than ever. On the one hand, personalized content is an expectation for our online experience: The New York Times should show us news that’s relevant to our interests, Yahoo should personalize their front page with things we care about, and Facebook should show us dating ads if we, among other criteria, actively list our status as ‘single’.

On the other hand, just because we enter a Google search for an obscure medical condition, update our relationship status on Facebook, or watch Gangnam Style six times on YouTube, doesn’t necessarily mean we want web browsers, advertisers, or anybody in-between following us around the Internet.

For the end user, this creates a dilemma: ‘Given the choice between online privacy or a personalized experience, what is more important to me? If I know what is important to me, how can I signal my preference?’

Enter Do Not Track (DNT), a technology and policy proposal that allows users to decide whether their online activity should be tracked by websites and advertising networks. The measure is significant in its aim to become the first standardized, universally respected user preference tool for content personalization[*] – implemented by every browser vendor, configured by every user, and respected by every website operator.  It’s success rides on these three factors working seamlessly together; undoing even one piece of the puzzle will detriment the entire effort.

Do Not Track has gained considerable momentum since 2010, and many leading technology companies (browser vendors, website operators and advertisers alike) support the implementation of Do Not Track. However, Microsoft recently took its support of DNT one step too far by automatically enabling the Do Not Track header for Internet Explorer 10. On the surface, this appears to be a reasonable choice that benefits consumers:

In a world where consumers live a large part of their lives online, it is critical that we build trust that their personal information will be treated with respect, and that they will be given a choice to have their information used for unexpected purposes. While there is still work to do in agreeing on an industry-wide definition of DNT, we believe turning on Do Not Track by default in IE10 on Windows 8 is an important step in this process of establishing privacy by default, putting consumers in control and building trust online.

Yet Microsoft’s decision raises some larger questions that have serious consequences for consumers, Do Not Track, and the future of the World Wide Web:

If Microsoft considers “Do Not Track” such a pivotal part of the web industry’s future, why don’t they take it upon themselves to educate their user base about Do Not Track, and help users choose whether enabling DNT is right for them? If privacy, control, and trust are paramount, why are they controlling the decisions for consumers, rather than making decisions with consumers?

Microsoft is a member of both the US and EU IAB, the NAI and has six members on the “Do Not Track” working group (only Comcast has more members on the working group); they are part of every forum that shapes the rules, education and awareness of online privacy; they are part of every conversation and every initiative to bring personalized content to users in a responsible manner; they have the ability to influence, shape and guide the future of personalized content & privacy.

But they choose not to.

Even if these forums are not sufficient for Microsoft to deliver this important message, they could still educate the public through their own software, be that Internet Explorer, Windows, or Bing – all of which still enjoy a sizable market share.

Instead, Microsoft seems to assume that their users are, and will always be, uneducated about the matter of online privacy, despite their own ability to influence and change this fact.

As a result, Microsoft’s choice to make this decision on behalf of all their users, without any effort to educate the public and without involving any relevant forums, has caused serious upheaval in the web and advertising industry.

The IAB immediately spoke out against the proposed default:

We do not believe that default settings that automatically make choices for consumers increase transparency or consumer choice, nor do they factor in the need for digital businesses to innovate and thrive economically. Actions such as these will undermine the success of our industry’s self-regulatory program.

Yahoo deliberated on the subject and then announced that it would be ignoring the “Do Not Track” header if it was sent by IE 10:

Ultimately, we believe that DNT must map to user intent — not to the intent of one browser creator, plug-in writer, or third-party software service.

Even Mozilla, the company behind Firefox and originators of the “Do Not Track” specification do not turn it on by default and instead provide a single checkbox in their settings to do so easily:

DNT allows for a conversation between the person sitting behind the keyboard and the site that they want to visit. If DNT is on by default, it’s not a conversation. For DNT to be effective, it must actually represent the user’s voice.

The Apache project went as far as proposing a patch on the Web Server level to ignore the “Do Not Track” header, arguing:

The standard quite clearly states that it must be the result of an explicit user choice, not that of the browser vendor or a mega corp pushing their own Agendas. Being included as part of the “Express settings” makes it the OS providers choice, not the users – if there was a stand-alone screen with that as the only question with no default option selected, than that would classify as a user choice – it’s not, so IE 10 is ignored and further dilutes the meaning of DNT for everyone else.

And today, my company, Krux, adds itself to the list of companies expressly speaking out against Microsoft’s choice.

At Krux, we are big supporters of the “Do Not Track” initiative.  We were among the first in the industry to properly support Do Not Track (and Open Sourced the code so others could do so easily as well); we integrate with the US IAB the European YourPrivacyChoice; and we offer a simple OptOut page, all to protect your privacy and capture your personalization preference. We do this because we want to offer you the content you want, to make your experience better.

But that’s also the important part of this personalization; it is your choice.  We believe that, unless the signal comes directly from a user, any default expressed on the users behalf (be that OptOut OR OptIn) does more harm than good for the consumer experience and the industry as a whole.

Although we have no doubt Microsoft’s intentions regarding user data privacy are good, its execution leaves much to be desired. By unilaterally making such strong minded decisions on behalf of its users, they limit user choice, user experience, and user education; this undermines the effort of an entire ecosystem trying to come together and offer a standard, data safe, and consumer-centric solution to personalized content.

We sincerely hope Microsoft will rejoin the “Do Not Track” discussion, and will work with the web and advertising industries to raise awareness, educate, and improve the web experience for everyone – even those not using Internet Explorer.

As for Krux? We will also be ignoring the “Do Not Track” header if it comes from IE 10, and we have updated mod_cookietrack to give you the same option if you so choose.

In the meantime, we will continue to educate, raise awareness and provide thought leadership in the conversation around personalization and privacy with both you and our colleagues in the industry.

* Until Do Not Track, the only options for consumers to opt out of tracking were only available for advertising related content. Through the Interactive Advertising Bureau (US), Your Online Choices (EU) or the Network Advertising Initiative (Member based), you can register your choice to not be targeted by advertising. However, that preference only applies to that geographic region, or the relevant membership. For any other targeted content, the content provider may offer an Opt Out page, but even then you’d have to find and explicitly Opt Out of every site you visit.

The dark side of the New Shiny

Why is all (new) software crap? Or at least, why do all the old-timers seem to think so, while all the hip young kids are all over it? Currently it’s all about MongoDB & Node, but what ever happened to LAMP or even Ruby on Rails?

A page from the history books

Back in the 90s, creating dynamic content on the web was the New Shiny; with a cheap dreamhost account that gave you shell access and an Apache configured to allow ExecCGI in your ~/public_html folder, you could simply drop a script in there and you’d be off to the races.

The New Shiny used for these dynamic pages was predominantly a language called Perl; it was installed on every Unix/Linux system and a simple “hello world” would take you mere minutes to set up.

Of course, for your first few more complicated scripts, you’d probably need some help, so you’d go to Yahoo or Altavista (yes, this is from a time before Google) and search for something like “perl cgi how to send email” to find your answers.

What you’d find as the top result was Matt’s Script Archive; a rag tag collection of various Perl scripts for common problems that you could drop straight in to your ~/public_html folder. Unfortunately, they were also poorly written, full of security holes, used absolute worst practices and taught bad code.

Even more unfortunate; they were also very popular.

So popular in fact, that the Perl community – which has it’s own repository for libraries, modules, etc called CPAN (which Pear, PyPi, NPM, and so on are modeled after) – started the NMS project to provide drop in replacements for Matts scripts that were actually of good quality, used best practices, taught sensible coding and came without all the known security holes. Eventually, even Matt himself acknowledged the problems and endorsed NMS over his own code.

But until NMS became a suitable replacement for Matts code (and gained the required ranking in search engines), people looking for how to do things in Perl found Matts archive. And they used it, they told their friends, wrote about it on Slashdot or posted about it on the Yahoo Perl Group, which in turn further increased the search ranking of Matts archive. So other beginners found it, and they posted about it, referred to it, and so on.

And all the while there was CPAN, a good source of code, full of community feedback, promotion of best practices, distributed automated testing and bug tracking; all the principles of good software development that we know and love as a community.

And still, Perl, (partially through Matt’s ScriptArchive, but of course other factors contributed) got a reputation – that persists even today – of being a hard to read, insecure language that only script kiddies used; with so many people flocking to it, using it before understanding it and the community failing to make best practices obvious and accessible, Perl buckled under the pressure of being the New Shiny.

The rise of the One-Eyed

As a result, people got frustrated with Perl, and there was room for a replacement. As PHP 4 rose the ranks, unseating Perl, to be the new darling of Dynamic Pages, the problem got worse. By now, blogs were becoming mainstream, most mailing lists had web interfaces and Google was indexing them. The One-Eyeds leading the blind became even more influential.

Then Rails came out for Ruby. Applications were becoming more complex, and there was room for a full framework (from Database to View) to unseat PHP and become the New Shiny.

By now, the hacker culture had firmly taken hold and it wasn’t just about finding the right Ruby/Rails library to do the job you wanted. Instead, as a new comer to Ruby on Rails, odds were you had to find the most current copy of a forked version of the original library on some guys website you could download. At least until that copy got forked by another guy and updated with the bug fix or feature you needed and all the while crossing your fingers none of the code you were finding was produced by a One-Eyed.

And the rise of Git & Github of course made the ‘fork and forget’ process even easier.

The Ruby community has since rallied behind Gems and has become much more vocal about best practices. But as with Perl, it is hard to unwind past habits, and some One-Eyedness remains.

Fast forward to 2012. Node is the New Shiny, and all the new comers are flocking here. And once again, the One-Eyeds are leading the blind:

 

In the article referenced in the tweet, written on a site called “HowToNode.org” – making it at least sound a lot more authoritative than SomeGuysBlog.com – there is some information that’s fundamental to understanding of how NPM works. It also sprinkles it with rants and blatantly bad advice (which breaks systems) that will be taken by the new comers as best practices or, in the worst case, Gospel, creating even more One-Eyeds.

Those who don’t learn from history…

And that’s when the industry, the jaded, those that have been through at least 4 incarnations of the New Shiny, that can see the words of the One-Eyed spoken as Gospel, respond like this:

 

Unfortunately, the problem of the uninformed One-Eye exists in the opposite camp as well, where people will come out with ridiculous, inflammatory, non-constructive, ill-informed statements, like in the video below, and spread that like their own Gospel in turn.

 

20 GOTO 10?

So how do we break this cycle? Where is the Wikipedia of Technology that actually uses facts and intelligence rather than hear-say, cargo-culting and gospel to inform and drive decisions? How do we actually teach those new to our field do build better and smarter things, rather than breeding yet another generation of One-Eyeds?

Measure all the things!

So, we all know that Monitoring & Graphing sucks. The only thing that sucks more is to retrofit your existing app with monitoring & graphing. The good news is, that there’s a few straightforward things you can do that don’t involve rearchitecting the entire stack or other similarly painful activities. And that’s the topic of an ignite talk I just gave at Devopsdays 2012.

First things first; you are going to need 3 tools to make your life better:

  1. Collectd – to collect system metrics like RAM, CPU & plugins for Apache, etc.
  2. Statsd – to collect stats from your app, or any other random place.
  3. Graphite – to store, query & graph your data.

Graphite Setup

There are lots of great docs on setting up graphite, so I’ll just highlight our choices in the setup: We use 1 graphite server per data center we are in and that scales to about 300k writes/second/node on an AWS c1.xlarge. You’ll want 1 core per carbon process you run, and being able to raid the ephemeral drives will greatly improve your performance. We use c1.mediums in our smaller data centers, as they collect far less data/second and aren’t asked to produce graphs.

Using namespaces

As an aside, Graphite doesn’t actually care about the keys you use for your metrics, but it can do clever things using a dot as a delimiter and wildcards for searching/aggregating, so pick wisely. Our naming scheme is:

<environment>.<application>.<<your key>>.<hostname>

This allows for easy roll ups to the cluster level, avoids stomping on other metrics and neatly keeps the dev/stage/canary noise out of the production data.

We also do automated rollups per cluster using the carbon aggregator using this rule, which saves CPU when displaying these numbers on graphs every 1-5 minutes.

<prefix>.<env>.<key>.total_sum (10) = sum <prefix>.(?!timers)<env>.<<key>>.<node>
<prefix>.<env>.<key>.total_avg (10) = avg <prefix>.(?!timers)<env>.<<key>>.<node>

Getting data into graphite

If you’re using Apache, you can easily get basic statistics about your system into Graphite. Just add a piped custom log to your Vhost configuration and you’re good to go. For example, if you were running a simple analytics beacon, this would define a log format that would capture response times for your beacon and send them to Statsd, which will then be immediately viewable in Graphite:

LogFormat “http.beacon:%D|ms” stats

CustomLog “|nc -u localhost 8125″ stats

If you’re using Varnish, you can write out a log file with performance data using varnishncsa, and then use a trivial tailing script to pipe the data onto Statsd. For example, to measure the performance of a dynamic service, you could do something like this:

$ varnisncsa -a -w stats.log -F “%U %{Varnish:time_firstbyte}x %{Varnish:hitmiss}x %{X-Backend} “

$ varnishtail stats.log

And if you’re using a framework for your dynamic application, like Tornado or Django, you can hook into their request/response cycle to get the data you need. For example, you can wrap the request handler with a “with stats.timer(path) …” statement and send stats on response code/time per path straight to Statsd from the response handler.

Transforming your data

Graphite supports lots of functions (mathematical and otherwise) to transform your data. There are a few I’d like to highlight as they really give you a lot of bang for your buck

Use events to graph all of your deploys, CDN reconfigurations, outages, code reviews or any other event that is important to your organization:

You can compare week over week (or day over day, etc) data with time shift; graph some data, set your graph period to a week, and then ask for the same data a week ago, two weeks and three weeks ago.

Did your page views go up? How about convergence?

Your latency, pageviews, signups, queues or whatever data you track probably changes by time of day, or day of the week. So instead of using absolute numbers, start using the standard deviation. Calculate the mean value and get the average standard deviation of the mean. For example, “our servers respond in 100 milliseconds, with a standard deviation of 50ms”

Then, figure out how far your other servers are from that mean; you may easily find servers that are 2-4 standard deviations away. Your monitoring would have caught a 500ms upper response time, but would have not let you do this preventative care.

Taking this one step further, look into Holt-Winters forecasting to use your past data to model historical upper and lower bounds for your current data. Graphite even has built in functions for it.

Further reading

This hopefully gets you a taste of what’s possible, but luckily there are a lot of people using these tools and sharing their insights:

Be “Do Not Track” compliant in 30 microseconds or less.

Fork me on GitHub

Last week I blogged about the state of Do Not Track on the top internet properties, advertisers and widget providers. If you haven’t read it yet, **spoiler alert** the results aren’t encouraging.

In my experience, many of the top web operators are in fact concerned with your privacy, so it might be hard to understand why even they aren’t honoring your “Do Not Track” settings. I’d venture that part of that is definitely awareness, but also because, at scale, implementing “Do Not Track” compliant solutions isn’t a trivial matter. The latter becomes obvious when you see people in the Web & Ad industry talking about the importance of DNT, with only a handful being able to actually provide a working implementation.

Do Not Track – the basics

Let me illustrate the difference between DNT and non-DNT requests by using Krux (my employer) services as an example.

Under normal circumstances, a basic request/response looks something like this, where a user generates some sort of analytics data, and is given a cookie on his first visit. The data is then logged somewhere and processed:

> GET http://beacon.krxd.net/pageview.gif?xxxxxxxxx
< HTTP/1.1 204 No Content
< Set-Cookie: uid=YYYYYY; expires=zz.zz.zz

Now, for a DNT enabled request, the exchange looks a bit different; the user still generates some sort of analytics data, but in the response cookie the value is now set to ‘DNT’ (we set the value to ‘DNT’ because you can’t read the value of the DNT header in JavaScript in all browsers yet) and the expiry is set to a fixed date in the future, so it’s impossible to distinguish one user from another based on any of the properties:

> GET http://beacon.krxd.net/pageview.gif?xxxxxxxxx
> DNT: 1
< HTTP/1.1 204 No Content
< Set-Cookie: uid=DNT; expires=Fri, 01-Jan-38 00:00:00 GMT

Implementing Do Not Track compliance

At Krux, we provide audience analytics for top web properties, with a very strong commitment to user privacy. As part of that, we take honoring “Do Not Track” for our properties, as well as our publishers’ properties, very serious.

Our analytics servers process billions of data points per day (or many thousands per second), and each of these requests should be handled quickly; any meaningful slowdown would mean a deteriorated user experience and provisioning many more machines to handle the traffic.

The industry standard for setting user cookies is basically Apache + mod_usertrack and in our production environment will get response times in the 300-500 microsecond range. This gives us a good performance baseline to go off. Unfortunately, mod_usertrack isn’t DNT compliant (it will set a cookie regardless) and can’t be configured to behave differently, so I had to look for a different solution.

Writing the beacon as a service is a simple programming task, and the obvious first choice is to try a high throughput event driven system like Tornado or Node (both are technologies that are already in our stack). I encountered 3 issues with this approach that made this type of solution not viable however:

  • Tornado & node both respond in a 3-5 millisecond window, and although that’s quite fast individually, it’s an order of magnitude slower than doing it inside Apache
  • Response times degrade rapidly at higher concurrency rates, which are a very common pattern in our setup
  • These processes are single thread, meaning they need to be behind a load balancer or proxy to take advantage of multiple cores, further increasing response times

Next, I tried using Varnish 2.1. It handled the concurrency fine, and was responding in the 2 millisecond range. It also has the benefit of being able to be exposed directly on port 80 to the world, rather than being load balanced. The problem I ran into is that Varnish does not allow access to all the HTTP Request headers for logging purposesVarnish 3.0 does have support for all headers, but can’t read cookie values directly and we’ve experienced some stability problems in other tests.

With none of these solutions being satisfactory, nor coming close to the desired response times, the only other option left was to write a custom Apache module to handle DNT compliance myself. And being not much of a C programmer (my first language is Perl), this was a fun challenge. It also gave me a chance to write mod_usertrack in the way it should have been behaving all along.

Introducing mod_cookietrack

So here it is, mod_cookietrack, a drop in replacement for mod_usertrack, that addresses many outstanding issues with mod_usertrack (details below), including Do Not Track compliance.

And most importantly, it performs quite well. Below is a graph that shows performance of an Apache Benchmark (ab) test using 100,000 requests, 50 concurrent to a standard Apache server. The blue line shows mod_usertrack in action, while the red line shows mod_cookietrack:

The graph shows that for all the extra features below, including DNT compliance, it only takes an additional 5-10 microseconds per request

We have been using mod_cookietrack in production for quite some time now, and it is serving billions of requests per day. To give you some numbers from a real world traffic pattern, we find that in our production environment, with more extensive logging, much higher concurrency and all other constraints that come with it, our mean response time has only gone up from 306 microseconds to 335 microseconds.

So what else can mod_cookietrack do? Here’s a list of the improvements you get over mod_usertrack for free with your DNT compliance:

  • Rolling Expires: Rather than losing the user’s token after the cookie expires, mod_cookietrack updates the Expires value on every visit
  • Set cookie on incoming request: mod_cookietrack also sets the cookie on the incoming request, so you can correlate the user to their first visit, completely transparent to your application.
  • Support for X-Forwarded-For: mod_cookietrack understands your services may be behind a load balancer, and will honor XFF or any other header you tell it to.
  • Support for non-2XX response codes: mod_cookietrack will also set cookies for users when you redirect them, like you might with a URL shortener.
  • Support for setting an incoming & outgoing header: mod_cookietrack understands you might want to log or hash on the user id, and can set a header to be used by your application.
  • External UID generation library: mod_cookietrack lets you specify your own library for generating the UID, in case you want something fancier than ‘$ip.$timestamp’.
  • Completely configurable DNT support: Do Not Track compliant out of the box, mod_cookietrack lets you configure every aspect of DNT.

The code is freely available, Open Source, well documented and comes with extensive tests, so I encourage you to try it outcontribute features and report issues.

For now, mod_cookietrack only supports Apache, but with that we’re covering two thirds of the server market share as measured by Netcraft. Of course, If you’d like to contribute versions for say, Nginx or Varnish, I’d welcome your work!

Now that being DNT compliant will cost you no more than 30 microseconds per request, all you good eggs have the tools you need to be good internet citizens and respect the privacy of your users; the next steps are up to you!

Note: these tests were done on a c1.medium in AWS – your mileage may vary.

 

The state of “Do Not Track”

Over the last few weeks, “Do Not Track” has been getting a lot of attention; Mozilla introduced DNT into FireFox as of January 2011 and has a very good FAQ up on the subject. Last week, the official announcement came from the big browser vendors, including Microsoft and Google, that they’d start incorporating DNT as a browser feature as well, which coincided nicely with the White House announcing a privacy bill of rights.

Do Not Track IndicatorIt’s great to see online data privacy finally being taken seriously, especially after the various shades of gray we have been seeing lately. Some of which are just plain scary.

But sending a “Do Not Track” header in your browser is one thing, having the server on the other side, and perhaps even more importantly, their (advertising) partners honor the request is quite another. And unfortunately, the the current state of affairs isn’t great; taken from Mozillas FAQ as mentioned above:

Companies are starting to support Do Not Track, but you may not notice any changes initially. We are actively working with companies that have started to implement Do Not Track, with others who have committed to doing so soon.

Let’s take a quick look at the current cookie setting practices from the top 500 websites, as counted by alexa. I ran a quick scan against http://www.domain.com, once with and once without a DNT header. Of those 500 sites, 482 gave useful replies; some of the most used domains are CDNs or don’t have top level content, so they are excluded.

From the chart below, you can see that most sites set 1-2 cookies, and that most of those cookies are somehow related to user or session specific data.

I’d have added a third line showing you the delta in cookies set when the DNT header was set, but the sad truth is that only 3 websites changed their cookie behavior based on the DNT header: Kudos to 9gag.com for not setting any cookies and blackhatworld.com & movie2k.com for at least dropping one of their user specific cookies. The outlier with a whopping 18 cookies, 10 of which are personally identifiable, is walmart.com.

Online Graphing

Now, setting a user/session cookie is not necessarily a bad thing; for one thing, you can not read the DNT header from JavaScript, so if you’d want to be DNT compliant in JS, you’d have to set a DNT cookie (although not part of the public standard, some newer browsers are starting to support inspecting the DNT setting from the DOM). The industry standard is now to set a cookie matching the string “DNT” or “OPTOUT”. Again, unfortunately, non of the top 500 websites actually do this when the DNT header is set.

The other viable option is to send back the same cookie, but set the expiry time in the past so that it’s removed by the browser. Although this would be silly on a first request (it would be better not to set a cookie at all in that case), and is not as useful in a JavaScript environment, it’d still be making an effort towards DNT compliance. From the top 500, only forbes.com is using this technique currently.

As it stands, only 4 out of 482 measured top 500 sites are actively responding to the DNT header being sent.

The FTC has been calling for a “Do Not Track” implementation and according to Mozilla, now 7% of Desktop Firefox users and 18% of Mobile Firefox users already have DNT enabled. With such a clear call from regulators and end users, why are so few sites actually following up with a solid implementation? And what does that mean for the advertising and widget partners they use, whose whole model is based around being able to use your personal data?

Again the answer is not very encouraging. The Wall Street Journal did a great investigation into this with their “What They Know” series and have found that even websites that you trust and use every day have literally hundreds of trackers ushered in when you visit them:

(full disclosure: I work for Krux, whose findings were featured in the WSJ “What They Know” series and we published a whitepaper on the subject)

If you browse through the above charts, it becomes obvious that your personal data is flying across the web and you have very little control of who takes it, how they use it and who they might be selling it on to.

The folks at PrivacyScore even built an index to show you how much your data is at risk when visiting any particular website. Some of the scores, like the one for Target are quite scary, and is illustrated by this story about how Target found out a girl was pregnant before her dad even did.

Bottom line, the worst offenders tend to be in the online merchant, advertising networks or widget providers space (especially those of the ‘free’ variety – because nothing is every really ‘free’) that play fast and loose with your personal data in order to optimize their own revenue. To illustrate the point, here’s a choice quote from the above article:

“AddThis has its own plans to sell user data, but it’s not looking to publishers as the main buyers. It will sell analytics on user data directly to ad agencies and brands themselves and will get a bigger cut by doing so.”

So, why is it hard for the good eggs to do the right thing, even though it’s making them look like bad eggs in the process? Part of it is awareness I’m sure, but another part of it is simply the challenge of implementing a good “Do Not Track” solution. Implementing DNT at scale is actually not that easy, and we spent a fair amount of time at Krux to get it right.

To further the cause of Data Privacy, we’re open sourcing our solution and it will be the topic of my next blogpost, in the hope that all the good eggs will at least be able to Do The Right thing easily, and making it easier for the rest of us to call the bad eggs on their behavior.

P.S, if you want to see where your personal data is going when you visit a webpage, we released a FireFox browser plugin called Krux Inspector, which you can install directly from our website. It shows you exactly who is bringing in which advertisers and partners on the webpage you’re viewing, and what personal data they’re skimming as well as the beacons they’re dropping.