How to measure everything: Instrumenting your infrastructure

At Kruxwe are addicted to metrics. Metrics not only give us the insight to diagnose an anomaly, but to make good decisions on things like capacity planning, performance tuning, cost optimizations, and beyond. The benefits of well-instrumented infrastructure are felt far beyond the operations team; with well-defined metrics, developers, product managers, sales and support staff are able to make more sound decisions within their field of expertise.

For collecting and storing metrics, we really like using Statsd, Collectd and Graphite. We use collectd to gather system information like disk usage, cpu usage, and service state. Statsd is used by collectd to relay its metrics to graphite, and any applications we write stream data directly to statsd as well. I cover this topic in-depth in a previous article.

statsd collectd graphite

The real power of this system comes from automatically instrumenting all your moving parts, so everything emits stats as a part of normal operations. In an ideal world, your metrics are so encompassing that drawing a graph of your KPIs tells you everything you need to know about a service, and digging into logs is only needed in the most extreme circumstance.

Instrumenting your infrastructure

In a previous article, I introduced libvmod-statsd, which lets you instrument Varnish to emit realtime stats that come from your web requests. Since then, I’ve also written an accompanying module for Apache, called mod_statsd.

Between those two modules, all of our web requests are now automatically measured, irrespective of whether the application behind it sends its own measurements. In fact, you may remember this graph that we build for every service we operate, which is composed from metrics coming directly from Varnish or Apache:

service graphte

As you can see from the graph, we track the amount of good responses (HTTP 2xx/3xx) and bad responses (HTTP 4xx/5xx) to make sure our users are getting the content they ask for. On the right hand side of the graph, we track the upper 95th & 99th percentile response times to ensure requests come back in a timely fashion. Lastly, we add a bit of bookkeeping; we keep track of a safe QPS threshold (the magenta horizontal line) to ensure there is enough service capacity for all incoming requests, and we draw vertical lines whenever a deploy happens or Map/Reduce starts, as those two events are leading indicators of behavior change in services.

But not every system you are running will be a web server; there’s process managers, cron jobs, configuration management and perhaps even cloud infrastructure to name just a few. In the rest of this article, I’d like to highlight a few techniques, and some Open Source Software we wrote, that you can use to instrument those types of systems as well.

Process management

Odds are you’re writing (or using) quite a few daemons as part of your infrastructure, and that those daemons will need to be managed somehow. On Linux, you have a host of choices like SysV, Upstart, Systemddaemontools, Supervisor, etc. Depending on your flavor of Linux, one of these is likely to be the default for system services already.

For our own applications & services, we settled on Supervisor; we found it to be very reliable and easy to configure in a single file. It also provides good logging support, an admin interface, and monitoring hooks. There’s even a great puppet module for it.

The one bit we found lacking is direct integration with our metrics system to tell us what happened to a service and when. For that purpose I wrote Sulphite, which is a Supervisor Event Listener; Sulphite emits a stat to Graphite every time a transition in service state happens (for example, from running to exited). This lets us track restarts, crashes, unexpected downtime and more on as part of our service dashboard.

You can get Sulphite by installing it from PyPi like this:

$ pip install sulphite

And then configure it as an Event Listener in Supervisor like this:

[eventlistener:sulphite]
command=sulphite --graphite-server=graphite.example.com --graphite-prefix=events.supervisor --graphite-suffix=`hostname -s`
events=PROCESS_STATE
numprocs=1

Configuration management

Similarly, you’re probably using some sort of configuration management system like Chef or Puppet, especially if you have more than a handful of nodes. In our case, we use Puppet for all of our servers, and as part of its operations, it produces a lot of valuable information in report form. By default, these are stored as log files on the client & server, but using the custom reports functionality, you can send this information on to Graphite as well, letting you correlate changes in service behavior to changes made by Puppet.

We open sourced the code we use for that and you can configure Puppet to use it by following the installation instructions and creating a simple graphite.yaml file that looks something like this:

$ cat /etc/puppet/graphite.yaml
---
:host: 'graphite.example.com'
:port: 2023
:prefix: 'puppet.metrics'

And then updating your Puppet configuration file like this:

[master]
pluginsync = true
report     = true
reports    = store,graphite

[agent]
pluginsync = true
report     = true

Programming language support

One of the best things you can do to instrument your infrastructure is to provide a base library that includes statistics, logging & monitoring support for the language or language(s) your company uses. We’re big users of Python and Java, so we created libraries for each of those. Below, I’ll show you the Python library, which we’ve open sourced.

Our main design goal was to make it easy for our developers to do the right thing; by the very nature of using the base library, you’d get convenient access to patterns and code you wouldn’t have to write again, but it would also provide the operators with all the knobs & insights needed to run it in production.

The library comes with two main entry points you can inherit from for apps you might want to write. There’s krux.cli.Application for command line tools, and krux.tornado.Application for building dynamic services. You can install it by running:

$ pip install krux-stdlib

Here’s what a basic app might look like built on top of krux-stdlib:

class App(krux.cli.Application):
 def __init__(self):
   ### Call to the superclass to bootstrap.
   super(Application, self).__init__(name = 'sample-app') 
 
 def run(self):
   stats = self.stats
   log   = self.logger

   with stats.timer('run'):
     log.info('running...')
     ...

The ‘name’ argument above uniquely identifies the app across your business environment; it’s used as the prefix for stats, it’s used as the identifier in log files, as well as its name in the usage message. Without adding any additional code (and therefor work), here’s what is immediately available to operators:

$ sample-app -h
[…]

logging:
 --log-level {info,debug,critical,warning,error}
    Verbosity of logging. (default: warning)
stats:
 --stats Enable sending statistics to statsd. (default: False)
 --stats-host STATS_HOST
    Statsd host to send statistics to. (default: localhost)
 --stats-port STATS_PORT
    Statsd port to send statistics to. (default: 8125)
 --stats-environment STATS_ENVIRONMENT
    Statsd environment. (default: dev)

Now any app developed can be deployed to production, with stats & logging enabled, like this:

$ sample-app.h --stats --log-level warning

Both krux.cli and krux.tornado will capture metrics and log lines as part of their normal operation, so even if developers aren’t adding additional information, you’ll still get a good baseline of metrics just from using this as a base class.

AWS costs

We run most of our infrastructure inside AWS, and we pride ourselves on our cost management in such an on-demand environment; we optimize every bit of our infrastructure to eliminate waste and ensure we get the biggest bang for our buck.

Part of how we do this is to track Amazon costs as they happen in realtime, and cross-correlate them to the services we run. Since Amazon exposes your ongoing expenses as CloudWatch metrics, you can programmatically access them and add them to your graphite service graphs.

Start by installing the Cloudwatch CLI tools and then, for every Amazon service you care about, simply run:

$ mon-get-stats EstimatedCharges 
   --namespace "AWS/Billing" 
   --statistics Sum 
   --dimensions "ServiceName=${service}" 
   --start-time $date

You can then send those numbers to Statsd using your favorite mechanism.

Further reading

The tips and techniques above are more detailed examples from a presentation I’ve given about measuring a million metrics per second with minimal developer overhead. If you find the above tips and techniques interesting, there are more available in the slideshow below:

 

 

 

 

 

Be “Do Not Track” compliant in 30 microseconds or less.

Fork me on GitHub

Last week I blogged about the state of Do Not Track on the top internet properties, advertisers and widget providers. If you haven’t read it yet, **spoiler alert** the results aren’t encouraging.

In my experience, many of the top web operators are in fact concerned with your privacy, so it might be hard to understand why even they aren’t honoring your “Do Not Track” settings. I’d venture that part of that is definitely awareness, but also because, at scale, implementing “Do Not Track” compliant solutions isn’t a trivial matter. The latter becomes obvious when you see people in the Web & Ad industry talking about the importance of DNT, with only a handful being able to actually provide a working implementation.

Do Not Track – the basics

Let me illustrate the difference between DNT and non-DNT requests by using Krux (my employer) services as an example.

Under normal circumstances, a basic request/response looks something like this, where a user generates some sort of analytics data, and is given a cookie on his first visit. The data is then logged somewhere and processed:

> GET http://beacon.krxd.net/pageview.gif?xxxxxxxxx
< HTTP/1.1 204 No Content
< Set-Cookie: uid=YYYYYY; expires=zz.zz.zz

Now, for a DNT enabled request, the exchange looks a bit different; the user still generates some sort of analytics data, but in the response cookie the value is now set to ‘DNT’ (we set the value to ‘DNT’ because you can’t read the value of the DNT header in JavaScript in all browsers yet) and the expiry is set to a fixed date in the future, so it’s impossible to distinguish one user from another based on any of the properties:

> GET http://beacon.krxd.net/pageview.gif?xxxxxxxxx
> DNT: 1
< HTTP/1.1 204 No Content
< Set-Cookie: uid=DNT; expires=Fri, 01-Jan-38 00:00:00 GMT

Implementing Do Not Track compliance

At Krux, we provide audience analytics for top web properties, with a very strong commitment to user privacy. As part of that, we take honoring “Do Not Track” for our properties, as well as our publishers’ properties, very serious.

Our analytics servers process billions of data points per day (or many thousands per second), and each of these requests should be handled quickly; any meaningful slowdown would mean a deteriorated user experience and provisioning many more machines to handle the traffic.

The industry standard for setting user cookies is basically Apache + mod_usertrack and in our production environment will get response times in the 300-500 microsecond range. This gives us a good performance baseline to go off. Unfortunately, mod_usertrack isn’t DNT compliant (it will set a cookie regardless) and can’t be configured to behave differently, so I had to look for a different solution.

Writing the beacon as a service is a simple programming task, and the obvious first choice is to try a high throughput event driven system like Tornado or Node (both are technologies that are already in our stack). I encountered 3 issues with this approach that made this type of solution not viable however:

  • Tornado & node both respond in a 3-5 millisecond window, and although that’s quite fast individually, it’s an order of magnitude slower than doing it inside Apache
  • Response times degrade rapidly at higher concurrency rates, which are a very common pattern in our setup
  • These processes are single thread, meaning they need to be behind a load balancer or proxy to take advantage of multiple cores, further increasing response times

Next, I tried using Varnish 2.1. It handled the concurrency fine, and was responding in the 2 millisecond range. It also has the benefit of being able to be exposed directly on port 80 to the world, rather than being load balanced. The problem I ran into is that Varnish does not allow access to all the HTTP Request headers for logging purposesVarnish 3.0 does have support for all headers, but can’t read cookie values directly and we’ve experienced some stability problems in other tests.

With none of these solutions being satisfactory, nor coming close to the desired response times, the only other option left was to write a custom Apache module to handle DNT compliance myself. And being not much of a C programmer (my first language is Perl), this was a fun challenge. It also gave me a chance to write mod_usertrack in the way it should have been behaving all along.

Introducing mod_cookietrack

So here it is, mod_cookietrack, a drop in replacement for mod_usertrack, that addresses many outstanding issues with mod_usertrack (details below), including Do Not Track compliance.

And most importantly, it performs quite well. Below is a graph that shows performance of an Apache Benchmark (ab) test using 100,000 requests, 50 concurrent to a standard Apache server. The blue line shows mod_usertrack in action, while the red line shows mod_cookietrack:

The graph shows that for all the extra features below, including DNT compliance, it only takes an additional 5-10 microseconds per request

We have been using mod_cookietrack in production for quite some time now, and it is serving billions of requests per day. To give you some numbers from a real world traffic pattern, we find that in our production environment, with more extensive logging, much higher concurrency and all other constraints that come with it, our mean response time has only gone up from 306 microseconds to 335 microseconds.

So what else can mod_cookietrack do? Here’s a list of the improvements you get over mod_usertrack for free with your DNT compliance:

  • Rolling Expires: Rather than losing the user’s token after the cookie expires, mod_cookietrack updates the Expires value on every visit
  • Set cookie on incoming request: mod_cookietrack also sets the cookie on the incoming request, so you can correlate the user to their first visit, completely transparent to your application.
  • Support for X-Forwarded-For: mod_cookietrack understands your services may be behind a load balancer, and will honor XFF or any other header you tell it to.
  • Support for non-2XX response codes: mod_cookietrack will also set cookies for users when you redirect them, like you might with a URL shortener.
  • Support for setting an incoming & outgoing header: mod_cookietrack understands you might want to log or hash on the user id, and can set a header to be used by your application.
  • External UID generation library: mod_cookietrack lets you specify your own library for generating the UID, in case you want something fancier than ‘$ip.$timestamp’.
  • Completely configurable DNT support: Do Not Track compliant out of the box, mod_cookietrack lets you configure every aspect of DNT.

The code is freely available, Open Source, well documented and comes with extensive tests, so I encourage you to try it outcontribute features and report issues.

For now, mod_cookietrack only supports Apache, but with that we’re covering two thirds of the server market share as measured by Netcraft. Of course, If you’d like to contribute versions for say, Nginx or Varnish, I’d welcome your work!

Now that being DNT compliant will cost you no more than 30 microseconds per request, all you good eggs have the tools you need to be good internet citizens and respect the privacy of your users; the next steps are up to you!

Note: these tests were done on a c1.medium in AWS – your mileage may vary.

 

The state of “Do Not Track”

Over the last few weeks, “Do Not Track” has been getting a lot of attention; Mozilla introduced DNT into FireFox as of January 2011 and has a very good FAQ up on the subject. Last week, the official announcement came from the big browser vendors, including Microsoft and Google, that they’d start incorporating DNT as a browser feature as well, which coincided nicely with the White House announcing a privacy bill of rights.

Do Not Track IndicatorIt’s great to see online data privacy finally being taken seriously, especially after the various shades of gray we have been seeing lately. Some of which are just plain scary.

But sending a “Do Not Track” header in your browser is one thing, having the server on the other side, and perhaps even more importantly, their (advertising) partners honor the request is quite another. And unfortunately, the the current state of affairs isn’t great; taken from Mozillas FAQ as mentioned above:

Companies are starting to support Do Not Track, but you may not notice any changes initially. We are actively working with companies that have started to implement Do Not Track, with others who have committed to doing so soon.

Let’s take a quick look at the current cookie setting practices from the top 500 websites, as counted by alexa. I ran a quick scan against http://www.domain.com, once with and once without a DNT header. Of those 500 sites, 482 gave useful replies; some of the most used domains are CDNs or don’t have top level content, so they are excluded.

From the chart below, you can see that most sites set 1-2 cookies, and that most of those cookies are somehow related to user or session specific data.

I’d have added a third line showing you the delta in cookies set when the DNT header was set, but the sad truth is that only 3 websites changed their cookie behavior based on the DNT header: Kudos to 9gag.com for not setting any cookies and blackhatworld.com & movie2k.com for at least dropping one of their user specific cookies. The outlier with a whopping 18 cookies, 10 of which are personally identifiable, is walmart.com.

Online Graphing

Now, setting a user/session cookie is not necessarily a bad thing; for one thing, you can not read the DNT header from JavaScript, so if you’d want to be DNT compliant in JS, you’d have to set a DNT cookie (although not part of the public standard, some newer browsers are starting to support inspecting the DNT setting from the DOM). The industry standard is now to set a cookie matching the string “DNT” or “OPTOUT”. Again, unfortunately, non of the top 500 websites actually do this when the DNT header is set.

The other viable option is to send back the same cookie, but set the expiry time in the past so that it’s removed by the browser. Although this would be silly on a first request (it would be better not to set a cookie at all in that case), and is not as useful in a JavaScript environment, it’d still be making an effort towards DNT compliance. From the top 500, only forbes.com is using this technique currently.

As it stands, only 4 out of 482 measured top 500 sites are actively responding to the DNT header being sent.

The FTC has been calling for a “Do Not Track” implementation and according to Mozilla, now 7% of Desktop Firefox users and 18% of Mobile Firefox users already have DNT enabled. With such a clear call from regulators and end users, why are so few sites actually following up with a solid implementation? And what does that mean for the advertising and widget partners they use, whose whole model is based around being able to use your personal data?

Again the answer is not very encouraging. The Wall Street Journal did a great investigation into this with their “What They Know” series and have found that even websites that you trust and use every day have literally hundreds of trackers ushered in when you visit them:

(full disclosure: I work for Krux, whose findings were featured in the WSJ “What They Know” series and we published a whitepaper on the subject)

If you browse through the above charts, it becomes obvious that your personal data is flying across the web and you have very little control of who takes it, how they use it and who they might be selling it on to.

The folks at PrivacyScore even built an index to show you how much your data is at risk when visiting any particular website. Some of the scores, like the one for Target are quite scary, and is illustrated by this story about how Target found out a girl was pregnant before her dad even did.

Bottom line, the worst offenders tend to be in the online merchant, advertising networks or widget providers space (especially those of the ‘free’ variety – because nothing is every really ‘free’) that play fast and loose with your personal data in order to optimize their own revenue. To illustrate the point, here’s a choice quote from the above article:

“AddThis has its own plans to sell user data, but it’s not looking to publishers as the main buyers. It will sell analytics on user data directly to ad agencies and brands themselves and will get a bigger cut by doing so.”

So, why is it hard for the good eggs to do the right thing, even though it’s making them look like bad eggs in the process? Part of it is awareness I’m sure, but another part of it is simply the challenge of implementing a good “Do Not Track” solution. Implementing DNT at scale is actually not that easy, and we spent a fair amount of time at Krux to get it right.

To further the cause of Data Privacy, we’re open sourcing our solution and it will be the topic of my next blogpost, in the hope that all the good eggs will at least be able to Do The Right thing easily, and making it easier for the rest of us to call the bad eggs on their behavior.

P.S, if you want to see where your personal data is going when you visit a webpage, we released a FireFox browser plugin called Krux Inspector, which you can install directly from our website. It shows you exactly who is bringing in which advertisers and partners on the webpage you’re viewing, and what personal data they’re skimming as well as the beacons they’re dropping.