How to measure everything: Instrumenting your infrastructure

At Kruxwe are addicted to metrics. Metrics not only give us the insight to diagnose an anomaly, but to make good decisions on things like capacity planning, performance tuning, cost optimizations, and beyond. The benefits of well-instrumented infrastructure are felt far beyond the operations team; with well-defined metrics, developers, product managers, sales and support staff are able to make more sound decisions within their field of expertise.

For collecting and storing metrics, we really like using Statsd, Collectd and Graphite. We use collectd to gather system information like disk usage, cpu usage, and service state. Statsd is used by collectd to relay its metrics to graphite, and any applications we write stream data directly to statsd as well. I cover this topic in-depth in a previous article.

statsd collectd graphite

The real power of this system comes from automatically instrumenting all your moving parts, so everything emits stats as a part of normal operations. In an ideal world, your metrics are so encompassing that drawing a graph of your KPIs tells you everything you need to know about a service, and digging into logs is only needed in the most extreme circumstance.

Instrumenting your infrastructure

In a previous article, I introduced libvmod-statsd, which lets you instrument Varnish to emit realtime stats that come from your web requests. Since then, I’ve also written an accompanying module for Apache, called mod_statsd.

Between those two modules, all of our web requests are now automatically measured, irrespective of whether the application behind it sends its own measurements. In fact, you may remember this graph that we build for every service we operate, which is composed from metrics coming directly from Varnish or Apache:

service graphte

As you can see from the graph, we track the amount of good responses (HTTP 2xx/3xx) and bad responses (HTTP 4xx/5xx) to make sure our users are getting the content they ask for. On the right hand side of the graph, we track the upper 95th & 99th percentile response times to ensure requests come back in a timely fashion. Lastly, we add a bit of bookkeeping; we keep track of a safe QPS threshold (the magenta horizontal line) to ensure there is enough service capacity for all incoming requests, and we draw vertical lines whenever a deploy happens or Map/Reduce starts, as those two events are leading indicators of behavior change in services.

But not every system you are running will be a web server; there’s process managers, cron jobs, configuration management and perhaps even cloud infrastructure to name just a few. In the rest of this article, I’d like to highlight a few techniques, and some Open Source Software we wrote, that you can use to instrument those types of systems as well.

Process management

Odds are you’re writing (or using) quite a few daemons as part of your infrastructure, and that those daemons will need to be managed somehow. On Linux, you have a host of choices like SysV, Upstart, Systemddaemontools, Supervisor, etc. Depending on your flavor of Linux, one of these is likely to be the default for system services already.

For our own applications & services, we settled on Supervisor; we found it to be very reliable and easy to configure in a single file. It also provides good logging support, an admin interface, and monitoring hooks. There’s even a great puppet module for it.

The one bit we found lacking is direct integration with our metrics system to tell us what happened to a service and when. For that purpose I wrote Sulphite, which is a Supervisor Event Listener; Sulphite emits a stat to Graphite every time a transition in service state happens (for example, from running to exited). This lets us track restarts, crashes, unexpected downtime and more on as part of our service dashboard.

You can get Sulphite by installing it from PyPi like this:

$ pip install sulphite

And then configure it as an Event Listener in Supervisor like this:

[eventlistener:sulphite]
command=sulphite --graphite-server=graphite.example.com --graphite-prefix=events.supervisor --graphite-suffix=`hostname -s`
events=PROCESS_STATE
numprocs=1

Configuration management

Similarly, you’re probably using some sort of configuration management system like Chef or Puppet, especially if you have more than a handful of nodes. In our case, we use Puppet for all of our servers, and as part of its operations, it produces a lot of valuable information in report form. By default, these are stored as log files on the client & server, but using the custom reports functionality, you can send this information on to Graphite as well, letting you correlate changes in service behavior to changes made by Puppet.

We open sourced the code we use for that and you can configure Puppet to use it by following the installation instructions and creating a simple graphite.yaml file that looks something like this:

$ cat /etc/puppet/graphite.yaml
---
:host: 'graphite.example.com'
:port: 2023
:prefix: 'puppet.metrics'

And then updating your Puppet configuration file like this:

[master]
pluginsync = true
report     = true
reports    = store,graphite

[agent]
pluginsync = true
report     = true

Programming language support

One of the best things you can do to instrument your infrastructure is to provide a base library that includes statistics, logging & monitoring support for the language or language(s) your company uses. We’re big users of Python and Java, so we created libraries for each of those. Below, I’ll show you the Python library, which we’ve open sourced.

Our main design goal was to make it easy for our developers to do the right thing; by the very nature of using the base library, you’d get convenient access to patterns and code you wouldn’t have to write again, but it would also provide the operators with all the knobs & insights needed to run it in production.

The library comes with two main entry points you can inherit from for apps you might want to write. There’s krux.cli.Application for command line tools, and krux.tornado.Application for building dynamic services. You can install it by running:

$ pip install krux-stdlib

Here’s what a basic app might look like built on top of krux-stdlib:

class App(krux.cli.Application):
 def __init__(self):
   ### Call to the superclass to bootstrap.
   super(Application, self).__init__(name = 'sample-app') 
 
 def run(self):
   stats = self.stats
   log   = self.logger

   with stats.timer('run'):
     log.info('running...')
     ...

The ‘name’ argument above uniquely identifies the app across your business environment; it’s used as the prefix for stats, it’s used as the identifier in log files, as well as its name in the usage message. Without adding any additional code (and therefor work), here’s what is immediately available to operators:

$ sample-app -h
[…]

logging:
 --log-level {info,debug,critical,warning,error}
    Verbosity of logging. (default: warning)
stats:
 --stats Enable sending statistics to statsd. (default: False)
 --stats-host STATS_HOST
    Statsd host to send statistics to. (default: localhost)
 --stats-port STATS_PORT
    Statsd port to send statistics to. (default: 8125)
 --stats-environment STATS_ENVIRONMENT
    Statsd environment. (default: dev)

Now any app developed can be deployed to production, with stats & logging enabled, like this:

$ sample-app.h --stats --log-level warning

Both krux.cli and krux.tornado will capture metrics and log lines as part of their normal operation, so even if developers aren’t adding additional information, you’ll still get a good baseline of metrics just from using this as a base class.

AWS costs

We run most of our infrastructure inside AWS, and we pride ourselves on our cost management in such an on-demand environment; we optimize every bit of our infrastructure to eliminate waste and ensure we get the biggest bang for our buck.

Part of how we do this is to track Amazon costs as they happen in realtime, and cross-correlate them to the services we run. Since Amazon exposes your ongoing expenses as CloudWatch metrics, you can programmatically access them and add them to your graphite service graphs.

Start by installing the Cloudwatch CLI tools and then, for every Amazon service you care about, simply run:

$ mon-get-stats EstimatedCharges 
   --namespace "AWS/Billing" 
   --statistics Sum 
   --dimensions "ServiceName=${service}" 
   --start-time $date

You can then send those numbers to Statsd using your favorite mechanism.

Further reading

The tips and techniques above are more detailed examples from a presentation I’ve given about measuring a million metrics per second with minimal developer overhead. If you find the above tips and techniques interesting, there are more available in the slideshow below:

 

 

 

 

 

The dark side of the New Shiny

Why is all (new) software crap? Or at least, why do all the old-timers seem to think so, while all the hip young kids are all over it? Currently it’s all about MongoDB & Node, but what ever happened to LAMP or even Ruby on Rails?

A page from the history books

Back in the 90s, creating dynamic content on the web was the New Shiny; with a cheap dreamhost account that gave you shell access and an Apache configured to allow ExecCGI in your ~/public_html folder, you could simply drop a script in there and you’d be off to the races.

The New Shiny used for these dynamic pages was predominantly a language called Perl; it was installed on every Unix/Linux system and a simple “hello world” would take you mere minutes to set up.

Of course, for your first few more complicated scripts, you’d probably need some help, so you’d go to Yahoo or Altavista (yes, this is from a time before Google) and search for something like “perl cgi how to send email” to find your answers.

What you’d find as the top result was Matt’s Script Archive; a rag tag collection of various Perl scripts for common problems that you could drop straight in to your ~/public_html folder. Unfortunately, they were also poorly written, full of security holes, used absolute worst practices and taught bad code.

Even more unfortunate; they were also very popular.

So popular in fact, that the Perl community – which has it’s own repository for libraries, modules, etc called CPAN (which Pear, PyPi, NPM, and so on are modeled after) – started the NMS project to provide drop in replacements for Matts scripts that were actually of good quality, used best practices, taught sensible coding and came without all the known security holes. Eventually, even Matt himself acknowledged the problems and endorsed NMS over his own code.

But until NMS became a suitable replacement for Matts code (and gained the required ranking in search engines), people looking for how to do things in Perl found Matts archive. And they used it, they told their friends, wrote about it on Slashdot or posted about it on the Yahoo Perl Group, which in turn further increased the search ranking of Matts archive. So other beginners found it, and they posted about it, referred to it, and so on.

And all the while there was CPAN, a good source of code, full of community feedback, promotion of best practices, distributed automated testing and bug tracking; all the principles of good software development that we know and love as a community.

And still, Perl, (partially through Matt’s ScriptArchive, but of course other factors contributed) got a reputation – that persists even today – of being a hard to read, insecure language that only script kiddies used; with so many people flocking to it, using it before understanding it and the community failing to make best practices obvious and accessible, Perl buckled under the pressure of being the New Shiny.

The rise of the One-Eyed

As a result, people got frustrated with Perl, and there was room for a replacement. As PHP 4 rose the ranks, unseating Perl, to be the new darling of Dynamic Pages, the problem got worse. By now, blogs were becoming mainstream, most mailing lists had web interfaces and Google was indexing them. The One-Eyeds leading the blind became even more influential.

Then Rails came out for Ruby. Applications were becoming more complex, and there was room for a full framework (from Database to View) to unseat PHP and become the New Shiny.

By now, the hacker culture had firmly taken hold and it wasn’t just about finding the right Ruby/Rails library to do the job you wanted. Instead, as a new comer to Ruby on Rails, odds were you had to find the most current copy of a forked version of the original library on some guys website you could download. At least until that copy got forked by another guy and updated with the bug fix or feature you needed and all the while crossing your fingers none of the code you were finding was produced by a One-Eyed.

And the rise of Git & Github of course made the ‘fork and forget’ process even easier.

The Ruby community has since rallied behind Gems and has become much more vocal about best practices. But as with Perl, it is hard to unwind past habits, and some One-Eyedness remains.

Fast forward to 2012. Node is the New Shiny, and all the new comers are flocking here. And once again, the One-Eyeds are leading the blind:

 

In the article referenced in the tweet, written on a site called “HowToNode.org” – making it at least sound a lot more authoritative than SomeGuysBlog.com – there is some information that’s fundamental to understanding of how NPM works. It also sprinkles it with rants and blatantly bad advice (which breaks systems) that will be taken by the new comers as best practices or, in the worst case, Gospel, creating even more One-Eyeds.

Those who don’t learn from history…

And that’s when the industry, the jaded, those that have been through at least 4 incarnations of the New Shiny, that can see the words of the One-Eyed spoken as Gospel, respond like this:

 

Unfortunately, the problem of the uninformed One-Eye exists in the opposite camp as well, where people will come out with ridiculous, inflammatory, non-constructive, ill-informed statements, like in the video below, and spread that like their own Gospel in turn.

 

20 GOTO 10?

So how do we break this cycle? Where is the Wikipedia of Technology that actually uses facts and intelligence rather than hear-say, cargo-culting and gospel to inform and drive decisions? How do we actually teach those new to our field do build better and smarter things, rather than breeding yet another generation of One-Eyeds?