Measure all the things!

So, we all know that Monitoring & Graphing sucks. The only thing that sucks more is to retrofit your existing app with monitoring & graphing. The good news is, that there’s a few straightforward things you can do that don’t involve rearchitecting the entire stack or other similarly painful activities. And that’s the topic of an ignite talk I just gave at Devopsdays 2012.

First things first; you are going to need 3 tools to make your life better:

  1. Collectd – to collect system metrics like RAM, CPU & plugins for Apache, etc.
  2. Statsd – to collect stats from your app, or any other random place.
  3. Graphite – to store, query & graph your data.

Graphite Setup

There are lots of great docs on setting up graphite, so I’ll just highlight our choices in the setup: We use 1 graphite server per data center we are in and that scales to about 300k writes/second/node on an AWS c1.xlarge. You’ll want 1 core per carbon process you run, and being able to raid the ephemeral drives will greatly improve your performance. We use c1.mediums in our smaller data centers, as they collect far less data/second and aren’t asked to produce graphs.

Using namespaces

As an aside, Graphite doesn’t actually care about the keys you use for your metrics, but it can do clever things using a dot as a delimiter and wildcards for searching/aggregating, so pick wisely. Our naming scheme is:

<environment>.<application>.<<your key>>.<hostname>

This allows for easy roll ups to the cluster level, avoids stomping on other metrics and neatly keeps the dev/stage/canary noise out of the production data.

We also do automated rollups per cluster using the carbon aggregator using this rule, which saves CPU when displaying these numbers on graphs every 1-5 minutes.

<prefix>.<env>.<key>.total_sum (10) = sum <prefix>.(?!timers)<env>.<<key>>.<node>
<prefix>.<env>.<key>.total_avg (10) = avg <prefix>.(?!timers)<env>.<<key>>.<node>

Getting data into graphite

If you’re using Apache, you can easily get basic statistics about your system into Graphite. Just add a piped custom log to your Vhost configuration and you’re good to go. For example, if you were running a simple analytics beacon, this would define a log format that would capture response times for your beacon and send them to Statsd, which will then be immediately viewable in Graphite:

LogFormat “http.beacon:%D|ms” stats

CustomLog “|nc -u localhost 8125” stats

If you’re using Varnish, you can write out a log file with performance data using varnishncsa, and then use a trivial tailing script to pipe the data onto Statsd. For example, to measure the performance of a dynamic service, you could do something like this:

$ varnisncsa -a -w stats.log -F “%U %{Varnish:time_firstbyte}x %{Varnish:hitmiss}x %{X-Backend} ”

$ varnishtail stats.log

And if you’re using a framework for your dynamic application, like Tornado or Django, you can hook into their request/response cycle to get the data you need. For example, you can wrap the request handler with a “with stats.timer(path) …” statement and send stats on response code/time per path straight to Statsd from the response handler.

Transforming your data

Graphite supports lots of functions (mathematical and otherwise) to transform your data. There are a few I’d like to highlight as they really give you a lot of bang for your buck

Use events to graph all of your deploys, CDN reconfigurations, outages, code reviews or any other event that is important to your organization:

You can compare week over week (or day over day, etc) data with time shift; graph some data, set your graph period to a week, and then ask for the same data a week ago, two weeks and three weeks ago.

Did your page views go up? How about convergence?

Your latency, pageviews, signups, queues or whatever data you track probably changes by time of day, or day of the week. So instead of using absolute numbers, start using the standard deviation. Calculate the mean value and get the average standard deviation of the mean. For example, “our servers respond in 100 milliseconds, with a standard deviation of 50ms”

Then, figure out how far your other servers are from that mean; you may easily find servers that are 2-4 standard deviations away. Your monitoring would have caught a 500ms upper response time, but would have not let you do this preventative care.

Taking this one step further, look into Holt-Winters forecasting to use your past data to model historical upper and lower bounds for your current data. Graphite even has built in functions for it.

Further reading

This hopefully gets you a taste of what’s possible, but luckily there are a lot of people using these tools and sharing their insights:

Advertisements

One thought on “Measure all the things!

  1. Pingback: How to measure everything: Instrumenting your infrastructure | Rants & Rambles

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s