Be “Do Not Track” compliant in 30 microseconds or less.

Fork me on GitHub

Last week I blogged about the state of Do Not Track on the top internet properties, advertisers and widget providers. If you haven’t read it yet, **spoiler alert** the results aren’t encouraging.

In my experience, many of the top web operators are in fact concerned with your privacy, so it might be hard to understand why even they aren’t honoring your “Do Not Track” settings. I’d venture that part of that is definitely awareness, but also because, at scale, implementing “Do Not Track” compliant solutions isn’t a trivial matter. The latter becomes obvious when you see people in the Web & Ad industry talking about the importance of DNT, with only a handful being able to actually provide a working implementation.

Do Not Track – the basics

Let me illustrate the difference between DNT and non-DNT requests by using Krux (my employer) services as an example.

Under normal circumstances, a basic request/response looks something like this, where a user generates some sort of analytics data, and is given a cookie on his first visit. The data is then logged somewhere and processed:

> GET http://beacon.krxd.net/pageview.gif?xxxxxxxxx
< HTTP/1.1 204 No Content
< Set-Cookie: uid=YYYYYY; expires=zz.zz.zz

Now, for a DNT enabled request, the exchange looks a bit different; the user still generates some sort of analytics data, but in the response cookie the value is now set to ‘DNT’ (we set the value to ‘DNT’ because you can’t read the value of the DNT header in JavaScript in all browsers yet) and the expiry is set to a fixed date in the future, so it’s impossible to distinguish one user from another based on any of the properties:

> GET http://beacon.krxd.net/pageview.gif?xxxxxxxxx
> DNT: 1
< HTTP/1.1 204 No Content
< Set-Cookie: uid=DNT; expires=Fri, 01-Jan-38 00:00:00 GMT

Implementing Do Not Track compliance

At Krux, we provide audience analytics for top web properties, with a very strong commitment to user privacy. As part of that, we take honoring “Do Not Track” for our properties, as well as our publishers’ properties, very serious.

Our analytics servers process billions of data points per day (or many thousands per second), and each of these requests should be handled quickly; any meaningful slowdown would mean a deteriorated user experience and provisioning many more machines to handle the traffic.

The industry standard for setting user cookies is basically Apache + mod_usertrack and in our production environment will get response times in the 300-500 microsecond range. This gives us a good performance baseline to go off. Unfortunately, mod_usertrack isn’t DNT compliant (it will set a cookie regardless) and can’t be configured to behave differently, so I had to look for a different solution.

Writing the beacon as a service is a simple programming task, and the obvious first choice is to try a high throughput event driven system like Tornado or Node (both are technologies that are already in our stack). I encountered 3 issues with this approach that made this type of solution not viable however:

  • Tornado & node both respond in a 3-5 millisecond window, and although that’s quite fast individually, it’s an order of magnitude slower than doing it inside Apache
  • Response times degrade rapidly at higher concurrency rates, which are a very common pattern in our setup
  • These processes are single thread, meaning they need to be behind a load balancer or proxy to take advantage of multiple cores, further increasing response times

Next, I tried using Varnish 2.1. It handled the concurrency fine, and was responding in the 2 millisecond range. It also has the benefit of being able to be exposed directly on port 80 to the world, rather than being load balanced. The problem I ran into is that Varnish does not allow access to all the HTTP Request headers for logging purposesVarnish 3.0 does have support for all headers, but can’t read cookie values directly and we’ve experienced some stability problems in other tests.

With none of these solutions being satisfactory, nor coming close to the desired response times, the only other option left was to write a custom Apache module to handle DNT compliance myself. And being not much of a C programmer (my first language is Perl), this was a fun challenge. It also gave me a chance to write mod_usertrack in the way it should have been behaving all along.

Introducing mod_cookietrack

So here it is, mod_cookietrack, a drop in replacement for mod_usertrack, that addresses many outstanding issues with mod_usertrack (details below), including Do Not Track compliance.

And most importantly, it performs quite well. Below is a graph that shows performance of an Apache Benchmark (ab) test using 100,000 requests, 50 concurrent to a standard Apache server. The blue line shows mod_usertrack in action, while the red line shows mod_cookietrack:

The graph shows that for all the extra features below, including DNT compliance, it only takes an additional 5-10 microseconds per request

We have been using mod_cookietrack in production for quite some time now, and it is serving billions of requests per day. To give you some numbers from a real world traffic pattern, we find that in our production environment, with more extensive logging, much higher concurrency and all other constraints that come with it, our mean response time has only gone up from 306 microseconds to 335 microseconds.

So what else can mod_cookietrack do? Here’s a list of the improvements you get over mod_usertrack for free with your DNT compliance:

  • Rolling Expires: Rather than losing the user’s token after the cookie expires, mod_cookietrack updates the Expires value on every visit
  • Set cookie on incoming request: mod_cookietrack also sets the cookie on the incoming request, so you can correlate the user to their first visit, completely transparent to your application.
  • Support for X-Forwarded-For: mod_cookietrack understands your services may be behind a load balancer, and will honor XFF or any other header you tell it to.
  • Support for non-2XX response codes: mod_cookietrack will also set cookies for users when you redirect them, like you might with a URL shortener.
  • Support for setting an incoming & outgoing header: mod_cookietrack understands you might want to log or hash on the user id, and can set a header to be used by your application.
  • External UID generation library: mod_cookietrack lets you specify your own library for generating the UID, in case you want something fancier than ‘$ip.$timestamp’.
  • Completely configurable DNT support: Do Not Track compliant out of the box, mod_cookietrack lets you configure every aspect of DNT.

The code is freely available, Open Source, well documented and comes with extensive tests, so I encourage you to try it outcontribute features and report issues.

For now, mod_cookietrack only supports Apache, but with that we’re covering two thirds of the server market share as measured by Netcraft. Of course, If you’d like to contribute versions for say, Nginx or Varnish, I’d welcome your work!

Now that being DNT compliant will cost you no more than 30 microseconds per request, all you good eggs have the tools you need to be good internet citizens and respect the privacy of your users; the next steps are up to you!

Note: these tests were done on a c1.medium in AWS – your mileage may vary.

 

The state of “Do Not Track”

Over the last few weeks, “Do Not Track” has been getting a lot of attention; Mozilla introduced DNT into FireFox as of January 2011 and has a very good FAQ up on the subject. Last week, the official announcement came from the big browser vendors, including Microsoft and Google, that they’d start incorporating DNT as a browser feature as well, which coincided nicely with the White House announcing a privacy bill of rights.

Do Not Track IndicatorIt’s great to see online data privacy finally being taken seriously, especially after the various shades of gray we have been seeing lately. Some of which are just plain scary.

But sending a “Do Not Track” header in your browser is one thing, having the server on the other side, and perhaps even more importantly, their (advertising) partners honor the request is quite another. And unfortunately, the the current state of affairs isn’t great; taken from Mozillas FAQ as mentioned above:

Companies are starting to support Do Not Track, but you may not notice any changes initially. We are actively working with companies that have started to implement Do Not Track, with others who have committed to doing so soon.

Let’s take a quick look at the current cookie setting practices from the top 500 websites, as counted by alexa. I ran a quick scan against http://www.domain.com, once with and once without a DNT header. Of those 500 sites, 482 gave useful replies; some of the most used domains are CDNs or don’t have top level content, so they are excluded.

From the chart below, you can see that most sites set 1-2 cookies, and that most of those cookies are somehow related to user or session specific data.

I’d have added a third line showing you the delta in cookies set when the DNT header was set, but the sad truth is that only 3 websites changed their cookie behavior based on the DNT header: Kudos to 9gag.com for not setting any cookies and blackhatworld.com & movie2k.com for at least dropping one of their user specific cookies. The outlier with a whopping 18 cookies, 10 of which are personally identifiable, is walmart.com.

Online Graphing

Now, setting a user/session cookie is not necessarily a bad thing; for one thing, you can not read the DNT header from JavaScript, so if you’d want to be DNT compliant in JS, you’d have to set a DNT cookie (although not part of the public standard, some newer browsers are starting to support inspecting the DNT setting from the DOM). The industry standard is now to set a cookie matching the string “DNT” or “OPTOUT”. Again, unfortunately, non of the top 500 websites actually do this when the DNT header is set.

The other viable option is to send back the same cookie, but set the expiry time in the past so that it’s removed by the browser. Although this would be silly on a first request (it would be better not to set a cookie at all in that case), and is not as useful in a JavaScript environment, it’d still be making an effort towards DNT compliance. From the top 500, only forbes.com is using this technique currently.

As it stands, only 4 out of 482 measured top 500 sites are actively responding to the DNT header being sent.

The FTC has been calling for a “Do Not Track” implementation and according to Mozilla, now 7% of Desktop Firefox users and 18% of Mobile Firefox users already have DNT enabled. With such a clear call from regulators and end users, why are so few sites actually following up with a solid implementation? And what does that mean for the advertising and widget partners they use, whose whole model is based around being able to use your personal data?

Again the answer is not very encouraging. The Wall Street Journal did a great investigation into this with their “What They Know” series and have found that even websites that you trust and use every day have literally hundreds of trackers ushered in when you visit them:

(full disclosure: I work for Krux, whose findings were featured in the WSJ “What They Know” series and we published a whitepaper on the subject)

If you browse through the above charts, it becomes obvious that your personal data is flying across the web and you have very little control of who takes it, how they use it and who they might be selling it on to.

The folks at PrivacyScore even built an index to show you how much your data is at risk when visiting any particular website. Some of the scores, like the one for Target are quite scary, and is illustrated by this story about how Target found out a girl was pregnant before her dad even did.

Bottom line, the worst offenders tend to be in the online merchant, advertising networks or widget providers space (especially those of the ‘free’ variety – because nothing is every really ‘free’) that play fast and loose with your personal data in order to optimize their own revenue. To illustrate the point, here’s a choice quote from the above article:

“AddThis has its own plans to sell user data, but it’s not looking to publishers as the main buyers. It will sell analytics on user data directly to ad agencies and brands themselves and will get a bigger cut by doing so.”

So, why is it hard for the good eggs to do the right thing, even though it’s making them look like bad eggs in the process? Part of it is awareness I’m sure, but another part of it is simply the challenge of implementing a good “Do Not Track” solution. Implementing DNT at scale is actually not that easy, and we spent a fair amount of time at Krux to get it right.

To further the cause of Data Privacy, we’re open sourcing our solution and it will be the topic of my next blogpost, in the hope that all the good eggs will at least be able to Do The Right thing easily, and making it easier for the rest of us to call the bad eggs on their behavior.

P.S, if you want to see where your personal data is going when you visit a webpage, we released a FireFox browser plugin called Krux Inspector, which you can install directly from our website. It shows you exactly who is bringing in which advertisers and partners on the webpage you’re viewing, and what personal data they’re skimming as well as the beacons they’re dropping.