Wednesday, December 10, 2014

I was on call for a week...and it didn't entirely suck!

Yup, it didn't suck. In fact it was actually pretty good -- I'd go so far as to call it kind of exciting and certainly educational. If you've been (or imagined being) on 24/7 call for a week for a high traffic internet service and think I sound insane, allow me to explain...

Earlier this year I started a new job as a software engineer at Sovrn in Boulder, CO.
“sovrn is an advocate of and partner to almost 20,000 publishers across the independent web, representing more than a million sites, who use our tools, services and analytics to grow their audience, engage their readers, and monetize their site.”
In plainer English this means we offer technology so websites can display ads and earn money. Although I personally think it would be great if more content-based websites could sustain themselves through contributions from their visitors like Wikipedia and Brainpickings do, the reality is that revenue from advertising is a more pragmatic revenue model for many website operators. In fact advertising revenue helps power much of the internet you and I use every day.

We at sovrn work at serious scale: according to Quantcast, in October of 2014 sovrn ranked as the 4th largest ad network in the world and the 3rd largest in the US.

This means we move billions of records of data all day every day from multiple datacenters to a central collection point for processing and analysis. This constant flow of data has to cope with the inevitable network and hardware issues that arise and ultimately transform data into warehouse and near-realtime reporting data stores.

Interruptions to any part of the system can cause a variety of issues, from delayed data through capacity challenges and loss of revenue for our customers. One facet of maximizing uptime and dealing with service interruptions is a sophisticated monitoring and alerting system that functions continually, necessitating on call engineers with software development, data management and enterprise IT skill sets.

My first exposure to being on call recently ended and I really did enjoy it. Although serving adverts might sound straightforward, there's a fascinating degree of sophistication involved, and when doing it at scale the problems only get more interesting.

Up until my on call week my understanding of the "big picture" of our operations was limited, having focused primarily on the needs of the team I'm part of. One of the great things about my on call experience was how it gave me a much greater appreciation for how everything fit together, and some exposure to the operational aspects of the data processing pipeline and big data toolset we employ at sovrn. (For the curious we're using Kafka, Mirrormaker, Zookeeper, Camus, Storm, Cassandra, Hadoop and more.)

Besides getting to see all of this stuff hum along in production, there's a definite air of excitement to dealing with an incident. We use a small set of tools to manage our on call duties including Icinga (for system monitoring), VictorOps (for managing on schedules and messaging on-call engineers), HipChat (we use a dedicated channel for production issues which helps keep all interested parties informed and allows multiple participants work an incident collaboratively) as well as a wiki for knowledge-base articles.

I've worked in jobs before where the software engineers didn't get anywhere near production -- primarily due to regulatory considerations necessitating a strong separation between development and operations. Although those separations may help address fraud and other similar concerns, they inhibit other very positive things besides the excitement and "big picture" comprehension I've already mentioned.

First, there's a definite camaraderie that emerges from trying to figure out what's going on when you're getting alert after alert one evening on a weekend and have to ask colleagues to help. This necessitates a level of communication and cooperation across teams that might not otherwise happen all that often and is definitely a very positive thing.

Secondly, seeing how your code responds in production is a phenomenal feedback loop for software engineers. You have a lot more skin in the game when you and your colleagues will be receiving alerts for failing systems. Suddenly great logging and debugging characteristics are first class concerns. Nothing will focus the mind on the need for writing high quality, easy to support code quite like this.

Hopefully now that explains my viewpoint and you no longer think I'm completely mad...

1 comment:

  1. Excellent article covering a good on call experience. This is how on call staff should feel. If they do not then either the software development standards are too low (e.g. a poor diagnosis experience) or, perhaps, the wrong type of person has been made available to be on call (nothing to be proud of).