Take This On-Call Rotation and Shove It

The familiar blue and gold intro graphic fills the screen every evening at six o’clock on the dot. The jabbing staccato string music conjures up vague secondhand memories of what a teletype machine might have sounded like. A high angle view of the studio floor with the large Lexan-clad desk in the middle, then a cross dissolve to a two shot of the presenters for this newscast. The music fades, each person introduces themselves, then they jump straight into the top story for the evening. It’s been this way for as long as anybody can remember. They’ve never failed to get this show on the air.

They’ve never failed.

Who you gonna call?

Everything fails all the time.
Werner Vogels

Producing any sort of live television show is a complex ballet. The studio’s cameras and microphones route their signals into video switchers and audio mixers, pre-taped packages come from the video server, field reporters are connected bidirectionally through a satellite link, and with a sprinkling of pizazz from the motion graphics machine, the final product is sent off to master control and ultimately to all the kitchen counters and family rooms across the city.

But there are ancillary systems outside of this direct pipeline. The studio lighting is quite important, as most professional broadcast cameras tend to produce underwhelming images under inadequate light. The teleprompters feeding the anchors their scripts are obviously important. The weather reporting segments use an entirely separate system of graphics rendering equipment that must be linked through a chroma keyer to place the meteorologist in front of the computer-generated forecast images. And quite obviously, this equipment requires a handful of human operators.

Studio-grade equipment is obscenely expensive, but it is also incredibly reliable. It is rare for things to outright fail, but anything can eventually wear out after enough daily use. If a camera fails, perhaps they can wheel the one from the sports desk over to cover this part of the broadcast. If the teleprompters fail, the anchors have a copy of the script at their desk that they can look down at. If one of the anchors calls out sick, they can sub in talent from the morning news team.

Each of these is an example of either a redundant backup system or spare capacity that can be reallocated if needed. The broadcast technically does not need any of these contingencies to function under normal circumstances, but in cases where things go wrong it can mean the difference between success and total failure.

Not everything can be made completely redundant. A failure in the power system for the lights will most likely plunge the entire studio into darkness, and that’s no way to run a news program. Similarly, if the $50,000 video switcher dies, it’s highly unlikely that they’ll have a spare holed up in the supply closet. To insure against every possible thing that could ever go wrong, they would have to build a second studio on a separate part of the city’s electric grid, with redundant copies of all the equipment and broadcast content, along with a full crew of understudies ready to take over at a moment’s notice. This is a degree of redundancy that can’t reasonably be achieved by any budget-conscious station.

There is a hybrid between the two options, allowing the station to only maintain a single instance of anything expensive while having some assurance that the equipment they do have will work when needed: They can find an expert of some sort who is capable of fixing anything that breaks well enough to get the broadcast out. We’ll name this person Alex. If the microphone battery dies, Alex will swap it out. If the video server acts up, Alex knows how to get it working again. If the tire pressure light in the Chevrolet Weather Beast comes on, or the studio’s air conditioning fails, or the technical director breaks both their hands and needs somebody to push the buttons on their behalf, it’s Alex’s time to shine.

Now, naturally, most of the time everything is going fine and Alex has nothing to do. So Alex has some other regular job in the studio—say running the audio mixer. In fact, the audio mixer thing is their official job title and their primary responsibility at the station; they only jump into universal-problem-solving mode when something goes wrong. As soon as the problem is resolved, it’s back to the audio mixer.

The other thing about all this is, well, it’s very difficult to find and train people like Alex. So since they are at the station all evening anyway, why not also have them stick around in case anything goes wrong during the 7:00 news, and 11:00? And if anything happens during the 4:30–7 a.m. news, the station can call Alex at home and have them bop over and fix the problem. Oh, and also the news at noon, and the 4 p.m. block. Apparently this station broadcasts six hours of live news programming most days. At least it’s only four hours on Sunday. In the station’s view, there is no need for anybody to relieve Alex because—most of the time—they never need Alex’s emergency response skills at all. There should be no need to hire and train somebody else to do this stuff because they barely use the services of the person they already have.

There is, of course, another option that the station has never seriously entertained: Don’t hold Alex to any of those responsibilities at all, and if things really go to hell they can just throw on an old The Price Is Right rerun and hope for better luck during the next scheduled newscast.

Grandpa, what’s a beeper?

1-800-759-7243
But if you ain’t got that pin number, dummy, you can’t call me
To hook up with Mix you gotta call that number
Then sit by the phone and wonder
Will he call? If you’re fine I might
If you’re a duck, good night
Sir Mix-A-Lot, “Beepers”

There was a time—not that long ago, really—where people couldn’t contact you if they didn’t know where you were. Telephones were literally screwed into the walls of houses and businesses. Portable two way radios existed, but they were a massive pain to carry around and operate. If somebody wished to contact you, they would not call you specifically but rather your house or your workplace, places where you might or might not have been at the time. If you were not there, maybe they’d try to call your brother’s house, your favorite bar, the Kiwanis club, or another location that was significant to you. If they still couldn’t find you, eventually they’d give up. People used to be more chill in that way.

In a more structured environment—say a hospital where doctors moved from room to room but stayed inside one building—it was important to be able to get in touch with a specific person without knowing which room they were in. To accomplish this, a phone operator would page This verb form of the word “page” uses the same sense as the noun “page,” an old-timey word meaning roughly “servant boy.” I page you in the sense that I am asking Kenneth, the NBC page from 30 Rock to send for you. the desired person via an announcement over the building’s public address speakers: “Paging Dr. Johnson, Dr. Johnson, please call fourth floor nurse station.” Assuming Dr. Johnson was in the building to hear this, they would find a phone and call the station as instructed.

This worked fine, but it generated a lot of “useless” noise because most of the staff were uninvolved in most of the pages they overheard. Thanks to incremental improvements in technology, the voice announcements were phased out to make way for unidirectional radio broadcasts that covered the entire building. The content of the radio message remained the same as the audible announcement: who the page was for, and who that person needed to contact in response. Each person who needed to receive pages was given a pager, a radio receiver that was pre-programmed to only activate in response to pages specifically addressed to it. Each pager contained a small numeric display where the information about who to call could be shown. These were colloquially called beepers because, well, they made a beeping sound to announce each incoming page.

To send a page, a person would pick up one of the building’s telephones and dial the number for the paging system. They would be prompted to enter the recipient’s PIN or unique identification code along with a callback number. If the sender wanted the recipient to call them directly, the callback number would be a phone that the sender was ready to pick up. It didn’t have to be, though. For example, the sender and recipient could have a prearranged system in which a code like “505” could be interpreted as the distress signal SOS with some mutually understood meaning. These codes were more common from senders that the recipient knew well, representing messages they frequently needed to exchange. To a building maintenance worker, “234” could indicate an emergency at 234 Maple Avenue while “5300” could have been 5300 Elm Street. The codes meant what the sender and recipient agreed they meant.

Technology got better. Things got smaller and faster. The unidirectional pager networks started becoming overshadowed by mobile phone networks which soon gained the ability to send bidirectional SMS messages. Microprocessors advanced to the point where a battery-operated handheld device could serve as a phone that could also send and receive text messages. These advances made it possible to send longer messages using a more expressive character set on a device that also did other things. My very first mobile phone could run a game of Snake that objectively blew. But the capabilities were there. Phones continued to gain capabilities, the networks they ran on continued to get faster with wider coverage, but the central thread of “I need to get this message to that device” is as clear today as it was when Sir Mix-A-Lot was courting his lady friends in the 1980s.

Also, the systems described up to this point had one thing in common: The person sending the page was a human being.

Getting on the same page

Dude: They gave Dude a beeper, so whenever these guys call—
Walter: What if it’s during a game?
Dude: Oh, I told them if it was during league play…
Donny: What’s during league play?
Walter: Life does not stop and start at your convenience, you miserable piece of shit.
The Big Lebowski (1998)

Like a disheartening number of things in the tech industry, there are no real standards around what on-call responsibilities look like. Each organization And each team within! is free to set things up in whichever way suits their tastes, and the resulting practices vary widely as a result. In order to ground this article in something concrete, I will describe Alex’s on-call arrangement, which seems to be typical for US companies whose business model is “Have a website and/or mobile app, and either put ads all over it or convince the users to enter their credit card information somewhere to use it.” The prevailing attitude of these organizations is that the product must work at all times, otherwise it results in failure to show an ad or collect a payment. Both of these negatively affect revenue.

Alex’s company uses the SEV system, which might Again, no standards. Somebody copied part of the philosophy from Amazon or Facebook or someplace but never bothered to codify exactly what the abbreviation meant to them. mean “severity,” “site event,” “significant event,” “serious event,” or anything else you might care to contrive that matches the pattern. SEVs are further divided into numbered classes depending on their impact on the product experience; a SEV 1 means that the business is currently failing to be a business because it is unable to perform its core functions and/or collect its revenue. The lesser SEV 3 might represent degraded performance on some non-critical portion of the application. An example of a SEV 3 might be a situation where users can still change their profile pictures, but those changes are not showing up promptly in the app due to some kind of processing delay. This will probably not impact the quarterly financial statement in a measurable way. An instance of a SEV 1, on the other hand, might entail the mobile app showing a perpetual loading spinner on every request to every user at once. That type of thing tends to get noticed.

Below the SEV system, there is a bubbling churn of things that are subtly broken, or are well on the way to someday being definitely broken, but are fine for the time being. A good example of this would be a disk that is 98% full. In its current state, nothing is actually wrong. But once it finally becomes 100% full and cannot accept any more data, something else in the system is going to respond poorly and this can likely cascade into some kind of SEV. Most systems in most organizations have monitoring in place for this sort of thing, and it is common for an on-call engineer to receive pages due to (e.g.) high disk usage to investigate specifically to avoid a potential SEV in the future. Practically all pages of this nature are generated and sent through automated means, and these pages can sometimes resolve themselves without outside intervention if (e.g.) the disk usage abates naturally.

The on-call engineer in Alex’s department is selected out of a rotation of all the team members. The on-call shift is seven consecutive days of 24-hour support, or 168 solid hours. ±1 hour depending on how daylight saving time shakes out. The on-call engineer does not need to stay awake for seven straight days; the idea is that they’re supposed to work on typical tasks during business hours and go about their non-work lives as usual, but be able to jump into handling an issue quickly after receiving any page at any time. The “quickly” part is formally defined as time to acknowledge, and durations from 5 to 30 minutes are fairly typical. Alex’s team expects pages to be acknowledged within 15 minutes.

If a page is not acknowledged by the on-call engineer, a system of escalation begins. The escalation policy usually follows one of these patterns:

If there is only a single on-call engineer, the page may escalate to them again. This re-raises the original alert in case it was somehow missed the first time.
In a “primary/secondary” type of arrangement, there are actually two people on-call at any given moment. All pages go to the primary, and only unacknowledged pages escalate to the secondary. If the secondary doesn’t acknowledge the page either, it may escalate further as described by the other bullet points here.
In a “hunt group” configuration, an unacknowledged page is sent to every member of the team—none of whom are officially on-call at the moment—in the hopes that one of them is free to acknowledge and handle the issue. This arrangement has a strong tendency to break down into one of two degenerate states:
1. One or a few people naturally become highly responsive to all pages, acknowledging them before most of their teammates have the opportunity to do so. Over time, most of the team members stop paying attention to pages and leave their highly-responsive peers to handle everything that comes in.
2. Something very close to the bystander effect occurs, where everybody in the group assumes somebody else will acknowledge the page, and ultimately nobody steps up to do it. This deadlock is broken when somebody (perhaps a team lead or supervisor) tags a specific team member and tasks them with taking ownership of the issue.

In each of the setups described above, the team’s manager may or may not be part of the escalation chain. If they are, it adds a whole new layer to the on-call calculus: Nobody wants their unacknowledged pages to end up notifying their manager, especially outside of working hours. Alex’s team uses the “single on-call engineer” model with escalation to the manager.

On-call shifts occur one week out of every N weeks, where N is the number of people on the team. For primary/secondary arrangements, the shift frequency is two weeks out of N, even though one of those weeks will ideally see few or zero pages. Still, the secondary must remain fully available during that time. If there are fifteen people on a team, each person will barely need to cover one shift per quarter. On a team of two, each person is on-call every other week. This is a substantial source of variability, and it can change suddenly as team members go on vacation, take personal leave, or part ways with the team or company. Alex works in a department of four, resulting in an on-call shift approximately once a month.

Sometimes life interferes with on-call scheduling, and for those times there is usually a mechanism for team members to trade partial or complete on-call shifts between themselves. If the active on-call engineer needs a few uninterrupted hours to attend a family function or unavoidable appointment, they can seek out a peer who is willing to cover the responsibility for that time. At some future date, the favor can be reciprocated when that other person is on-call and needs somebody to cover for them.

When an engineer receives a page and needs to do unplanned work in response to it, that work is called on-call load. Each organization has an expected amount of on-call load for each shift. Or rather, they’re supposed to, but it’s not surprising to find places that have never given the idea any serious thought. If an excessive number of issues occur and the load exceeds the expectations for the shift, it becomes on-call pain. True fact. Why would I make that up? Pages that occur outside of regular working hours are considered more painful than those that occur during weekdays.

As far as what the on-call engineer needs to do during incident response—the time between acknowledging a page and resolving the issue that caused it—this is another area of huge variance. Sometimes they’ll need to log into some web UI and click one button. Sometimes they’ll spend ten straight hours trying to resuscitate a completely inaccessible product. A team may experience both ends of the load spectrum from one week to another just by luck of the draw.

Occasionally the on-call engineer will be faced with a situation that is objectively unfixable. Sometimes a critical piece of AWS’s entire us-east-1 region fails, ultimately hobbling a significant chunk of the internet along with it. Sometimes 33 Whitehall loses generator power after Superstorm Sandy drowns its fuel pumps in seawater. Alex’s company has worked very hard to cut down on operational costs by farming out a bit too much of its core functionality to a third party with bad customer support turnaround times, whose outages then become Alex’s outages by proxy. In instances like these, sometimes the on-call engineer just has to throw up their hands in defeat. Other than simply waiting out the problem, the only other feasible option would be to undertake some over-ambitious migration to an entirely different provider. That’s not something that anybody can do in any kind of reasonable time frame, and doing it under the duress of a service outage would be unwise at best. At a certain point, the best Alex can do is turn on The Price Is Right and wait for things to blow over.

Now, obviously, on-call duty is by no means a job requirement that is specific to the tech industry. Doctors and surgeons can be on-call. The building superintendent for an apartment complex can be on-call. The guy who fixes air conditioners can be on-call. The difference is that the people in those industries are fairly compensated for doing it.

Wait, you guys are getting paid?

Work work work, day after day
Fifty hour week, forty hour pay
No time to get over all this overtime
Yeah I’m always runnin’, but I’m always runnin’ behind
Tracy Lawrence, “Runnin’ Behind”

There are ways of looking at employee wages in the US that are elegantly simple. An employee is hired at a rate of $X/hour, they work for Y hours in a week, and the total pay is the product of those two numbers. There is a minimum wage at the federal and possibly state level that sets the smallest legal amount for $X. The employee should work a maximum Y of 40 hours in that week, otherwise they enter an overtime situation where their hourly $X becomes $X and a half. Those highfalutin white collar workers are basically the same, except their Y is fixed at 40 hours regardless of the time actually worked so their total pay stays the same week after week. That’s how it all works, right?

This is the system laid out in the Fair Labor Standards Act of 1938 (FLSA) and its many amendments. This is the law that underpins concepts like minimum wage, overtime, the 40-hour work week, and the notion that child labor probably isn’t such a good thing to do. It also defines a set of exemptions to the rules, thus creating the concept of an exempt employee. If you are a US-based tech worker in a full-time position, I’m going to take a stab in the dark and assume that you are almost certainly classified as an exempt employee. This means that the FLSA’s protections effectively do not exist for you. You are not guaranteed overtime, and you could conceivably work so many hours over the course of a week that your effective hourly rate ends up less than minimum wage. Now I’m wondering if an employer could get away with hiring child labor by classifying them as exempt employees. I would guess not, or somebody out there would be doing it right now.

The FLSA is designed with repetitive and predictable work in mind: Somebody who works on an assembly line, or who moves boxes around in a warehouse, drivers and couriers, et cetera. Workers in these sorts of jobs have a tendency to produce a similar and predictable amount of work in any given hour. Drop in on them during one hour of the workday and you’ll observe roughly the same level of productivity that you would find from them at any other hour.

Employees who are exempt from the FLSA tend to have variability in their workday. The original thinking was that this would apply to executives and highly-skilled professionals who performed such a wide range of tasks throughout the day that some hours were markedly more valuable than others. These sensibilities changed and eventually morphed into “white collar jobs that paid a lot.” The current regulations specifically list computer-related occupations in their list of exempted fields. And from a certain angle, if you really squint, it makes sense! Think of hours where you have pounded out hundreds of lines of code, then compare it to hours where you sat in a conference room staring at a blinking text insertion cursor instead of paying attention to the presenter. Sometimes you’ll make no progress towards a thorny challenge during the course of an entire workday, which might be completely offset by a single spark of creative inspiration while washing the dishes later that night.

All this to say, there is nothing in the regulations of the United States that can protect Alex from working more than 40 hours in a week. There is no requirement that overtime be paid to them. If the work requires more than 40 hours in a week, oh well, sucks to be Alex. This means that technically Alex could work fewer than 40 hours by applying the same logic, assuming they get all their necessary work done. They have been meaning to work up the nerve to try to pull that one day.

So. With that bit of background out of the way, it’s clear that there is no legal or regulatory requirement for an employer to pay anything for performing on-call duties as long as the responsibility is given to an exempt employee. Based on my own experiences and informal polling of others in the industry, the prevailing attitude is that on-call is part of the job description and “baked in” with the total compensation. It’s not at all unusual to find on-call shifts that receive no additional payment or consideration for carrying the pager. There is also usually nothing extra paid for responding to a page that occurs outside of regular working hours.

And again, there are no absolute rules about this. Some places actually do pay a modest honorarium for each on-call shift worked. Some will provide “unofficial” compensatory time And if your employer gives comp time, a small question for you: Do they also reduce the amount of sprint story points they expect you to work through when you take it? to balance out a page handled outside of typical business hours. Legends are told of organizations where the teams are staffed adequately and the systems simply don’t page. Just imagine a magical place where a person is only on-call for like three weeks a year, and who never gets paged during those times. Alex, who once spent an entire summer being on-call every other week while occasionally fielding a dozen pages in the span of a single day, cannot.

Most places won’t even provide a phone or subsidize a mobile carrier bill, nor will they provide a company-paid mobile hotspot for laptop tethering purposes. It’s just assumed that you’ll happily install PagerDuty or Opsgenie or some other hateful app that violates the sanctity of your personal device, right there on the home screen next to Okta Verify. A brief tangent: Fuck Okta Verify. Your personal phone becomes your pager, the thing that pulls you out of leisure time and into work time. After a while, you might start to notice on-call beginning to fundamentally change your relationship with the device.

The absolute largest source of variability comes from a team’s willingness to improve the on-call situation as opposed to simply accepting that things are the way they’re meant to be. Some teams view every page—no matter how trivial—as a signal that something needs to be immediately fixed to prevent that specific thing from ever happening again. Other teams view it as something that just happens as a natural consequence of supporting a product, like a smoke detector battery chirp that everybody has learned to tune out over the course of several years. It is the manifestation of technical debt that has been boiling for years, looking for a pressure relief valve to escape through, and it just happens to keep finding its release through Alex’s pager.

Perhaps unsurprisingly, the teams that are most willing to defend against recurring pages are also the most likely to actually perform in-depth postmortems so they can write and maintain their on-call runbooks. Sometimes the runbook is the only friend an on-call engineer has, and there’s nothing more disappointing than discovering that this friend can’t help fix anything.

168 long, cold, lonely hours

All I wanna do all day is spend it in bed
But that’s bad for the body and even worse for my head
So I’ll try and find a place where no one will ask me a thing
It’ll help me to forget and help me to sing
Reel Big Fish, “Drunk Again”

A page can conceivably come at any time, day or night. Alex needs to receive the alert and begin working on the issue within fifteen minutes, which means they must have a suitable work computer and sufficient internet connectivity available within that time commitment. They must remain cognizant of their phone’s signal quality and the availability of nearby Wi-Fi networks. Unless they take their bulky work laptop with them, By the way, not everybody lives in a perfectly idyllic area. There are plenty of places where computers get stolen from parked cars and bags get snatched. Carrying this stuff around is a genuine risk for people in some situations. it’s not possible to travel anywhere that takes more than a few minutes to return from.

Even certain household tasks—cutting the lawn for example—require special consideration. If a page arrives during that activity, Alex needs to put the mower away to a certain extent In some areas, as above, an unattended mower might get swiped. In others, it could lead to an HOA fine. before going inside to clean up enough to do knowledge work. It’s mentally taxing to jump from domestic labor to complex problem-solving, and it’s equally difficult to go back when the issue is finally resolved.

It turns out that there are many things in life that are technically compatible with an on-call shift, but which require such delicate planning and forethought that it sometimes ends up being easier to just not bother doing any of that stuff during an on-call week. No significant travel or long walks/drives, no excessive drinking or *ahem*, no ability to simply unplug and decompress. Even if a page never actually comes, there is always the potential for a page to come. Maybe the primary on-call turned off their phone without telling anybody to attend a screening of Oppenheimer. Actually happened. Maybe there’s time to quickly run to the grocery store and back, but it might be cutting it close. Maybe it’s better to just stay home until the end of the week. Park in front of the TV and run out the clock. But don’t watch anything too engrossing; getting paged right during the good part really sucks.

This has a tendency to happen eventually, even at organizations where the expected on-call load is near zero. It’s not possible to live life completely normally while staying prepared to handle any page at any time. It would perhaps be hyperbolic to compare the experience to that of being placed under house arrest, but it’s the closest a lot of us will ever get to experiencing that level of freedom-yet-confinement.

And, of course, when a page does come, it manages to find the most inopportune time to do so. Alex has been paged during nice dinners, in the middle of live entertainment, and at times that rightly should’ve been devoted to time with family members and friends. Not to mention that alert sound, and the notification box on the phone’s lock screen. Alex’s phone became a source of resentment and negative emotions to the point where they basically had to disable almost every other sound and all other notifications because their heart jumped every time one popped up. Alex won’t go as far as to say it caused PTSD, but it sure led to a fair number of the symptoms of PTSD.

Also, it regularly ruined my sleep. Whoops, I meant Alex’s sleep. I’m not Alex. Nope.

Sometimes pages decide to come during overnight hours. Here’s what happens when a page occurs in the middle of the night: First, if you happen to have a significant other, the alert sound invariably wakes them up before it wakes you. You get out of bed. It’s dark. It’s cold. You open your work laptop. Even at its lowest brightness setting, the 16-inch Liquid Retina XDR display lights up the room with its blinding intensity. You log into your email and Slack, open some dashboards, open Okta Verify on your phone, Fuck Okta Verify. and you’ve basically done everything you usually do at 9 a.m. on a regular workday. Six hours before you’re supposed to be here, you’re here. Still half asleep—no sense having any caffeine if the intention is to try to go back to bed after this is over—this is really not the right kind of headspace to be in while poking at unfamiliar and on-fire code on production systems. And since it’s the middle of the night, nobody else is here to help diagnose or double-check anything. There would be a kind of palpable loneliness here, if you had the mental acuity to notice it. Maybe you’ll manually page somebody else to come and help. Or maybe you can’t bear the thought of being the one responsible for spreading this on-call pain onto them.

Eventually the problem gets resolved one way or another. You close the laptop and try to quietly return to bed. Your significant other (if applicable) is awoken again by this. You end up lying there for a while, unable to go to sleep due to the mental exertion, the light from the computer screen, and a fair bit of leftover adrenaline. May as well just stay awake; the issue probably isn’t actually fixed and it’ll likely page again in a few minutes anyway.

Hey, you know what this sounds like? Anxiety! On-call basically causes anxiety. And if you’re a person who already has anxiety due to some other preexisting reason, congratulations! Now you have extra anxiety. And for what? Because some Kafka broker stopped running?

We need to talk about Kafka

I thought that since Kafka was a system optimized for writing, using a writer’s name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka. Plus the name sounded cool for an open source project.
Jay Kreps, Kafka: The Definitive Guide, Second Edition

Jay Kreps contributed to the technology that would eventually become Apache Kafka while he was working at LinkedIn. Very broadly, Kafka can be thought of as a message queue that accepts data from one side and sends it out to one or many interested parties on the other side. Unlike a typical queue it also persists this stream of messages on disk so that delivery can be deferred, batched, or even repeated at some future date. At scale, it may be tasked with handling such an immense volume of data that the operation of the system becomes a major pain in the ass.

Part of this operational difficulty is caused by the fact that Kafka runs on multiple discrete computers that must constantly cooperate with each other to behave as a single larger system. Much like the Borg in Star Trek. But Google already took that name. If any of the members of the cluster of computers become disconnected or degraded, the performance and stability of the entire group is impacted. If an organization runs Kafka in production, there is a very good chance it is routinely paging somebody due to low disk space, processing lag, or other inscrutable gremlins.

The sheer quantity of data that Kafka wants to write to its disks, as alluded to in the Kreps quote above, is what led to its name. Apache Kafka writes a lot, just like author Franz Kafka did. Surely there is no reason to think any further about this.

Franz Kafka created literary worlds in which unbearably absurd things happen for seemingly no reason and people are expected to simply endure them as if nothing out of the ordinary is going on. His environments only partially make sense, producing bureaucracies that defy any attempt at comprehension. The protagonists in his stories feel alienated and isolated. A queasy undercurrent of anxiousness and sometimes outright horror runs through his whole oeuvre. The author was likely neurotic, he destroyed approximately 90% of everything he ever wrote, then he died well ahead of when he probably should have—leaving several substantial works unfinished. In this regard, Apache Kafka shares some similarities.

That is how you justify the project’s name. Saying “I took some literature classes in college and I thought I remembered liking them” is just intellectually lazy.

Important meaningless things / Meaningful unimportant things

Jesse: Look, I like making cherry product, but let’s keep it real, alright? We make poison for people who don’t care. We probably have the most unpicky customers in the world.
Breaking Bad, “Fly” (Season 3 Episode 10)

I am going to pose what might sound like an unthinkable question: Is this important?

My question is sincere. Does this service or product fulfill a need so critical that there is a legitimate reason to always keep one or more human beings on-call for it? Or my personal favorite, usually offered by engineers trying to pull one another back into the crab bucket, which goes something like: “Don’t you think you should be responsible for your own code that you have put into production?” The proper response to this, of course, is “What my code? We are a team; this is our code.” Or, probably a more healthy view, “This is the code.” The production system in question is almost certainly a schizophrenic box of compromises brought about through poor decision-making, unaddressed technical debt, design-by-committee, and impossible timelines and budgets. This is not a system that any single rational human being on the team would’ve chosen to build if permitted to do so alone. Trying to assert ownership over an environment like that is just begging to get your shit rocked. Will the business suffer a significant loss in sales due to an outage? Will they break a contractual service-level agreement (SLA) and expose themselves to legal liability if the outage exceeds a certain threshold? Will they lose the goodwill of customers if the product is unavailable for too long? Do the customers have other options if they get upset with the reliability of the product? Is it even feasible for them to switch to those competitors? Can an unaddressed issue lead to loss of life or property damage?

The answer to at least one of those questions is probably automatically “yes,” which justifies the use of any means the organization deems appropriate to avoid risks. Like an adult sternly barking “because I said so,” the conversation is supposed to end here. On-call is important because it’s important. The mere idea of questioning that axiom brings almost certain trouble, so few people dare prying further.

But it is worth prying. If there are no firm SLAs, it’s hard to justify why the “time to acknowledge” expectations are set the way they are. How much additional customer goodwill does the organization earn by adding one more nine of availability? What is customer goodwill actually worth in the first place? Is it worth more or less than the long term mental well-being of the engineering staff and the eventual turnover incurred by burning them all out?

Each of these perspectives boil down to the same thing: The business might lose money (either from uncaptured revenue or due to penalties) if somebody is not around all the time to handle any technical fault that may occur. It then follows that this person—this lowly on-call engineer—is like an insurance policy that can prevent a larger calamity.

But here’s the key difference: Insurance policies have premiums that cost something. The insured entity can’t just hand-wave the cost away by smearing the responsibility across a bunch of exempt employees who have the words “and other duties as assigned” at the end of their job descriptions. Handling on-call load is work. Modifying life outside of business hours to make them compatible with potential on-call load is work. On-call pain is tantamount to a large volume of work. Work should be compensated, especially if that work is such a critical part of the organization’s risk mitigation plan. If it’s not important enough to fairly compensate the people who have to shoulder the on-call load, why is it important enough to base the success of the business on?

“Importance” really is the key to thinking about all of this. Some might hold the opinion that if an engineer is not on-call as a part of their regular duties, they clearly must not be working on anything very important. I propose to look at it a different way. To understand this perspective, you’ll need to go deep into the forgotten corner of your closet and find That Cage. You know, That Cage you have worked so long and so hard to trap your imposter syndrome in. Go ahead and pull That Cage out for a minute. Lift off the bed sheet that’s been covering it up. Stare deeply into the dark, haunting eyes of that demon. Once comfortable in each other’s presence, ask your imposter syndrome a simple question: If this was actually important to the success of the organization, why did they trust us with it?

Something for the pain

Bart: Milhouse, how could you let this happen? You were supposed to be the night watchman!
Milhouse: I was watching. I saw the whole thing. First it started falling over, then it fell over.
The Simpsons, “Homer’s Enemy” (Season 8 Episode 23)

Obviously, there are ways to support a product that don’t involve putting staff on-call outside of working hours. The so-called “follow the sun” paradigm pretty accurately describes itself—the team is distributed around the world and the product is supported by whichever part of the globe is in daylight at that time. To do this perfectly fairly, there should be a minimum of three teams each separated by eight hours of timezone distance. When it’s 5 p.m. in Chicago and folks are preparing to go home, it’s 9 a.m in Melbourne. When the Aussies are done for the day, it’s a new workday in Lviv and the sun is rising over Dublin. This doesn’t provide perfectly fair coverage during weekends or holidays, but see below for some ideas about that. Or, just close during those times. Banks and financial institutions all do it, and they seem to be doing fine! At any given point during the course of the day, there is some team that is already awake and already within working hours that can handle things without pulling somebody out of their slumber.

If there is no other option than to require on-call support outside working hours, consider making it voluntary. Now I know what you’re going to say to that: If it’s voluntary, nobody will want to do it. And that is absolutely the point. Nobody wants to do it because it freakin’ sucks. It’s not a good deal. So it’s the organization’s responsibility to sweeten that deal enough for somebody to consider taking it. Pay people something for taking an on-call shift. Either a per-day or per-week stipend, or something like the equivalent of one hour’s pay for every four hours on-call. Saying “well, you’re already making plenty of money with your engineer’s salary so that should count” ain’t the way to do it at all.

On-call staff should also be paid something each time they respond to a page, especially when it occurs outside of working hours. If simply being available is worth a flat-rate stipend, actually having to jump into a firefight should be worth something even greater than that. Because if not, it implies that a span of unbearable on-call pain endured by person A is exactly equivalent to an uncharacteristically tranquil week enjoyed by person B at a different time. This is not fair to either of these people or the team as a whole.

Making the business pay to page staff will certainly change the timbre of the on-call load. Nothing cleans up noisy, flapping, inactionable pager alerts quicker than making them expensive to generate.

In a distant past life, this was proposed and shot down with the following rationale, which I distinctly remember as being one of the stupidest things I have ever heard somebody in my management chain say. Paraphrasing: “If on-call engineers were to receive compensation for each incident they resolved, it would incentivize them to intentionally build systems that fail so they could increase their pay by increasing their on-call load.” My guy, that is sabotage and fraud. You are hypothesizing a scenario where your subordinates are committing actual crimes. If somebody is doing criminal acts at work, fire their ass! Not to mention that anybody who deliberately self-inflicts on-call load is a goddamn idiot and should be sacked just on that basis alone.

There is also the radical option of simply leaving certain spans of the day/week uncovered, with nobody officially on-call during those hours. If something fails, let it fail for a while and then deal with it during support hours. Sometimes a large and visceral production incident needs to bubble up to senior management’s attention in order to rally together the willpower to actually pay down some of the technical debt that led to the problem. If all the engineers know that the product is a wobbly tower of paperclips and duct tape, perhaps the people seated at the very top of the infernal structure should get to see exactly how precarious the whole thing really is from time to time. It’s rather easy to put up a fake facade of perfect customer-facing uptime, and it’s also surprisingly easy to conceal the damage done to the employees who are tasked with carrying the weight of that facade on their backs. At least until they all become disillusioned and quit, anyway.

Something to bear in mind is that you, as the employee, have a certain voice here. You can ask potential employers during the interview stage how they do on-call, and withdraw yourself from consideration if you don’t like the answer they give. You can tell them flatly that this is your reason for not wanting to work there. You can leave a job if on-call is cramping your style or ruining your life, and you can tell the exit interviewer exactly this. If you’re at a place that’s thinking about formally adopting on-call, you can dig your heels in and either demand compensation for it or refuse to do it. Will your employer respect the boundaries you’re drawing? There’s really only one way to find out.

If you find yourself negotiating a job, try to get a line in the employment agreement that specifically disallows unpaid after-hours on-call shenanigans. Remember, negotiating isn’t just about arguing over compensation numbers, you can try to haggle over material job duties and expectations. Push back on the non-compete and non-solicitation clauses while you’re at it, and the overreaching intellectual property assignment, all that crap. Have you ever redlined a contract? It might be worth giving it a shot someday! And I won’t go as far as to say that tech workers should unionize or anything like that, but I will say that it seems like a whole lot of employers in this industry specifically do not want their employees to unionize. There’s probably a reason they’re apprehensive about it, and that reason almost certainly benefits the employers and not the rest of us.

And it’s scary to stand up for principles like this, which is likely a big part of why on-call duties get successfully foisted on so many unfortunate people. I can’t promise this won’t lead to an uncomfortable and fruitless conversation, or a burned bridge, or a pay cut, or months of unemployment. All I can say is that you—you specifically—are worth something. Your time is worth something, just like your mental health and physical well-being. Your employer spent money hiring you, and they would need to spend money to replace you with another hire. Unless you are absolutely useless or a complete dickhead, losing you would negatively affect your team’s morale and output. Your manager has to go through the performance review cycle just like you do, and losing a direct report is not a good look for them. You have a voice and you have some leverage. It’s up to you what you do with it.

Sleeping through the night

But it’s a five o’clock world when the whistle blows
No one owns a piece of my time
And there’s a five o’clock me inside my clothes
Thinking that the world looks fine, yeah
The Vogues, “Five O’Clock World”

Earlier, when I asked if this was important, I suspect that most readers answered from the perspective of the company. Of course it’s important, why wouldn’t it be? But now I’m asking you, specifically. Is this important to you?

I suspect that a fair number of readers are going to feel that what I wrote is naive, overly cynical, too idealistic, or simply incompatible with the realities of modern business expectations. Perhaps this article will be perceived as a handbook for how to become embittered and then get fired. But detached from all of that, in the innate nature of almost every human that participates in these systems, this can’t possibly sit right. Why do we accept this plainly abusive practice? Why do we go above and beyond to forfeit the enjoyment of our free time to an organization that is unwilling to reciprocate in any meaningful way? To an organization that is perhaps incapable of reciprocating?

It turns out that there are all kinds of different people out there. Some are young (or young at heart) and have nothing better to do outside of work than party and pass out drunk. For them, on-call might be almost fun, like that invigorating feeling somebody might get when they sign up for stage crew in high school and get to screw around in the building after all the other students and staff have gone home and it’s weird and empty. Others have complicated home lives with difficult caretaking situations and really do not need to be dealing with yet another source of stress and anxiety in an existence that is already a hair’s breadth from going completely off the rails. Some people simply do not care about work when they’re not physically there; they clock in, work for the day, then clock out. There is nothing inherently wrong with trying to limit the encroachment of work into life. Each of these people have different priorities, different needs, different values and principles. It is not fair to blindly shoehorn them all into the same on-call rotation and pretend they are going to respond to it the same way.

On average, most of us get around 4,000 weeks of life on this earth. If you’re exceptionally fortunate, you might make it to 5,000. How many of those weeks do you want to spend in the shadow of a pager?

When I was just shy of 2,000 weeks old, I suffered through a particularly acute week of on-call pain. At one point I was in my third or fourth video call about the same long and protracted smoldering SEV and, in a moment of frustrated weakness, I made an offhand comment about just being tired of repeatedly handling the same problem. My manager was present on the call, and my statement seemed to really set him off. I was essentially told that my feelings about the situation—perhaps the only authentic part of myself I ever expressed there—were wrong. In the days that followed I was made to feel like I was not a team player, that I was not pulling my weight, and that I was not meeting the bare minimum of what was expected of a person bearing the torch of on-call. With the utterance of a single sentence, I opened a rift in the relationship with my manager that remained until the day I left that job.

But I meant what I said. I mean it now more than ever: I have been paged enough.

« Back to Articles