Test in Production With Charity Majors

Introduction

Adam: Welcome to CoRecursive. Where we bring you discussions with thought leaders in the world of software development. I am Adam, your host.

Charity: Metrics dashboards can all die in a fire. Yeah. And every software engineer should be on call. Boom.

Adam: Hey. Today’s interview is with Charity Majors. We talk about how to make it easier to debug production issues in today’s world of complicated distributed systems. A warning, there is some explicit language in this interview.

I originally saw Charity give a talk where she said something like fuck your metrics and dashboards you should just test in production more. It was a pretty hyperbolic statement. But she ended up backing it up with a lot of great insights. And I think you’ll find this interview similarly insightful.

If you are a talented developer with functional programming skills, I’ve got a job offer for you. My employer Tenable is hiring. Hit me up via email to find out more. I’ll also drop a link in the show notes. Tenable is a pretty great place to work if you ask me.

Charity: Okay. I’m hearing you a little.

Adam: Can you actually hear me?

Charity: Oh! I can hear you now.

Adam: Oh really?

Charity: Yeah.

Adam: Aha. Okay. Let’s call this the beginning.

Charity: Cool.

Adam: Charity, you are the CEO of Honeycomb.io.

Charity: Accidental CEO.

Adam: Accidental CEO. Well, thanks for joining me on the podcast.

Charity: Yeah, my pleasure. It’s really nice to be here, thanks.

Debugging Is Complex

Adam: So, I used to be able to debug production issues like, something would go wrong, some ops person would come and get me and then we’d look at things and we’d find out whatever… There’s some query that’s running on this database that’s just like, causing a bunch of issues, and will knock it out. Or okay, we need to turn off this feature and add some caching in front of it. Um, and you know, I always felt like a hero.

Charity: It mostly works.

Adam: Yeah, yeah. And now I’ve woken up into this dark future where… First of all, now I get paid before the ops person sometimes and then… Like things are just crazy complicated.

Charity: Yeah.

Adam: There’s like more databases than people it seems and like every product is Amazon.

Charity: Ten microservices per developer. Yeah.

Adam: Yeah. So uh that’s why I wanted to have you on because I feel like maybe I’m hoping that maybe you have an answer for all this mess.

Charity: Oh, well. I do. The answer is everything’s getting harder. (laughter) And we need to approach this not as an afterthought as an industry. But as something that we invest in, that we expect to spend money on, that we expect to spend time on. And we have to upgrade our tools.

Like the model that you described where you know, you have your trusty rusty ops buddy who kind of guides you through the subterranean [inaudible 00:03:10]. But also like our systems used to be tractable. You could have a dashboard, you could glance at it, you could pretty much know with a glance where the problem was, if not what the problem was.

And you could go fix it, right? Whether it’s like [inaudible 00:03:24], or [inaudible 00:03:25], or you know, pairing somebody with a bunch of infrastructure knowledge, query sniffing, whatever. Finding the component that was at fault was easy. And so you just needed localized knowledge in that code base or that technology or whatever.

As you mentioned, this basically doesn’t work anymore for any moderately complex system. The systems often back into themselves. Their platforms, like when you’re a platform you’re inviting all of your users’ chaos to come live on your servers with you. And you just have to make it work, right? And make sure it doesn’t hurt any of your other users.

Complex [inaudible 00:04:03] problems like that. There’s ripple effects, there’s thundering herds, there’s God knows how many programming languages and how we store it. And databases, don’t even get me started. I come from databases, right? So I am… Yeah. Anyway, the way I have been thinking of it is like we’re just kind of hitting, everyone is hitting a cliff where suddenly, and it’s pretty suddenly, all of your tools and your tactics that have gotten you to this point, they no longer work.

And so like, this was exactly what happened to us. So my co-founder Christina and I are from Parse. Which was the mobile backend of service, acquire by Facebook. And I was there, I was the first infrastructure engineer. And Parse was a beautiful product. Like, we just made promises. You know, you can write, this is the best way to build a mobile app. You don’t need to worry about the backend, you don’t need to worry about the storage model or anything. We can make it work for you. It’s magic. Which you can translate as a lot of egregious work.

And around the time we got acquired by Facebook, I think we were serving about 60 thousand mobile developers. Which is nontrivial. And this is also when I was coming to think with dawning horror that we had built a system that was effectively undebuggable by some of the best engineers in the world. Like, both of our backend teams were spending like all of our time tracking down [inaudible 00:05:24] which is the kiss of death if you’re a platform. Somebody would be like Parse is down like every day. And I’d be like why is Parse down, like dude, behold my wall full of dashboards, they’re all green. [crosstalk 00:05:36]

Arguing with your users is always a great strategy. So I’d dispatch the engineer, I’d go try to figure out what was wrong. It could be anything. We let them write through our queries and upload them. We just had to them work. Right? Let them write their own JavaScript and upload it, we just had to make it work. So we could spend more time than we had just tracking down these one-offs. And it was just failing. I tried to, I’ll fast forward through all the things I tried.

Facebook Scuba

One thing that finally helped us get a dent, helped us get ahead of our problems, was this janky ass unloved tool at Facebook called Scuba that they had used to keep other [inaudible 00:06:14] databases a few years ago. It’s aggressively hostile to users. It just lets you slice and dice on any dimensions in basically real time. And they can all be high cardinality fields. And this didn’t mean anything to me.

But we got a handle on our shit and then I moved on. Right? I’m on to the next fire. And it wasn’t until I was leaving Facebook that I kind of went wait a minute. I no longer know how to engineer without the stuff that we’ve built in Scuba. Why is that? How did it worm it’s way into my soul to the point where I’m like this is how I’m understanding my production systems and it’s like getting glasses and then being told you can’t have them anymore.

It’s just like how am I even going to know how to navigate in the world. So we’ve been thinking about this for a couple years now and I’ll pause for a breath here in a second… But I don’t want to say Honeycomb is the only way to do this. Honeycomb is the result of all of this trauma that we’ve endured like when our systems hit this quick of complexity. And we really thought at first it was just platforms that was going to hit this. And then we realized no, everyone’s hitting it because it’s a function of the complexity of the systems. You can’t hold it in your brain anymore. You have to reason about it by putting it in a tool where you and others can navigate the information.

Adam: So how did Scuba… Is that what it was called? Scuba?

Charity: Yeah.

Adam: So what did it consume?

Charity: Structured data. It’s agnostic about it, mostly logs. But it was just the fact that it was fast, there was no how do we construct a query and walk away to get coffee and come back. Because you know when you’re debugging it has to be… You’re asking lots of small questions as quickly as you can, right? You’re following cookie crumbs instead of crafting one big question you know will give you the answer. Cause you don’t know what’s going to be answer, you don’t even know what the question is. Right? Also high cardinality.

And when I say that I mean… Imagine you have a table with a 100 million users. High cardinality fields would be influencing unique ID, right? Social security number. Very high cardinality would be last name and first name, very low would be gender, and species I assume would be lowest of all. So the reason I was laughing when you said fuck metrics, I’ve said that many times. The reason that I hate metrics so much, and this is what 20 years of operations software is built on, is metrics, right?

High Cardinality

Hating Metrics

Well the metric is just a number. Then you can pin tags to it to help you improve them. You’re limited in cardinality to the number tabs you would have, generally, which is 100 to 300. So you can’t have more than 300 unique IDs to group those by which is incredibly limiting. So newer things like Prometheus [inaudible 00:08:56] a little bit better but bottom line, it’s very limited and you have to structure your question, your data, just right. Like all the advice you can get online about how to try not to have these problems, which, when you think about it, is stupid.

Because all of the interesting information that you’re ever going to care about is high cardinality. Like request ID, raw query, you know? You just need this information so desperately. And so that I think was the coolest thing for Scuba. It was the first time I’d ever gotten to work with a data system that just let you have arbitrary… So imagine… A common thing that companies will do as they grow is they have some big customers who are more important to theirs. So they pre-generate all the dashboards to those customers because they don’t have the ability to just break down by that one in 10 million user IDs. And then you’re not accomplishing anything else. When you can just do that, so many things get so simple.

Adam: To make sure I understand it, so if I’m using… Like with metrics I have Datadog, right? And I have Datadog metric and basically I’ll measure this request on my microservice or whatever, like how long does this normally take. Right? So it has some sort of… Just the time that it takes from start to end on that. And I can put it on a graph. Or whatever. So high cardinality as I understand it is saying let’s not just count the single number let’s count everything. What’s the user that requested it.

Metrics Are Devoid of Context

Charity: It’s more like… So every metric is a number that is devoid of context. Right?

Adam: Mm-hmm (affirmative).

Charity: It’s just a number with some text. But the way that Scuba and Honeycomb work is we work on events. Arbitrarily wide events. You can have hundreds of well instruments servers using 300 to 500 dimensions. So all of that data for the request is in one event. The way we typically will instrument is that we’ll… You know, you’ll initialize an event when the request enters the service. We pre-populate with some useful stuff. But then while the request is being served we just toss in everything that might possibly interesting someday.

Any IDs, any shopping cart, information on raw queries, any normalized queries, any timely information, every hop to any other microservice. And then when the request is going to exit or error you just ship the event off to us or at Scuba. And then you have all this information and it’s all tied together. Right? The context is what makes this incredibly powerful. Like a metric has zero context. So like over the course of your request in your service you might fire off like 20 or 30 different metrics. Right? Counters, whatever. But those aren’t tied to each other. So you can’t reason about them as all of these things are connected to this one request.

This is so powerful because so much of debugging is looking for outliers. Right? You want to know which of your requests failed and then you went to look for what they have in common. Was it that there’s a caught net in some of the TCB, that statistics were overflowing? Only on those? Or was it that those are the ones that were making a particular call to a host or to a version of the software? Just being able to just slice and dice and figure that out at a glance is why our time to resolve these issues went from hours or days or God knows to seconds. Just seconds or minutes. Just repeatedly. Because you can just ask the questions.

So I would say to summarize the thing that makes it powerful is the fact that you have all that context. And you have a way to link all of these numbers together. And the fact that you can ask questions no matter how high cardinality is so you can combine them, right? You want to look at the combinations. This unique request ID issued in this query from this host at this time or whatever. It’s precision.

Structured Logs

Adam: It sounds like what I normally do with logs. I have them all gathered somewhere in Splunk or something and I’m searching for things.

Charity: It’s much more… Because logs are just, what, typically unstructured events. They’re just string, right?

Adam: Mm-hmm (affirmative)

Charity: And if you’re structuring your logs then you’re already way ahead of most people. If you’re structuring your logs then I would say I encourage you to structure them very widely. You know, not to issue lots of log lines per request, but to bundle all the stuff together so that you get the additional power of having it all at once. Otherwise, you kind of have to reconstitute it, like, give me all the log lines for this one request ID and you have to do stuff. If you just pack them together then it’s much more convenient. And that’s basically what Honeycomb is plus the colander store that we wrote in order to do the exploration. You can also think of it like BI systems.

Adam: You can think of it like BI systems… I don’t…

BI for Software Systems

Charity: BI for systems. Business Intelligence.

Adam: Oh, okay.

Charity: For systems. You were talking in the beginning about debugging with the ops team in the dashboard, right?

Adam: Mm-hmm (affirmative)

Charity: The ops person was just kind of like along for the ride and filling in all of this intuition, all of this, past scar tissue, all of, you know… You were able to explore that information because it wasn’t in a tool it was in someone else’s brain. This is why the best debugger on the team is always the person who has been there the longest. Right? Because they’ve amassed the most context built up in their brains. Which is… I love being that hero. I love being the person who can just gaze at a dashboard and go it’s res. I just feel it in my bones. But it’s not actually good for us as teams. Right? I can’t take a vacation. Nobody else can explore the data.

And I’ve now had the experience three times where the best debugger on the team was not the person who had been there the longest. This was at Parse, at Facebook, and Honeycomb because when you’ve taken all of that data about this system and put it into a place where people can explore it with a tool then the best of debugger is going to be the person who’s the most curious and persistent.

Bigger Than Our Heads

Adam: I like what you said about intuition and I find that I describe that problem of debugging something and I know that there’s a person on my team, John, and I feel like he just has a really good model of how the system works in his head.

Charity: Yeah. Yeah. The problem is that systems are getting too large and too complicated and changing too quickly and they’re overflowing our heads across the board. But what you just said there is another thing that I’m so excited about. Our tools as developers, they have not treated us like human beings. They have treated us like atomitons. You know. Like how many thin sequences do you know by heart? Way too many. I know way too many. It’s been like this point of play which is kind of stupid.

So the thing that we’re really passionate about is building for teams. Looking for ways to bring everyone up to the level of the best debugger. Or the person with the most context and most information about every corner of the systems. Right? Because if I get paged about something and I’m like ugh, shit, is this about Cassandra? I don’t give a fuck all about Cassandra. But Christine does. And didn’t we have an outage that was five or six weeks ago and I think she was on call then. I’m just going to go look at what she did. What questions did she ask? What did she think was meaningful enough to publish to Slack? What got tagged as part of a post-mortem? What comments did she leave for herself?

I learned Linux by reading others peoples’ bash history files and just trying all the commands. Tapping into that sense of curiosity, almost that snoopy-ness that we have when people are really good at their jobs. We just want to go see how they do things. I’m so excited about tapping into the social… Once we’ve gotten the information off our heads then how to do we help people explore it? How do we make it fun? How do we make it approachable? And how do we make it so that we forget less stuff?

Because when I got to debug a gnarly problem I’m going to do a deep dive and I’m going to know everything about it for like a week. And then it starts to decay. Right? You could ask me two or three months later and I’m just like back to zero. But if I can just have access to how I interacted with the system. What columns did I query? What notes did I leave for myself? What were the most useful things that I did? And If I and my team can access that information then we’ve forgotten a lot less. And that’s nice.

Dashboards Are Looking Backwards

Adam: I find… I have a bunch of dashboards that somebody has kindly made and painstakingly put together and they have helped me before but not that much.

Charity: Yeah and fundamentally you’re consuming very passively. You’re not actually interrogating your system. You’re not forming a hypothesis or asking a question. And the best way to actually get good at systems is to force yourself to asks some questions. You know. To predict what the answer might be. Every time that you look at someone else’s dashboard or even your own dashboard from a past outage, it’s like an artifact. Right?

Adam: Mm-hmm (affirmative)

Charity: You’re not exploring it. It’s a very passive consumption and because it’s so passive we often miss when a data source isn’t there anymore. Or when, you know, it’s like the dog that didn’t bark. I can’t even count the number of times that I’ve been… There’s a spike and I’m just looking through my dashboards looking for the root cause and realizing that oh, we forgot to set up the graphics software on that one. Or it’s stopped sending… Or something like that because you’re not actually actively asking a question you’re just kind of skimming with your eye balls. Just like scanning. Eyes getting tired.

Testing in Production

Adam: Agreed. You said something. I was saying I watched this talk of yours and you said something about how we should be doing more testing in production, or something like that. What does that mean?

Charity: I think what I’m trying to say is that we do test in production whether we want to or not. Whether we admit it or not. Every config change. Even if you devote a lot of resources to try and keep staging in sync with production, assuming it’s even possible that your security conditions and blah blah blah. It’s never exactly the same. Your config files are different, every unique combination of ED deployed, plus the software you use to deploy, plus the environment you’re deploying to, plus the code itself is unique.

There’s literally no way. As anyone who’s ever [inaudible 00:19:20] production knows. You know. There’s some small amount of it that is… It is a test. Because you’re doing it for the first time. And I feel like most teams because there’s this whole you can’t test in production, we don’t do anything in production that isn’t tested. Like, they’re just not admitting reality and that causes them to pour too many of their very scarce engineering cycles into trying to make staging perfect.

Those cycles would be better used making guard rails for production. Making… You know, investing in things like good canary deploys that automatically roll back if the item they promote is good. That part of the industry is starved for resources and I think it’s because we don’t have unlimited resources and the right place to take it from is staging. I think because staging is fragile and full of… it’s just not a good use of time. I think that, and I’m not saying we shouldn’t test before production. Obviously we should run tests.

But those are for your known unknowns. In the future, known unknowns are not really the hardest problems or even the most frequent problems that we have. It’s all about these unknown items. Which is a way, I think, of talking about this cliff that we’re all going off. You know, it used to be known unknowns. You’d get paged, you’d look at it, you’d kind of know what it was. You’d go poke around and you’d solve it. Now, it’s like when you get paged you should honestly just be like uh, what is this? You know. I didn’t see this before. This is new. Or, you don’t really know where to start. Partly because of the sheer complexity and probably just because there are so many more possible outcomes or possible root causes.

You need stress resiliency in the modern world. Not perfection. And I think that… I’m sort of joking and trying to push people’s buttons when I say I test in production. But also, sort of not. I mean, it’s for real. Like that outage that Google cloud platform just had last week. What did they do? A config change. Worked great in staging. They pushed it to prod. Put the whole thing down.

You can’t test everything. So you have to invest in catching things… Failure should be boring. Right? That’s why we test in prod. And you can say experiment in prod, I don’t know, whatever. But I think that for the known unknowns you should test before production. But there are so many things that you just can’t and so we should be investing more into tools that let us test.

And I think that a really key part of that has been observability. We haven’t actually… It’s easy to ship things to production. It’s much harder to tell what impact it has had. That’s why I feel that something like Honeycomb, where you can just poke around, is necessary. Like, I hope that we look back in a couple years at the bad old days when we used to just ship code and wait to get paged. Like, how fucking crazy is that? That’s insane that we used to just like wait for something bad to happen at a high enough threshold that’ll wake us up.

We should have the muscle memory as engineers that if… What causes things to break? Well, usually it’s us changing something. So whenever you change something in prod, you should have the muscle memory to just go look. Did what you expect to happen actually happen? Did anything else obvious happen at the same time? Like, there’s something so satisfying, so much dopamine you can get straight to your brain just by going and looking and finding something and fixing it before anyone notices or complains.

Recovering vs Preventing Production Issues

Adam: So if we have like… In the real world, we have a fixed amount of resources and if we’re trying to decide what percentage of effort should go towards recovering from production issues and what should go towards preventing them.

Charity: Oh, this is such a personal question. It’s based entirely on your business case, right? Like how much appetite do you have for failure. It’s going to be different for a bank than for a, you know… And how old are you? Who are your customers? You know. Startups have way more appetite for risk than companies that are serving banks. You know? It’s very very… There’s no answer that’s exactly the same for any two companies, I think.

Adam: But it sounds like what you’re saying to me is that we should put a lot of effort into recovering from production issues.

Charity: Into resiliency. Yeah. Into early detection and mitigation. Recovery is an interesting word. Often I think it’s just understanding. There’s many changes you have to make. Say you’re rolling out a version of the code that is going to increase some [inaudible 00:23:58] and it’s not a leak, you know it, but you don’t actually know how much because you run it in staging and again you’re not going to have the same kind of traffic, the same variants. So you don’t actually know. So, I’m arguing that you need to roll things out, you need to habitually, to make this a very mundane operation. Right? It should roll out to 10 percent, get promoted, run for a while, get promoted 20 to 30 percent, and be able to watch it. So that you know if it’s about to hit out of bounds or something.

Adam: Because I think it’s important, like actually… I think it’s just as a developer it gives confidence when you can actually just roll back but not everything can be rolled back I guess is…

Charity: Yeah, especially, like when the codes you get to the laying bits down on disc the more things roll forward only.

Adam: Then you start to get sweaty palms. I don’t know, depends but I’ve seen some hair raising database migrations.

Building Your Own Data Store

Charity: Oh God. I come from databases. I have done things with databases that would turn your hair white.

Adam: So you mentioned earlier that you build your own database.

Charity: Oh, no, no, no. I have spent my entire career telling people not to write a database so I’d like to be very clear on this point we have written a storage engine. I’m sticking to it.

Adam: Tell me about your storage engine.

Charity: Oh, it’s as dead simple as we could possibly make it. It’s a colander store that is really frigging fast. We target one second for the 95th percentile of all queries.

Adam: Why did you need your own data store?

Charity: Well, that’s a great question. Believe me, we tried everything out there. So the operations engineering community for 20 years has been investing in time searing databases. Built on metrics. Right?

Adam: Mm-hmm (affirmative)

Charity: And we knew that this was just not a data model that was going to enable the kind of interactive, fast, interaction we wanted to support and furthermore, we knew that we wanted to have these really wide, arbitrarily, wide events. And we knew that because we’re dealing with unknown unknowns, we knew that we didn’t want to have any schemas. Because anytime that you have to think about what information you might want to capture and fit it into a schema it introduces friction in a really bad way. And then you have to deal with indexes. Like one of the problems with every log tool is you have to pick which indexes you want support. Some of them even charge by indexes, I think.

But then if you need to ask a question about something that isn’t indexed, well you’re back to like oh, I’m going to go get coffee while I’m waiting for this query to ground, right? And then if you didn’t ask the right question you’ve got to go for another ride. It’s just not interactive, it’s not exploratory. So we tried everything out there. Druid came a little close but it still didn’t have that kind of richness. Yeah. We knew what we wanted and so we had to write it. We wrote it as simply as possible. We were using Go Lang. It is descendant from Scuba and Facebook, for sure. Scuba was just 10 thousand lines of C++. It was entirely in memory cause they didn’t have SSDs when the wrote it and it chills to our sync replication. It’s janky as fuck.

But the architecture’s nice. It’s distributed so there’s a fan-on model where the query comes in, fans-out to 5 nodes, does a column scan on all five of aggregates, pushes them back up, because there’s too much to aggregate then it fans out again to another five nodes and repeats. It’s very scalable. We can handle very very high through play, just by adding more nodes.

Adam: So you’re saying it doesn’t have any indexes or index isn’t everything?

Charity: Well, columns are effectively indexes right.

Adam: Yeah.

Charity: So everything is equally fast. Basically.

Adam: It’s sort of like index everything.

Charity: Exactly.

Adam: Because everything’s a column.

Charity: Yes. Yeah and you can have arbitrarily wide… We use a file per column basically. So up to the Linux file, open file handle, the one that’s just like 32K or something, becomes not tractable for humans long before them.

Adam: I like this idea that they’re is this very janky tool at Facebook that changed the world.

Charity: They can’t kill it, it’s too useful. But it has been not invested in and so it is horribly hard to understand. It’s aggressively hostile to users. It does everything it can to get you to go away but people just can’t let it go.

Adam: Do you think that more people should kind of embrace the chaos?

Build Better Tooling

Charity: Yes.

Adam: And have more of a startup focus.

Charity: Yeah, I do. Yeah. I thought you were going a different direction with that question. But yes, that too.

Adam: Which way did you think I was going?

Charity: Oh, I thought you were going to ask if more people should build tools based in events instead of metrics. And yes, I’m truly… You’re very… I’m, you know, opening the door. We’ve been in talks, and then we built our storage engine. As an industry we have to make the jump from very limited…The thing about metrics is also they are always looking for aggregate and the older they are the less fine grained they are, right? That’s what [inaudible 00:29:26], by aggregating at the right time. We drop data by sampling instead because it is really really powerful to have those raw [inaudible 00:29:34]. This is a shift that the entire industry has to make, and I almost don’t care if it’s us or not. That’s a lie, I totally care if it’s us or not.

But there needs to be more of us. Right? This needs to be a shift that the entire industry makes. Because it’s the only way to understand these systems. It’s the only way I’ve ever seen.

We should talk about Tracing, real quick, because tracing is just a different way of visualizing events. Tracing is the only other thing that I know of that is oriented around events. Oh, what I was starting to say was that metrics are great for describing the health of the system. Right? But they don’t tell you anything about the event because they’re not fine grained and they lack the context. And as a developer we don’t care about the health of the system. If it’s up and serving my code, I don’t give a shit about it. What I care about is every request, every event and I care about all of the details from the perspective of that event.

Right? And we spend so much time trying to work backwards from these metrics to the questions that we actually want to ask and that’s that bridge right there, is what is being filled by all of this intuition. You know, and jumping around between tools like jumping from the metrics and aggregate from the system to jumping into your logs and trying to grab for the stream that you think might shed some light on it.

Everything becomes so much easier when you just ask questions from constructive feedback. Tracing is interesting because Tracing is just like Honeycomb except for it’s depth first. I think its was breadth first, where you’re slicing and dicing between events trying to isolate the ones that have characteristics that you’re looking for. And Tracing’s about, okay, now I’ve found all those events, now tell me everything about it from start to finish.

And we just really start tracing product and what’s really frigging about it is you can go back and forth. Right? You can start with… I don’t know the question, All I have is a vague problem report so I’m going to go try and find something, find malware, find an error that matches this user ID, query, whatever. Oh, okay, cool. I found it. Now, show me a trace. Trace everything that gets this hot or this query, or whatever. And then once you… and like oh, cool, here, I found one. Then you can zoom back out and go okay, now show me everyone else who was affected by this. Show me everyone else who has experienced this.

And we’ve been debugging around storage engines this way for about three or four months now. It is mind blowing just how easy it makes problems.

Adam: Yeah, that sounds powerful for sure. I guess we’re kind of getting the tools back that we lost when we split up into a million different services? In some ways.

Charity: Yeah, totally. It’s kind of like distributed GDB.

Adam: Mm-hmm (affirmative) So I don’t talk to too many ops people on this podcast. I wanted to ask you what do you think developers have to learn from an ops culture or mindset or?

Learning From Operations

Charity: Oh, that’s such a great question. First of all, I heard you say that you’re on call and you get pages for the op people. Bless you. This is the model of the future. And I want to make it clear that I don’t want to… Ops has had a problem with masochism. Like, for as long as I’ve been alive. And the point of putting software engineers on call is not to invite them into the masochism with us. It’s to raise our standards for everyone in the amount of sleep that we should expect. I feel very strongly about this. The only way to write and support good code is by shortening those feedback loops and putting the same people who write software on call for it. It’s just necessary.

In the glorious future, which is here for many of us. We are all distributed systems engineers. And one thing about distributed systems is that it has a very high operational cost. Right? Which is why software engineers are now having to learn op. I’ll often say that I feel like that the first wave of [inaudible 00:33:29] yelling about suing [inaudible 00:33:29] and learn how to write code, you know. And we did. Cool. We did.

And I feel like now for just the last year or two the pendulum’s been kind of swinging the other way and now it’s all about, okay software engineers, you know, it’s your turn. It’s your turn to learn to build operable services, it’s your turn to learn to instrument really well, to make the systems explain themselves back to you, it’s your turn to pick up the ownership side of the software that you’ve been developing.

And I think that this is great. I think that this is better for everyone. It is not saying that everyone needs to be equally expert in every area. It’s just saying that what we have learned about shipping good software is that everyone should write code and everyone should support code. We have something like 70 to 80 percent of our time is spent maintaining and extending and debugging, not great field development. Which is fundamentally… Software engineers do more ops than software engineering.

So I think it makes sense to acknowledge that fact. I think it makes sense to reward people for that. I am a big proponent of you should never someone a senior engineer. Don’t promote them if they don’t know how to operate their services. If they don’t show good operational hygiene. You have to show that this is what you value when you work, not what you kick around and deprioritize. And people pay attention to signals like promotions and pay grades and who thinks they’re too good for what work.

Adam: Definitely. I think there is an ops culture maybe it’s just my perception. You mentioned, like masochism. I don’t know where the causation and corelation go. That there’s a… I don’t know if developers are going to become more… There’s a certain attitude sometimes and I don’t think there’s anything wrong with this but of, you know, like, you know, call me when something’s on fire. You know. That’s when I’m alive. When things are breaking and…

When Things Are on Fire

Charity: Yeah. Believe me, I am one of those people. I love it. I’m the person you want to call in crisis. If the database is down, we’re not sure if it’s ever coming up again, the company might be screwed, I am the person that you want at your side. I’ve spent my career working myself out of a job repeatedly. I guess I’m a startup CEO now. But that aside, you could both enjoy something and recognize that too much of it is not good for you. I enjoy drinking but I do try to be responsible about it.

Yeah, I don’t know. I think the things that you call out and praise in your culture are the things that are going to get repeated. And if you praise people for fire fighting and heroics, you’re going to get more of that. And if you treat it as an embarrassing episode, that we’re glad we got through it together, privately thank people or whatever, but you don’t call it out and praise it. You make it clear that this is not something that you value. That it was an accident and you take it seriously. You give people enough time to execute all of the tasks that came out of the post-mortem instead of having a retrospective, coming up with all this shit, and then deprioritizing. Going on a detour. That doesn’t say yes we value your time and we don’t want to see more fire fighting.

I think that these organizational things are really the responsibility of any senior management.

Adam: Mm-hmm (affirmative) It’s a tricky problem. I wanted to ask you. You are now a CEO. Do you still get to work as an individual contributor? Or do you still get to fight fires and get down in the trenches?

Contributor to Manager and Back

Charity: I am fighting fires but not of the technical variety. I wanted to be CTO, that’s what I was shooting for. But circumstances… I don’t know. I believe in this mission. I have seen it change peoples’ lives. I’ve seen it make healthier teams. And I am going to see I through. I really miss sitting down in front of a terminal every morning. I really really really do. But I’ve always been highly motivated by what has to be done.

You know, I don’t play with technology for fun. I get in, in the morning, and look at what needs to be done. So, I guess this is just another variation on that. This is what needs to be done. I spent a year trying to get someone else to be CTO, I can’t find someone, that’s fine. I’m in it for now. I’ll just take it as far as I can.

Adam: That’s a very pragmatic approach. I always worry if they take my text editor away from me I’ll never get it back. Just because I’ve seen it happen to other people.

Charity: For sure. I’ve written a blog post about the engineer manager pendulum because I believe that the best technologists that I’ve ever gotten to work with were people who had gone back and forth a couple of times. Because the best tech leads are the ones who have spent time in management, they’re the ones with the empathy and the knowledge for how to motivate people and how to connect the business to technology. And explain it to people in a way that motivates them.

And the best managers I’ve ever had were line managers. Were always never more than two or three years removed from writing code, doing hands on work themselves. And so, I feel it’s a real shame that’s its often a one way path and I think it doesn’t have to be if we’re assertive about knowing that what we want is to go back and forth. Certainly what I hope for myself.

Adam: Yeah, there doesn’t seem to be a lot of precedent for switching back and forth.

Charity: There isn’t but since I wrote that piece I still get contacted by people everyday just saying thank you, this is what I wanted, I didn’t know it was possible, I’m totally going to do it now. I actually wrote it for a friend of mine at Slack who was considering going through that transition. I was just like yeah, you should do it. And I wrote the post for him and he went back to being IC and he’s so much happier.

Adam: So he went back to being a contributor rather than in a management role?

Charity: Yeah, he was the director. And he’s having an immense amount of impact in his senior IC role because he’s been there for so long he knows everything. He can do these really great industry moving projects.

Adam: Oh, that’s awesome. How are you doing for time? Do you have to run?

Charity: I don’t know. Let me see. I can be a few minutes late.

Adam: So do you like dashboards or not?

Dashboards and Over-Paging

Charity: I think that some dashboards are inevitable. You need a couple of top level… I think that a couple are inevitable but they’re not a debugging tool. Right? They’re a state of the world tool. As soon as you have a question about it, you want to jump in and explore and ask questions and I don’t call that a dashboard. Some people do. But I think it’s too confusing. Interactive dashboards are fine. But you do have to ask that question. You need to support… And what about this? And what about for that user? What about… I don’t care what you call it as long as you can do that.

I also think that a huge pathology right now of these complex systems is that we’re over paging ourselves. And we’re over paging ourselves because we don’t actually trust our tools to let us ask any question and isolate the source of the problem more quickly. So we rely on these clusters of alerts to give us a clue as to what the source is. And if you actually have a tool with this data that you trust I think that the only pages you actually really need are requests per second, errors, latency, maybe saturation, and then you just need a dashboard that at a high level, shows you that.

And then whenever you actually want to dig in and understand something than you jump into more of a debugging framework.

Adam: So these issues that you talked about before, like a specific user… Like you would never get paged for that. How would that come to your attention?

Alerts Won’t Cut It

Charity: That is a great question, that is a great point.

So many of the problems that show up in these systems will never show up in your alerts or else your over alerting, right? ‘Cause they’re localized. This is another thing that’s different about the systems that we have now versus the old style of systems. It used to be that everyone shared the same tools and you know, it’s here for the web, for the app, for the database. And so they all had roughly the same experience.

With these new distributing systems, you know… Say you had a 99 percent reliability in your old system. Well, that meant that everyone’s erroring like .5 percent of the time. So on the new systems it more likely means that the system was 100 percent up for almost everyone but everyone whose last name starts with SHA, who happens to be on this shard, you’re 100 percent down. Right?

You’re just going to top level on percentages, your paging alerts are not going to be reflective of the actual experience of your users. So then you’re like why even generate alerts for every single combination of blah blah blah and then you’re just like [inaudible 00:42:49]. So honestly, a lot of the problems that we are going to see are going to come to us through support. Or through users reporting problems. And over time as you interact with the system you’ll learn what the most common high signals are, maybe you’ll want to have an end to end check that traverses every shard, right? Hits every shard. Or something like that. It’s different for every architectural type. But I don’t remember what the question was.

Oh, I was just talking about the difference in systems. There are so many ways that systems will break, they only affect a few people though. So it makes the high cardinality questions even more important.

Customer Support and Empathy

Adam: And you were mentioning developers should be able to operate their systems. I think actually developers should spend time doing support. It’s horrible, it’s not fun.

Charity: God, yes. No, but it really teaches you empathy for your users.

Adam: Yeah and so that the issue with, whatever you said, users with the last name SH, that’ll come in. That’ll come in as a support ticket and if I’m busy and I’m a developer and that kind of, I’ll be like, that doesn’t make sense. Are you sure? But if I’m on the one who was to deal with this ticket…

Charity: No, totally. Yeah, we’re big fans of having everyone rotate through on call, rotate through support, triaging. It doesn’t even have to be that often, maybe once a quarter or so is enough to keep you very grounded.

Adam: It’s like an empathy factor, I think.

Charity: It really is. The thing that separates good senior engineers for me, is that they know how to spend their time on things that have the most impact. Right? Business impact. But what does that mean? Well often it means things that actually, materially, effect your users experience. And there’s no better way than just having to be on support rotation. ‘Cause if you aren’t feeding your intuition with the right inputs your sense of what has the most impact is going to be off. Right? I like to think of the intuition as being something you have to cultivate with the right experiences and the right shared experiences. You want a team that kind of has the same idea of what makes important, important.

Operations Comradery

Adam: As a team, I feel likes there’s healthy teams and unhealthy teams.

Charity: Mm-hmm (affirmative)

Adam: But, I mean, some teams really gel. And I always feel like the ops people tend to be more cohesive than other groups.

Charity: I think so too. A lot of it is because of… It’s like the band of brothers effect. Right? You go to war together. You have each others backs. You know, getting woken up in the middle of the night. There’s just a… Every place that I’ve ever worked the ops team has been the one that just has the most identity, I guess. The most character, the most identity, the most in jokes, usually very ghoulish, you know, graveyard humor.

But I think that the impact of a good on call rotation is that there is this sense of shared sacrifice. And I would liken that to salt in food. A teaspoon makes your meal amazing. A cup of it means that you’re crying. A teaspoon of shared sacrifice really pulls a team together.

Adam: Yeah. You don’t want it to be like the person can’t sleep at night.

Charity: No. But like if one of the people on your team has a baby then everybody just like immediately volunteers because they’re not going to let them get woken up by the baby and the pagers, they’re just going to fill in for them for the next year. That type of thing. Lowing the barrier, it should just be assumed that you want to have each others backs. That nobody should be too impacted. As an ops manager, whenever somebody got paged over night I would encourage them not to come in or to sleep in or I would take the pager for them for the next night or something like that. It’s just looking out for each others welfare and well being is something that binds people I think.

Adam: Definitely. Well, it’s been great to talk to you. Is there anything… I liked your controversial statements about, you know, puck metrics. What else ya got?

Every Software Engineer Should be On-Call

Charity: Metrics, dashboards, can all die in a fire. Yeah. And every software engineer should be on call. Boom.

Adam: All right, there’s the title for the episode.

Charity: There ya go. Going to make a lot of friends here.

Adam: All right, that’s the show. Thank you for listening to the CoRecursive podcast. I’m Adam Bell, your host. If you like the show, please tell a friend.

CORECURSIVE #019

Test in Production

With Charity Majors

Transcript

Introduction

Debugging Is Complex

Facebook Scuba

High Cardinality

Hating Metrics

Metrics Are Devoid of Context

Structured Logs

BI for Software Systems

Bigger Than Our Heads

Dashboards Are Looking Backwards

Testing in Production

Recovering vs Preventing Production Issues

Building Your Own Data Store

Build Better Tooling

Learning From Operations

When Things Are on Fire

Contributor to Manager and Back

Dashboards and Over-Paging

Alerts Won’t Cut It

Customer Support and Empathy

Operations Comradery

Every Software Engineer Should be On-Call

Test in Production