Data and Scale

With Pat Helland

Data and Scale

Pat Helland has a wealth of knowledge on building distributed data stores. He has been working on distributed data stores since 1978, when he worked on the tandem fault-tolerant database. Since then he has been involved in many distributed database projects.

Here is the key thing, he is also a master at explaining the key ideas of distributed systems using simple language and practical everyday examples. Can you get married on the phone? How are messaging systems and idempotence like regional offices communicating via fax machine? These are the type of metaphor that Pat uses. Today, Pat sits down with me and teaches me about dealing with data in a distributed, fault tolerant, infinitely scaling world.


Note: This podcast is designed to be heard. If you are able, we strongly encourage you to listen to the audio, which includes emphasis that’s not on the page


Adam: Hello, this is Adam Gordon Bell. Join me as I learn about building software. This is CoRecursive.

Pat: You have to think about the crazy idea for sometimes years before you have it organized well enough to make a story out of it. Because if you don’t have a story, you don’t have a paper, and if you don’t have a purpose and a point to make, you don’t have a paper. And so it just takes a lot of noodling in order to make it happen.

Adam: That was Pat Helland. He’s been working with databases since 1978. Today, we talked about where we should store data. Should we store it in a relational database on a single server? Should we store it in some sort of distributed database? What are the trade-offs the various stores represent? Questions like, could we make an infinitely scalable service? How does immutable data make distributing things easier? If you liked= this episode, give us a review on iTunes or Twitter or mention on Reddit or Hacker News or wherever.

40 Years of Data Stores

Hey, I’m Adam Gordon Bell, and this is CoRecursive. Today, I’m speaking with Pat Helland, Pat, thanks for joining me.

Pat: Well, thank you for having me. I’m looking forward to our chat.

Adam: I am as well. Pat, you work at Salesforce, and I’ve written down here that you’ve basically been working on data storage and writing bits to disk since 1978.

Pat: Correct. I had hair at that time, I don’t know, but I’ve been working on databases and storage and transactions and distributed systems and built a multiprocessor and working on application platforms and stuff since 1978. It’s been a long and checkered career of building systems and then having thoughts about it, and sometimes having the opportunity to write that down and share.

Adam: You have a lot of great articles, which I think will cover some of that ground on the ACM.

Pat: It’s actually one of my favorite things to do, is to carve off of half a day or a day to work on the next paper. I’m always boring my friends with the titles and ideas for papers that are not yet written.

Adam: Yeah, you have great titles. I’m trying to think, I forget now, I can’t think of any one.

Pat: Well, just right now today on ACM Queue Online was Identity By Any Other Name, which is talking about how, when you get into a distributed system, things are knit together but identifiers. And you may have different names for the identifiers, their pointers, their IDs, their whatever, but it’s pretty much identifiers that are tying everything together once you get into a distributed system.

Adam: Yeah. Standing on the Shoulders of Distributed Giants is another one. I just like your titles, you tend to be playful with them.

Pat: Standing on the Shoulders of Distributed Giants, which plays off of the physics and physicists because it’s picking up five different physicists and giggling about the things that for which they’re famous proceeded our distributed systems ideas. It’s pedagological mechanism to explain how the heck distributed systems work. And I think that was just fun and silly, and I had a good time with it.

When Did Data Become So Complex?

Adam: Here’s why I brought you here, data stores and distributed systems. When did everything become so complex? I really love relational databases. And for a long time, I felt like that’s all I needed, but now I live in a world and there’s like Kafka and key-value stores and so on. I’m trying to decide when which is the right tool, so you’re here to be my teacher, hopefully.

Pat: Well, let me give you a perspective. I fell in love with building databases and transaction systems in ‘78 because the thing I absolutely loved is that the API was begin transaction, do some stuff in the transaction. And underneath it, we needed hundreds of thousands of lines of code, and now growing into millions of lines of code to just deal with all the goop that happens environmentally so that the user doesn’t have to think about that. And that was just a phenomenal, phenomenal advancement back in the day when were running on one computer. And it became under more pressure as we were trying to run over thousands of computers because you just can’t pull everything into one place at one time.

Between Being And End Transaction, A Singularity

Adam: So what happens in that thousands of lines of code between begin and end transaction?

Pat: Well, you have to organize the data so you can locate it on disk efficiently. For relational systems, you have to create the inverted indices so that you can look things up by different values that are kept in there, and you have to make sure it happens atomically. You have to deal with the fact that if something croaks, how are you going to put it back together if something dies? Because that’s a very important part. You want to lower the burden for the developer that’s using the system by taking those complexities on within the system. You and I’ve talked about it just a bit ago.

One of the short papers I wrote is called The Singular Success of SQL, and that’s the thing I based the column that I write papers… I write papers for ACM. There’s a column, I titled the column Escaping the Singularity, It’s Not Your Grandmother’s Database Anymore. The point I wanted to make there is that to make relational algebra work, to do relational systems, you have to make all the data’s there, all the data accessible, and all the data still because you need to be able to do these set oriented operations across all this stuff and not get tangled up in other people changing other things, and not get tangled up in the data not being here now.

It works when you bring everything together to one point in space and together at one point of time, and that’s known as a singularity when you collapse space and time. Now, transactions make it look like you’ve collapsed time, even though you might have other stuff running against the database at the same time, and that’s part of the fun of those. But the model, the abstraction is one point in space and time. Now, our systems are so darn big, so darn complex, that doesn’t work so much anymore, it only works as one subset of the solutions that it takes to compose a broader system.

At What Size Does Rdms Break Down

Adam: So at what size does that break down?

Pat: I don’t know. There’s no black and white. It works pretty well for a handful of computers that are close together or one computer, certainly, but when you start getting into the scale of big web solutions, it becomes a challenge and that becomes one of the design points we have to deal with. And then you get into people who have quite legitimately, it’s really an interesting thing, the key-value store that’s scalable, that you shard. It’s a fascinating data point. And I wrote a paper in 2007 called Life beyond Distributed Transactions: an Apostate’s Opinion, because I was a believer in distributed transactions.

I believed they solved all the problems of the free world, and then I started seeing things getting brittle around the edges and getting challenging, and availability being a challenge when one system went down, and that kind of stuff. And so people start doing other things. And in this new world, what we see people doing is sharding their data so that it can run across multiple machines, and then there are patterns by which you can have your workflow do steps one transaction at a time, step one here, step two there, step three there, and allow the system to scale. And so that’s what that paper from 2007 is about.

Adam: Sharding is basically… So you’re saying, “Oh, my data is too big now for one relational database, so I could divide it into many relational databases, partitioning it over some key?”

Pat: Maybe you can, but the question is, “Okay, how do I identify that piece?” And so if the system’s going to pick that piece up because of the system that’s holding it, the server that’s holding is getting too fat, it’s got too many pieces, it can’t run more. So now it’s going to pick some stuff up and move it from this server to that server. Okay, that’s cool. But what that means is I have to have a bite-sized piece, and I called it an entity when I was doing this in this paper, it’s just a term, it doesn’t have to be the only term.

And the idea is that you can have transactions within that entity, but I can’t have transactions across identities because if I can’t do transactions across systems and I move one of these things from this system to that, then it becomes a challenge because now I’ve got to do transactions across systems if I had a semantic in my app of working across multiple entities to do the transaction. So how do you design for that? You start out and you say, “This is all I’m ever going to use for a transaction, so if it’s working from one entity to another, and there’s another one in the same server, I’m not going to do transactions across them because one might get moved to another.”

And fundamentally, you’ll have to pick an identifier for these things, which is the boundary, it’s the boundary within which you do transactions. And so that’s a trick people use.

Sharding Example

Adam: Let me make sure I understand that. You’re saying that we have an entity and we’re limiting our transaction scope to a single entity.

Pat: Correct.

Adam: Actually, do you have an example maybe that would put some meat on this?

Pat: Say I’m working on one particular order, I’m processing the work for a workflow to ship some stuff for an order. And the order identity and the order information and the information about who’s the shipper and the information about the inventory that I’ve got from here and there, it all fits into one entity and I can put it on one box. Matter of fact, I can put hundreds of thousands of them on one server, but as time goes on, there’s going to be more of them that fit on a single computer. And so if I can’t do transactions across computers, because that’s a challenge, then how do I make sure that I stick within it? Well, I only work on one order at a time and then I commit that work, and then I work on the next order at a time.

And so, now the question is, how do I hook up the order processing to the shipper? Well, I probably have messaging that goes from this box to the computer’s or computers that hold the shipping information for the shipper, and so then I’ll send something off to there. And that composition allows me to then expand and scale the system without things breaking.

No More Transactions

Adam: Yeah. And also, the place where I get concerned is, it’s also complicated. If I had my relational database, I could update the order table and the inventory table at the same time, right?

Pat: Yup.

Adam: And now I can no longer do that. Is that right?

Pat: In many cases, for many large distributed systems, yes. And so it adds a challenge in building the application because you can no longer transparently view all these tables and just do what you want to do. You’re correct. But it also builds a system that can handle, scale and continue to function.

Adam: I have my order, say I ordered three pens, and then there’s an inventory system that tracks the pens. And so I send a message to the inventory system say, “I would like three pens”?

Pat: Something like that. Yeah. You have to design what the protocol is, yes. And you get into this world of, “How many pens do I order? How many pens do I ship? Have I promised more pens than I’ve actually got?” There’s all the overbooking challenges. Applications are fascinating and funds system. And the question is, what do you build it on top of? But there is no doubt that building a scalable distributed system has more challenges than a centralized one. But let me give you a twist on that too. I need to be having my company dealing with my suppliers and my customers, but I have to do that kind of work across trust boundaries anyway, I have to do that kind of planning for how do I do my piece, and then send a message to the partner to do their piece.

And so that kind of work is what we’ve been dealing with for decades as we build systems that connect to other companies. When I was a kid, you connected companies by sending faxes to someone who sat at a terminal who then typed it into the local computer, and the computer only dealt with your company’s stuff. And then you would print things out and then you would send them via fax or snail mail to the other company. And so that meant there were people in the loop to try to deal with some of these challenges of ordering and reordering it. Do I have enough inventory? And did I promise more than I’ve got?

And so where we’re dealing with what’s going on now as much of that stuff, we’re trying to figure out patterns for codifying it so that we can actually make it all connected from computer to computer, even across trust boundaries.

Faxes Solve Everything

Adam: This was all on the basis of you coming up with a theoretical, infinitely scalable system. And if I understand correctly, your solution involves fax machines.

Pat: No, not now. That’s what it was when I was a kid, way before this time. But that’s funny, that’s good. I used to refer to that as swivel chair integration when you’d have to look at two screens and type between them. So that’s the way it was way back in the day. Now, we’re trying to put systems and hook them together again.

Adam: It’s an interesting metaphor because in the system, so if somebody orders some pens and then I go to ship out the order, then I want to tell the warehouse that I need to have those pens. In the transactional world, like I could take things from the inventory and send out the order at the same time. But now, I have a world of potential things that can go wrong, right? Like if I ask for two of X and then I asked for two of Y and the two of that, and then there’s not enough said. I have to unroll this somehow, right?

Pat: Correct.

Adam: So how do I do that?

Pat: Carefully. No, at some level, things become not about changing the pen value from five to three, it becomes about having a thing that’s there, which is, I wish to order two pens. And so that’s an operation, and that’s what we built. When we do these loosely coupled systems, we build operations that say, “This is my intent.” And then the question is, can I apply them and unapply them? And how can I reorder that?” And so how can I say, “Oh gee, I took away two pins, but now the shipment arrived with new pens.” The ordering of it doesn’t matter as I increase and decrease the inventory as long as I don’t bottom out and run out. And even if I bottom out and run out, what am I going to do about that? Do you see what I’m saying?

Because it’s not about me viewing that inventory as a number that I can just automatically do arithmetic on, it’s about me writing down the intention to get this many things shipped. And the reordering of that is an important concept. I talk a lot about interchangeability because the thing that I’m dealing with, there’s a bunch of them that are the same, and so I can have differing orders of that. You see that when you go to a hotel. The hotel gives you a reservation for a king size non-smoking room, but they don’t give you a reservation for room 301.

The rooms are equivalent, they’re categorized as equivalent, even though one of them is next to the elevator and a bit noisier, but they’re still equivalent, so you’re selling one of the pool. That allows them to deal with people checking out, late, checking out, early, assign you any one of those, that pool, and still have a successful completion to their commitment. And the same thing happens when you have pens in inventory, you typically have a stock keeping unit, which has got a certain number of pens and each is the same as the other. It is that ability to talk about computing operations having interchangeable resources that allows the ordering to be swizzled around and still be successful.

And that’s a huge thing that we have to deal with as we’re talking about bringing systems together. And I have a partially written paper on interchangeability that I hope to get done soon.

Manual Locking in Distributed Systems

Adam: To me, I think the interchangeability makes sense, but there’s also a feel of things, as an outsider who’s never worked on databases, that I’m building my own database system. So my inventory system, it gets a message that like this person needs two pens, and you have to set those aside, it’s almost like locking on something and then waiting until the order’s confirmed and then-

Pat: It certainly is a form of locking, there’s no doubt about it. And there are actually business decisions that get involved in there. And I wrote about this in a paper called Building on Quicksand in 2009. One of the conditions, one of the things to think about is, do I want to over provision? That means I’m selling you pens, and for every pen, I’m going to actually have the pen and I can walk into the warehouse and touch it. I promised it to you, and it is there. And if things go well and assuming I don’t run over the pens in the forklift, then I will be able to fulfill every committed order. That’s one choice.

But that also means I’m spending a lot of money for inventory that isn’t being used. It happens that way. I can do is overbook. I can say, “I’m going to sell 10% more pens than I actually have.” And so that means that when things get canceled, it can work out correctly if that 10% is the right number for the prevailing business conditions and how it gets handled. You see this in the airlines, they’re always overbooking seats because we don’t show up all the time, and so filling the airplane means airplane seats are cheaper and the money, and it works out.

And that’s a business choice they make, which can be annoying when we get told we can’t get on a plane, but it’s a business choice they’re making. What I’m pointing out is, this isn’t a database, this is just not a database. In databases, we have reads and writes and we make it look like there’s a scratch board that you can write things on and do some relational algebra and get back a set of answers, and that’s fine, that’s a phenomenal building block. But then on top of it, you have to have abstractions for what makes a business tick. And those abstractions include long-running work.

If you think about it, database transactions don’t have a notion of long running. They make it look like it all happened at one point in time, you say, “Begin transaction,” you see your stuff you seek in transaction or report, whatever you’re going to do, but it’s not a long-running thing with other things getting in the way. To put other things getting in the way, which is necessary to do time-based work, you have to have this interchangeability. And even more than interchangeability, you have to have the dealing with uncertainty and the probabilities of, “Gee, I bottomed out because I ran out of pens because I over promised.”

Adam: Isn’t there uncertainty inside the database, like in the part I don’t have to deal with between begin and end transaction, it has to cover some-

Pat: We map it all and make it a… We make you obviously a committed or reported transaction. Yeah, there’s uncertainty, things go wrong, but you don’t see that as things go wrong when you’re building an app on top of it. It gets collapsed into a simple answer, yes or no.

Adam: Yeah. And now in a distributed system, these concerns are now mine.

Pat: Correct. And what I believe we have not done a good enough job in doing is figuring out how to make that easier for the programmer. We’ve done some, but not all the way that we need to do to talk about, what does it mean to have interchangeable things? What are the business rules for overbooking versus over provisioning? That class of stuff. And so there’s room for us to do those things over time as an industry.

Trying To Force A Database Metaphor

m Adam: Yeah. Maybe this connection isn’t as there as much as they want it, but like in a database, let’s say I write something, I send some sort of write to the database and it commits it, but it like crashes before it updates all the values. Is that possible?

Pat: Only if we have a bug. That’s our job, is to not have that show up to you. I’m not saying it’s never broken, databases sometimes have bugs, but they usually don’t, and that’s our job to never have that happen. But to do that, we narrow the abstraction of what you can do to have records in a database and you’re updating a set of them and we know how to deal with making sure that all the things you updated are all or nothing.

Adam: Yeah. I was trying to draw this metaphor between the one system sending a message to the other as something like writing something to the transaction log. And once it’s there, it’s committed, but maybe not everything has been updated to reflect the data that’s in the transaction log.

Pat: Right. In databases, you get into the visibility of that effect, where you’re correct, it’s in the log, but we have to make sure as an implementer of a database, that you can’t see the value before the logged entry. So that’s part of the job. But you’re correct, if I’m building a loosely coupled system where the app is doing some work in generating a message and the message is going to rattle over to the other machine, and then it’s going to do some stuff, there’s a window of visibility where you can look between the machines and you don’t see the change on both of them.

And to make you see the change on both of them, then we have to hold something still. We could hold the source still so that you can’t see that their transaction admitted where we’re trying to get anything to the second guy, or we could try to make the second guy still. The model isn’t quite right when you’re saying, “Oh, the app can see the two databases in between.” And so we deal with that. Now, what that means is the messaging system has to guarantee the delivery of the message over time, and it has to then make sure that the message is applied exactly once and isn’t applied twice because you don’t want to pull two orders of two pens out.


So there’s a whole set of things. Idempotency is a fascinating area. Idempotency is a fancy word, it’s a mathematical word that we use in computing, which idem means, I said it before, and potent means still good. So idem, I-D-E-M, is like when you see ID at the bottom of a footnote, which says, “Yeah, yeah, yeah, that previous footnote’s got the stuff that I’m talking about.”

Adam: I never thought of that connection.

Pat: Yeah. That’s what the words comes from. And so idem, I said it before, and potent, it’s still good. So, it’s okay to say it many, many times, and I’d like to say sweeping the floors is idempotent, it’s naturally idempotent. Reading a record given that any value from a record over time is going to be within that window of time is okay, it’s idempotent. Read it a second time, you get a good record. And I have a paper called Idempotence Is Not a Medical Condition. That was fun getting that title through.

It’s a super and powerful thing. And knowing whether you’ve applied it before… The paper that just hits the web today, Identity By Any Other Name, talks about the fact that in a banking system with paper checks, you buy these printed paper checks and they have the routing number, which says what bank you’re going to, it has the account number which says which account within the bank, and then it has a check number. So that is a unique ID for that operation. So you put the piece of paper and you fill out, “I want to send 100 bucks to Adam.” Great. So it’s going to go send 100 bucks to Adam, it goes into your account, and then it hits my account.

Well, when it hits my account, the bank writes down all that stuff, including the check number, it says, “I’ve cleared this check for 100 bucks.” Fine. So now if that check comes around a second time, no, it’s already cleared, we don’t have to send another 100 bucks. And furthermore, banks usually put a time limit of a year on the check. So the check has a date on it. And so if that check shows up two years from now where you would potentially pay the 100 bucks a second time because you’re not remembering things two years out, but the check is rejected because the date, it’s more than a year after the date of the check.

So the combination of the date on the check and this rule means that the bank doesn’t have to remember every check forever, which is important for them to implement idempotence of clearing the check.

Adam: That makes sense. So in the microservices world, I have one service that’s sending messages to another, let’s say via Kafka or something, I’m just recounting to see if I’m understanding this correctly, and say each message that it receives has some ID. So in order to make sure that it doesn’t replay messages, it would need to track what IDs it’s processed.

Pat: Yeah. But that’s really hard in a microservice world. When you say microservice, I think dozens of equivalent services on different computers and a load balancer in front of it. And when I send a second message, most of the time, it will get to the same actual microservice server as the first message, but there are many reasons why it doesn’t always, the server can fail, things are busy, it gets rerouted, so it goes to a different one of the dozens of servers that now has to figure out what to do with it.

So if you want to make something which is stateful be idempotent when you’re writing down, whether you did or did not do it, you have to write it down in a way where each of the sibling microservices can see it in order to enforce that. That can be done, you pick this thing up and you run into a key value store based upon a key, which has to do with me sending a message to the Adam service. And then it goes and it looks up and it says, “Oh yeah, I’ve already seen that,” and that could be done, but you have to think carefully about what happens because microservices means goes away sometimes.

And the value of microservices in a scalable cloud environment is huge. There are way easier to operate. Nowadays, we try to make developments be small teams working on small microservices and small deployments, that’s huge for the nimbleness of a system, but it carries the additional burden of figuring out how to deal with state and that figuring out how to deal with state includes, how do you make sure that idempotent work that’s already happened doesn’t happen a second time? Unless it doesn’t matter, which is possible.

Tracking Idempotence

Adam: I guess what I’m getting at is, if I understand what you’re saying, you’re saying the expiry of the check is important. So if I have a service, whatever, it’s stateless, but it writes to some store and it receives messages and it may receive one more than once. So it needs to keep track somehow of whether it’s applied one or not. However, the expiry date limits how many it has to keep track of.

Pat: That’s one thing that can occur. In the bank clearing the check, they notice if they send the 100 bucks a second time. So there’s a problem if you actually do the work a second time. For many things, you do many things you do with microservices also, it really doesn’t matter if you do the work a second time, it’s not a big deal. So the real question is, if you want to promise idempotent behavior to the caller, you have to take apart, what did the operation do? Did it matter if I did it more than once? And if so, how did I make sure that the part that matters is protected from doing it more than once? That then gets more challenging.

Adam: What’s an example where it’s innately replaceable?

Pat: Oh, heck, go get me the product catalog information for product X, Y, Z out of the cash for my retail site.

Adam: Reads specifically?

Pat: That’s right. Reads are a fine way to do it. They’re just naturally idempotent. And there are layers of abstraction in the system, the read and the processing of the read request is probably monitored and logged just for tracking the correctness of the service, and that monitoring logging doesn’t matter because if you process the read twice and it’s logged twice, that’s fine, but it didn’t do any harm to do that. So you’re always trying to take apart, when does it do harm and when doesn’t it do harm? If you’re also calling something to add something to the shopping cart and that service then goes and opens a value in a key value store and reads it and it adds something to the shopping cart and writes it back.

Well, that’s fine if you do that twice with the same item, as long as the act of doing it has some unique ID which allows you to say, “Oh, geez, I already put this in the shopping cart. No problem.”

Adam: That makes sense.

Pat: Yeah. So people, they designed these things for, how do you make it be okay to do the requests idempotently? If you don’t make it okay to do them idempotent, you’re going to have a problem. Now, these are not challenges that you deal with in a classic two-tier surrounding a database, because there, you’re much less thinking about a failure restart of the work. And if you do, you’re typically making the restart of the work idempotent, because the way you go and update the database means that it’s okay to do the second transaction, and it bumps into the first transaction and it says, “Oh, gee, I already get that.”

Distributed Transactions

Adam: And at some point, you had worked on distributed transaction systems, which propose a different solution to this problem, right?

Pat: Mm-hmm (affirmative). For me, the concern that I found when I first did distributed transactions, it was back in Tandem, and I’m super proud of the time I had at Tandem in the 1980s. And there you had two to 16 processors, and any processor could fail and it kept on going and the system was super resilient. And so if you were doing a distributed transaction across multiple clusters, multiple of these clusters across a wide area network, it was okay because those guys didn’t really go away for extended periods of time.

Then when you started applying the distributed transactions to single servers spread across a distance, typically, but the server goes down and the resolution of the transaction is stuck, it’s just stuck behind that, then that means the locks in the database on the other nodes are stuck behind it, and so the entire app gets to be more brittle. There are many solutions that I’m interested in where you’re keeping the state of the transaction in a cluster and it’s more vibrant. So I don’t love two-phase commit, but I actually think it’s okay if the participants don’t go away for long periods of time, and there’s a whole bunch of tricks to make that happen.

And so I stand by my life being on distributed transactions as an application pattern that is super important for people to look at, but I also believe there’s still more innovations that are going on Spanner that are very interesting, for example, lots of things are very interesting and I don’t think we’re done seeing growth in that area. By the way, I have a model for people who are not super familiar with distributed transactions and what it’s like. The way I characterize two-phase commit is as follows. Imagine you’re getting married on the telephone, and your spouse and the minister are somewhere else.

Two Phase Commit on the Telephone

And so you’re on the phone and you say, “I do.” And then you hear a dial tone. The real burning question is, are you allowed to date? And that’s the fundamental algorithm behind two-phase commit. And so you don’t want to get locked up. And another thing that I’ve said is that people are pretty comfortable with the CAP conjecture, which is the consistency, availability, partition tolerance, CAP. And the conjecture in theorem is pick two out of the three because you can’t have three out of the three. You can have consistency and availability as long as things don’t break, but things break, so that one isn’t useful.

You can have consistency and partition tolerance, but you’re going to lock up if things go bad, that kind of thing. And so one of the quotes I had from the ’80s is that two-phase commit is the anti-availability protocol. It picks perfect consistency over availability. That is its job, that is called success. And I’m not saying that’s necessarily bad, I’m saying it fits into a landscape of this complex and challenge.

Adam: That helps explain why distributed transactions aren’t used that much because you’re giving up availability. So if something gets blocked anywhere, it blocks everything.

Pat: And as the number of participating nodes goes up, the probability that one or more of them are unavailable goes up.

Adam: So if my data is distributed among eight servers and my transaction crosses them, then anyone can stop the whole thing?

Pat: Correct. And that’s not good when each of these things goes away for a while. Now, we are getting better about making them not go away for a while, but that’s not been what we’ve dealt with in the last 10, 20 years on a whole. There’s exceptions, but on a whole.

Adam: I think it explains how, then as the amount of data I have in my distributed data store grows, the more distributed transaction approaches is likely to stop.

Pat: Correct. Now, what’s helping with this, which is really interesting is that in cloud computing, you’re seeing a separation from compute and storage. So I have a compute box, which is like a database or something and is writing to a store, but the store is actually not on that physical server. It’s off on three different physical servers that are spread around the data center. Now, if this compute box goes away, you can pop a new one up and within minutes or something like that, it can be running, and now finishing off what was partially done, because all of the knowledge, the log, the database, all that stuff is out on three other random servers, and you can make sure that at least one of them you can get to.

Adam: So we just make each instance like a cluster?

Pat: Again, separate storage from compute. So the storage is just hundreds, thousands, tens of thousands of servers that are writing things out there and everything they write is on three different copies. So I can do the math on the failure of the servers in the data center, and three copies ends up being more likely to lose the data center than all three copies by losing servers by a lot. The data center we’ll get a meteorite that hits it or an earthquake that hits it or something, because when one of those servers goes away, you find an empty server or some other server, and you say, “Go from two copies back to three copies,” and now you’ve got three. So how quickly does that go down?

And the math on that is something that we all do when we think about how to build a distributed system. So the storage is spread around. So fine, now we’ve got a database, the database is writing logs and it’s reading from the storage. Great. Where are the logs going? They’re going spread around. And where’s the database going? Spread around. So if that server goes away, you just scratch your head and look around, and this is automatic, by the way, it’s not a human. Scratch your head, look around and say, “Okay, let’s pop up this empty server over here and make it run.” And then it reads the data and it puts it all back together.

And so the time to then come back to life is now not gated on getting a repairman to fix the one server in the previous design. Now, that means that two-phase commit is getting less painful, it’s still challenging, and it’s got times when it’s challenging and we could talk more about that for a great period of time, but it’s getting less challenging when the individual participants are more available.

Consistency of Reads and Linearizability

Adam: One thing a system like that would allow is consistency of reads. You mentioned before with the relational database, I can read the data immediately after I update it. I can even read it in the same transaction, and it should be-

Pat: Mm-hmm (affirmative). You must be as per the rules.

Adam: But if I’m using a key-value store that’s distributed, I could write somewhere and possibly read from somewhere else and the changes haven’t cascaded out?

Pat: Well, it depends on the store. Most of the key-value stores offer what is called linearizability. That means there’s a linear sequence of changes. And another more approachable way to say that is read your writes. If I write something and I go to read something, it’s what I just wrote. If I write something and somebody else goes to read something, it’s what I just wrote. The fancy word for that is linearizability, but it’s really just read your writes. It turns out that’s what most systems do because most applications go completely haywire if they don’t have that characteristic, but a fact of life in a distributed system, which is what you get in a data center, and nowadays, almost everything is distributed systems.

A fact of life is that it means you’re writing it in three locations, and that means that I can’t just say, “Oh, I got two copies, that’s good enough.” Because then if you happen to read that third one that I didn’t change, then things are tangled up. So what you got to do is you got to say, “I’m writing this one, I’m writing that one, I’m writing the next one. Okay. We got them all, we’re done. We’re happy.” Now, remember, every server that we have out there is usually pretty fast and it has a pretty bounded latency, except when things are tangled up and the latency gets very large for a period of time.

And that can be garbage collection, it can be a whole bunch of other things that can go wrong, it could just be storms in the network, which doesn’t happen too much in data centers nowadays, where you can’t actually talk to the guy. He’s there and he’s happy, but you can’t get the message through, but it looks to you like you’re slow. So now when I’m trying to do a change, if I have to change three copies, then that means that I have to make sure that I wait until all three copies are there, or I officially give up on one of the copies. Now, to officially give up on one of the copies, we have to shun that copy, “Oh, we’re not talking to him again until we decide to bring him back in.”

Then we bring them back in, there’s going to be all this work to do. Actually, taking and shunning, which is just my word, it’s not the official word, but to push it out of that membership of three means you have to decide that server’s bad, which takes a minute in most systems. So now you’re writing something that was taking you 10 milliseconds, and now it’s taking you a minute. And so that’s a trade off associated with being able to read what you wrote, because if you don’t do that, there might be old copies fooling around, and sometimes you’ll bump into the old copies.

Adam: So you had described this as predictable answers results in unpredictable latencies?

Pat: Yeah. And so in the paper Heisenberg Was on… which is standing on the distributed shoulders of giants, there’s a section called Heisenberg Was on the Write Track, which talks about one technique to avoid it, but it’s only useful when you structure the data in certain ways.

Adam: Does that mean we should be giving up on reading our writes in certain cases?

Pat: No. Well, it does in certain cases, it depends what’s more important. Is it more important to have a quick answer or is it more important to have a perfect answer? And the tagline I have for that is, do you want it right, or do you want it right now? If you have a shopping experience for users that are buying junk and you’re showing the product catalog, or you’re showing the shopping cart or you’re showing stuff, measurements show that if you wait too long to give them their answer, they go do the dishes and they don’t buy anything.

Measurements also show that if you show them the slightly stale version of the product catalog out of your caches because the cash is being updated irregularly, at first you see the new one, then you see the old one, then you see the new one, then nobody gets too excited about it because it works, and it’s okay. So it depends upon the business. If the business wants the latency bounding of making sure that the user is getting the answer quickly, even if it’s not the latest and greatest answer, that’s fine. It’s a business requirement, but you have to be thoughtful about it.

Adam: So the business requirements dictate our storage strategy?

Pat: I’m thinking that they should, and that we don’t pay enough attention to it.

Adam: If I understand, like rewinding, if we have data that’s too big to fit on a single machine, we have to start distributing it. So if we use distributed transactions, there’s some potential availability issues, so then we need to decide, assuming we didn’t decide on Spanner or something, we need to decide whether we want it write or write now, whether we want it first to write down or accurate to read?

Right or Right Now

Pat: But at some level, this is the same discussion of whether the app sees a ginormous database running across many, many servers and distributed transactions, or whether they actually do the kinds of things I talked about with entities and workflows and messaging in the app, which will also be more resilient and more rapid in its behavior, but it doesn’t get tangled up when things break. And the write versus write now is less about distributed transactions and more about any individual store in a key-value store. That was how I was applying it.

Adam: Yeah, yeah. I was thinking once you’re in this land of a key-value store that is distributed, you need to decide whether you want it to wait for everything to be written or whether you want it to just write what it can and then go.

Pat: Yes. But usually, people want it to read your writes because they don’t know how to think about what the app would do otherwise.

Adam: And are you saying that your entity model overcomes this?

Pat: No. I’m saying that it’s at a different granularity, we’re conflating two things that we shouldn’t conflate and I’m sorry. The entity model doesn’t overcome it, the entity model gives you a mechanism for scale that tolerates systems going out and the system can continue to work.

Adam: One interesting tidbit I picked up from your paper was this concept of getting the answer right now, one way it could overcome is if your data’s actually immutable, how would that help you?

Immutable Magic

Pat: Wonderful thing about immutable data, one of the wonderful things is that I don’t have to worry about writing two versus three replicas because and getting a stale version of it, because there is no stale version, the data associated with what I’m writing has only one value across all space and time. The unique ID that I have is only going to be for that stuff. So I don’t have to get excited about, did I write two versus three copies from the standpoint of something out, an old version, I want to get excited about how many copies I have so that if something fails, I don’t lose the data, but that’s a different discussion, than do I have correct updates to it.

And so I’m writing it out, I’m not replacing a previous value because there was no previous value. And so now, I can write it anywhere, I can write it on any three machines. And if one of the first three is stuck, I’ll just pick the fourth of the fifth and go put it back later on when things aren’t stuck.

Adam: It kind of sounds like a cheat, like a trick. You’re like, “Okay, updating things is hard, so we just won’t update.”

Pat: Yeah. But when you build systems out of stuff that doesn’t update the whole distributed system thought process, inverts in a positive way. Now, you have an obligation, how do I deal with not updating things? And that ends up giving you some additional challenges. You have to think about a database very much like an accountant’s log of changes. And there’s some well-established techniques for this, where you write the changes and then you continue to merge and coalesce them, so you can actually find the latest thing for that key. This is called a log-structured merge system.

But in all cases, you’re only writing out brand new stuff with a brand new identifier. And the storage system is far easier because you don’t have these corner cases when systems get wedged. And when you look up and down our computing stack, everything from how do we do the hardware inside of an SSD up to how we’re building the databases are inverting to the point where we’re only writing data that you never update. You only logically update it by writing new things into the journal describing what should be different.

Adam: I think an example will help. So let’s say, you mentioned accounting. What about a bank account? Making changes to a bank account, how would I represent that?

Pat: The way we do it at the bank. You have debits and credits and you get the monthly statement, it says, “This changed, and that changed, and this changed, and that changed. And we started from this, and now we have these, and then we have that.” And that’s the bank account. And then you roll them up and then you have a yearly summary, and all this stuff is based upon just adding new artifacts of knowledge. In the middle of the month, a check clears, it’s added to the list, it’s added to your bank account. At the end of the month, you do the analysis of what the monthly statement is, and some stuff’s added to your bank account.

You just keep adding to your bank account. Eventually, when year it gets wrapped up, you peel it off into archive. You can go back and see it, but it’s not as easy to see it, you have to look a little harder. And so you’re constantly adding information and you’re never deleting it, you’re never changing anything. You’re adding knowledge.

Adam: So somewhere in the system, I still would like to know my balance. So something has to roll these up, I assume.

Pat: Correct. But sometimes to read your balance online in the middle of the month, you have to have an algorithm that comes in and looks at the balance as of the beginning of the month and apply the debits and credits as you’re trying to figure out right here and now what’s the balance. You apply the debits and the credits and say, “Oh, okay.” And now you have the summary. But the thing that’s at the beginning of the month, which is your bank balance, it motivates for you, it bounds how much you have to do that because this rolled everything up to the beginning of the month, but you’re actually calculating in situ to give an answer in the middle of the month and then you can get the balance right now.

Adam: So any given lookup both the balance, we’ll start at the last snapshot and move forward?

Pat: Yeah. And apply the changes. And organizing the system so that that’s reasonably efficient is a part of the game. But the facts are in accounting, if you use an eraser, you go to jail. And so one of the titles of one of the sections in one of my papers is Accountants Don’t Use Erasers. And so much of this is described in the paper, Immutability Changes Everything. Thinking about distributed systems data as a collection of immutable artifacts, absolutely inverts the way you think about the system, and it’s cool.

Adam: How so?

Pat: I just have these things that somebody wrote somehow and I’m looking at them, and so they’re artifacts and they’re there and they’re going to be there until maybe they’re deleted, but they’re never updated. So the question is, how do I interpret the bank balance out of that? And again, even when you’re in the middle of the implementation of an SSD chip and they’re copying forward the pages that you’re updating and applying the write to a word that you’re doing inside of a page, they’re doing it by creating immutable artifacts and copying them forward.

And so that’s just one example of what you’re seeing up and down the stack in the way we deal with data, is increasingly about writing something down, never changing it, and then doing something to semantically show the change.

Log Structured Merge Tree

Adam: How do we semantically show the change you mentioned, log something?

Pat: Log-structured merge system is a technique that was first documented in ‘94 by some friends of mine. And it says, “Look, I’m going to have keys, and I’m going to have values. In the semantics I’m going to offer is I’m going to update the values associated with the keys or add new keys or delete keys, and I’m just going to give you a system that lets you modify the key-values.” Now, that’s cool. As I’m doing that though, what I’m going to do is you make a change and I’m going to put it in a transaction log, just like a database, or I tell you the transactions committed to make those changes, then I’ll make sure that that database is written down, that log is written down in a place where I can find it.

Great. So now I have all this stuff in memory, the new values. Okay, fine. So I’m going to write it out into a file, which is structured by the keys and the values. And it’s got the changes that happen, say in the last five minutes. And so I keep doing that, but now I have a stack of changes representing five minute intervals. And after a year, that’s a real drag to deal with. So what you do is whenever you write a new one out, you know how below that a tree of things, where at the next level down, it’s maybe got 10 times as many files and they’re organized by key. So there is at least one-tenth of the keys and the second one-tenth of the keys and the third, and so forth and so on.

And then I’m going to take that new one that came out representing five minutes and I’m going to sort merge it with the ones below and create go from 10 to 11 files because I’ve mixed the keys within that. And then I’m going to pick one of those and I’m going to go down to the next level, which has got 100 keys. And that the 10 keys I have are the one-tenth of the key range covers one-tenth of the key range when there’s 100 of them, and I’m going to merge it with 10 files below it because it had 100 files, so it’s roughly that way. And so now I’m going to merge it down. And as you’re constantly creating data, you’re merging it by key, down into what becomes a tree that gets fatter and fatter and fatter by key-value.

So now when I want to look something up by key, I have a reasonably small number of places to read, it depends on how 10th to the fifth or 10 to the sixth is like five or six places to read. And so I go looking for the key and those five or six places, and it’s reasonably bounded

Adam: The old values are still found, they’re just further down in the tree?

Pat: Correct. But there’s a whole bunch of sub details in this, like if I have a different value for the same key, do I destroy the old one as I merge it in? And usually you do, so you only have the latest version there, but that’s an artifact of your specific implementation, but yes, you can discard the old values when you merge them together.

Adam: So at a higher level than an SSD?

Pat: By the way, this is the plumbing underneath HBase, this is the plumbing underneath the MaxDB, LevelDB. There’s a set of technologies that are pretty well established for the last handful of years that are out using this technique, which is called a log-structured merged system. And again, these systems, these key-value stores run on top of the storage where they give the storage immutable files. They come up with a completely long unique number for the file and they write this file out and they read it as long as they need to, and then they go delete it. They never update it.

Adam: If I understand, one of the advantages of the immutable data is what you called having data on the outside of your system?

Pat: That’s a different discussion. I know where you’re going, but it’s a different discussion. It’s okay. Let’s move on from immutable. It’s at a higher level of semantic is why I said that. And that was the paper I wrote at 2005, it is related, I’m retracting my… Let’s talk about it. I was becoming aware in 2004 and ‘03 and ‘04, and ‘05, that there was data that was outside of databases that we weren’t talking about. People were squirting files around and there were documents on the web and there was all this stuff and messages and files. And we were in the database community, not looking at that so much.

And so my nomenclature, which did not stick, nobody else uses it, it’s just from this one paper, is that data from a database is inside the database. It’s inside and it’s relational, and it’s got all these really cool properties, but when you take data and you send it out on a wire and you push it out, or you write it to a file system where you send it across the web or whatever you’re going to do with it, that data tends to emerge as being semi-structured. And at the time, the leading semi-structured thing was XML. And I think XML is fine, I think JSON is fine. I have no problem with either of them, but it’s very nice to have this semi-structured glop.

And that thing has got an ID and maybe a version and you write it and you never change it. And so it is immutable, but at a level of discussion. It does fit into the immutability description, I’m sorry. And so you’re writing that stuff, and now the question is, what do you do with it? One of the really interesting things is that when you write it out, the metadata you put with it is just describing what you meant at the time. You were going to say something.

Adam: Yeah. In my mind, it all fits together because when I think of this, it makes me think of… When I hear immutable data, I think of like messages or like Kafka topics. And you mentioned RocksDB, I’m going to pull this all together because that’s often used along with Kafka as a way of summarizing this immutable data, like a service consumes and rolls things up somehow into RocksDB.

Immutable Topics

Pat: Some fascinating products out there. There’s really cool stuff.

Adam: I still have this feeling that somehow the distributed systems become giant databases where these immutable messages are like the transaction log and then individual services, maybe, making some materialized view out of them by processing them.

Pat: Well, there’s a perspective, which is that I’m sitting where I am in the universe, and I’ve had an event horizon of knowledge arriving to me. And that knowledge can be archived in my brain and a store, which has, I got all this stuff. Now, these things are going to have a regular metadata, irregular shape and irregular form, and so the question is, can I, could I, should I, put a projection on the shape and the form of what I’ve received into a place where I want to do queries and analysis on that? And so I tend to think of every message that comes in as having its descriptive metadata that describes what’s in that message.

I tend to think of being able, when you want to do a relational system where you want to be able to do queries against it, I think of that as needing a prescriptive database, which is thou shalt fit into this shape. So one of the questions becomes, how do I take these things which were from a different origin at a different metadata and they’re different, their description of their shape is not what I want. How do I shoehorn it in? And I just finished writing a paper that’s not even edited yet called Extract Shoehorn and Load. And it’s all about that whole notion of everything that transforms is typically a lossy transformation from the descriptive metadata into the prescriptive shape of where I want to do the analysis

Adam: The descriptive, prescriptive dichotomy, it’s like if I write old school database, I have some insert X, but in the event system, I might have just something happened and the system reacts to it. So that’s-

Pat: Each event is going to have its descriptive metadata, each event permanent, it’s like, “Oh, here’s my events ID, here’s the time at which was issued, here is the alarm that went off. Here’s what was going on. And here’s what I mean by this stuff.” And it’s immutable and you can’t go back and revisit what the metadata is. It is what it was and it was what it is. And so that’s fine. Those things are little artifacts of knowledge that have their own description of the metadata for them. The question is, how do you shoehorn that into a pool of stuff where there’s a consistent enough interpretation of that that you can do a query?

Adam: Yeah. Which sounds hard.

Pat: Well, yeah, but people are doing it. When I talk about this stuff, it’s just, how can we give better tools to talk about it and maybe make what we do better, but even still, understand it.

Writing Down Ideas and Idemptotence

Adam: Yeah. It seems like a lot of your writing maybe has to do with cataloging some of these patterns that you see.

Pat: [inaudible 00:51:10] think of, man, I’m staring at the ceiling all the time and annoying my wife. It’s like, “Where the hell are you?” “I’m thinking about some crazy idea.” And then I have to try to figure out how to organize it. And you have to think about the crazy idea for sometimes years before you have it organized well enough to make a story out of it. Because if you don’t have a story, you don’t have a paper. And if you don’t have a purpose and a point to make, you don’t have a paper. And so it just takes a lot of noodling in order to make it happen.

Adam: So you’re saying you’re writing down ideas, not so much observing what’s out there, but trying to think about how things could be?

Pat: No. Well it could be and/or. Just try to explain the behavior of things, if you understand. The paper that I just got put up on the web called Identity by Any Other Name, is the observation that we are using identifiers for all the junk to hook these systems together, and then walking from example to example, to example of how identifiers are the backbone that makes it work. And I even was able to get more crystallized in my mind when I was writing it, that immutability is a relationship between an identifier and stuff that doesn’t change. And that was not something I thought of saying that way when I wrote the paper, Immutability Changes Everything, but it is a relationship. I got this identifier and here’s the stuff.

And by the way, things are sometimes immutable at one perspective and changing at others. If I pick a byte stream up and I move it from one underlying storage to another, is it the same? Is the King James Bible immutable with different traditions, with different fonts, with different pictures next to it, but the same sequence of letters because it’s the King James Bible? There’s like 6,000 Kings James Bibles you can order on Amazon. They’re all different, but they’re all the same. So what’s immutable? And so this one paper I wrote called Side Effects, Front and Center, it was trying to articulate the fact that when you want to ask these questions about immutability or the certain behaviors, you’re doing it from the perspective of an abstraction layer and not from a layer below.

If I do an insert into a database and abort the transaction and it causes the underlying block to split in the database, but the record is removed., so now when I read the set of records, it’s the same. Did I change anything? Well, the answer is yes and no. I changed things from the level of the blocks, I did not change things from the level of the records, right?

Adam: Yeah. It depends on what level you’re paying attention to or care about.

Pat: Right. And that’s true for even idempotents. I want to do the thing exactly once, but do I really care if I did it twice and created two logging records, the thing was processed twice? No. Do I care that when I hit this server the second time it caused my memory manager to allocate some stuff and do a [inaudible 00:53:32] and things were messed up in the heap? No.

Adam: Yeah. You’ve been working with basically data for a long time, where do you think somebody should start today who’s building a new system? I think you’re going to say it depends on your business requirements.

What Data Store To Use

Pat: Well, yeah. One of two things is happening, either you’re in school and you have a harebrained idea that you’re trying to flush out, and that’s all cool. I’m all in favor of that. Or you walk into work and your boss says, “This is something we need.” And so now you have to try to figure out how you’re going to practically do that. And in a world of just ever increasing resources, both in terms of software packages we can get and cloud computing and all these things, you want to figure out an effective way to accomplish what the business needs, leveraging technology that’s around.

And so there is no one answer. Of course, for each of us as engineers, we need to know that we’re learning and growing and joining our team and enjoying our job while we continue to contribute to what our company needs us to do. That’s how it rolls. But that was a non-answer, I think.

Adam: Well, I don’t think there is a right answer. I think that’s one thing I’ve learned from your discussions that the business requirements should dictate how we store things, the constraints we put on it, how we interact with it.

Pat: Yeah. I completely agree. And that’s not just a one-size-fits-all because some of these things you have to make a choice, and the choice has this benefit and this drawback. And the other choice has got different benefit and the different drawback. And that’s fine. It’s just about a matter of educating ourselves and figuring out how to apply those things.

Adam: If you had to choose between the model you outlined in your theoretical infinitely scaling system or something like the Google Spanner, where would you choose?

Pat: I think Google’s Spanner is fascinating, but I don’t think it quite works as in terms of across trust boundaries. That’s one issue. So if one of the participants is nefarious, then how much does that get you tangled up? But it’s way the heck easier for a lot of uses. And so I have a ton of respect for it. It’s really interesting. No one answer.

Adam: Well, this has been a lot of fun, Pat. Thanks so much for the time. I will put links to your ACM articles in the show notes.

Pat: Thank you. And thank you very much for your time, and I appreciate the interview. It’s very nice to meet you virtually, and I hope to meet you physically, Adam. Thanks.

Adam: All right. That was the show. I hope you enjoyed it. I’m trying to make all these interviews evergreen, and hoping that people in the future will still find them useful. If you are listening to this sometime after March 2019, hello from the past. I hope you liked the episode and let me know what you thought. Shoot me an email or whatever, mention me on Twitter or DM me, or however people in the future communicate. I have no idea. Until next time. Take care.

Support CoRecursive

I make CoRecursive because I love it when someone shares the details behind some project, some bug, or some incident with me.

No other podcast was telling stories quite like I wanted to hear.

Right now this is all done by just me and I love doing it, but it's also exhausting.

Recommending the show to others and contributing to this patreon are the biggest things you can do to help out.

Whatever you can do to help, I truly appreciate it!

Thanks! Adam Gordon Bell

Audio Player
back 15
forward 60s

Data and Scale