CoRecursive #014

Microservices Architecture

With Jan Machacek

I don’t know a lot about microservices. Like how to design them and what the various caveats and anti-patterns are. I’m currently working on a project that involves decomposing a monolithic application into separate parts, integrated together using Kafka and http.

Today I talk to coauthor of upcoming book, Reactive Systems Architecture : Designing and Implementing an Entire Distributed System. If you want to learn some of the hows and whys of building a distributed system, I think you’ll really enjoy this interview. The insights from this conversation are already helping me.

Transcript

Why Break Up a Working Monolith?

Adam: Welcome to code recursive, where we bring you discussions with thought leaders in the world of software development. I am Adam, your host.

Hey, here is my confession, I do not know a lot about microservice architectures. I’m currently working on a project that involves decomposing monolithic application into separate parts and integrating those parts using Kafka and HTTP. Today I talk to the co-author of the upcoming O’Reilly book Reactive Systems Architecture, subtitled, Designing and Implementing an Entire Distributed System. If you want to learn some of the hows and whys of building a distributed system, I think you’ll enjoy this interview, the insights from this conversation are already helping me. Jan [inaudible 00:00:53], oh actually, is that how you pronounce your name?

Jan: It’s [inaudible 00:00:57]. Sorry I’ll shut up.

Adam: Jan [inaudible 00:01:05] is the CTO of Cake Solutions, and a distributed systems expert. How do you feel about being called an expert?

Jan: It’s false advertising entirely. No idea if I would call myself expert. I’ve had my fingers burned quite a few times. So I’m not sure if that qualifies as the answer.

Adam: I think that qualifies. From where I’m sitting, somebody who’s made some mistakes, is in a great place to stop me from making them.

Jan: Okay, I’ll do my best and make more mistakes.

Adam: So recently I’ve been trying to get up to speed on how to split up a monolithic app. And also just about how you might design a distributed system in general. And my former colleague Peter recommended I check out your work. So why would I build a distributed system?

Jan: Now, that’s actually a tricky question. So, typical answer is that people say, “Oh, well we don’t like this monolithic application, because it’s monolithic.” That’s the only reason. And by monolithic, I mean, a thing where all the modules, all the code that makes up the application runs in one container. I’m using the word container in this most generic way. So, maybe a Docker container like the JVM, or something. Everything runs in this one blob of code. And that doesn’t actually mean that it doesn’t scale, right? It’s quite possible to think of multiple instances of this monolithic application. Monolithic in a sense database contains the thing that deals with orders and the theme that deals with generating emails and what have you. And it scales out right it runs on multiple computers, and it talks, again a scaled out database.

Now, that might be okay for business, but clearly it’s not really entirely satisfying for development and soon enough that too will meet and business will say things like, “Well, guys, we need you to update the blog component, the theme that generates the emails that come out to our customers.” And the development team says, “Yeah, sure we can do that.” And then they say, “Yeah, that’s all done.” So, when can we redeploy this entire thing? [inaudible 00:03:34]. Now, that’s where it’s starting to get a bit ropey. Then the product owners, the business will have questions like, “Okay, yeah. Okay. Are you sure you haven’t… Everything else is faulty, right?” And then we the programmers say, “Oh, yeah, yeah we haven’t changed a single line of code in this other stuff. We’ve only changed the lines of code that make up the email sending service.” And we deploy it, and then something breaks.

Now, that’s actually really implicit. So we say, “We don’t want that.” We realize that the entire system is made up of smaller components and they have different life cycle. So we want to break the up. So we want to have a service that doesn’t even [inaudible 00:04:13] to have a service that does the product management, and some sort of eCommerce component that takes payment. And so we break it up. Now, when we do, we gain the flexibility of deploying these smaller components independently, but there’s a price to pay. Like the complexity of the application remains the same. And the fact that we’ve broken it off and we say, “Okay, this is now simpler.” While the complexity hasn’t gone away, that still stays. And we just pushed it aside to the way in which these components talk to each other. So this brings me to the first monumental mistake, and yeah I’ve [inaudible 00:04:57] made it.

The Distributed Monolith Trap

We said, “Oh well what we actually have now is a distributed monolith. What I mean by that is application that runs in multiple processes, logical containers if you like, but its components need to talk to each other. So to display a web page, I need to talk to the product service, but I also need to talk to the discount, service and the image service. And I need to get all three results before I can render out the page. And if one of these components fails, well that stuff doesn’t work. So, that is a particularly terrible example of a distributed monolith. Essentially what you’ve replaced is the main process method calls and function calls with RBC.

Now, even when I say it out loud, it sounds pretty unpleasant, who would want to do that. Well, clearly, this might not be the right thing to do. You have a monolithic application, you decide to break it out, but it needs… We need to take a step back. The designers need to take a step back and say, “The fact that we’ll now do a distributed application, which means that there are… Time is not a constant, right? So we need to worry about time. We need to worry about unreliable network. We need to worry about network as such, these components need to be talking to each other.” And because these components, each of them will get their own life., we need to worry about versioning all of a sudden. So all that brings up this vast sort of blob of complexity that we need to solve. And you might have heard about this reactive application development thing, have you?

Adam: Yeah. I feel… I’m not clear whether it’s just a buzzword to be honest, like a marketing term.

Jan: I hear you.

Adam: What is reactive?

Jan: So, it has these four key components, and it centers around being message-driven, so that the components when they interact with each other, it’s not a request followed by a fully baked response. It’s a world where one of the components, which is the message, and it says… Here’s my favorite example. So we’re building these big eCommerce websites. And when we’re designing the checkout process, we have this entire thing, and you’ve integrated with payments, this all works. But each service when it runs, it should emit messages. Now, this is all sort of hand wavy. You’re not seeing me live, but I’m waving hands. So you can imagine the picture and all its horror, but you publish your message onto some message bus. So think [inaudible 00:07:52], that sort of thing, a durable message, we’ll get to that. So hang on to the web durable message. Now each service as it operates, it should emit as much information about its work as possible onto this message topic now.

Events That Outlive the Services

You’ve successfully implemented this. This first version of the eCommerce website works. Now, because the services were emitting these events as they went along, it is now possible retrospectively, this is the [inaudible 00:08:21] thing, it’s now possible for someone to come along and say, “You know what, I’m going to build a blackmailing service.” The blackmailing service is going to go through all the historical messages in this persistent message [inaudible 00:08:34], and it’s going to pick out the embarrassing orders that I’ve made. You wouldn’t believe the stuff I buy on Amazon. Now, if we design a microservice architecture that way, so we are event driven, each service publishes an event, the events are durable, it’s possible later, to construct another service that consumes these past and future events, and does more work. This is how we extend. This is the promise. Surely, we can add another service to our system, and it now does more things. If you think about it, that can only be achieved if we have this historical record of stuff that happened.

Not just here’s an entry in the database. That’s like a snapshot in time. You have stream of messages going from message number one, zero, the beginning of time, all the way to today. Now, if you were to replay all the messages and write them to a database table, you’d get a snapshot in history, you’d get a snapshot of when I replayed all the messages, here’s the state of the system. But as more messages come in, of course, this snapshot keeps changing. And so, it’s actually sometimes really useful to think of even a database table or some sort of persistence store as a snapshot of this system at a particular points in time, for the life of it is expressed through these messages.

Adam: Just so I’m clear, you’re saying… So you designed the eCommerce system as a standalone monolithic app. However, you make sure that you’re emitting [inaudible 00:10:06]admitting everything you’re doing to some…

Jan: You know what? That could be a good start. If you have a standard, if you have a monolithic application. You think, “I want to do micro services and niche stuff first, and build the new stuff as a new microservice.” There’s a whole bunch of value in that versus the usual, dare I say, enterprise initiative that says, “Let’s take what we have and rewrite it.” I have yet to see one successful rewriting from scratch. It just never works. There is so much value and so much experience baked in existing code, that rewriting it is almost always a disaster.

But extending it is a good idea. Now, extending it, you don’t want to add another monolithic bit on to it. So adding messaging, asynchronous messaging is another bit, another asynchronous non-blocking, another [inaudible 00:11:03] another reactive buzzword. You don’t want to, as part of processing of a request from the user. The last thing that the app should do is to wait and block on some sort of IO. Because we are distributed after all, and this file could be network. Now, who knows how long that might take? Sometimes even forever, sometimes there’s no results. And these TCP sockets timeout after those 60 seconds.

60 seconds, there’s no way on earth that a thread this heavy weight thing should be blocked for 60 seconds. So it needs to be asynchronous. Now, modern operating systems are… Kernels of modern operating systems have all these asynchronous primitives. So it’s possible to say things like, begin reading from a socket, and continue running this thread. And then the OS will wake up some other thread, and will say, “Okay, yes, I now have the data. Here’s your callback. I’ve read five kilobytes from the socket, here, you deal with it.” And then it’s up to the application frameworks to manage it in some reasonable way for the application programmers.

I do a lot of Scala, so there’s a whole bunch of convenient asynchronous libraries like Akka, basically goes in a very similar way.

Adam: I mean, I guess asynchronous could have a couple meanings, I guess. Like I could have a… Like is it fire and forget? Or is it I’m just not blocking, but then the scheduler is going to return something? Like is every request like two [inaudible 00:12:46] or four? Like I say, save user, and they’re like, “Okay, we heard you.”

Jan: That’s also a really, really, really good question. If you can get away with fire and forget, your system will be so much quicker and so much easier to rate. Now, a lot of our systems, the users wouldn’t accept fire and forget. So again, think about the things of typical messaging, send, write to a socket to write to disk, or write to this persistence [inaudible 00:13:20]journal. Now, if you accept at most, once delivery, so fire and forget, what these components are telling you is, “We’re going to do our best. We’ll try not to lose your message.”

Why Exactly Once Is a Pipe Dream

And that’s probably okay. For statistics. It’s probably perfectly fine for health checks, monitoring. But it’s not okay, for, I don’t know, payments. Basically, it gets a little more difficult. So, if a component has taken money from card, the thing that receives it really should receive it. Now, this is where it gets really complicated, because distributed system, right? Okay. So you and I need to exchange information in a guaranteed way. What I can do is say, “Hey, Adam, I have the payment for you,” and I can hang up or disconnect my computer.

Now, from where I’m sitting. I’ll hear nothing from you. Okay, okay, okay. But from where I’m sitting, I don’t know if the message that it was going to you got lost and you actually received it and you have it. Or if you actually received the message, you began processing it, but then you crashed. So I should send it again in both cases. Or is it the case that I sent a message successfully go to you, you processed it. But then just as you were replying, my network went down, which is you have it, but I don’t know that. So I’m going to send the message again to you.

Now that means that you might get duplication. So you might hear the same message again. There’s nothing I can do about it because from my perspective, I haven’t heard a confirmation from you. So I need to send it to you again. You now need to do extra work to de-duplicate. Now that’s problematic. How do you do deduplication? Wow, okay. [inaudible 00:15:16] the message. So you compute some shots [inaudible 00:15:19]256 of it, or you keep it in memory.

Yeah, how much memory do you have? Because this could go on for a really long time. Now, this is when we say pragmatic things. Like well, in reality, it’s we have a system, we know that between you and I, we exchange like one message per second. So you think, “Well, much memory do I really need to remember this stuff from beyond? We’re going to need the last 10 statements.” So you submit, make sure you have a memory for the 10 hashes of the last 10 sentences. And that’s good enough for you.

And the key thing is to measure the system, and know how much is going through it. And then you can tune these duplicating strategies and that you can keep things in check. But it’s a probability game. At some point, something’s going to go wrong.

Adam: If you phone me and say, “Hey, I got your payment,” then that’s like, at most 1s, [inaudible 00:16:18] because you hang up, you haven’t heard an acknowledgement from me.

Jan: Exactly. [inaudible 00:16:23].

Adam: And if you wait for me to acknowledge back, then we have this problem where that’s at least once because maybe the line comes out, well, I’m telling you [crosstalk 00:16:36] then you have to [crosstalk 00:16:36]. So I get it twice. I mean, I think what everybody really wants is exactly once.

Jan: I know. I [inaudible 00:16:44] two. I’d also like a wristwatch with a fountain. Never going to happen.

Adam: So it’s exactly [inaudible 00:16:51]what’s possible?

Jan: Okay, yes, if you reduce the context. So if you remove a lot of IO, or if you are prepared to say, trade off something else, so think I want exactly once, okay, that’s fine. Which now means that you need to say, “I need an extra coordinator. And this extra service, extra coordinator can now tell me, have I seen this message? Yes or no?” And if the answer is yes, I’ve already seen this message then, reject it, don’t even attempt to deliver it.

And that’s okay. But you’ve sacrificed availability. But if this coordinator goes down, your system is saying, “Well, no, I’m no longer sending anything.” So it is a trade off. Throughout this thing, it’s a trade off.

Adam: So in the phone example, I guess, if we had some third person who tracked whether… Like we both acknowledge to the third party, is that the idea?

Jan: That’s the idea. You’d have something else that listens and says, “[inaudible 00:17:57] through.”

When Failure Becomes a Feature

Adam: Let’s rewind. You said earlier that… I think that you said one of the main reasons for wanting a distributed system was deployment. It seems like a small reason to me just want to deploy components separately. Is that the only reason?

Jan: Okay, so the big reason, the favorite reason is we say scale. That’s the chicken [inaudible 00:18:27]. And we always imagined the system that goes, you know what? I’ll be Amazon. This will be so cool. 100,000 requests per second. Maybe even and [inaudible 00:18:39] Lindsey. Okay, well, it’s unreasonable for a monolithic application to be able to handle 100,000 requests per second, when the distribution of work is actually 90,000 of browsing, people just look for stuff. And only 10,000 is people buying.

I think I’ve made those numbers out. I think if that’s the case, then you know, Mr. Bezos is [inaudible 00:19:00]. You get like 90-10, 9 to one browsing to actually buying. Oh, wow, that would probably be pretty good. So I think the numbers are different, but doesn’t matter. So yes, it’s scale. And of course, if the system is divided into different services, each service can be scaled differently.

What’s actually even more encouraging about it, is that when the system is broken up into services, we can now read and think about failure scenarios, and what do we do if something is broken? So, again, hypothetical e-commerce website. Let’s say that the e-commerce website is really keen that people get whatever goes into your shopping basket, people have to be able to get. We’re just going to believe that. So if I put [inaudible 00:19:53], that’s one set of micro services that can do the search and the image delivery and all that personalization.

As I mentioned, [inaudible 00:20:01] basket, and I go to the payment stage, and the payment service goes down. Now, one of the options that we have, which will be sort of silly in this particular example, but nevertheless reasonable to think about it, let’s say we’re really good and we say, “We want our customers to be super happy. [inaudible 00:20:20]They spend all this time putting their stuff in the shopping basket.”

Regardless of whether the payment service works or not, we’re going to send this stuff. And so they find me because… Okay, I know silly, right? But if you make a request to this payment service, and you say, “Hey, I want to now pay for this. I put a few DVDs in my [inaudible 00:20:40] years ago, right? I put a couple of DVDs in my shopping basket and paid for it. At the check out, the service that ran the website, made a request to the payment service that was down. And so it’s said never mind, let’s just deliver this. Ridiculous, you see. Of course. Think though about maybe you’re subscribing to online music, you go to Spotify.

Starbucks Gives Away Free Coffee

Adam: I’ve actually heard of an example. So Starbucks, they have their member… It’s like a gift card or whatever. But it’s also their their membership card, and you can put coffees on it. So I guess they have problems with their system going down. And when down, they just give free coffees.

Jan: Absolutely right. Absolutely right. Now, you can extend it to say a music service. So suppose you want to listen to something. Open on your phone, say, “I’m going to subscribe using Apple Pay.” And so the payment has gone through the Apple systems. And then the receipt gets sent to this music service. And it says, “Hey guys, so I have a receipt number 47 from Apple.” And the music service now says, “Okay, well, I’m going to go to Apple and just double check, really, that is the receipt number 47 has that been paid.” Apple service goes, “503, over capacity.”

Now at that point, it’s really reasonable for the music service to say, “Never mind, I’m going to let the user listen to stuff. For the next 15 minutes, I’m going to handout the authentication token that allows access to all this music. But 15 minutes later, the user has to check back because by then I will have checked for sure with Apple and I will know.” So this is actually a really reasonable thing to do.

Now, again, it sort of makes, I guess, the commercial people’s blood go stone cold. But what’s the alternative? If we didn’t do this, then the alternative would have been really annoyed customers. They would have used something, everything would have worked [inaudible 00:22:34], and just because one of our services is down, they don’t get to do what they want it to do. So they pick up the phone or write a tweet. It’s actually better, in a lot of cases, to just trust people, and maybe allow them access for a short duration. [inaudible 00:22:52] industry [inaudible 00:22:52] overly generalizing [inaudible 00:22:55].

But with distributed system, this is now possible. If you were in a monolith, if the payment service went down, because it’s part of the monolith, that’s it, you’re done, nothing works. If it’s a distributed system, you have a better chance of defining some sort of failure strategy. And because the payment system publishes events about what happens to it, you as developers will have a chance of going through what happened, and doing this sort of sequence of events might lead to failure. Which you can use to improve your code, or you can write a predictive mechanism that tells the operations team or tells whoever happens to be on pager duty that says, “Heads up, this is not going to be pretty. We’re still fine, but something’s coming.” And that is fantastic. That is the one thing that that keeps services running, keeps the entire system’s running.

Adam: It’s interesting, because it brings to mind to me even if you have a monolithic app, likely, you already have a distributed system, right? Because you’re calling out to some payment processor, you’re calling out to some database. Even though your application is fully formed, you’re not… I guess what I’m hearing you saying is that you need to model these systems and what will happen when you can’t reach them. I just always assume I can reach the database, if I can’t, whatever.

Jan: Absolutely right. Absolutely right. And the worst thing about it is that at small scale… I don’t mean it in a bad way. But at low scale, it’s quite possible to go to AWS and provision a MySQL instance. And that thing will run forever. [inaudible 00:24:47]. And because it’s just one computer, but as the number of computers increases, the chance of probability goes… I mean the chance of failure goes right [inaudible 00:24:58] something [inaudible 00:24:59] to fail.

I used to say this. Well, we need to build these distributed systems, and we need to treat failures as first class citizens. This is really important. Then I said [inaudible 00:25:13] in the back of my mind [inaudible 00:25:15]. We now happen to have a system that handles significant load, and it happened every day. Every day, some service goes down. Couple of nodes of a database go down, couple of Kafka brokers go down, they get restarted, and it’s fine.

The key thing is because we can reason about this failure, and because we’re prepared for it, it’s fine. We survive it, our infrastructure keeps running, our application keeps running. The users are getting their responses. But internally, of course, we can detect errors and recover from them. So that was [inaudible 00:26:01], truly.

Build the Monolith First?

Adam: What do you think about this argument? I haven’t heard it in a while, but it was very popular, about build the monolith first, and then once you reach these problems, then you start splitting.

Jan: I don’t have a general answer to that. I think a lot of the times you can know what the expected load is going to be if you know… As in you have an existing customer base, and you know that you have a million customers. And so the new service, when it goes live, it will get that load. Now, I think it would be silly in that scenario to say, “No, no, we’re just going to start with this thing that we know isn’t right. And then we’ll see how it goes, how it crashes.”

Adam: Feel free to say it’s a horrible idea. I think it was Martin Fowler, who wrote about this probably several years ago. And I think his argument was, if you try to build this new application as a distributed system, but once it starts interacting with customers, your requirements start changing drastically. And if you don’t know where the change is going to be, like maybe you’ve cut things up in a way, that doesn’t make sense.

Jan: And that was going to be my follow on. If it’s a new application, [inaudible 00:27:31]. Dare I say startup, then it makes total sense to experiment first. But there was another, I think that this was Googlers [inaudible 00:27:43] who wrote it. And it goes along the lines of design data systems and you say, “Okay, I need to be able to handle 10 concurrent users.” Design it for 100, but once you get to 1000, it’s a new system, right? Once you get to 10x the original design, it’s just not going to work. I know 10 is a silly example.

Say you have 100,000 users, design a system that will fold up to say a million. But once your usage grows out to be 10 million, the thing isn’t suitable, it’s the wrong thing. Now, I think this is where Martin Fowler was getting as well. The design choices for something that handles 10 or 100 users might be completely different to the thing that handles 1000. And if the starting position is that you have zero users, well, you’re right, design something, anything. Because as you point out, who knows what these silly users will say when they finally log in and say, “Well, I don’t like this.” And you might cut [inaudible 00:28:48] your system the wrong way you [inaudible 00:28:51] define these consistency boundaries.

That’s the big word. [inaudible 00:28:55] I haven’t come up with a big word for a long time. So consistency value or this context boundary, because we’re in a distributed system, we have… I’m sure people know this [inaudible 00:29:09] cap theory, which says, well, you have consistency, availability, or partition tolerance. You’d like all three, but you can only have two. And as long as one of them is partition tolerance, because you have a distributed system, right?

Pick Two: The CAP Theorem Trade-Off

So partition tolerance. Now pick two, consistency or availability. And each really depends on what you’re building. Now, the choice isn’t usually quite… It’s not binary. It’s not like consistency 1,0. This is stable. But what you’re ultimately dealing with is physics. Maybe you have a data center. Say there’s a data center in Dublin, and there’s a data center in East Coast, US East, Amazon, and EUS [inaudible 00:29:53] in Dublin. It takes time for this electricity nonsense to get from America to Ireland, it just does and there’s nothing you can do about it.

And so if you write a thing to a system like a database in Dublin, it’s going to take, like 100 milliseconds before the signal gets to US East. And there’s nothing you can do about it. That’s just life. So what can you do? You can say well, so if someone else is reading the same row from US East, they’ll see old data, because it’s just not there yet. So it gives okay, okay, okay. So you hate the idea. Absolutely hate the idea. It must be consistent. When I look at this database row, I must see. Wherever I look in the world, I must see the same value.

All right, so you better have this other component, this coordinating guy that adds a tink [inaudible 00:30:52], and says “Yes, this is now…” I’ve heard acts from data sent to every replica that they now have it. And now it’s good. Now I can release these regional locks that people are waiting for. Okay, it’s consistent. What if one of these networks connection… What if this coordinating component goes down? Oh, there you go. You can’t be consistent. So the only thing that the system can do is now provider of service.

Adam: I think even in monolithic applications, I think people don’t think about the fact that maybe you’re consistent within the database. But that doesn’t necessarily. I imagine a scenario, you have a bunch of app servers and a database behind it like Postgres, and everything’s [inaudible 00:31:40] acid, but one user reads a record, and displays it somewhere, and then they change some things. And then your little magic ORM software saves the record, like on Mars back[inaudible 00:31:50]. And somebody else could have the same record up at the same time. So consistency is lost. Sometimes people don’t even realize it, I guess.

Jan: Oh, absolutely. Oh, absolutely. So the more a distributed system, the more you have to deal with these scenarios, the more you have to deal with the possibility that something will be inconsistent. And it can actually be a really… I’ve had a whole bunch of discussions with typically e-commerce people who find it absurd. “What do you mean that I don’t know how much stuff I have in stock?” Well, you do, roughly, but it’s not down to like one unit. So they say, “Oh, no, how about we reserve items?” You can. “How about we lock items?” Yes, you can.

But as the number of users grows, the number of locks or these reservations grows. And so it’s really easy to get to the point where everything’s locked. And no one can actually do anything because you want to be consistent. Because you want to be consistent. You’ve sacrificed availability. And so people say, “No, no, no, no, no, no, no. You can’t have that. I want to be consistent and available.” At which point you say, “Okay, fine. Buy one big, huge machine. And run everything on this one machine.” But even that’s ridiculous, that machine will have multiple CPUs, it will have wires going through it and they will break. So it’s a-

Adam: And the availability is lost if that machine goes down.

Jan: Oh, absolutely.

Verifying a Passport in Ten Seconds

Adam: It’s not pretty. So you had an example about an image processing system you designed?

Jan: I did, yes.

Adam: I’m wondering if you could describe that a little bit.

Jan: Yeah, of course. So this was a really interesting thing. Now [inaudible 00:33:39] little bit more worried about their private information. But nevertheless, this was in the good old days, where people uploaded pictures of their passports freely onto any service they could find. Now, what we did is, it was obviously doing capital processing steps with these identity documents.

Adam: So what was the goal of the system?

Jan: It was to essentially provide biometric verification of the user. So you want to open a bank account, and you don’t want to go to the branch, and the bank doesn’t really want to open a branch for you, because it costs a lot of money. What they would like to do instead is to have the users use their phones to… Log into their phones and take a picture of their driving license or something. And for this system to say, “This is [inaudible 00:34:30] of the driving licenses, this is the real deal. It’s not been tampered with. And the thing that staring into the camera, that’s the same face. It’s alive.”

So that was that. Now, of course, the users, they want this thing to run beautifully, smoothly. And they will give you 10 seconds of attention, maybe. So if the processing really isn’t done within say 10 seconds, maybe 15, if we invent some cool animation, then [inaudible 00:35:03]this would have been a total failure. So this needed to be obviously scalable, needed to be parallel, literally concurrent. Many of these steps needed to be executed concurrently. And a lot of these steps were actually pretty difficult as you can imagine. There is a whole bunch of machine learning involved.

And so [inaudible 00:35:25] the PDF that you might have read or the book that you might have [inaudible 00:35:31] essentially [inaudible 00:35:31] thing that there’s a front service that ingests a couple of images, as in high resolution pictures of documents or whatever else, and it ingests a video of the person’s staring into the phone. And then the downstream services then computes the results. So it needs to be OCR’d. Their face needs to be extracted from it.

We need to check that [inaudible 00:36:00] picture on me. When you’re [inaudible 00:36:05]. We used to do this. I [inaudible 00:36:05] speakers from Eastern Europe, so we used to [inaudible 00:36:08] all the time. I keep hearing that elsewhere in the world, and think the [inaudible 00:36:12] instances, in order to prove that they are allowed to have a drink. And so they take someone else’s driving license, put that picture on it. So we need to be able to detect those scenarios, obviously.

And then combine the result. And then at the end, there is a state machine then react to all these messages that are coming from these upstream components. And as soon as it’s ready, it needs to emit a result. So imagine that you’re at the bank, you want to open a bank account and you say… A question there, you say, “How much [inaudible 00:36:47] or a betting company?” You’re someone who needs to verify identity of a person. And so if I’m a betting company, and I say, okay, if I’m going to play on the Grand National, which is a UK horse race [inaudible 00:37:00] bet before. So I need to register, I need to prove that I’m over 18, so I [inaudible 00:37:06] process. And then in the app I say, “How much do you want to bet?” I say [inaudible 00:37:11] two pounds.

Well. So, the betting company then can have a ruling that says, as soon as I hear from this system, as soon as it emits the message that says, “You know what? [inaudible 00:37:26] and this driving license, it looks like the UK driving license.” At that point, it might say that’s good enough, it’s fine, allow that to go through. If on the other hand, I would say, well, look, I’m going to bet three thousand pounds. They would have to wait for all these messages to arrive. That’s is to say it’s OCI and we checked the driving license and register to verify that it’s the real thing. And we also compared the picture on the driving license with a face that was staring into the phone and yes, it’s the right person.

Now, that is only possible if your system in [inaudible 00:38:09]. It would have been impossible to condense down and to start processing and wait until we have everything this entire process [inaudible 00:38:24] and then send one poll, one result to the gambling company.

Emit What You Know When You Know It

Adam: That makes sense because if we were doing back to our start up generating something quick, I might have some Web server that kind of throws these images into whatever Kafka topic and then just something else that just does everything right, pulls it out and goes through it. So you’re saying the problem with that is the upstream consumer needs to know about specific… It wants to know before it’s done, I guess.

Jan: Yes. And the main challenge is that you don’t know what it wants to know. [inaudible 00:39:04]. And so you say to your customers, look, here is what we can do. You don’t get the feedback on what they might say is, well, that’s nice, but couldn’t you just send us online gambling? Could you just send us the initial processing? That’s good enough for us. Oh, you have to give if client equals equals five that do this and then some other time comes along where [inaudible 00:39:32] but if the betting amount is more than 200, then we end that way. So it’s much better to just admit everything that you know, as soon as you know it from your services.

Now, if [inaudible 00:39:50] we decide together that we’re going to be so good to our customers that we’re going to implement this workflow engine. We can as a separate service.but as a separate service. That’s the [inaudible 00:40:03].

Adam: So we can defer that decision.

Jan: Absolutely.

Adam: So, how does the client? Okay. The client is confusing term here. So there’s a user with a phone, but then there’s your actual client like the bank.

Jan: Right.

Adam: So how are they consuming these results?

Jan: Now, there are multiple choices. The crudest one is we just push to them over a connection. So we say to our client speaks the system that should receive the messages from us. We say, okay, guys, why don’t you have up a publicly accessible endpoint and we’re going to post stuff to you. That’s a horrible way of doing it, don’t ever do that. First of all, your customers will hate it because they’ll say, no. Come on. We don’t want to stand down whatsoever in order to consume your service. That’s a stupid idea, personally. And secondly. It pushes the responsibility of making sure that you’re talking to the right end point [inaudible 00:41:09].

It would be really bad if, say, you had this complicated image processing service and it posted data of private data, biometric data to some other URL through misconfiguration. [inaudible 00:41:23]. So how about we do it? So we say okay, you don’t see lines over a long post. Request and [inaudible 00:41:37], and we will send you chunks of messages as soon as we receive something. Now, that’s a much nicer way of doing it because it just allows the client to control where it’s connecting to. It also means that the client has to check the HTTP on the connection. It has to know and it has to be sure that it’s talking to the right service so we can wash our hands of this whole horrible business.

And it also means that people… The client is in control of its consumption. Now, that’s kind of nice. What’s slightly problematic about it is, [inaudible 00:42:22]. So if it’s the case of like one image, [inaudible 00:42:25]. If it were [inaudible 00:42:29] thousand images per second. That’s a bit tougher. You would want multiple of these connections to be opened and so now we need to think about a way to which connection gets which results, which images. And they need to be looked at carefully. And you also need to think about what you need to be nice to your clients and say, we understand that the connection [inaudible 00:42:55] and you might have to make new connections. Okay, but you need to be able to say things like, I’m making a connection, but I want to start consuming from the [inaudible 00:43:05] 75 instead of the last one. But that can be [inaudible 00:43:07].

So, that’s a neat way of bridging the gap between someone who says we’re [inaudible 00:43:23] says no I’m dot.net, [inaudible 00:43:26]. In which case they sort of [inaudible 00:43:30] connection works pretty well. Of course ideal scenario is you can say, well, we would like your system to expose the [inaudible 00:43:43] interface and we can just mirror that. So, that’s one of the other options that we offer. Now, in the end, this particular system, we ended up just with these HTTP and points.

Don’t Expose Your Kafka to Clients

Adam: So that’s something you do. You do integration with Kafka, like as you’re?

Jan: We also do that because these [inaudible 00:43:59] that looks like it was going to be large enough to warrant that. But in the end, we dropped it.

Adam: Interesting. I can see why it would be easy from your side.

Jan: Oh, yeah, exactly. That’s what we said. We can do this for you, but it’s… I didn’t like it in the end because it exposed too much of our entire workings. It felt like integration through database. Wasn’t quite right. Now you can restrict the topics that are implicated. You can maybe transform them [inaudible 00:44:35] still felt a little bit crummy. So I think it’s better to have [inaudible 00:44:41] completely disconnected interface. And even if you’re talking, say you’re [inaudible 00:44:46] and you publish your messages to [inaudible 00:44:50] and then you would say to your client, oh yes, I see you have [inaudible 00:44:53]. Why don’t we push stuff to your [inaudible 00:44:56]. Again, it feels like you’re letting your dirty laundry out for everyone to see. So there should be an integration service that, it might well consume from [inaudible 00:45:10], [inaudible 00:45:11] and publish through [inaudible 00:45:14].

That’s all good. It might even be a [inaudible 00:45:15] for all I care. But it shouldn’t be a direct connection.

Adam: So that’s interfacing with the externally, so inside of your your micro services world, how do you agree on these formats?

Jan: Formats being [inaudible 00:45:35]?

Adam: Yeah. How do you agree on how you communicate between the various micro services?

Jan: So a lot of it is driven by the environment. So we have one system that’s divided and part of it lives in a data center and part of it lives in AWS. Now, that drives the choice because there’s no [inaudible 00:45:58] in a data center, we actually use in that particular case. And then we publish messages out to Kinesis and the AWS components [inaudible 00:46:08]. What’s actually more important is to think about the properties of this messaging thing and think about the payload that goes on this state.

So, in a distributed system [inaudible 00:46:26] things to have these ephemeral messages, so if the integration between our components, our services is like an HTTP request, there’s no way to get back to it. If once the request is made, it’s made. It’s gone. There’s no way. No record of it anywhere. So what we really prefer is persistent messaging and you can then think of these message brokers as both messaging delivery mechanism as well as journal in the [inaudible 00:47:03] sets so that message boards contains all messages for duration of time, but they’re not lost. So if I publish it, [inaudible 00:47:20] gets the confirmation that, yes, the sufficient number of nodes [inaudible 00:47:26]. It’s there. It sits somewhere. It can be re-read again, it’s persistent.

Now, actually, terms and conditions apply. That’s not exactly what’s happening with Kapka because what happens is if you by default, if you cluster of [inaudible 00:47:44] number of nodes and you can say I want to receive confirmation when the record is published on for a number of nodes. Okay. That still doesn’t mean that the messages written to [inaudible 00:47:55], it just means that for a number of nodes have the message in memory. Now, [inaudible 00:48:01] you’d have to be really unlucky that all of these nodes would fail before they push their memory to desk. And of course, you can say, [inaudible 00:48:13] but then the message rates goes right down, [inaudible 00:48:16].

So, [inaudible 00:48:23], which is a distributed network, file system painted over a number of SSDs. And so we have a cluster of Kafkas that use this [inaudible 00:48:37] which we’re not entirely happy about but we’re not also entirely unhappy about it because it helps. [inaudible 00:48:43]. It wouldn’t be fun.

One Repo to Rule All Schemas

Adam: So, just to make sure that I’m understanding. Your image service has a bunch of micro services inside of it, and they are communicating with each other using Kafka topics. And we kind of assume that that is persistent, even if some conditions apply. So how do you organize it in terms of, message schemas and topics?

Jan: That’s a good question. So the message is we’ve chosen [inaudible 00:49:22] as the format, description or messages. So we’re very, very, very strict about it. And we actually have, this is where people are going to get really angry because one of the things about micro services is that they are independent, each micro service is its own world. It shouldn’t have any shared code and share dependencies. Well, I mean, true, but then reality kicks in. Now what we’ve done? This is where people will get angry. We actually created one separate [inaudible 00:49:53] repository that contains code definitions for all the services. [inaudible 00:49:59] weird. But this one thing about us humans to spot when someone is making a change in protocol.

So if there’s a pull request to this one central repository that contains all the protocol definitions for the services, the entire team is on it and it goes, wait a minute, if we merge this, is it going to break any of our stuff? Because remember, these micro services have independent lifecycle. And so it’s quite possible to have version one point two point three of service A running alongside version one four seven version of Service B. And following the usual semantic versioning rules, these should be compatible. Well, except version is only as good as the programmers who make it. And so this is why we have this repository and we have humans [inaudible 00:50:52], is this going to work? Yes or no?

When we build the micro services, they all build protocol with limitations using this shared definitions repository. So if there’s a C++ version, it makes the purchases between C++ versions of [inaudible 00:51:17]. But then it’s all quote unquote, guaranteed to work together when it’s [crosstalk 00:51:24].

Adam: Is [inaudible 00:51:26] the best solution or what do you think of?

Jan: Now, to me, the best solution is language that has good language tool. All right, so you could say, oh well, we should have used [inaudible 00:51:43] more efficient, it has supporting Kafka, [inaudible 00:51:50] or something else. But what if we were developing the system [inaudible 00:51:56] mature tooling for the languages that we used. [inaudible 00:52:03]. There was good tooling [inaudible 00:52:06] for C++ and there was good tooling of [inaudible 00:52:09] and that made a difference. So, I guess what I’m saying is you’re free to choose which protocol you like [inaudible 00:52:19] as long as the tooling matches your development lifecycle and as long as the protocol is rich enough to allow you to express all the messages that you’re sending [crosstalk 00:52:32].

Adam: That makes sense. So how do you decide about topics? Is each micro service pushing to a singular topic or what?

Jan: Yes. So that’s a good question. Typically each micro service is pushing through multiple topics. [inaudible 00:52:54]. So there is that the basic result topic but then it might have partial results, it might have some statistics. So these are definitely topics that we [inaudible 00:53:10]. I can’t say that one service only ever consumes from one topic and publishes to one topic. Typical service that the system will consume from two topics and publish to maybe three. It is worth mentioning though, particularly with versioning is the way we’ve gone about it as topics for major versions. So there’s like a V1 of images and obviously V1 will be forwards and backwards compatible in all the major versions within the way to make minor versions within the one major version. And if we ever decide to deploy [inaudible 00:53:51] which presumably is completely different because it’s just not compatible, then we would create image/V2 topic for it.

Adam: I mean, it seems to be a theme you’re saying is to make these things explicit?

Jan: Mm-hmm (affirmative). Absolutely right. You have to be able to talk about these things and accept reality. What always scares me the most is when people describe the system, they say we can break the monolith or if we are designing micro services based architecture [inaudible 00:54:24]. We’re going to do this on the other. My first question, all those scenarios is what if it’s down? What if it’s not available? And, so this is very scary, but let’s go back to our [inaudible 00:54:41]. And you have a log in page and you have like a music shop or something, and you would be okay. [inaudible 00:54:51] what you do when your user database is down and someone wants to login?

Adam: Freak out. Get [inaudible 00:54:56].

Jan: Right. There you go. Now, a lot of people will say, well, what can we possibly do? The database is down. But what you can do, option A is [inaudible 00:55:09], option B is okay fine you’re logged in. I don’t care what user name and password you typed in. We’ll take your word for it. We’re going to issue you a token that’s good for the next two minutes. [inaudible 00:55:21]. And then it’s this. Right? And then you say to your [inaudible 00:55:25] and they say, are you ridiculous? We can’t allow that to happen. And then you say, okay fine I hear you. So option B is if the data base goes down, and it will go down. Then your service is down. We hate that as well. And, we do. I know. I hate the database to go down and I’m sure [inaudible 00:55:48] hates the idea of just letting people in, but we have to make a choice. That’s reality.

Adam: It makes sense for a music service, for a bank I think that…

Jan: Absolutely. Absolutely. And this is where the dial is [inaudible 00:56:05]. Right. It’s not a one zero. So, of course, if I can’t verify my online banking identity, I can’t let you in. What if though, let’s say there are a number of banks in the U.K., one is actually [inaudible 00:56:18] massive IT failure last week. Anyway, besides the point, suppose you’re running a bank and you have 2005 code and you do a two factor authentication by text messages. Okay. I am going to log in my username and password or whatever digital of your security code was fine, but the SMS sending service is down. And now it’s a reasonable choice, maybe a possibly reasonable choice, it’s to allow really access.

Adam: Makes sense.

Jan: [crosstalk 00:56:54].

Adam: [crosstalk 00:56:55] that could be a choice you can make.

Jan: Absolutely.

Adam: That’s interesting.

Jan: So you’d have to think about it and actually designing these systems [inaudible 00:57:08]. As we design them, we always say, what if it’s not there? What if it’s unreliable? What if it’s slow? And not treat it as either an academic discussion as in what if the data center is offline.no. No. No. I mean, these mundane, boring things [inaudible 00:57:33]. Okay. Well, someone [inaudible 00:57:35]. The database isn’t accessible. My SMS sending service is down. [inaudible 00:57:41] server is down. That happens all the time. And so that’s the first layer of data and the answer shouldn’t be, oh, well. There has to be something, it has to be a decision. And it’s the implicit “oh well” that causes the most problems.

Adam: “Oh, well” could be a decision I guess. Or I mean, I guess it’s implied you’re saying we need an explicit decision to say [crosstalk 00:58:06].

Jan: That’s exactly [inaudible 00:58:07]. It never be we assuming that. And it takes so much of mental practice because we are used to things [inaudible 00:58:18] and we say select start from users and then if TCP connection denied then most people, myself included, will say, [inaudible 00:58:30]. Seriously, what do you want? But if we don’t say it out loud, if we don’t say that if the database isn’t there, I’m not doing anything, then other people wouldn’t know that we have made this decision. It’s this hidden decision that’s going to come back and bite us.

Adam: I assume also the cutting things up this way into micro services allows us to be kind of more explicit about the various parts failing while others remain. In a monolith, it’s hard to make these.

Jan: Yes. Absolutely.

Adam: [crosstalk 00:59:03] the lines aren’t clear, I guess, between…

Jan: Yeah. Absolutely, it’s people want… [inaudible 00:59:12]. You go with Netflix and you watch stuff and you expect it to happen because after all, you pay $10. [inaudible 00:59:21] world class service. I want files to be ready on [inaudible 00:59:26], ready for me to start watching within a second of pressing play. Because come on, I paid my $10. We are guilty of that expectation. And so I find it bizarre that we would then be at work and say things like, [inaudible 00:59:42] we can do and we’re building a system for a bank, or for a retailer, or for something that actually gets more money than the $10 per month of Netflix.

Why a CTO Should Still Write Code

Adam: So Jan, you’re the CTO of Cake Solutions. What’s that like? It doesn’t sound to me like you’re doing CTO things. It sounds like you’re… Maybe I’m in the weeds of designing distributed systems or.

Jan: Right. I mean, I guess I hate the idea of not understanding what my teams are doing. That just frightens me. And same goes for coding architects and that sort of thing. I think to actually make a useful decision, one really has to know what’s involved. And I think these are complicated decisions. And so it really annoy me if I didn’t know, if I didn’t understand, and if I had to go to meetings where I would hear things that people talk about [inaudible 01:00:49], sounds complicated, I suppose they’re right. That would really frighten me. So I think it’s actually super important for [inaudible 01:01:02] to be within big teams and big companies.

Actually, there was an interesting article I read a while ago that said that [inaudible 01:01:16]. It talks about his time at Microsoft. He said that Bill Gates, that back then, CEO, [inaudible 01:01:26] I don’t know, C-something. Chief. Very important person. Yeah, he said he would dig into technical [inaudible 01:01:33]. He says he remembers the time when Bill Gates was insisting that there is a reusable text editor component. Now, that can only come from really understanding what the hell’s going on in technology. Without it, how would a CTO who doesn’t code, how would a CTO who doesn’t understand any of this insist on having a reusable rich text component? So, yeah, that’s my take on it anyway.

Adam: I mean, I think it’s great. I think that all CTOs should take that on. I think I guess people just get buried by the ins and outs of the job and sometimes…

Jan: And that’s fair. It’s fair and it’s tempting to achieve the optimum density [inaudible 01:02:23]. It’s tempting, but no resist. Honestly, if anyone is listening, don’t do it. It’s horrible. Be interested. There’s so much interesting stuff going on and that goes down to teams that are [inaudible 01:02:40] micro services. Some of the stuff is on the outside boring. Oh, goodness. Who would like to make a PDF report, really? Just the results of it. But what I’m always reminded of that, again, I’m terrible with names. No idea who said this, but when one looks closely enough, everything is interesting. And I think that’s really the case. Even PDF reports can be made more interesting.

I mean, in the darkest of days, we used [inaudible 01:03:13] transformers to make PDF reports. Some PDF report thing is still boring, but we could use this other stuff, which was really interesting. I guess my point is this is the time to be alive. There’s so much technology available to us that I really don’t think that anything can be boring.

Streaming Live Sports at Scale

Adam: Well, that’s that’s a great sentiment. Do you want to say a little bit about Cake Solutions before we wrap up?

Jan: Well, sure. So, [inaudible 01:03:43] and we still may distribute systems but of a year ago we were working for our clients. And then about a year ago, we were acquired by Bantech and the Bantech were subsequently acquired by Disney. And so we now distributed systems. It’s the same stuff, right? We [inaudible 01:04:11] a delivery. So if anyone is a sports fan and if you guys are watching ESPN plus-

Adam: Awesome.

Jan: … And you go that’s the stuff, those are the distributed systems. [inaudible 01:04:25] in 2019, there will be a thing that that will just be the best thing in existence, like sliced bread, nothing.

Adam: Wow, that sounds exciting. I can understand where you’re coming up from scale and if you’re working on ESPN.

Jan: Oh God. Yeah, absolutely. This was the volume [inaudible 01:04:47]. And I was trying to be completely specific about what we do. But you can tell delivering that sort of experience is super important. It’s super important for these systems to deliver content to our users.

Now, I’m going to back in the days when we were making these, quote unquote, ordinary distributed systems for banks and corporates. Well, if e-commerce is not available, it’s annoying. [inaudible 01:05:23]. Imagine, though, there’s a baseball game or football. You’re a fan, you really want to see this, right? And 500, no. Well, first of all, it’s live. So, it cannot happen. You cannot have 500 because where it’s in e-commerce, users were annoyed if the thing went down, now they’re angry, deeply angry.

That was quite an eye opener. But, having this distributed system comprising of these micro services allowed us to think about what will happen. What do we do if something breaks? What’s the mitigation? Because ultimately the motivation is to deliver content. Our users have paid for this and they’re fans. This is their passion. They want to see this video. So, there you have it.

Adam: Awesome. So, thanks so much for your time Jan. It’s been great.

Jan: It was an absolutely pleasure.

Support CoRecursive

Hello,
I make CoRecursive because I love it when someone shares the details behind some project, some bug, or some incident with me.

No other podcast was telling stories quite like I wanted to hear.

Right now this is all done by just me and I love doing it, but it's also exhausting.

Recommending the show to others and contributing to this patreon are the biggest things you can do to help out.

Whatever you can do to help, I truly appreciate it!

Thanks! Adam Gordon Bell

Support The Podcast