Big Ball Of Mud Architecture and Services with Wade Waldron

Introduction

Adam: Welcome to CoRecursive, where we bring you discussions with thought leaders in the world of software development. I am Adam, your host. If you’re anything like me, then learning how to build software in a sustainable way, a way where you don’t continually build up technical debt, and have development slow down as the project gets more complex, has been a career long struggle.

Big Ball of Mud is the title of a paper presented in 1997 at PLoP Pattern Languages of Programs Conference and I think it’s super interesting. The researchers went out into the field to see what software architectures were being used in industry. And big ball of mud is what they found, along with six other patterns with names like sweeping under the rug, and reconstruction, which is the throw it all away and build it again and hope it’s better than next time pattern.

Anyhow, I think this is a hard problem, evolving software under constrained resources is always going to be a challenge. And we kid ourselves when we don’t admit that it’s hard. Today, I talked to Wade Waldron about how to avoid this situation or how to recover from it.

If you like the show, spread the word, tell a friend, leave an iTunes review or follow us on Twitter. If you’re listening in your web browser, on the website, subscribe to the podcast for a much better experience. Wade, thank you for joining me on the podcast.

Wade: I’m glad to be here.

Adam: So if you were at a dinner party, what would you tell somebody you did for a living?

Wade: I’d probably try to avoid that question. I think it’s usually a little awkward to explain that. But I guess when I do get asked that question, I usually tell people I’m a software consultant. I’m not sure that necessarily explains things very well but I guess that’s the answer they get.

Adam: And I actually got the request to interview you from a listener. And I started to dig into these courses you have on the reactive architecture. And I found them to be very interesting. So I’d like to start with this question. What’s a big ball of mud?

What Is a Big Ball Of Mud?

Wade: It’s an interesting question. One of the challenges I think with that particular question, and one of the challenges with that term, is sometimes it’ll get people’s backs up, I guess, because they hear me talk about a big ball of mud. And they say, “No, no, no, that’s not how I build a system.” I think it’s important when I talk about a big ball of mud like that, that I establish first off that this is a… I consider this a worst case scenario, I do not by any means consider this to be the general case.

So when I talk about a big ball of mud, what I’m talking about is usually a system that has been built in a monolithic way. So it’s been built as a single application, rather than a series of microservices, for example. But I’m not talking about every monolith, not every monolith is a big ball of mud. There are plenty of monoliths that are probably extremely well designed, and extremely resilient and extremely robust.

However, I have worked with monoliths specifically, where they were not really well designed, and were not very robust. And so in those cases, those monoliths those particular systems were built in such a way that there was no clear separation between dependencies. So you had, every piece of the system was depending on every other piece of the system, either directly through the code or sometimes through the database.

And it’s very difficult in those cases to sort of unravel where the clear system boundaries are. Instead, what you end up with this big ball of mud, where it’s all sort of one cohesive blob, and you can’t really separate it into smaller pieces.

Improving Your Ball of Mud

Adam: If I have a big ball of mud, what should I do next?

Wade: Well, some of that depends on what your goals are, and what that big ball of mud is doing for you. What I would suggest not doing is throwing away the big ball of mud and starting over. That’s something that I think is often tempting. You look at this big ball of mud and you don’t know where to start and you say, “Wow, let’s just build a new thing and replace the old thing.” And then usually two or three years into that project, you realize that you didn’t really understand the old thing to begin with, and what you’ve got is a big ball of smaller balls of mud or something like that.

What I would usually recommend in that case is to start looking for pieces of that big ball of mud that might be easier to isolate. So find a section of the code that is maybe not quite as intertwined as other sections, and start looking for ways that you can disentangle that piece of code, what I would usually do is, I would start by extracting that out into sort of a separate library or something like that, that you can then basically remove or control the references to that library a little bit better.

Once you have things sort of separated like that, then you can start looking at, “How can I take this thing and potentially move it out of the ball of mud completely? How do I move it into a separate microservice or something like that. But first, you have to start by finding that piece that isn’t so tightly coupled everything that the moment you try to move it, it’s going to break. And that can be a challenge, that’s not always an easy thing to do. But you got to start somewhere.

The other thing I would highly recommend is, if you are in the situation where you have that big ball of mud, don’t make it bigger. So when somebody comes in and says, “Hey, we need this new thing,” don’t just jump in and start putting it into the big ball of mud, look at the possibility of, “This is a new thing. Can we build this new thing separate from the ball of mud, so that at the very least we’re not making the problem any bigger? We’re maintaining the existing problem, but moving things in a way that encapsulates them and isolates them better?”

Adam: Yeah, that’s a great suggestion. I think that in my experience, the tricky part with that is sort of what if the piece isn’t that cohesive, like they want this new thing, but it kind of relates to what’s there? Is there a way to… I don’t know, is there a way to do this without just making that new thing, like actually dependent on the monolith, but maybe across process boundaries, but it’s still tied to it?

Wade: There are, and it’s something I have done in the past, depending on the I guess maturity of the development team, how familiar they are with different techniques and things like that, it may not be a problem. Also depends a little on what kind of infrastructure you have in place. But for example, if I use a concrete case that I did, we had an application, which was a big ball of mud. And we wanted to introduce, in this case, it was a rewards program for the company that I was working with at the time.

And with that rewards program, we needed a lot of information that was potentially contained within that big ball of mud. And so what we did, in order to make that a little easier, is we built that rewards system as a separate component. And then what we did is we went into the existing monolithic application. And we found the places where the information we needed was being recorded, we kind of located and isolated those particular pieces of the application.

And then we made the application essentially emit an event. And this is something that comes from event-driven architecture, which obviously we haven’t talked about today, but it’s something that reactive systems quite often focus on. And what you do is you make that piece of the monolith emit an event, which can then be consumed later on by your new piece of the system.

And so that event can be consumed, you can create your own sort of internal representation of the data as necessary, based on what’s coming out of that event. And so now, you don’t have to talk directly to the monolith. Instead, you indirectly consume these events. Those events are probably broadcast through a tool like Kafka or Kinesis or RabbitMQ or something like that. And so you consume those events, and build up your own model based on that. And that’s one way that you can separate yourself from the monolith.

Adam: And what’s the advantage of emitting events as opposed to I don’t know, rest calls or something like that.

Why Use Events?

Wade: So emitting events is using asynchronous messaging generally. And that tends to be a little bit more robust for a few different reasons. One of the reasons that it tends to be more robust, is because if you emit an event, an event is something that you consume asynchronously, and as a result of that it means that you don’t necessarily require all pieces of your system to be active and functional at the same time.

So for example, if in the rewards example that I was giving, if the reward service was down, there’s nothing stopping me from emitting that event anyway. So I emit the event, even though the reward system is down. Now when the reward system comes back up, I can just consume that event, even though I wasn’t alive when the event was emitted. On the flip side, if the system that is emitting that event goes down that doesn’t stop the reward system from continuing to serve any requests that it has to serve. It also doesn’t prevent it from consuming any events that were already admitted.

So that’s one aspect of it that I think is beneficial, it actually allows for more flexibility in terms of what’s active and what isn’t active. There’s other things too, again, because it’s asynchronous, it means you become more decoupled in time. And so that has its own set of advantages. You’re not expecting something to happen right now, you’re expecting it to happen eventually. If you expect something to happen right now, then again, we go back into the failure scenario, where if something fails, for some reason, and you need it right now, you have no choice but to fail the whole process.

On the other hand, if something fails, and you need it eventually, well now you have other options that you can take, you can do retries, you can wait for the information to become available, you don’t have to deal with that problem right now. So again, I think what it does is it allows the system to become more robust and less brittle over time.

Adam: So that is sort of isolating one service from the other one in time I think, because the event could be sitting in a buffer or sitting in Kafka. Is there other ways that we should isolate services from each other?

Isolating Services

Wade: There’s a ton of different ways to isolate services, I kind of feel like they boil down to a few specific ones. So specifically, I think isolation in time I think is very important. Isolation of state I think is equally important, especially when you build microservices. And so when I talk about isolation of state, what I mean is, microservices shouldn’t share a database.

Now, I want to clarify that statement a bit and say, they shouldn’t share tables and things like that in the database, they may actually all be operating within the same SQL database or whatever, or Cassandra database doesn’t matter. But they don’t have access to each other’s data through the database. If you want access to each other’s data, then you do that by communicating through the API that, that service presents. And that helps to decouple services, which again, can make them more flexible, that makes it easier for services to evolve. I’ve been in situations, for example, where you’re isolated, or sorry you lack that isolation, everything depends on the database.

And then you get into this awkward situation where you go, the structure of this particular table is actually kind of awkward, and I want to change it. But these 17 dozen other locations in the application all depend on that table. So if I make a change here, I got to change all those other things. And that’s the kind of situation we want to avoid, we’d like to be able to have the freedom to evolve our database as necessary for that particular microservice, for example.

So, that’s isolation of state. There’s also isolation of space. So in that case, what we’re saying is, microservices shouldn’t depend on the location of another microservice. So this is different from a monolith where, if you have a monolith, essentially everything is deployed as a single unit. And so because of that, you might have your reward service, and your customer service and whatever other services you have. Those are all deployed in the same location. And a monolith is largely dependent on the fact that those are deployed in the same location.

Now, you might have multiple copies of that deployed in different locations. But those copies usually don’t know about each other, any communication happens through the database. So within the individual application, everything is deployed in the same place. With microservices, that’s different, we shouldn’t necessarily require that a micro service be deployed on the same machine as another micro service, for example. We should be able to have the flexibility to deploy them across many machines. And what that gives us is that gives us the ability to scale and to be elastic.

So now if we need, maybe our customer service needs 10 copies, and our reward service only needs three copies, that’s fine. We can deploy as many copies of our customer service as we want, because it’s not directly tied to the location of the reward service or anything like that. So that’s isolation of space. And-

Adam: To make sure I follow, so you were talking about this rewards example. We can change examples if you want. We have a rewards service now, we’ve broken it out. And so it has its own database. And then when a user shops or something that would cause rewards to happen, then we would emit events and then… I’m I on the right track here?

Wade: Potentially, yeah. So you might have for example, when a customer buys something, you would emit an event, which indicates customer bought something for X amount of dollars, the reward service would receive that. And then the reward service would potentially know that, that amount of money translates into this kind of reward, whatever that happens to be. So it receives an event, which is something like, I don’t know, item purchased or something like that, it receives that and then rewards the events, or sorry, applies the appropriate rewards based on that event.

Adam: Makes sense. I interrupted you. You were going to another type of isolation, I believe.

Isolation of Failure

Wade: Yeah. So far, we’ve got isolation of space, isolation of state and isolation of time. And then the other one we’ve kind of already talked about a little bit, and that’s isolation of failure. And so that’s essentially just making sure that you have your system isolated in such a way that if one piece of the system fails, it doesn’t bring down the whole system.

So for example, if our reward service fails, well, people should still be able to buy stuff. We don’t want to have a situation where our reward system fails, and therefore we have to say, “Nobody can buy anything anymore.” So we want to isolate those failures, such that a failure in one piece of the system… It might have an impact on another piece of the system, but it doesn’t bring down the whole system.

I think Netflix actually does this really well. And what they have is… I’m kind of making some assumptions here, based on my experience with Netflix, I’ve never worked for them or anything, so I don’t really know. But based on my assumption, and my experiences with them, I don’t know if you’ve ever noticed. But sometimes, if you look at the My List feature, in Netflix, sometimes that disappears. I’ve seen that happen a couple of times in Netflix, where I go to look for My List, and it’s just gone.

And I think what is happening in that case is the microservice or the service that supports the My List feature has disappeared. It’s failed for some reason, or maybe it’s being redeployed, or whatever the case may be, it’s gone, the rest of the application still works fine.

In fact, if I wasn’t specifically looking for the My List feature, I wouldn’t even know it was gone. Because there’s no error message or anything like that. Everything just continues to work, I can still watch videos, I just can’t access the list. And so I think that’s a really good example of how you can isolate failures in your application.

Adam: I heard another example from Netflix, which I’m probably going to get wrong. But when you first go into Netflix, they have like some personalized recommendations. And I guess there’s a service that generates that. But it’s an expensive thing to generate, so it kind of is cached. So when you go in, it’ll start generating, and it will get shown from the cache.

However, like lots of times, they just don’t have that. You haven’t viewed it before, it hasn’t been kicked off. It’s not in the cache. So they just have a default, they just show here’s what we think everybody would like. Like everybody likes movie Ghostbusters. I don’t know what they put in the default recommendation. But it’s not per se a failure, I guess an explicit fallback, we’re not assuming this is always here.

Wade: I think that is a representation of a failure of some kind because potentially they could have a situation where, for example, maybe that service is actually unavailable, and they fall back to the defaults. And again, that’s a great way to hide or to isolate a particular kind of failure. So rather than failing completely and saying, “Look, you’re looking for this kind of information, I can’t give it to you.”

What we do instead is we say, “Well, normally we’d give you this really rich, detailed information, but we lack that right now. So here’s the next best thing we could give you and give us that instead.” Again, I’m not sure if that’s an actual implementation detail that Netflix uses. But it wouldn’t surprise me. And I think it would be a really good example as well of isolating failures.

Making Failure Explicit

Adam: Yeah, I had a previous interview with, how do you say his name [inaudible 00:19:09]. And he was saying just making these decisions explicitly is like a big change. Like when we have a monolithic app and like the user service, there’s no user service, the user part is just embedded in the application. So you never have to make an assumption about what should you do if there’s no ability to look up users. But all of a sudden, when you split these things up, you can make these explicit decisions and say, “Maybe we can have a read-only mode if we can’t authenticate this user or what have you.

Wade: Yeah, and I think that’s definitely true that it’s not always obvious when you’re building existing systems, monolithic systems, things like that. One of the things… I teach a lot of courses as part of my job. And one of the things that I teach in one of my courses is I go through the exercise of breaking out a system into separate microservices, and then with the students, I’ll actually sit down and kind of talk to them and say, “Okay, so we’ve got this series of microservices right? Now, what happens if this microservice fails?”

And sometimes the immediate reaction is sort of like, “Well, I don’t know, I guess nothing works.” But what if we wanted it to work? What would we do if this service failed, in order to allow it to continue to work? I find that’s a really interesting process going through with the students and talking to them about how could we change this system in some way, so that we can tolerate a failure here.

And so then we start to look at things like, what if we instead of making a direct rest call, what if we emitted events consumed to those events and created our own internal view of that data? If we do that, then when that external service fails, it doesn’t matter, because we have our own internal copy of that data that we can fall back on?

Yes, the data might not be 100% up to date. But in a lot of cases, that doesn’t really matter, in a lot of cases, mostly up to date is probably good enough. And in most cases, I would say, mostly up to date is better than, “I can’t help you right now. I’m just going to explode.” I think going through that exercise of pointing at different parts of the system and saying, “Okay, now imagine that fails, what are you going to do? Now imagine that fails? What are you going to do,” I think that’s a really good exercise to go through with any system really.

Circuit Breakers

Adam: One potential way to deal with this that I’ve seen go badly is this service sometimes is busy. So if it doesn’t respond in a certain amount of time, I’m just going to retry it. And then we have multiple services, and they start basically knocking something over, it starts to get slow so you ask it again, have you encountered this problem before?

Wade: I have, I’ve built a system like that, to my own shame, I guess. I built a system years ago, where it would attempt to send messages onto another aspect, another area of the system. And then when that area of the system got busy, we’d end up getting timeouts. And so we’d retry and send more messages to an already busy system, and things would just get busier and busier and busier.

And if you do that enough, then what ends up happening is the busy part of your system just collapses under the weight and everything falls over. So, yes, I’ve definitely encountered this scenario, there is a solution to that. And I think that solution is to be a little more polite, I guess on the sending one. So rather than retrying over and over and over again until you kill that busy system, what we can do is we can use techniques like a circuit breaker.

And what a circuit breaker does is essentially, any requests go through the circuit breaker whether they’re successful or not. But what happens is, as soon as a request fails, for some reason, it trips that circuit breaker. And so once that circuit breaker gets tripped, what happens is now any requests that come through that circuit breaker just immediately fail and they fail with a message, something like, circuit breaker is open or something like that and so you get this rapid failure.

As a result you could retry as much as you want, but you’re not actually putting any load on the external system. Because the circuit breaker is basically just preventing you from sending those messages on. Then eventually after some predefined period of time, the circuit breaker flips over into what we call a half open state.

And in the half open state, what it does is it allows one request through, just one, not a whole bunch. But what it does is it checks for that one request. And if that one request succeeds, then we go back to normal operation. On the other hand, if that one request fails, then we assume the external service is still unavailable for some reason. And we go back to just blocking all of the external calls.

What this does is it allows your system to be I guess more polite again to that external system so that you don’t just drive it into the ground. The circuit breakers are something that are… They’re implemented in various libraries. The libraries I work with are things like Akka and [inaudible 00:24:26]. Both of them have built in circuit breakers that you can use out of the box. There’s other libraries that implement them too, though, those are certainly not the only ones.

Adam: You shouldn’t be rolling it yourself at this point, I guess is the…

Wade: There’s no reason to build this yourself. There’s plenty of options out there for leveraging circuit breakers.

Autonomous Services

Adam: So you have a lot of great principles here about making things work over time, making sure state’s not shared. It sounds like or to steal your terminology from your course that the goal is to make these services autonomous, like able to stand on their own, is that a correct characterization?

Wade: That’s definitely one of the major goals yeah. Autonomy is a tricky thing, I think because it’s a really nice goal, but not one that we can never or ever necessarily reach completely, like a fully autonomous system would be a very rare system, I think. But the further we can move along that path, the better. The closer we can get to a fully autonomous system the better because that allows for all sorts of really interesting things.

I guess, to provide a bit of a definition, when we talk about an autonomous system, what we’re talking about is a system that doesn’t depend on anything, right? It depends only on itself and nothing else. If you had a system like that, you could deploy as many copies of that as you wanted and there would be nothing preventing it. You would never reach a point where there’s a bottleneck or something that says you can’t deploy any more of these.

That means you could essentially scale forever, it also means you’d be totally resilient to any kind of failure, because again, you can deploy as many copies of these as you want. And if one of them fails, no big deal, you have 50 other copies. So that allows you a lot of flexibility in terms of being… In terms of building a very robust system. Like I said, it’s usually not easy or even necessarily possible to get to that point. So it’s more about moving along that path and going as far along that path as possible.

Adam: I think it must be completely impossible. You have to have user input, for example, I guess we’re excluding user interaction.

Wade: Yes or no. The trick and the reason why I say it’s generally impossible to have this is because in order for you to have user interaction, you need some way for the user to know where all of the copies of the server are, right? And so in order for the user to know where all the copies of the server are, you need some sort of load balancer or something like that in between the user and all of them many different copies.

The moment you introduce that load balancer, it’s not a fully autonomous system anymore, you now have a dependency where the load balancer depends on all the services and the input or the users depend on the existence of that load balancer in order for this to work. I think that’s an example of where I typically say that it’s probably going to be impossible to build a fully autonomous system, because at some point, you’re going to have a load balancer or something interfacing with the user. I can’t think of a system off the top of my head where that wouldn’t ever be the case.

Adam: It’s an interesting game to play. If I were going to make a service and so it has persistent data but I want it to be totally autonomous. So I guess I would just emit things that would get stored in the database by something else, but at the same time, keep everything cached locally within that service. It sounds like a horrible idea, I guess. But it would mean if the database were down, I could just keep emitting these events and use my local cache but yeah.

Wade: I would argue that if you have a database, you’re not a fully autonomous system. And again, that’s where I say that fully autonomous systems are very, very difficult, if not impossible to build. Because if you have a database, well, now you have a dependency. Your micro service, or whatever it is, depends on the presence of that database.

Now you can improve autonomy there. So what you could do, and again, this is something I’ve actually done in the past is I’ve actually built a system at one point where every instance of my microservice had its own database. And so that improves autonomy, because now I don’t have a shared database, I have independent databases for each microservice, and each instance of the microservice. And each one has its own copy of the data and everything else that improves autonomy, but that’s also really expensive.

And so that’s another thing that you have to consider when you do this is the further you move along this path to autonomy, oftentimes things get more expensive. And so there has to be a real value delivered in order to make this worthwhile. I would not, for example, recommend that everybody go out and build systems that create fully independent copies of databases for every unique instance of a micro service, that’s probably more expensive than what most people need.

But it’s something where if you reached a scalability limit, and you realized that you couldn’t go any further because you have this shared database, well, then that might be a place where you could say, “Well, how can we break that coupling? How can we isolate ourselves even more so that we don’t have a shared database? And what benefit would that give us and is it worth it?” But is it worth it is always the key question, I think.

How Big Are Micro-Services?

Adam: So you hit a big question that I have in this area, which is how micro? How monolithic? Is there guidelines that can be used to decide how many services are needed to serve this customer function? Or how do we decide how to cut these things up?

Wade: Part of that for me is the principles of isolation that we talked about, the goal of microservices isn’t necessarily, in my opinion, based on size. I don’t want to be one of those people who says that a microservice should only be 100 lines of code or something like that. I would prefer to say that a microservice should be as small as it needs to be in order to get the job done and remain isolated.

So there’s no… I mean, that’s kind of a wishy washy answer in the sense that I had not given you a concrete answer. Only that I would say, you have to look at your unique use case and say, “Okay, we can make this thing more isolated by doing X, Y, and Z, whatever that happens to be. But is that costing us more than the value it’s delivering?” And so that’s kind of the key thing is, we could build our applications so that everything shares a single database, but they all have isolated tables that they don’t access. That’s better than having shared tables that everybody accesses that’s more isolated.

So we isolate by creating those separate tables. That’s kind of the first step. The second step might be okay, are there other options within our database? So can we have different schemas, for example, if you’re using like a SQL database, or if you’re using a Mongo database, MongoDB has the concept of databases within your MongoDB. And then Cassandra, I think they call them Keyspaces.

So different databases have additional isolation techniques, and so that’s better. Again, there’s a little bit more of an overhead when you do that. And then probably for most use cases, that’s going to do the job. But then there might be those rare use cases where you really need to scale beyond the ability of one database instance to handle, and then you start looking at, “What if I create a whole nother instance of Cassandra, that’s just there to handle this service, or whatever the case may be.” And at that point, you really have to ask yourself whether the cost of maintaining that new thing justifies the benefit that you get out of it.

Scale vs Complexity

Adam: Is the important distinction for microservices like the complexity of the business requirements and how they interact or the actual scale of deployment and the usage?

Wade: I think it’s both in a lot of ways. So one of the things that I think microservices do really well, is they allow you to isolate complex business logic in one area, rather than having it trapped in your database in a series of stored procedures or something like that, that are used by multiple parts of the application. You can have a single microservice that just deals with this piece of business logic, however complex that business logic happens to be.

What I like about that approach is it allows me to go into that microservice, and sort of forget about all the other things for a while and just focus on that microservice, and that piece of the business. And so I can sort of keep that all in my head without getting lost in the details of what everything else is doing with that.

So I think it is good for isolating business logic and I think that’s very important. I think actually that’s one of the primary things we talked about, how big should a microservice be? I think one of the primary things is to look at specific isolated pieces of business logic. In DDD, domain driven design, we use the term bounded context, look for those bounded contexts. And that’s kind of where you start building a microservice, because it allows you to isolate that business logic.

So I think that’s definitely a very important thing. However, the fact that you’ve created this isolation, and then potentially introduced a new level of autonomy that you didn’t have before, then enables you to scale in ways that you didn’t previously have the ability to scale. So in that respect to this is something that enables scalability. I think it’s a little bit of both. It’s both business logic and scalability that drives us.

Service Oriented Architecture

Adam: I think the term microservices I think became pretty hip maybe around 2015 or something. That’s my recollection. And maybe there’s a bit of a backlash now, but a long time ago, like maybe 2005 I remember people talking about service oriented architecture. Is there a difference? Is it rebranding or?

Wade: I think to some extent, it’s a rebranding, but I would argue it’s not completely rebranding. I would argue that microservices are a subset of service oriented architecture. So service oriented architecture would be kind of an umbrella term that covers microservices but it also covers other things.

One of the problems, I think that happened with service-oriented architecture is overtime, they started building infrastructure around doing service-oriented architecture. So you started getting like these enterprise message buses and things like that. Enterprise service buses, and these would have a lot of functionality built into them. They would do message passing between different parts of the system, they would do message versioning, they would do API versioning, all sorts of different things.

And I think that sort of muddied the water a little bit because people got so focused on these enterprise service buses, which really wasn’t necessarily the original purpose of service-oriented architecture. I think service oriented architecture, the original purpose was really about isolation, which is, again, kind of what I’ve been talking about. But on top of that some people build applications that they would call service oriented architecture, but they build them in a monolithic style.

So what they do is they build a single deployable unit but within that single deployable unit, there are multiple services, essentially. And those services communicate with each other through a single or rather through discrete API’s. So each service presents an API, when another service needs data, it talks to that API, it doesn’t go directly to the database. So they have isolation of state in that respect.

What they didn’t do necessarily is require that those individual services be deployed independently. And I think that’s where microservices are different. Microservices take all of those ideas of isolation of state and providing that that API and communicating only through that API. But then they also add the additional requirement that says, and these microservices have to be deployed independently. They’re not deployed as a single unit and I think that’s where the difference is.

SOA Prevents Big Balls of Mud

Adam: And it seems like an all right solution, I guess, like deploying these things as separate services isn’t necessary to overcome the things you were talking about earlier. Like having clear dependencies, it sounds like by the services talking to each other through their external API’s they’ve covered that intertwined dependency risk.

Wade: Yeah, definitely. I think the thing to keep in mind is that again, going back to the principles of isolation that I talked about, they cover off the isolation of state fairly well, and so that’s a really nice thing. I think just by using service-oriented architecture, you tick that box, at least to some extent, maybe there’s ways that you don’t. But I think generally speaking, service-oriented architecture does a really good job of taking the isolation of state box.

Where it falls down a little bit is not service-oriented architecture but that sort of monolithic deployment style of service-oriented architecture, where it falls down is isolation of space. So again, because each individual service is packaged up into a single deployable unit, so all of your services get packaged up into this one unit that you deploy, that means you don’t have isolation of space, you are basically requiring that you have exactly the same number of copies of every service.

And that limits your scalability, and it limits your ability to handle failures. Because you can’t, for example, say, “Well, I want 10 copies of my customer service, and only three copies of my reward service,” you lose that ability, because it’s all packaged up into one deployable unit.

Adam: That makes sense. And there’s a continuous delivery problem, I’m thinking where and if you have these four services all wrapped up in one deployment, and you want to roll out a new version of one of them, you have to switch them all at once. So you can’t have the old user service still there and the new version of the reward service talk to it.

Wade: Yeah, and this is one of the things that I think is really beneficial from a development perspective, is when you are working with that monolithic deployment style, if you’ve ever worked in an application that does that, oftentimes you get into this situation where you get deploy day, “Okay, everybody, we’re deploying today, which means nobody change any of the code because we got to make sure that nothing moves between now and when we deploy.”

And then you’ve also got this problem where people are talking to each other and saying, “Okay, I got my stuff in, did you get your stuff in?” We got to sync up everything before we deploy and that all gets very expensive. But then the other thing too, is that you get into situations where, I need to make a change, and it’s a very small change. It’s a hot fix for a bug that I guess got deployed or whatever.

I want to make that change and I want to deploy that but now we’ve got this problem that maybe my change is small, and we want to deploy to just that change but we can’t. We have to deploy all this other stuff as well. And so there’s ways that you can work around that to some extent with branching and things like that. But it starts to get awkward and the maintenance burden of that gets harder. What I like about working in microservices, is it allows you to say, “I want to make that hotfix to that bug and deploy this service.” I don’t actually depend on anything else so that’s fine. I can make that change to just that one service, deploy it and I’m not going to have to worry about what everybody else changed in the meantime.

Breaking Changes

Adam: The one thing that I think maybe can be worse is when you want to change a service and the things that utilize that service, like when it was all a monolithic application, it could be a single commit, I guess the rollout is a bit trickier, perhaps but now when it’s multiple things, if you need to make a breaking change, how would you handle that?

Wade: I think you’re right, I think the rollout of changes that affect multiple services is arguably harder in a microservice based approach than it was in a monolithic approach. I don’t think there’s too many people that would argue that.

So yes, I absolutely agree, I think that is harder. Again, there are certain deployment techniques that can mitigate that to some extent, like you can do blue-green deploys, for example, where you still deploy each service individually, and you deploy each service as some number of copies, but you kind of deploy them all into an inactive cluster. And then you flip over from the active cluster to the inactive cluster.

There’s ways to sort of mitigate that but it is more complicated, I think, is what it boils down to. I guess the way that I would suggest you mitigate that problem and then the way that I have done this is you support the old API. So if you need to make a change to an API of one service, and you have another service that depends on that API, when you make the change to the API, you have to support the old API as well, that is harder than it was with a monolith. With a monolith, you would just make the change to the API, deploy everything at once, and you wouldn’t have a problem.

So this is one of the things where, when you make the move from monoliths to microservices, you’re going to get a lot of benefits, you’re going to get some disadvantages as well. And so it’s a matter of figuring out for your particular use case, do the advantages outweigh the disadvantages? I think when you’re starting to talk about things like scalability, resiliency, things like that, if you’ve got a system that has to deal with millions of users or terabytes of data all the time, then we start to get into the situation where yes, we probably do want to make that sacrifice.

On the other hand, if you’ve got a system that’s dealing with 15 users an hour or something like that, and very small amounts of data, this might be a bit much, you might not need this kind of resiliency and this kind of scalability.

Adam: I think that’s why sometimes you were mentioning before, like people having pushback when you say something about their monolithic app, because it’s probably started small and delivered a lot of value. And then grew, and grew and delivered more and more value, and along the way, they’re always taking these little steps to make it better.

Startups and Unknown Requirements

Wade: Yeah, I think so. In a lot of cases, you get people who jump in, and they’re a startup initially, right? And when you’re a startup, you don’t have any users. But then over time that user base grows, but maybe you get a few 100 users, a few 1000 users, up to a few million users, whatever. At some point, your application starts to break down because you didn’t build it under the understanding that you would have that number of users.

I think that does happen. And I think that is one of the things where you start to get pushback is when a startup is first jumping in and they’ve got no users or a very small number of users, doing a lot of this kind of stuff might be really expensive, and really time consuming, and not worth it to be perfectly honest.

Now, the thing that they have to consider obviously is, so we don’t have any users right now but where do we want to be in a year? If our goal in a year is to be at 10,000 users or whatever it is, are we going to be able to support that given the infrastructure that we built? If our goal is to be at 10 million users are we going to be able to support that given the infrastructure that we built?

And so I mean, obviously, I guess everybody wants to be at 10 million users, but being realistic about it, is that a likely scenario? And so it’s about figuring out again, how much is worthwhile right now, is it worth going through all the effort right now so that we can be prepared in a year for when we get to the scale that we want to be at.

Adam: And the tricky thing I think in that startup mode is maybe you don’t really know what the future is going to hold, because you need to get input from customers. So this reward example, you may have an idea that a reward system is a good idea, but you may actually build it and nobody uses it, and then want to remove it. So I think that’s why sometimes, “We’ll just add it to the existing code base because we don’t know if it’s a thing yet,” like this is just a proof of concept to see if users engage with this feature.

Wade: Yeah, and I think that’s okay, depending on how you do it. Again, if you’re in a situation where you’ve got this big ball of mud style of architecture, then I think at that point, you really have to be more careful, you shouldn’t just make the ball of mud bigger. That’s not to say necessarily that you can’t add the existing functionality into your existing monolith, maybe just to save on deployment, hassles and things like that.

But what you should do in that case, if you’re going to add it to that existing monolith, you should add it to the monolith in an isolated way. And so that means kind of talking along the lines of the service-oriented architecture style of monolith where you create the reward system inside your monolith but you provide an API, and every other part of the monolith that needs access to the data goes through that API, they don’t go directly to the database.

So then your reward system has its own isolated section of the database that it’s fully in control of, and nobody gets to talk to that database, they just go through the API. What you’ve done now, though, is you’ve put yourself in a position where if it turns out this reward thing does turn into a big deal, now what you can do is you can say, “Well, we’ve already got the tables and everything isolated, nobody’s accessing those tables, except through the API, the API is already defined, it’s clear, it’s consistent, whatever, let’s just pull that API out into a microservice.”

And now we can do that, we can pull it out into a microservice without a whole lot of hassle. And now we can start playing with the scaling options that we’ve talked about already. And so that gives you the flexibility to do that. The key is, again, don’t make the existing problem worse, always look for ways to make it better than it was before.

Adam: And I think that gives a great segue to your opinions on this hexagonal architecture. So if we’re building a single app, like how do we build it in such a way that the dependencies are not tangled?

Hexagonal Architecture

Wade: Yeah, so hexagonal architecture I think is a really interesting thing, something that I use heavily when I build my own applications. And what hexagonal architecture does is it sort of divides your application along clear boundaries. And so you have kind of at the center of the application, you have your domain, and your domain is like, basically, your business logic, it’s all the things that are critical to the operation of your business, the rules that are associated with that business, the decisions that you have to make, all of that kind of stuff falls into your domain.

At the outside, the very outside edge of that, the very outside edge of your system, you have all the infrastructure, you need to make the system work. And so that’s things like your database, your user interface, if you’re using any kind of messaging platforms, your messaging platforms will be out there. Any of the technology that enables you to make your application work, those kind of fall into the infrastructure category.

And what hexagonal architecture does for me, is it allows me to make very clear distinctions between what is domain and what is infrastructure. And so, essentially, what you do is you say, “Okay, within the domain, I’m not allowed to have any dependencies on infrastructure.” So my domain doesn’t know what kind of database I use, it doesn’t know I’m using SQL, doesn’t know I’m using Cassandra, it doesn’t know whether I have a REST API or user interface based on a website or something like that. It doesn’t know those things, those are all infrastructure.

All it knows is things like when I get a request to reward a customer, because they purchase something, this is how many reward points I will give, based on the amount of money that they spent. That’s a business rule. So what hexagonal architecture does is by forcing you to say your domain can’t depend on your infrastructure, it forces you to introduce layers of isolation, that then enable you to make interesting decisions later on.

So for example, you need stuff out of a database, I mean, that’s going to happen at some point, but you don’t need to know what kind of database it is or what that database looks like. You just know, for example, in the reward system you need reward points. So you know that there is an API that you can call that gives you reward points, you build an interface or something in your application that does that, then you have an implementation of that within the infrastructure that says, “This happens to talk to SQL,” or it happens to talk to Cassandra or whatever.

Now that you’ve done that, you’ve created that separation between the domain which just says, “I need a way to get reward points,” and the infrastructure that says, “I get reward points out of the database.” Now that you have that separation, you can start doing interesting things like saying, “Okay, I realize that the database representation that I used here was actually very inefficient.” So I’m going to rewrite that database representation. None of my domain code changes, because your domain code is still just getting reward points. You’re only changing that infrastructure layer and so that allows for a lot of flexibility.

I’ve done systems that use hexagonal architecture, where, for example, I have changed the underlying table structure of something in order to make it more efficient, without basically just rewriting one class. And that’s, that class that’s accessing the database, what we would call a repository and domain-driven design terms. I’ve changed the implementation of the repository, the domain code stayed exactly the same.

I’ve also changed to a totally separate database. So I’ve gone for example, from MongoDB, to a SQL database. And again, the domain code didn’t have to change, nobody using that service had to know that, that change was made. No other services had to change because everything’s isolated in state.

I’ve gone further than that, though because on the flip side of that, if your infrastructure layer says I am operating through a REST API, and I’m making calls into that domain, that domain presents sort of a clear API that says, this is how you talk to that domain. Now, what I can do is I can do things like say, “Okay, well, originally, I had a REST API, and it made these calls into this domain.” But now I don’t want a REST API. Now I want an event driven system, well, it just makes the same calls into the same domain.

So you can add additional endpoints maybe a REST endpoint and then an event-based endpoint and then maybe later on a user interface based endpoint, they’re all talking to the same thing. They’re all talking to the same domain. And so you can make those kinds of changes, you could potentially do things like rewrite the entire domain. And as long as that interface that you’ve provided, as long as that API to the domain remains consistent, you don’t have to change anything on the infrastructure level. So there’s lots of flexibility that comes when you do this properly.

Adam: It’s very interesting. And seems to have a lot of principles that are great for keeping these dependencies from… Keeping the dependencies from being too coupled to each other. One thing I didn’t understand about it is, I don’t understand why it’s a hexagon. I saw a picture of it, there’s a hexagon in the middle, it says domain, but at the six sides, I don’t really understand where the six sides come from.

Wade: Yeah to be honest, I’m not [inaudible 00:53:15] question either. When I first started learning hexagonal architecture, I was introduced to it with three different names. So I was introduced to it with the name hexagonal architecture, which I found very confusing for the same kind of reason that you expressed, why is it hexagon? I was also introduced to the concept of ports and adapters, which is another name for it. And then I was introduced to it as onion architecture as well.

In some ways, I think onion architecture represents my understanding of it better, which is you have these different layers you have, so the inner layer is the domain. Outside of that you have what you would call the API layer and then outside of that you have the infrastructure layer. And the dependencies in these layers go from the outside in. So infrastructure depends on API, API depends on domain, but never the other way around. And I think logically in my head, that makes sense. I’m not really sure why the original hexagon.

Adam: I’ll figure it out. And I’ll put a link somewhere. But yeah, I think what you said makes sense where it makes it easy to put in different implementations, which could be various sides, perhaps.

Wade: Yep.

Adam: I saw here on your Twitter, it says that you’re a science fiction author.

Science Fiction Author

Wade: I would say I’m a wannabe author to some extent of a little bit of science fiction fantasy. So yeah, I do a bit of writing on the side. Nothing published. But I’ve written one novel, which I’m kind of in the final stages of polishing up before I maybe start farming it out to publishers and working on other projects here and there.

Adam: That’s awesome. Who’s your favorite author right now?

Wade: Favorite author right now is Brandon Sanderson definitely. He’s written a number of books. I think my favorite by him is the Mistborn trilogy, which is absolutely a fantastic series of books, which I would highly recommend to anybody, if you’re interested in fantasy at all.

Adam: I’ve never heard of it. I read a lot of science fiction, but not fantasy as much. I’ll check it out, though.

Wade: Brendan Sanderson dabbles in a little bit of science fiction. He’s primarily a fantasy author, but he’s had some short stories and things like that, that are more science fiction oriented, I think. I would say I probably read more fantasy, but I do read a little bit of science fiction here and there as well. I think my favorite science fiction book, actually that could be a tough one. It probably is between Frank Herbert’s Dune and Orson Scott card’s Ender’s Game would be kind of my top ones.

Adam: Yeah, those are both both great books. At some point, I read all the Frank Herbert books. Herbert, Herbert I don’t know. And I love them. They’re great. So much detail in his world that he created.

Wade: Yeah, to me, Dune is kind of the science fiction equivalent of the Lord of the Rings [inaudible 00:56:31]. That intense world building?

Adam: Yeah, that’s definitely true. Well, before we wrap up our talk, I wanted to say that your course that you’re building is really great. I went through quite a bit of it. And I liked the structure. I love watching tech talks. But the thing I liked about your structure you have is there’s a talk portion, there’s questions, there’s answers. Makes it a little bit more engaged than just watching like a several hour talk. I thought it was great.

Reactive Architecture Courses

Wade: Yeah, I think that was one of the things that we really focused on when we were building the course was a couple of things. One is, everybody learns differently. So some people learn by watching, some people learn by listening, some people learn by reading, some people learn by answering questions and things like that. So we wanted to sort of hit as many of those different learning approaches as possible with the course.

But the other thing too, is I didn’t want the course to be something where you can just sort of like, put it on in the background and tune out and not really pay any attention to that to it, I do that all the time, I’ll start listening to something and I sort of wander off and don’t pay attention to it. I wanted this to be something where you come out the other end, and you have actually absorbed the information. And so that sort of necessitated the introduction of the questions. We also try to find ways to use the questions as a bit of a learning experience as well.

Adam: Well, the thing I liked was it takes a case study approach somewhat with this reactive barbecue.

Wade: Yep.

Adam: It just made me want barbecue, to be honest.

Wade: We do in-person training of this same course, it’s not exactly the same. But we have an in person version of it. The exercises are all very different, much more interactive, obviously. But one of the things that I did during one of the teaches, I think about a year ago, is I spoke to the organizers of the conference where I was teaching the course. It was the reactive summit and I spoke to the organizers and specifically said, “Hey, can we organize some sort of barbecue meal during the course at some point,” particularly because we were teaching in Texas.

So it was sort of like, okay, we’re teaching the reactor barbecue in Texas. I mean, come on we have to have barbecue at some point. So they came through and we indeed actually had a nice barbecue meal the one day, so it was really nice for that.

Adam: It would have been funny if actually the barbecue ordering site went down during the process because it fits right into your case study.

Wade: Yeah, absolutely.

Adam: Maybe not funny when you’re hungry. All right, Wade thank you so much for your time. It’s been a lot of fun.

Wade: No, it’s been great. And you mentioned the course, I think at this point, we have three pieces of the course out but we’ve got another bunch coming. So keep your eyes out, I guess for the rest.

Adam: Yeah. And actually, let’s just touch on that. What are the three courses you have so far?

Wade: So the three courses are basically the first one is kind of an introduction to reactive architecture. The second one is domain-driven design. And then the third one is all about building reactive microservices.

Adam: Awesome.

Wade: And so that’s part of one training path on the IBM cognitive class. So the training path is the Lightbend Reactive Architecture Foundations. So we’re going to be launching another training path shortly, which will include another three courses.

Adam: That’s great. I’ll put a link in the show notes for this episode.

Wade: Yeah, great. That would be awesome.

Adam: Well, that is the show. I would like to thank everyone who helped share the last episode with Philip Wadler. It got some great attention on Reddit, our programming where there were lots of interesting comments and critiques. If you made it this far, you must have enjoyed the show. So tell a friend about it, mention online somewhere, whatever you can do, it helps grow the show. Talk to you next time.

CORECURSIVE #022

Big Ball Of Mud

Architecture and Services with Wade Waldron

Transcript