CORECURSIVE #030

Rethinking Databases

Noria With Jon Gjengset

Rethinking Databases

Jon Gjengset

Can we make databases faster and remove the need for caching reads in an external cache? Can we make a distributed SQL based relational database that outperforms memcached? Jon Gjengset and the PDOS team at MIT CSAIL have done just that with Noria.

Today I talk to Jon about Noria, about building a database in Rust and his efforts to teach people intermediate Rust via live coding sessions.

Jon was great to talk to. He really was able to explain to me how Noria is able to do what it does and where it is in terms of maturity. The key, besides rust and evmap, is that Noria uses materialized views to do query optimization ahead of time, on write. The devil is in the details though, of course. And the details, in this case, are turning declarative SQL into a dataflow program that handles cache updates on new writes.

Transcript

Note: This podcast is designed to be heard. If you are able, we strongly encourage you to listen to the audio, which includes emphasis that’s not on the page

Introduction

Adam:Hello. This is Adam Gordon Bell. Join me as I learn about building software. This is CoRecursive.

Jon:I think it has also made me a better programmer because in some sense, it’s almost like pair programming, except with a hundred other people, and that is really fun, but you are right, it is also intimidating. I think what’s needed is to accept the fact that you’re going to get stuck and you’re going to make mistakes, and then realize that the people watching, what they learn the most from is watching how you deal with problems in your code.

Adam:That was Jon Gjengset. He is one of the leads behind an exciting new database. He and the other members of the MIT CSAIL team have built something pretty cool. Externally, it looks like a database with some extensions for materialized views, but inside, they have rethought how a database works leading to some pretty exciting performance numbers. The whole thing is built on Rust, and we’re going to talk a little bit about that towards the end of the podcast.

If you enjoy the podcast, I would love it if you would spread the word to your colleagues or friends who you think might enjoy it. Also, we have a Slack channel you can join. Recently, there was some interesting discussions of the J programming language. User KR007 had some interesting thoughts about it, and we also have been talking a lot about Unison web after interview number 27.

So, Jon, thank you for coming on to the CoRecursive Podcast.

Jon:Well, thanks. I’m happy to be here.

What Is Noria?

Adam:So, I have actually a printout because I’m a bit of a luddite here of a paper that you contributed to or lead on. I’m not really sure about Noria. Could you tell me what Noria is?

Jon:Sure. So, noria came about as we observed that databases are a little bit weird for a lot of the applications that people use them for. Usually, web applications and the like are very read-heavy, and databases do a lot of work on read. They had to execute all these select statements and queries, but when the workload is primarily reads, you want the reads to be fast and that’s not what databases today are currently written for, and so people end up sticking a cache in front of the database and we figured, “Why not have that cache be maintained by the database instead, which knows about how your queries work?”

What Is A Data Flow Engine?

Adam:Interesting. Interesting perspective. So, in the paper, it says that it’s a data flow engine, I believe. What is a data flow engine?

Jon:Yeah. So, we had to play a few tricks to get the database to keep your cache up-to-date, right? The idea is that whenever new rights come in, they cause the state that’s been cached in the past to change in some way. So, we wanted a system that can capture that, capture the effect that rights have on your cache.

Data flow is a nice conceptual model to use for this, where you think of data entering the system at the top of a graph, if you will, near the root, and then the changes that that data cause as a result of whatever operators are in your query propagate through the graph all the way down to the caches that are maintained at the leaves.

So, data flow lets us capture that notion of rights are propagated through a bunch of operators, and in the end, they have some effect on the cached state that you keep in your cache.

Adam:So, it’s like the old joke of about naming and cache validation, being the two hard problems in computer science. So, what you’ve built is an inference thing for finding out when you have to invalidate a piece of cache.

Jon:In some sense, yeah. What the system does is you name your queries in advance, you give the database, “These are the queries that my application cares about,” and then the database uses those queries to figure out how the cache has to change and respond to different types of rights.

So, it constructs, essentially, a data flow program that knows what to do with any single right and how that affects your cache rather than you having to manually right up the code for that in your application.

Adam:So, it has a declarative model. Actually, using SQL has the declarative model of the state of flow.

Jon:Yeah. That’s entirely right. So, the Noria takes normal SQL queries, prepared statements, and turns them into these data flow programs. So, it basically, as you say, it infers how to compute changes to your cache using the information from your SQL queries. It basically understands how your reads are related to your writes.

Adam:So, this isn’t the type new database that has its own language that I need to use instead of SQL?

Jon:No, not at all. In fact, we very explicitly wanted this to feel very much like interacting with a regular database because one of the problems you have with all these research databases that exist today is they require that you change your application pretty substantially in order to make use of them, but people have lots of application code written already, and it would be really nice if they could take that existing code and just remove all the cache maintenance stuff and just have the database do that for them, but apart from that, it should feel like interacting with a regular database.

Cloning Hacker News

Adam:You have this really neat example about, I guess, making a clone of a hacker news website. I wonder if you can describe how the database helps in that case.

Jon:Sure. So, we’ve taken this website called Lobsters, which is basically hacker news, except that the source code is open source. We also managed to get a bunch of analytics about the workload that that site sees. This let us construct a benchmark where we both have the correct access patterns, but also know the real schema and the real queries that that application issues because we can look at the underlying source and actually run the real application.

This let us construct the benchmark that contains real queries, issues that’s the real rates that a normal application would see, and then benchmarked Noria using that workload, which is essentially a way to measure what a real application that use Noria instead of what it was previously using, what kind of speed up would that see if they were to use Noria instead.

Caching

Adam:So, if I understand the example, it’s like you have a table of news articles, people can submit to it, and you have a table of users, and people vote on the stories. So, where does the caching come in?

Jon:Yeah. That’s right. So, it’s similar to a Reddit hacker news, all those sites. Basically, they use pretty much the schema you would have guessed at if you had to guess, and the caching comes in, actually, at a number of layers. So, Noria caches not just the final results, but also intermediate values, and it does so cleverly. It caches things that are useful to cache.

For example, it would cache the current vote count for every article, and then it would also cache the current front page based on what those vote counts are. Then if a new vote comes in for an article, it will update the vote count for that article and also update the front page if necessary.

Adam:So, if I just had my SQL database, I would have these two tables, and then I guess I have to aggregate all the votes to count things. I’m just making sure I understand this. So, the optimization I might want to do because that’s a join and then a sum and it probably ends up being a bottleneck. Is that I just every once in a while get these totals and actually cache them and then actually show the first page off that is how I’m thinking how I would make it without your database, right?

Jon:Yeah. That’s exactly right. So, that is, in fact, what Lobsters does, except they take it slightly further and they keep a column in the stories table that is the number of votes, and whenever someone votes for a story, they do an insert into the votes table, and then they also do an update to the stories table.

So, that way, getting the front page is you select from stories, order by votes descending, and then limit. Except, it turns out that the computation of how popular a story is is not just the number of votes, it’s also computed from the number of votes on comments on that post, their penalties apply to certain tags that stories can have, then the number of comments is important. So, the computation of the hotness of a story is actually fairly involved.

The way they capture this is they do a transaction on the database whenever anyone votes for or comments on the story, where they do this full querying all the things that are relevant, computing the updated score, doing a transaction that updates the story’s hotness, which turns out to make rights really slow. Anytime you vote, you have to do all this work.

Adam:That’s fine because most people are reading. I assume that’s the trade off they’re making.

Jon:Exactly. So, they have, in some sense, made the same observation that Noria makes, which is you should do this work on write rather than on read because the reads are more common.

Adam:So, what’s your solution to the problem?

Jon:So, in Noria, your select would be exactly that thought you had initially of doing a join and a sum. That is the query you would write. There’s no other columns you add. You don’t think about caching at all. You just write the straightforward join and sum query, and behind the scenes, Noria will recognize that, “Oh, I should cache the sum and I shall also cache the front page view,” and it will incrementally keep those up-to-date whenever a vote comes in or if a comment comes in.

So, you describe the full read you want to do, including how to compute this hotness in your SQL queries, and then Noria will construct the appropriate data flow, and that will, whenever write comes into any relevant table, it will figure out how that write, whether it’s a new comment, a new vote, whatever, it will figure out how that affects the current score, update the score in place, and then update the front page if necessary. So, all of that happens automatically internally in Noria.

Adam:Do I have to tell it ahead of time what this front page query looks like?

Jon:Yeah. So, Noria does rely on knowing the queries in advance. However, you can always give it a new query or a new set of queries, and then it will dynamically adopt the data flow to also take those queries into account. So, it’s not as though the queries have to be static. It’s just that you do need to declare a query before you run it. The easiest way to think about this is probably something like prepared statements, where you prepare a statement and then you execute it a bunch of times.

Adam:It sounds a bit like a materialized view.

Jon:Yes. In fact, it basically is a materialized view. Well, Noria, actually, does implement materialized views. That is a technically accurate statement. Where it gets tricky is that materialized views have a bunch of restrictions as they exist in many systems today that Noria tries to address. The first of which is that materialized views. You want them to be incrementally maintained. You don’t want it to be so that whenever any right comes in, you throw away the entire materialized views and compute it again, which is what you would get with something like create materialized view in Postgres or something that’s trigger-based.

Refreshing the Cache

Adam:Yeah. You have to refresh it in some frequency, and it throws it all away.

Jon:Exactly. Noria does not do that. It does incremental view maintenance. Now, incremental view maintenance is also known in the database literally, where Noria innovates here, is that Noria also has the ability to make state partial. So, what this means is if you imagine a website such as Lobsters, you don’t want to keep the vote counts for every story ever around all the time. That will be pretty inefficient. You want stories that are years old. You don’t really need to remember anymore. If someone asks for them, then you can compute it at that time, but you don’t really need to keep them all around all the time.

So, what Noria lets you do is it only keeps in memory the things that have been asked for recently, and then it has the ability to evict from the materialized views. This is not something that any existing materialized view system that we know about can do. They are all full materialized views. They are never only keeping some of the state or the state that the application cares about.

Adam:How does it decides? So, it sounds like an LRU cache mixed with the materialized view, I guess. How does it decide what’s in there to start with?

Jon:Yeah. That’s a good analogy. It pretty much tries to apply some kind of caching strategy. Currently, it uses randomized eviction, which is a halfway decent approximation of LRU, but over time, one of the things we want to look at is how to incorporate more advanced caching strategies that are not just randomized eviction.

In theory, there’s nothing preventing us from doing pretty severe or pretty significant analysis of the access pattern of the application to figure out what things to throw away and what thing to keep.

Dealing With Large Views

Adam:One potential problem with materialized views that it sounds like maybe you found a solution for is they could just be really large. If I’m joining a number of tables and I have this fully expanded result, it’s just too big to store or it just becomes expensive to read because it’s so large that it’s on disc.

Jon:Yeah. Exactly. This is one of the reasons why the ability to have partial materializations is so important because you just can’t realistically store everything, and especially when you have things like joins that might cause some kind of a combinatorial explosion in the amount of state necessary. So, this is why Noria will only keep the state that you have asked for recently, and then the other state it will just get rid off.

The way many existing systems try to deal with this is they provide windowed operators, like windowed joins or windowed state usually based on time. While those are fine in some context for analytics, it’s not great if you have a web application where you really want your reads to represent all of the data.

Windows, Indexed Views and SQL SERVER

Adam:Definitely. Are you familiar with the SQL server index views?

Jon:Yeah. So, SQL server index views is something we’ve looked at a little and they’re a little interesting because they do try to provide something similar to what Noria does. They do have some limitations, though. So, first of all, they are full materializations. They are not partial. Second, from doing some performance analysis, they seem to be relatively slow and you have many rights. So, if you have rights to different keys, then they still effectively end up, we’re not sure whether it’s locking or recomputing the entire materialized view, but it does not scale well in writes.

It also seems as though it does not scale well in reads on those tables, and especially when they collide with writes. It’s almost as though everything gets sequentialized in some way.

The other restriction that materialized views or index views in SQL server have is that there are number restrictions on what you’re allowed to do in a query that you create an index view over. They have this whole documentation webpage on if you want an index view, here are all the things your query has to do. It includes pretty severe restrictions like you cannot use other views inside of a view.

Adam:Yeah. So, it’s been a long time, but at some point years ago, I tried to use these index views as a solution for something, and I found that those restrictions you’re describing, they made it not a feature that I could use. It’s something like after looking at your work, it seems like just they’re not able to reason beyond certain types of structures to know how to update this index view.

Jon:That’s exactly right. We’ve tried to figure out why that is, but, unfortunately, the internal design of SQL server isn’t something we really have access to. We hypothesize that either they’re doing essentially some trigger magic behind the scenes in which case they would effectively be recomputing the materialized view every time or maybe they’re doing some perky triggers to make it slightly more efficient. It could also be that they’re using something the research literature calls delta queries. They’re basically a way to refresh a view more efficiently. So, it’s a way to query for only the things that have changed since last time you queried, but in order to compute these delta queries, you essentially have to compute derivatives over SQL queries, which is pretty hard and you can only do that analysis for certain queries.

Whereas for us, we’re not really doing that. We’re not trying to figure out how to compute only the changes since last time we asked. Instead, we’re doing just a regular forward computation for every write, which just turns out to be easier to construct, at least that’s been our experience.

The Delorean and Back to the Future

Adam:Yeah. It sounds like the DeLorean or something, right? What you’ve built looks like a database on the outside, but, in fact, it’s something different on the inside. I don’t know if that metaphor makes any sense, but-

Jon:Yeah. I think it actually makes a lot of sense. This is something we’ve struggled a little bit with when trying to present this work is people will either say, “Look, this is just a database,” when we say, “No. Really, it’s not,” but it looks like a database. In fact, Noria has this MySQL compatibility layer. That means that you can have an application that just uses normal MySQL client libraries, whatever off the shelf thing exists for your language, and you can just hook it up to Noria and it will work without you making any other modifications. So, it really just looks like a database, but, of course, as you observe, it isn’t.

Then we have the other side of the coin, which is people who say, “This is really just a data flow system,” where, sure, it really is a data flow system under the scenes, but in contrast to basically all existing data flow systems, it is specifically designed for the caching database use case, where you care about state being partial, you care about not doing windowing, you care about being able to shard efficiently, you care about supporting SQL. In fact, you also care about having something that can change so you can add new queries without having to start the whole thing again, which is generally what you have to do with existing solutions.

Adam:So, what is an existing data flow solution, an example?

Jon:So, there aren’t any that are directly trying to solve the same thing Noria has. There are a number of systems that are similar, though. So, Naiad is the first one that comes to mind. Naiad is a data flow system, where the user constructs the data flow by essentially writing a program that then gets compiled into something that performs some computation, and while it’s generally geared towards things like graph computation, where you have a lot of cycles, it works for any kind of data flow program you want to make.

Naiad makes a number of assumptions about how you’re going to use it. So, for example, it assumes that the computation isn’t going to change while you’re running. It assumes you don’t really have reads. There isn’t a good way to expose a state that is cached, that someone external to the program can read. Also, Naiad tries to provide very strong consistency guarantees, and that comes at a cost, especially when you look at more distributed settings.

Whereas Noria says, “We’re going to provide the same consistency as a normal cache, which means that sometimes your read will be a little stale, but we assume that that’s okay for applications.”

Adam:Consistency is an interesting one. So, does that mean, is any read, like if I’m not using your materialized view, if I just write two table and read to it, can I read my writes immediately?

Consistency - Read Your Writes

Jon:So, in Noria currently, there is no synchronization between reads and writes. So, if you do a write and then you subsequently do a read, you may or may not observe that read. What we guarantee, though, is that every write will be applied exactly ones. So, you might read and then not see your result, and then you read, and then you see your result. Your read will not then at some later point go away. That cannot happen. You will also never see it applied twice.

Adam:That makes sense.

Jon:Now, that said, we are looking at ways in which we can give things like read your own write, so slightly stronger consistency guarantees, and we have a scheme that we don’t have fully implemented yet, but that is in the works that provides that stronger consistency for users and queries that need that.

Adam:So, if I have my derived or my materialized view of these votes that we were talking about, what type of lag is there between writes and reads, writes to underline votes, let’s say, and this summation table?

Jon:In general, it should be very fast. It should just be however long it takes to write the write over the network and then updating an in-memory counter and then passing it along. This is because Noria doesn’t try to provide strong consistency guarantees. It also has to do very little internal coordination. So, processing writes happens in a very streaming fashion, where there’s a little bit of batching internally, but you should observe the read within millisecond scale.

Adam:Oh, wow!

Jon:Of course, this depends a little bit on how large your graph is. So, if you have a query that is very complicated, that has lots of nested joins and aggregations, that, of course, it will take longer for the write to propagate through those operators. So, you will get the read whenever all of the operators have finished computing.

In some sense, you can compute the delay between you doing a write and that being represented in your reads by just adding together the computation costs of the operators that the write has to go through in order to reach the view you’re reading from. So, in fact, if you think about it, you might have two views that both will eventually represent your write, and it might be that the write is visible in one way sooner than it’s visible on the other because one was a much simpler view to compute.

Views of Views of Views

Adam:Can we have views that are derived from other materialized views?

Jon:Absolutely. Essentially, you just write create view SQL statements and Noria will create that view for you. It will decide whether or not it thinks that you should be materialized, but that’s not really a decision that the user has to think about. They can just write their queries and Noria will choose whatever materializations are appropriate for that query, such that the reads will be fast.

Propogating Updates

Adam:I don’t know much about database internals, but I’m imagining I write something and there is a materialized view and because of the join behavior, that change needs to be in many rows of the materialized view. So, is it possible for me to read it and see that change only partway through?

Jon:No. You will actually always see any given write either fully reflected or not at all reflected. The way this works is that when a write enters Noria, so think of this as an insert into the base table. That is logically one update. That update is then propagated down the operator data flow graph. So, imagine if we just join. That join is then going to do a lookup for the key and that update that you wrote on the other side of the join. That gives back a number of records that happen to match that join key. All of those records are going to be used to produce a single new update that has all of the outbound changes. So, if you add a vote for, let’s see, what is a good example of this?

Adam:Maybe if the username was on the final result and I changed my username, but I had multiple stories submitted.

Jon:Sure. Okay. So, let’s take that example. So, you have a stories table that has a join with users on the author and you want to display the username in the final stories. In that case, what you would do is if you change your username, what will enter the base table is you will do an update on users, set name equals your new username, where ID is whatever your user ID is. That will enter the graph as a single update that contains two logical updates. Removal of your old username and the insertion of your new one, which is basically replace.

Those will then flow to the join as a single update. That join is then going to do the lookup into these stories table based on the user ID. It will find any stories that you have written. So, then it takes all the stories that it founds and performs essentially the join with the old records, so your old username and your new username, and it emits a single update that has negative records or revocation records for all of the stories that you have written with your old username, and positive or insertions for all the stories with your new username.

Adam:Oh, it’s like a diff.

Jon:Yeah. Exactly. It’s exactly like a diff, and that diff is then propagated as a single update down to the cache at the bottom, which will then be applied to that cache in a single atomic operation. So, it will remove all the old records, and add all the new ones, and when you do a read, you will either before that update is applied or after.

Adam:So, it sounds like I would need to not be able to read while that was going on is my thoughts.

Jon:Yeah. Exactly. In fact, we’ve put a lot of work into trying to make the views that the user can read from in such a way that writes and reads can happen concurrently while we still preserve this guarantee of atomic application of updates. The way we do that is using this neat data structure that essentially keeps two maps and all the writes go to one and all the reads go to the other, and then the writer swaps the two maps whenever it’s done applying an update.

Reading Snapshots - EV Maps

Adam:So, the reader is always reading off a snapshot of a couple writes back? Is it conceptually?

Jon:Yeah. That’s a good way to think about it. In fact, you can give basically perfect consistency by saying that the writer is going to swap the maps after every update. So, if you do that, then the reads will always see the latest state or, of course, you can trade this off for performance by saying, “We’re going to swap only every five updates,” and now you’re amortizing the cost of these swaps. So, there’s some atomics involved there, and the readers will now see slightly more stale reads, but you have amortized the cost of the swap.

Adam:So, this is a map with a tunable consistency level, sort of.

Jon:Exactly.

Adam:What’s it called?

Jon:This data structure is called evmap for, eventually, consistent map. That is what we use internally Noria for the leaf views precisely as you say so that you can choose what consistency guarantees you want and you can trade off performance for consistency.

Adam:That’s very cool. Do you expose it off at the database level that this is tunable or not yet?

Jon:Currently, we don’t. So, currently, we always refresh the map or swap the map after every update, so the reads get the strongest consistency we can give them. There are stills of the propagation delay for writes, but beyond that, we try to expose the latest updates or the latest state at all time, but it is something we’re looking into where you could imagine that the user can choose how fresh they want this view to be, and then it should be pretty easy for us to incorporate that into Noria to then say, “Okay. We will only then swap this often.”

Adam:So, is it because there could be a queue of writes? So, it’s not immediately consistent because there could be this backlog of writes even though you’re syncing it after every write.

Jon:Yeah, sort of.

Adam:It’s okay if I’m wrong.

Jon:No, no, no. So, you’re on the right track. The way to think about it is you do a write and then that write has to be processed, right? The write has to go through some count or a join to produce the diffs we talked about earlier. That takes a bit of time. When you do a write to Noria, though, we reply to you saying the write has finished the moment we have stored it on disc and sent it into the data flow.

So, there’s this time period between when the write is accepted by Noria and when it’s reflected in your reads. So, that is that propagation time. Of course, that means that it is, eventually, consistent even if once the write reaches the leaf view, it is immediately exposed.

10x Speed Up!

Adam:What are we talking about in terms of performance numbers here as compared to MySQL in this example we were discussing?

Jon:It really depends on the workload. In the paper, we both measure a simplified version of Lobsters where all you have are stories and votes and nothing else. This is basically so that we can have a micro benchmark, where we can test lots of different approaches and see how Noria compares to each of them. So, here, we compare against materialized view in a commercial system against MySQL, against Memcached and a bunch of similar systems. Then we also compare the full Lobsters example against just MySQL. In the Lobsters case, what we see is about a 6x to 7x speed up using Noria.

Adam:Oh, wow!

Jon:In the micro benchmark, we see a speed up of … Oh, I don’t even know if I have the number in my head, but it’s more than 10x-

Adam:Oh, wow!

Jon:… compared to both the commercial system and MySQL. In fact, because of this evmap, we end up outperforming Memcached for reads, even though Memcached doesn’t do any … That’s just key value, right? It doesn’t execute any queries. It doesn’t maintain the cache for you. It doesn’t persist anything. Noria does all of those things and still outperforms Memcached.

Beating MemCached

Adam:Yeah. That doesn’t sound right. I mean, Memcached, isn’t it just some memory map to some value and you’re reading and writing from it? How do you beat it? I don’t know.

Jon:So, the way we beat it is actually purely because of evmap. Memcached has a bunch of internal locking to manage their memory, basically to manage writes and reads for the same key. That ends up costing them whenever you have very high amounts of load. Whereas in evmap, the reads never have to take a lock. The reads always is read. They do an atomic read, an atomic increment, and then they go ahead and do the read. They never have to wait.

Adam:Wow! That’s impressive. Do you think … Is this a new data structure and known data structure? Will other databases be taking this idea up, you think?

Jon:It’s a little hard to say. I haven’t looked enough at the literature to say that this data structure is new, but it is an interesting trade off because it is really targeted for this, eventually, consistent case. It also comes at a bit of a cost in terms of memory use because you now need to keep essentially two indices over your data rather than just one, right? You keep two maps. Now, you can duplicate the actual values, but the keys you still have to duplicate, so you are duplicating the index.

CQRS and Kafka

Adam:I see. Well, there’s a couple of things. You’re presenting this as an improvement to having a MySQL database and then some cache. Another thing it sounds like maybe on a bigger scale, something like CQRS or even something like people use Kafka in a certain way in which they … I’ve heard it called turning the database inside out, where all your writes go into some queue, and then each materialized view, in fact, is a service that receives those and then actually writes out the database structure that is needed.

Jon:Yeah. That is a very similar way to think about the problem that Noria takes, specifically. So, we think about it as turning the database upside down, where instead of doing the work on read, you do the work on write. These Kafka-based systems do essentially the same thing. They think of writes as just being insertions or a queue, if you will, of writes, and then they think of some queries instead of reads. So, the queries are operations that are executed by some service, and that service might choose to process the writes a bunch before you get to do that read. That is exactly the same bottle Noria follows.

Adam:It’s crazy to think because you’re doing it just based on declarative statements, where with the Kafka system, you have to build a service and case streams and whatever. You guys are inferring that from a declarative statement. It’s impressive to me.

Jon:Oh, yeah. A lot of pain went into trying to get from the declarative queries we get from SQL to a data flow progesterone that actually faithfully executes that SQL statement or, in fact, sequence of SQL statements. It turns out that we’re essentially doing the same thing a database query planner is doing, except that we are doing it for the long-term. So, rather than doing it for a query when you’re about to read, we do it across the set of all queries that we know about.

So, this is also a research problem that has been studied in the past called multi-query optimization, and we’re doing the same thing. We’re taking all of the queries that we know the application cares about and we’re trying to construct a single program that jointly satisfies and serves all those queries.

Adam:You have more information as well because that’s the advantage knowing these things ahead of time, you can do this.

Jon:Absolutely. We take full advantage of that. This is how we are able to do things like infer what indices to add because we know that the user is going to query by. We know what the free parameters of the parameterized SQL query is. So, we can infer what things will be looked up and when those things are looked up, what other things have to be looked up through join keys and the like. We can use the queries to infer what kind of sharding we should do, what internal intermediate materializations we should add. So, we have a lot of information based on having the queries in advance.

Scaling to Multiple Nodes

Adam:So, you mentioned sharding. Does this scale beyond a single server?

Jon:Yes. So, Noria, actually, works in a distributed setting in the same way that data flow systems generally work across settings. So, Noria works in a distributed setting, but it does so at a somewhat naïve way currently. It basically follows the same strategy that some existing data flow systems do where anytime there’s an edge between two operators implying that the output of one operator goes as input to the other operator. That edge doesn’t have to be on the same machine. That edge can be a TCP connection instead. So, you can partition the operator graph any way you want.

In addition, Noria has the ability to infer how to shard a given operator or a clique of operators and run shards of that clique of operators on different machines, and then doing shuffles before and after that stage if, for example, you want to aggregate by some other column that doesn’t match the sharding key.

Adam:The first part of that, does that mean each materialized view would live on a single server? Is that the unit of division?

Jon:Yeah. So, if you partition the graph, then you would end up with different materialized views on different servers. If you, in addition, shard those operators, then you would end up with one shard per machine. That said, because Noria can choose for any given shard or any given clique of operators, where to place them? It can also place them on different cores on the same machine. So, this is how Noria gets multi-core scalability, too, is that you can’t execute a single operator on multiple cores, but you can shard that operator and run it across multiple cores or multiple machines, whichever you prefer. Similarly, you can partition the graph and run the partitions in parallel either on different machines or on different cores.

How Big Is This Database

Adam:Wow! So, this thing does a lot. How long did it take for this to be constructed? Let’s start there.

Jon:Yeah. It’s definitely a bit of a biz at this point. We started this project in the end of 2015. So, it’s been going for a while now. It’s now up to, I think, last time I counted we were at around 70,000 lines of code. Although that’s including documentation, testing, SQL parser, MySQL front end, so a bunch of other somewhat auxiliary things, but I think the core of Noria is 40,000 to 50,000 lines of code.

Is Noria Ready for Primetime

Adam:Yeah. So, it’s not a small research toy. It’s a real database. Would you encourage people to use it, to actually use what you’re saying as a drop in replacement for some MySQL system?

Jon:So, I think those are two very different questions. For the former, I think I would say yes. It is a much more real system, I think, than many research systems tend to be in the sense that we strive really hard to be able to execute real application queries. We want the MySQL shim, for example, the thing that lets you use existing MySQL client libraries means that in theory at least, you can just run a normal application and it should just work, and that is something we strive to provide.

In terms of using this in practice, though, I think there are a number of things that you care about in production that we haven’t cared about in the research. So, examples of this are things like being able to choose what you use as your underlying persistence layer.

So, currently, we use RocksDB on a single disc, but you could totally imagine that someone wants to integrate this with their existing MySQL database or run it on GFS or something like that.

Similarly, we don’t have very good support for some more esoteric SQL constructions, like if you have joins where the join condition is not a quality or if you have things that require range indices, we don’t currently support or even more. So, there are things like the SOUNDEX SQL operator, which checks whether two strings sound the same when pronounced in English. We just haven’t added support for that, but it’s also not particularly interesting from a research perspective.

Adam:So, what programming language is Noria implemented in?

Jon:So, before we go to that, I’d like to add one more limitation that I think is important. So, there’s one other thing that I think prevents you from using Noria in production today, at least if you want to run it on multiple machines. That is the fact that it currently doesn’t have a good model for fault tolerance.

Currently, if one machine goes down, we essentially throw away anything that is running in the data flow graph at the failed operators or below in the graph, and that includes any materialized state. You can always recompute that state from the base tables, and that would be equivalent to running the read queries again in a database setting, but that might take a bunch of time.

Similarly, if you have any other failure, even if it’s controlled, like you want to restart your server, all the materialized views start out empty. So, we are working on ways to add more dynamic fault tolerance to Noria, but that is not something we currently support.

Adam:So, how did you implement it?

Jon:Noria is implemented in Rust, the new, well, I guess it’s not new at this point, but Mozilla’s “systems programming language”. When we started working on Noria back in 2015, Rust was relatively new. This was not too long after the 1.0 release of Rust.

Working in Rust

Adam:Do you regret starting so early on Rust?

Jon:Actually, quite to the contrary. I am very happy that we chose Rust over pretty much any of the alternatives that were available to us. I think if we had started writing it in C++, I think we would not be able to have a code base of 70,000 lines of code today because we would be too bogged down in just getting stuff to work in the first place. This is a highly concurrent system, and writing those in C++ is pretty painful.

We could have chosen to write it in Go, but with Go, you have some other problems like debugging concerns. Debugging concurrency is pretty hard, even though concurrency itself is easy. We also want pretty low level control of memory, which Go, you can make Go let you do that, but Rust is better tailored for implementing your own data structures at a lower level.

Adam:I’m familiar Rust has the borrow checker and the ownership model. Were those helpful in building this?

Jon:Yeah. I think there’s certainly been a love/hate relationship with the compiler, which I think a lot of people develop with Rust when they first start out. I think it takes a little while to get used to the borrow checker because it’s a part of the programming process that people aren’t used to dealing with or they’re used to dealing with it after the fact. They realize later on that they were trying to modify a value they weren’t allowed to. They’re not used to the compiler yelling at them when they try to do it in the first place. So, that definitely took some getting used to, but I think, subsequently, we found it to really help us avoid a number of bugs that would have been very nasty to track down.

Unsafe & Borrow Checker

Adam:I had a previous interview with Jim Blandy, who wrote a great book about Rust, and I think one of the things he said was if you’re using Unsafe, which is the way to turn off the borrow checker, you’re probably doing it wrong. So, is it all Unsafe everywhere or what’s your strategy around when Rust gets in the way and when you can ignore its concerns?

Jon:So, I think people overestimate how bad Unsafe is, and I think they also overestimate how often you need it. In Noria, we have very little unsafe code. I would have to go back and check, but from memory, it’s on the order of tens of lines.

Adam:Nice.

Jon:What we do have is we depend on some libraries like evmap, which do have some Unsafe internally, but even the evmap, that only has, I think, seven Unsafe lines of Rust. So, you only really need it when you’re doing something where you know about an invariant that the compiler can’t check. That is really what it comes down to. If you have pointer to some piece of memory and you know right now because of some other bookkeeping you’re doing that no one else is currently accessing that memory, the Rust compiler has no way of checking that it’s safe for you to mutate through that pointer.

So, when you’re using Unsafe, you’re basically telling the compiler, “Look, I have checked this manually.” It’s pretty rare that you have to do that. I think in the beginning when you don’t really know why the borrow checker is yelling at you, it’s tempting to use Unsafe to circumvent what it’s trying to make you do, but, usually, that’s not the way to go about it. Usually, there’s a safe way you can use instead, but learning what those safe ways are might take a little while.

Adam:So, you’re like, “Don’t use Unsafe, but I am far enough long, I understand the rules in this particular narrow case. I can do it without exposing a memory safety issue.”

Jon:Sort of. I mean, I’m also very hesitant about using Unsafe because I know that that is where you can shoot yourself in the foot. However, writing Unsafe is basically like writing any C code, right? So, it still is an overall win. The way I think about it is if I think I have to use Unsafe, I first try to figure out a way in which I could do it slightly less efficiently or maybe not even less efficiently at all without using Unsafe. Only when I’ve convinced myself that I really need Unsafe, then I sit down and go, “Okay. If this is in fact going to be Unsafe, I’m going to have to convince myself that I’m really not doing anything Unsafe here,” and that usually leads to a page full of comments that explains exactly why this one line of Unsafe is okay.

Rust and Distributed Databases

Adam:So, you are a part of a research group that’s building distributed systems, if I understand correctly. In the past, there’s been distributed systems built like, obviously, C++, but some more recent ones like Cassandra is it’s the JVM, ZooKeeper is JVM-based, ETCD is Go, I think. So, where does Rust fit in the future of building distributed systems, distributed databases?

Jon:I actually think Rust fits really well into that ecosystem. There are a couple of reasons why. The first one is that what a lot of these distributed systems end up with in terms of time spent during development is in trying to debug concurrency issues. Rust just fixes that for you. Well, it is true that you can still have bugs in Rust software, absolutely, and you can still have raises if you use Unsafe or you use some library incorrectly, they’re just much rarer, at least empirically I found when you use Rust.

The second one is Rust has this nice intermediate where you get low level control over memory if you need it, but the language still feels pretty high level when you work with it. You get things like generics. You get enums, algebraic data types, which are very nice, gives you options and results. Error management is quite nice. The asynchronous I/O story for us I think is getting really good now. So, I think the language has a lot to offer in that space.

In fact, to add to that, as you mentioned, I’m in the parallel and distributed operating systems group at MIT, and slowly but surely, I’m getting the other people on my floor to give Rust a shot, and it’s now starting to take over in a number of the projects there, not all. There’s still like if your writing a Kernel, you probably still, to some extent, want a lower level language, at least if you want to modify existing Kernels, but Rust is now being used for things like writing low level networking stuff. It’s being used for code generation. They’re considering using it for porting a RPC, a new research RPC system. So, it really is starting to get some traction in that world.

Adam:It sounds like because of you. Do you see yourself as an advocate for Rust?

Jon:Well, I suppose to some extent that’s true, although I think the way it came about for me was more that Rust is the first language in a long time where I find myself enjoying the language. I have found that I actively want to take old projects I’ve written and rewrite them in Rust, which is a weird desire to have.

So, I really as though Rust has a lot to offer, and that means that I end up talking about it, but it’s not really to evangelize in any way. It’s just my experience is that it works well for these problems.

Live Streaming Rust

Adam:Yeah. I totally get that. You do live streaming of yourself doing Rust coding.

Jon:Yeah. That’s right. So, I started this last year. It came about because some of the Rust core team ran the Rust survey that they do every year. In it, a number of people said that what they really felt was lacking in the Rust ecosystem was resources for intermediate level developers, people who either come have a lot of experience for languages like Go or C or C++, and want to see what Rust can do for them or for programmers who had started out with Rust and now felt like they were ready to level up.

I figured that one of the best intermediate resources would be to see real software being built. So, I sent out some fillers on what would people like to see, and the overwhelming response I got was, “We want to see people program. We don’t want to read blog posts. We want to actually watch people get stuck and people who know what they’re doing so they can then get themselves unstuck.”

So, I started doing these live coding streams where I just build real software in Rust and people watch and comment and correct me all the times I’m wrong, and then post them online afterwards.

Fears of Live Coding

Adam:It sounds a little intimidating to me to do live coding in front of an audience. How do you find it?

Jon:Oh, yeah. It was absolutely terrifying. No doubt about it. I remember that very first stream, I had all sorts of worries like I would start the stream and no one would watch. I would start the stream, people would watch, and I would get stuck and not be able to get unstuck or I would just make massive mistakes that people would comment out and laugh at me. What I found, though, it’s actually quite the opposite. I started programming. I tried to solicit ideas on what people would like to see. I started programming all of them and the people watching were enthusiastic, helpful. They found the material useful and interesting, and it was a really supportive experience. I found that overtime I think it has also made me a better programmer because in some sense, it’s almost like pair programming, except with a hundred other people. That is really fun, but you are right. It is also intimidating.

I think what’s needed is to accept the fact that you’re going to get stuck and you’re going to make mistakes and then realize that the people watching, what they learn the most from is watching how you deal with problems in your code, right? How do you debug? How do you figure out what went wrong? How do you think about how to structure your program in the first place? That is what people learn from. Therefore, it’s good when you get stuck.

Adam:Yeah. Do you feel like there’s any performance art aspect to it that you’re performing for a crowd or do you look at it at a different way?

Jon:That’s a good question. I don’t think of it so much as performance art as I think of it as it teaches me to articulate my thoughts, right? I think a live programming stream is pretty much useful if the people watching don’t hear what you’re thinking because that’s what they’re going to learn from. So, you need to just start talking and keep talking and making your reasoning out loud, which feels somewhat unnatural to begin with. So, in that sense, you do have to get used to the performing in the sense of making your thoughts audible, but I don’t think there’s that much performance beyond that.

In fact, if anything, it feels more like a conversation with the viewers, right? They are observing you programming and will say, “Why were you doing that?” or “Can you tell me more about why you made that particular design choice?” or in the process of explaining some design choice, I realized that it’s not the right one and I point out why and then we collectively come up with a better design.

Adam:I was thinking of they have those contests where they build a game in 48 hours and a lot of people livestream it, and I get the sense, some people, they built a similar game engine many times. So, you sit down, you can almost watch them with popcorn, right? They know exactly what they’re doing and it’s almost like, as an outsider, it feels a little bit to me like this is a practice routine and they’re building it up. I guess you’re doing a more exploratory approach, right?

Jon:Yeah. So, I very intentionally didn’t want it to be like that. So, I try to choose problems that I have some past exposure to in the sense that I know roughly what the problem is like, but I don’t do anything where I’ve already written this code before. I also don’t do any planning ahead of the stream. So, I don’t design the software, the APIs. I don’t read the RFCs. I don’t look at the APIs that we’re going to interact with at all before I start the stream because I really want it to be a resource where people can watch and learn how someone who’s experienced with the language and experienced with software development in general, how they would approach solving this problem, and that includes the process of how do you get to the point where you’re writing code.

So, I think it’s important, actually, to include that aspect and not just do a show off “Look, I can write code that I know how to write because I’ve done it before.” I don’t think that’s interesting or useful to anyone.

On Becoming a Live Coder

Adam:Yeah. No, that’s true. You’re brave, I think.

Jon:It’s certainly been scary and intimidating. There’s no doubt about that.

Adam:If people wanted to become live coders themselves, where would they get started?

Jon:So, I think you need to first figure out why you’re doing it. What is the goal? Is the goal for you to develop interesting stuff and you are just happy for people to watch? Do you want to teach something specifically, do a video tutorial or a class or are you trying to just teach them programming in general? Because what your intended outcome is is going to affect how you do the stream. It’s going to change whether you do planning or not.

So, for example, there are a number of Rust podcasts and live coding and video streams where people have different goals and they approach it differently. Some of them they have fully written out scripts for the episode. Some of them the code is available before they even start. Some of them they just read the code, and that can be useful, too. So, I think having a good idea of what it is you’re trying to do is important.

Then I think the second element of that is you want to make sure that the viewers or listeners have something to listen to, right? It’s not going to be useful if they’re just watching you type stuff out because they’re not going to learn anything from it. It might be interesting, like you mentioned, and game jams or something like that, but it’s not a learning experience, unless you are also telling them why you’re writing that particular code.

Adam:Yeah. That must be a challenging part.

Jon:It’s true. One of the streams I did that was really helpful for me and I think for other people, too, in figuring out what the stream was even going to be was I did an open source contribution stream, where I had people on Twitter send me suggestions for open source Rust projects that were still relatively new, where we would just go to that, get the page, and I had only seen the ReadMe before, and then we were going to make a contribution to that project. That could be filing a pull request. It could be reading through the code and trying to understand how it works and file an issue. It could be taking an existing pull request and doing a review on it, just something where we’re learning some other code base, we’re learning how to reason about existing code, and we’re learning how to contribute to the ecosystem.

Adam:That’s amazing. Yeah. I can see why that would be really interesting to watch. I’m thinking like, “Hey, I want to contribute to an open source Rust project, and now I can watch Jon and then see his approach.”

Jon:Yeah, and that’s exactly how it works. I think another reason why that stream went so well was also because people felt as though they had a stake in it because they could suggest projects that we were going to do. It also made them feel as though it was more real, I think, precisely because they knew that I hadn’t seen these projects before. Even though I say that in every stream, I think that adds to the genuine feeling of it. We ended up doing I think three different projects in that one stream. So, the streams end up pretty long. They’re usually about five hours long.

Adam:Oh, wow!

Jon:Over five hours, you can cover a lot of material. I’ve had people ask me whether the stream should be shorter, whether I should do shorter videos and then rather do more of them, but I think for programming, it’s valuable to run longer sessions because otherwise, there’s just so much context you have to keep track of in your head and having to put away the project and then pick it up a week later, I think it’s actually pretty hard and requires that you spend a bunch of time just reiterating what you did last time. It’s not how programming usually works.

Adam:Yeah. So, I think that we covered Noria in a lot of depth. I do compare it to a number of different databases. Then we talked Rust and live streaming. Is there anything else you think we should have covered?

Jon:No. I think that’s a pretty good summary of the kind of things I’ve done recently. I think the one thing I would add is we really want to hear other use cases for Noria that people think would be interesting, whether that is queries that they think would be hard or would be interesting, workload patterns that they care about or even just from the business use case side of things, what are use cases that you have found to be hard to cache, hard to write queries for, how to write a Kafka pipeline for. I think it will be useful for the research to learn more about what the “real problems” are.

Adam:Yeah. That’s great. So, I will definitely put some links in the show notes to all of these various projects. I mean, yeah, hopefully, some people have feedback on this database, which seems quite amazing. So, thanks for your time, Jon. This has been a lot of fun.

Jon:Absolutely. It was a lot of fun, and it’s always interesting to talk about these things. So, I’m happy to connect to anyone after the fact, too, if they want to chat more.

Adam:Great.

That was the interview. I hope you enjoyed it as much as I did. If you have any feedback, let me know via email or Twitter or just join the Slack channel. Thanks to Liff Batterman, and Rodrigo Rorich, and Tim Heaney and probably many more people who recently said nice things about the show. Thank you very much, guys. I appreciate that a lot. I will talk to you next time.

Support CoRecursive

Hello,
I make CoRecursive because I love it when someone shares the details behind some project, some bug, or some incident with me. No other podcast was telling stories quite like I wanted to hear. Right now this is all done by just me and I love doing it, but it's also exhausting. Recommending the show to others and contributing to this patreon are the biggest things you can do to help out.
Whatever you can do to help, I truly appreciate it!

- Adam Gordon Bell

Audio Player
back 15
forward 60s
00:00
00:00
58:00

Rethinking Databases