Frontiers of Performance With Daniel Lemire

Intro

Adam: Hello and welcome to CoRecursive, the people and the stories behind the code. I’m Adam Gordon Bell. Did you ever meet somebody who seemed a little bit different than the rest of the world? Maybe they questioned things that others wouldn’t question or said things that others would never say. Meet Daniel Lemire.

Daniel: You were asking whether I was entirely sane and I like to think that I’m a little crazy. By nature, I will obsess over things that people would rather not think too much about it. I think it’s a personal trait.

Adam: Daniel is a world-renowned expert on software performance and one of the most popular open source developers if you measure by GitHub followers. Today, he’s going to share his story. It involves time at a research lab, teaching students in a new way. It will also involve upending people’s assumptions about I/O performance. And Elon Musk and Julia Roberts will come up a little bit more than you might expect.

The story starts as Daniel is doing his PhD at the University of Toronto. He gets thrown a problem and the way he solves it sets his career on a different trajectory. It starts when a couple of geologists come to him with a data set that they have generated in what seems to me a very unique fashion.

The University of Toronto And Geologists

Daniel: Basically they were using helicopters and tied to the helicopter you’ve got a balloon of some kind. Between the two you’ve got this ring. This ring throws out EM waves into the ground. This is fairly standard stuff. They captured the EM waves and they know what to do with them if the signal is perfectly clean.

The way it’s supposed to work is that you shoot this wave and it comes back. It’s supposed to come back as an exponentially decreasing curve. So theory tells you exactly what you should be getting. But what you got in practice was massive garbage. It’s stuff that you cannot feed it into any computer so you need to clean it.

The way you want to clean it is that you want to build some model for what the noise is. As a young PhD student they asked, “Well, can you help us clean up the data?” And I did but it was quite a process because they had these CD-ROMs at the time that would have hundreds of megabytes of data on them. I would sit down and design an algorithm and then I would implement it and try it out and it would be spinning forever. Just trying to test it out was taking way too much time.

Adam: Were the geology guys gathered round and you were like, “I’m going to try out this program,” and then it just spins and spins? Or-

Daniel: Right. So you have this idea. You think it’s going to solve their problem and you try it out. Okay, if it takes hours for you to find it out then it’s annoying because of course it slows you down. But it goes further than that, is that if you want to give them the algorithm and it takes them hours to check that it works. Right? They may not do it.

That’s actually what happened in my case. It was too painful for them to try things out. So in my case they really put the stuff on their desk and they’d say, “Well, when we have time we’ll check it out.” I just said, “Okay, fine.” I wasn’t waiting for them. Then I get a call months later saying, “Well, we finally got around to it. It was painful but yeah, it really solves our problems. Where can we go with that?”

Basically slow computing can introduce friction. It can make things that are possible practically very difficult and I had this experience over and over and over and over again until I decided, “Okay. I’m going to turn my life around. Instead of solving [inaudible 00:04:12] design stuff, I’m going to go down and I’m going to try to solve this annoying problem of trying an idea and then having to wait forever for it to pan out, instead of the higher-level problems work that I can leave to other people.”

Adam: Was this geology time when you decided that a focus on performance, focus on computer science was important?

Daniel: That’s where I was headed. Basically being able to run code quickly is a huge enabler. We can go into why is deep learning taking off right now. Well, it’s a complex topic and there are lots of reasons but certainly one of the reasons for it has to do with system performance. If it did exactly what it does now but it took 10 times or 100 times slower, we might not even know about it because it would be too expensive to experiment with it and you wouldn’t have all these educations coming out because it would be too expensive to develop.

Adam: It’s like, I think it’s the quote from Joseph Stalin so maybe it’s not good to use. He said, “Quantity has a quality-

Daniel: Right.

Adam: … all its own.” If you have enough computing power it can be a whole different game. Right?

Daniel: Right. Software that is just a little bit too slow to use can seem unbearable, but if you make it really, really fast then all of a sudden it’s much more fun.

NRC and Recommender Systems

Adam: So with this realization that performance can be a great enabler, he finishes up his PhD and he joins a research lab.

Daniel: In Canada, we have this research institution, it’s called NRC. It’s this research-only government lab, basically. At the time they were creating this e-business initiative. My academic career, I would say, really started there because it was really this unique environment to all have these really, really smart people put together in the same building. And they all have different ideas. Because it’s brand new, you don’t have two old guys in the corner who run the whole show and tell everyone what to do because nobody knows what to do.

So it’s basically if you’re young and you have ideas, they say, “Well, go. Do it. We don’t know what to do, so do something.” This was a lot of fun for me.

Basically, we could anything we wanted. We were free to build the research program we wanted. I really got to try things. I worked a little bit on recommender systems. At the time Greg Linden had come up with the recommender system that Amazon uses and I thought that was really, really cool. This inspired me to work a bit on this problem.

Adam: Daniel’s work on recommender systems led to the creation of the Slope One family of algorithms. According to Wikipedia they are, “The simplest and most performant collaborative filtering algorithms.” While at NRC, Daniel has another big turn in his career happen.

Daniel: So I was this young researcher typing at my desk. There’s this guy that comes in. He looks like a homeless person. He’s got this long hair and he’s swearing a lot about not being able to find a place to sit. And I’m a little bit scared because you’re there and you’ve got this person that looks totally out of place and you’re wondering, “Are they going to sleep on the floor or something?”

But it turns out that it was this really, really, really brilliant guy that could never get a corporate job because he’s really too strange but is excessively smart. Very, very smart. We start talking and he’s telling me these stories about his vision and he’s saying, “Well, you know soon you’ll have all these people, thousands, maybe millions of people taking these classes online and it’s all going to be free.” He’s a little bit on the left side of the [inaudible 00:08:57] spectrum.

And, “It’s all going to be free.” I started listening to him. This was very inspiring. He was one of the guys who really shared my vision of the world because he was very… I think he was slightly prescient. He really did predict a few things that did happen. He did foresee a few things. At the time, he was very preoccupied with the costs of higher education, for example.

Which, as you may be aware, only got worse over time. He thought, “Well, okay. We need to fix this problem. So we need to get all of these fancy profs to go online where anyone, no matter how poor they are can listen to them and learn from them.” This was very inspiring, I thought.

His name is Stephen Downes. Now, you probably don’t know him but he’s the inventor of MOOCs, Massive Online Courses. If you go on Wikipedia they credit him with this invention.

The Online Professor

Adam: Is he what led you to go and try to become a teacher, to become a professor?

Daniel: Right. Yeah, so I became a professor and I started to build online courses. My first online course, I think, was in 2005. And for credit, not like you build a Power Point and you post it online. Actual, for-credit courses. Except for graduate work, which is different, I started basically teaching exclusively online at the time. I’ve been doing so for a long time now. For example, I’ve got this Introduction to Programming class where I have, I don’t know, something like 250 students a year but it’s all online. It’s actually a lot of fun.

Adam: Nice.

Daniel: It’s extremely cost-effective because there’s only one of me and there are 250 students but it still works.

Adam: Did you find it hard to get into a role like that?

Daniel: I grew into it, I think. Now I’m enjoying myself a lot but it was very uneasy at first. I think academia is very conservative in a strange way. I mean, we like to think about universities as being progressive and in some way they are, like nobody cares if you’re transgender. In that sense where it’s very socially progressive.

But there are ways in which it’s extremely conservative. For example, there’s a tool that is perfectly fine that’s called MATLAB. It’s a programming language assistant that, to my knowledge, is very rarely used outside of a campus. Certainly if you go to a data science conference and people will be using Python or R or something. They probably won’t be using MATLAB. But if you go on campus, everyone is using MATLAB because well, to the best of my own knowledge the reason is that their classes were in MATLAB.

So then, they’re going to teach the way they were taught. You reproduce these things. When you try to challenge these ideas, academia can resist you quite a bit. One of the things that I wrote maybe 10 years ago, or maybe slightly longer on my blog at some point, I pointed out that there was a big problem with the big academic conferences. They’re very selective. Basically nobody from outside of academia ever attends. Right?

Adam: Mm-hmm (affirmative).

Daniel: They’re like bubbles and everyone is chasing what is hot. If you look at just the papers, you know that, “Okay. This was the year of XML.” It’s all about XML. And I say, well, these actually play an [inaudible 00:13:33] role because they actually… If you want to do something original, you’re probably not going to be meant for these conferences. The people building the real system don’t shut up. It’s a little bit challenging to be a contrarian in academia.

The Semantic Web

Adam: Can you think of a specific example of when you maybe had some head-butting with maybe a department head or somebody because of your different take on things?

Daniel: Right. There was something emerging, it was called the Semantic Web. I don’t know if people still use the term or if it’s completely gone now. But anyhow. It came out of expert system and classical AI. At the time, for all sorts of reasons, I got into this project with colleagues. What they were trying to do, they were trying to leverage this Semantic Web that did not yet exist. But they thought, “If Tim Berners-Lee says it’s going to happen, it will.” Well, it didn’t.

They were saying, “Okay. The way we should be building online classes is through these things called learning objects.” These learning objects are like objects in object-oriented programming, so they have this metadata and they all come together to magically and they’re [inaudible 00:15:12]. At first, I actually thought this all made sense. Then I started asking questions. Then I started writing my friend Stephen Downes and asking, “Okay, but can you tell me what exactly is a learning object? This is too abstract.”

Then he said, “Well, it can be anything.” Okay, so we’re working on anything. So I started telling people, “This is not a good direction.” The irony is that you can go on Google and can find a name. There’s a book called Canadian Semantic Web with my name on it. I was the editor.

I started to have real doubts and so I wrote a few things about this not being a good idea. I prepared a presentation about it and so forth. This was very controversial. I got emails like, “Why are you doing this?” And I said, “Well, we shouldn’t go there.” This was very unpopular. Some people said, “Well, okay. You don’t have tenure yet,” so at the time I didn’t have tenure. “So maybe you should be quiet a little bit and not voice your opinion too much.” But I felt really strongly that this was wrong, so because of who I am I couldn’t resist speaking up.

I think one lesson I learned from this, it’s hard to think in the abstract. So I always ask people to give me examples, to be concrete. Right? Software is abstract. Someone could tell you, “Well, what’s the best way to do X?” They think it’s very-well defined problems. And so, “Okay. Well, give me an example.” You know?

Adam: Mm-hmm (affirmative).

Daniel: “How much data do you have? What’s your workflow? Be precise. Tell me. Then you can be smart about it.” But if the problem is too abstract, if you’re thinking in really general terms I think that most people, me included, are not smart enough to think in these abstract terms. We need to bring it down a little bit and to really take the thing down and really think in concrete terms what does it mean?

That’s why, for example, you’ve got this focus on software performance that is basically all about taking concrete systems and getting hard numbers out of them. I would say it’s easy to be smart once you do that because then you can say, “Okay. I’ve got this hard number. I know it’s probably not lying to me. I know the problem and then I can reason where this should go.”

Adam: To me, this is a really big insight from Daniel. It’s easy to be smart when you can be concrete and precise and it’s really hard to be smart when you’re dealing with abstractions. Let’s dig into performance, though. Daniel has started to question some of the underlying best practices about performance.

IO and File Processing Performance

Daniel: A long time ago, when I was doing more mundane database research, one of the problems that I was dealing with, it was just not a research question. It was just a practical problem. It was that you’ve got, for example, these X files, say a CSV file. You know? That maybe-

Adam: Mm-hmm (affirmative).

Daniel: … you exported from Excel or whatever, and you wanted to add them up and include them in your program or do some processing on them. I remember being really annoyed at the fact that it was slow. So I looked into the best people who were doing it. It turns out that the best people were using multi-threaded parsers. They were using several threads to read a CSV file.

That felt strange to me because everyone had been telling me the following: People were telling me that the bottleneck was the disk, so you couldn’t go faster than the disk, which makes sense. Because you were hitting the speed then the efficiency of your code didn’t matter. And so I thought, “Well, okay. I’m stuck because of my disk.” It was really, really annoyingly slow. I don’t remember the exact numbers, but reading a gigabyte of data was taking forever and it was slowing me down and slowing down the experiments and so forth and I was annoyed.

Then I started thinking about that and chatting, and then a fellow who was very good at this stuff was kind enough to exchange emails with me. I said, “Well, don’t you think it implies that we’re not disk-bound?” And he said, “Of course we’re not disk-bound. It’s the software. Really, we’re processor-bound.” But this was very unpopular. People would not normally say that.

Adam: Okay, we have the problem. It seems like this stuff might be CPU-bound. Then what? That sounds like a hard problem. Lots of people have built file processing stuff before.

Daniel: Yes.

Adam: It’s not a novel area.

Daniel: No. No, it’s not. So I was telling you that there’s not enough Elon Musks in the world. One of the things that Elon Musk does, if you listen to him when he’s thinking through, is he says, “Okay. So we have this problem here. How good could a solution be?” He’s trying to do this back of the envelope thing. Right? So, “How much would it cost to send someone to Mars. Let’s not go ask consultants about it. Let’s try to figure it out from first principle.” Right?

What programmers don’t do, typically, is they don’t do that. They don’t ask. They’ll figure out, “This is slow and this is annoying,” but they’ll never ask the reverse question is, “Okay. How fast could it be?” Now you sit down, you say, “Okay, I’ve got so many bytes.” Blah, blah, blah. Blah, blah, blah. Blah, blah. Then when you start asking this question, your thinking switch over because then it’s an injuring constraint. Right?

I mean, the bill comes back from Amazon and it’s whatever it is. Oh, well. People don’t ask, “Okay. How low could it be?” Now, the important thing about this question is that you don’t need to make it that low. Right? But it gives you a range. So if you know you are a hundred times higher than you could go, then it gives you room. You know you could adapt it.

Adam: In thinking about this problem, getting CSV files parsed faster, Daniel has another light bulb moment. It turns out there’s another file parsing task that’s chewing up computer cycles the world over. Something that’s a bottleneck, whether people know it or not.

Json Parsing

Daniel: I was reading about really a lot of data science and NoSQL benchmarks involved a lot of JSON. You would attend talks where really, really smart people, people who have a lot of followers were saying, “Well, avoid JSON. It’s too slow.” So I said, “Okay, okay. Let’s benchmark it,” and then I figured out. I see this isn’t done. Well, this is amazingly slow. This is truly slow.

I asked a friend of mine, Geoff Langdale who had done a lot of work where he was working on building really fast regular expression parsers, I asked him, “Do you think we could do better?” Because I looked at the numbers and say this is terrible. Okay. But how good could it be? In that particular case I did not have a lot of experience parsing, so I turned around to someone who does. Right? Well, okay. He does exactly as I would expect. He goes into this Elon Musk mode and he tries to figure it out. You know?

Adam: Mm-hmm (affirmative).

Daniel: “It should be about that much.” I took what was reported by several people as being the fastest library available at the time, RapidJSON from Tencent, Chinese folks, and I was getting on a typical 300 megabytes per second or something like that. Which sounds fast until you reason about the fact that… I’m hopefully going to get that PlayStation 5, so a game console, this week or soon. I don’t know. It has a disk that exceeds five gigabytes per second, you know?

Adam: Yeah.

Daniel: In reading speed. If you’re processing JSON at 300 megabytes per second, there’s quite a range. There’s more than 10 X difference between the two. Of course networks are faster. Really fast networks can be much faster than five gigabytes per second, so this means that you’ve got this huge gap.

Then, the next experiment I like to do is I just take C++. C++ is not a slow language, right? It goes really, really fast. I just used the standard library and I just called the getline function, it’s the function that takes the current line and text file and returns it as a string. And I just enter it through the file like that. I don’t remember the exact numbers I get but it’s something between 500 megabytes and 900 megabytes but it’s well under gigabytes per second.

Adam: Let’s pause to absorb this. Right? The standard logic is that disks are a bottleneck, I/O is slow. But just calling getline from a file is maxing out one CPU core, you’re only getting one-tenth of the speed of the disk. So obviously some of the standard programming performance dogma must be wrong. But also, and here’s where Daniel lost me. He thinks that based on him and Geoff’s back of the envelope, Elon Musk inspired calculations, that they can parse JSON at disk speed. That just seems unreasonably optimistic to me.

JSON parsing involves infinitely [inaudible 00:25:33] numbers. You need to reject things that don’t match the spec. You need to understand Unicode. Doing that all out over 10 times the speed that C++ can read a line, it just sounds like it’s possible.

Allocating Memory

Daniel: So when you look at that, you will think, “We’re dead. There’s no way we can parse a JSON file at anything close to the disk. We’re dead. There’s no way to do it.” Right? But if you look at the architecture of my last little test, what it does, it creates a new little object string that [inaudible 00:26:10]. It does an allocation, it creates a little object, it populates it, then it throws it away. It’s extremely wasteful. Even though it’s like three lines of code and it looks efficient, it’s terrible. Right?

There are a few rules that people who focus in efficiency learn and that they all share. This is not my finding, this is… So basically, you try to avoid allocation. You need memory at some point, but then you do it in big chunks. You don’t go through a document and then, “Oh! I’ve got this little string with the word name in it. Let’s locate this little string there and let’s put that there.” This is terribly slow. You don’t want to be doing this.

Adam: So that’s the first trick in Daniel’s toolbox. Don’t allocate memory unless you really have to. And when you do, allocate a big chunk.

Daniel: A common pattern that people use is that they have this data structure there and then they build something like a new territory. Right? They access it through some high-level API and they say, “Well, this is nice because it’s really abstract and then it’s going to make my code very beautiful.” But this is like basically drinking beer from a straw. Which is fine because that detour is the straw but you’re never going to win any beer-drinking contest if you’re out with your friends at a bar. You’re just not going to drink many beers at this rate. But the straw [inaudible 00:27:54] this interaction is really, really elegant but at the same time it’s going to block you all the time.

Adam: This is the second trick that Daniel has: Don’t use too many unnecessary abstractions. Stay low level so that you get the full performance. The next trick is the one I think I’m least familiar with, and this one is about parallelism.

Parallelism

Daniel: When people think about parallelism, doing things that are parallel, they always think, “He means several cores.” But actually, with a single modern core, you’ve got plenty of parallelism. First of all, in real code, you can execute at least three instructions per cycle and you can reach higher. Right? This is one instance of parallelism that they can do. There’s other levels of parallelism.

For example, there’s memory-level parallelism where you can… So, you may have this mental model where your processor requests a byte of memory somewhere and then it gets it back, and then it requests another byte of memory and gets it back. But of course, it doesn’t work that way at all.

Actually, the way processors work is that they can issue multiple memory requests at a time. Easily 10, but we’ve benchmarked much wider than that, like 25 or something. Something like the Apple processors, they’re incredibly wide. What you should derive from this is that if you can tell your processor what to do in such a way that it can just go and do it all without having to wait for results so there’s no data dependency. It doesn’t have to wait for this part to be done before doing this part.

So, if you can avoid these data dependencies and if you can avoid bad branches, then you can go really, really fast. There are ways to break data dependencies and there are ways to break the branches. The branches are bad because the way modern processors work is that they have all this amazing parallelism but then when they get to a branch they don’t know which way to go. They don’t whether it’s left or right, and so they’re going to guess. Most of the time they’re right but when they’re wrong then they have to undo all of the work they’ve been doing. Right?

Adam: Yeah.

Daniel: And come back. So the cost can be enormous if it’s done poorly. You have to engineer your code so there are as few branches as possible. You basically want to write your code having a mental model of the machine. You see this line of code here and this line of code here and you want as much as possible for the processor to be able to run both of them at the same time. If you think this way, then a lot of code can become really, really much faster.

Adam: Oh, wow. What was the end result once you applied all of this?

Daniel: The story is that we reached two, three, and in some cases four gigabytes per second.

Adam: Oh, wow.

Daniel: We’re not yet at the disk, but here’s the-

Adam: Getting close.

Daniel: … fun part. I think we can reach the disk, given enough clever work. Writing good could, it takes time. And I don’t know if I’m going to be the one breaking the five gigabytes per second barrier. Well, it would never be me alone in any case, but what I’m saying is that I think people will. If not this year, and if it’s not me, if not this year, next year or in two years, we’re going to see parsing probably five gigabytes per second in the future.

And I gave you the strongest competitor, which was RapidJSON. Now there are much faster alternatives. After [inaudible 00:32:00] the JSON came along then some other people learned, I guess, a bit from us and that they go faster than RapidJSON. But at the time this was the fastest competitor that there was, really, that was correct. It was parsing everything without breaking any rules. It was really, really fast. It was much, much, much faster than some popular alternatives.

This means that the gap, we’re talking about 20 times, 30 times faster than some other options. It’s really interesting to think that as we’re sitting on all this software architecture, we think because we’re working with this old thing that they must be as fast as they can be. But they’re probably not. It would be a bit like being in 1980 and driving a car and thinking, “Well, my car cannot get much more fuel efficient. We’ve been working on engine for a century. This is as well-tuned as it will be.” But of course now, our cars are much more fuel efficient than they were.

And so, the same is true with software. There are hard limits but we’re very often quite far from the hard limits and so software is like that. There’s lots of things that we accept that are actually atrociously inefficient.

Adam: So Daniel questioning assumptions about disk I/O led him to create the fastest JSON parsing library in the world. It was 20 to 30 times faster than some popularly used libraries. But that’s not all. His work on bitmap indexes is being used in much open source software including Git, Spark, and Elasticsearch. He created a hashing algorithm that’s in TensorFlow. But always questioning assumptions and not being afraid to ignore the rules has not always made life easy for Daniel. Let’s go back to when he was in kindergarten.

Back to School

Daniel: So they expect kids to learn to count up to some number, say one, two, three, four, five, six, seven, 10 or something. And see, I got this wrong, I think. They ask you to memorize your phone number and you have to tie your shoelaces. So, these are cognitive tests, tests you have to pass to be considered a normal human being. Of course I did not memorize my phone number and to this day if you ask me my phone numbers, I’m quite poor at it. I certainly don’t know my office phone number nor my cell phone number.

Then, as far as counting goes, I figure I was five years old and so I could count until five and that was good enough. Well, to this day, and this is a true story. People will see me walking downtown Montreal and they’ll say, “Well, your shoelaces are undone.” I’ll say, “Oh,” and then I’ll go and try to do something about it. The story is that they decided I wasn’t very smart, so they put me into the special ed class.

Adam: Did your parents sit you down and say you’re going to be switched classes? Or do you remember the experience?

Daniel: Well, yeah. My mother was a teacher. Now she’s retired. And so, this was very embarrassing to her because obviously when you’re a teacher you want your kids to do really well. If you are a primary school teacher then you want your kids to do really well in primary school. And I did do well, by the way. Right? In the end my grades were good but this was a little bit of a struggle with my mother, who…. Well, you know how parents are sometimes. They want you to succeeds, so basically they want you to say, “Well, shh. Stop asking all the questions and just do what you’re told.”

Adam: Did they think that you had a learning disability?

Daniel: Okay, so that’s interesting because yeah, yeah, yeah. They definitely thought that I had a learning disability. It was the ’70s and so it wasn’t at the level… Like now basically, at least in Montreal, you have something like 20% of the kids or more who have a label as having some kind of disability. But it wasn’t like that at all in the ’70s.

So at the time schools, at least where I lived, schools had easy access to kid psychologists and so forth. Which I’m told now is much more difficult, but at the time… So I would see this nice lady who would run tests by me and so forth. They did consider that I had a learning disability.

Adam: Whether or not the school gives him a label, a five-year-old who refuses to learn to count past five because he doesn’t see the point of it is unlikely to follow a conventional path in life. One thing that’s unconventional about Daniel is when he’s writing code he tries to think of what communities might use it. He writes code thinking about adoption first.

Shipping Code

Daniel: The same way if I want to go to China and reach out to people I’ve got to speak their language, I think it’s the same approach with software, is that if you want to reach out to Java programmers you might have the nicest Rust program or nicest Rust library you want, they won’t pay attention because you’re not speaking their language. Right? You have to reach out to people and you have to write in their language. That’s why actually, I tried to learn and use the most popular languages. So I’ve taught myself, of course, JavaScript, Java, Python, C, C++.

I’ve done less Rust because until recently, Rust was low in the popularity [inaudible 00:38:23] but of course now it’s becoming more popular so my stance has changed on it. Now I’m happy to do Rust when needed. So, yeah. It’s just a matter of reaching out to people.

Adam: When did you decide that shipping code was important?

Daniel: Well, this relates to another good friend of mine that I met at NRC, his name is Martin Brooks. Martin Brooks gave this talk at NRC at one time. He said, “Well, okay. We’re discovering a lot and we’re doing research for the world [inaudible 00:39:02] and public and so forth. That’s our mission. We’re trying to make the world better. We have this model where we do this research then we do some kind of prototype, maybe.” And then he said, “And then we throw it over the wall.” There’s this wall, right? “And you throw it over and you hope someone is there to catch it and run with it. But actually if you go and you tilt your head and you look behind the wall, you see there’s nobody there catching anything. Nobody cares.” Right?

He says, “Well, this is broken.” You know what happened when he was giving this talk is that I was sitting there and I thought, “Oh. This is really smart.” I was thinking that it was and people were leaving.

Adam: Oh, really?

Daniel: One by one. Yes. Because this was very upsetting. This was very upsetting to people, being told that their model of research does not work. Don’t get me wrong, I’m not against publishing papers. Quite the opposite. I think more people should be. All sorts of people should be writing research papers. This is very important.

Apparently even Elon Musk wrote a research paper a few years ago. It’s a true story. But more people should be writing papers, but you shouldn’t just write a paper. Especially with the style that we have now in computer science in 2020 where paper is hard to read for all sorts of complicated reasons. If you go back to [inaudible 00:40:38] in the ’50s or even the beginning of computer sciences, the ’70s, you can pick up these papers today and they’re quite readable. But now, they’re often very, very hard to read.

If you hit the right topic and you’re somewhat famous or something or you know people who are famous, your paper might get cited a lot but that by itself does not mean you’ve achieved anything. Being cited is like having stars on GitHub or something, or having follower on Twitter. It’s not by itself an accomplishment. It’s not. This is just vanity stuff. It doesn’t change the world. It doesn’t really matter.

Maybe Twitter terminates your account and all of the followers are gone. I don’t know. But it’s really virtual [crosstalk 00:41:39]. Right?

Adam: Mm-hmm (affirmative).

Daniel: It doesn’t really matter. So if you want to really have an impact on the world you have to reach out to people, the practitioners.

GitHub Collaboration

Adam: The way Daniel reaches out to practitioners is centered around collaborating with people on GitHub.

Daniel: It really transformed the way I do research. Because now I can write code, I can interact with really, really, really smart people that I would never have access to. Just this morning I was interacting with Russian programmers. They’d look at an algorithm that I wrote and say, “Well, it’s really nice but we have this focus on this other aspect of the problem and we think it could be improved if you did this instead.” And I’m like, “Okay, yeah.”

So it’s super interesting. This interaction just wasn’t possible before. The way I do research, I think it’s a successful model but it’s not a model that people can readily adapt because it really fits what I do very specifically. Now, for the people who do Semantic Web and so forth, they’ve been doing open source software and so forth because there are still people working on this Semantic Web and they probably don’t like me very much if they’re listening to you right now.

But very often there’s this fake open source thing. Even large companies have been guilty of it, where you take this thing that you’ve built and you just dump it on the internet with its source code and say, “There. It’s open source.” I think Microsoft now understands but I think at some point they were doing things like that, that they would call open source but really they were missing the social component, which is the most important part because open source is really not about the code. It’s really about the interaction with the people. It’s really a social thing.

Adam: This is why Daniel is known for his code, because he embraces the social nature of open source. His JSON parsing library isn’t really his. He’s the top contributor but he has 68 other people working with him on GitHub. He embraced the radical ideas of Martin Brooks, that people in academia should collaborate with people outside of it. Actually, he also ran with the ideas of Stephen Downes, embracing remote computer science education back in 2005.

Reflection

There’s one story I want to revisit though. I don’t know why I keep going back to this early schooldays story but it stuck in my head. When I feel like somebody mistreated me or misjudged me or something, right? I think of Pretty Woman. Do you know this movie?

Daniel: Mm-hmm (affirmative). Of course.

Adam: They don’t let her shop at that store.

Daniel: Yes. Yes [crosstalk 00:44:31].

Adam: Have you ever wanted to run into your grade four teacher while you’re accepting an award and be like-

Daniel: No. No, no. I mean, it makes for a great movie scene but I think it’s not quite healthy. You know?

Adam: Mm-hmm (affirmative).

Daniel: I think Paul Graham had an essay recently about it. I think he called it The Privilege of Orthodoxy, or something like that. And so, his stick is basically if you tend to easily think like most people in a group then you have this thing that he calls a privilege because you’re never going to be challenged very much and people are going to say, “Oh, you’re fine. You’re one of us and you’ll be fine.”

If you’re by nature a little bit more prone to ask more questions and to be less quick to adapt the majority opinion, then I think you’re going to be always flagged as someone who is a little bit strange. In schools, being strange is not always a good thing, obviously. People, they like to believe simple things that are being given to them and I think that goes contrary to what, for example, science is.

So you need to be bold to go against the grain, at least selectively. I don’t recommend marching in the street or refuse to wear a mask at Walmart or something. That’s not what I mean. I mean it in a more intellectual manner. Where you’re willing, in a company, not necessarily to challenge your boss but asking questions. Like, “Should we be doing this? Why do we do this?”

The scientific paradigm is about always asking another question. No matter where you are you always want to be challenging the state of knowledge. You always want to find where the frontier is.

Adam: So that was the show. I hope you found Daniel as fascinating as I did. I think he’s quite a character. If you liked this episode, do me a huge favor and just tell somebody else about it, who you think might like it. Just ping them on Slack or WhatsApp or however people communicate these days.

This is Adam Gordon Bell. Until next time, thank you so much for listening.

CORECURSIVE #059

Frontiers of Performance

With Daniel Lemire

Transcript

Intro

The University of Toronto And Geologists

NRC and Recommender Systems

The Online Professor

The Semantic Web

IO and File Processing Performance

Json Parsing

Allocating Memory

Parallelism

Back to School

Shipping Code

GitHub Collaboration

Reflection

Frontiers of Performance