S06E09 - From Code Generation to Revolutionary RavenDB: Unveiling the Database Secrets with Oren Eini
Support for this episode of The Modern .NET Show comes from the following sponsors. Please take a moment to learn more about their products and services:
- RJJ Software’s Podcasting Services, where your podcast becomes extraordinary.
Please also see the full sponsor message(s) in the episode transcription for more details of their products and services, and offers exclusive to listeners of The Modern .NET Show.
Thank you to the sponsors for supporting the show.
The .NET Core Podcast
S06E09 - From Code Generation to Revolutionary RavenDB: Unveiling the Database Secrets with Oren Eini
Supporting The Show
If this episode was interesting or useful to you, please consider supporting the show with one of the above options.
In this episode of The Modern .NET Show podcast, Oren Eini, a seasoned developer with over 20 years of experience in the .NET field, discussed the evolution of the .NET framework and the complexities that come with it. Eini highlighted the rapid pace of change in the language, from the introduction of generics at version 2.0 to switch expressions and pattern matching in the latest versions. While these new features allow for more concise code, Eini acknowledged that they also increase the scope and complexity of learning C# from scratch.
One topic of discussion was code generation techniques, such as T4 templates and Source Generators in Roslyn. These techniques have been commonly used to automate the creation of data models and data access layers based on database structures. Eini reflected on the challenges of working with generated code in the past and how these techniques have improved over time, simplifying tasks in modern .NET development.
The conversation also delved into Eini’s work on RavenDB, a non-relational database that he has been developing for the past 15 years. RavenDB has undergone significant changes, including transitioning from .NET Framework to .NET Core and running on Linux. Eini built RavenDB out of frustration with the limitations and complexities of existing databases, aiming to create a system that worked more intuitively and efficiently.
RavenDB differentiates itself through its automatic indexing capabilities, which eliminate the need for developers to manually create and optimize indexes. The latest version of RavenDB features a new indexing engine called Corax, offering improved performance for indexing and queries. Eini also discussed features like Shelving, which allows for storing large data sets that exceed the capacity of a single machine, and the challenges of implementing sharding, a technique for distributing data across multiple machines.
Welcome to The Modern .NET Show! Formerly known as The .NET Core Podcast, we are the go-to podcast for all .NET developers worldwide and I am your host Jamie “GaProgMan” Taylor.
In this episode, I spoke with Oren Eini about RavenDb, he shared some practical tips for databases (it’s not just a case of “index all the things”, who knew?), and we talk about the speed at which Modern .NET is evolving and how that could possibly put new developers off. Oren has a very unique perspective on Modern .NET’s innovation speed, as he’s been around since the beginning:
I can tell you something really frightening. I started using .NET when it was before the One release, which was C# 10, which didn’t have generics. And then we got generics at 2.0 and link at 3.5 and async I think in 50 or something like that. And when you realize the pace of change is amazing. Some of the things that I’m looking at right now, we have switch expressions now and pattern matching. They allow you to write very succinct code. But I think to myself, if I was trying to learn C# right now from scratch, the scope that I would have to deal with is far larger and some of those things are really complicated.
So let’s sit back, open up a terminal, type in
dotnet new podcast and we’ll dive into the core of Modern .NET.
Jamie : So Oren, thank you very much for joining us on the show. Again, for those who don’t know, Oren has been on the show in the past and if you check the show notes, there will be a link to the episode that Oren was on previously. But you’re going to want to stick around in this one because this one’s going to be quite exciting as well. So how are yo Oren? All right?
Oren : I’m doing very well, thank you for having me again.
Jamie : Hey, no worries, no worries. So I guess for the folks who haven’t heard your previous appearance on the show, I was wondering, would you mind giving us a bit of a sort of like an elevator pitch, a little bit about you maybe how you got started in development, that kind of thing.
Sure. So my name is Oren Eini. I’ve been programming professionally since - dear god - 1997/8, something like that. That’s the time I got the four checks for this at least. Been working primarily in the .NET field since the 10 alpha of C#. That’s roughly 23 years. I think it’s scary when you start putting numbers like that for the past 15 years or so. Yeah, it’s utterly ridiculous. I can say that I’ve been working on .NET for 20 years and I’m short-changing myself in this case.
I’ve been working on RavenDB, which is a non-relation database for the past 15 years. It’s actually written in C#, which is an interesting experience to write system software in a managed language and it’s fascinating to see the changes over time. 15 years ago I was writing RavenDB and it was basically an ASP .NET MVC application and it accepted REST requests from the network and then it store[d] and processed data. And 15 years later we are just about to release the 6.0 version. We are actually releasing that next week. Probably by the time people hear this episode it’s already going to be out. And the architectural and way that we are working is vastly different.
But we still have non-trivial amount of code from 15 years ago that are basically unchanged. And it’s amazing to see the difference that this amount of time made both for RavenDB itself and the environment in which it operates. So we used to be a Windows only system, now we are running on Linux. We used to run on .NET Framework, now we run on .NET Core, and now the normal .NET I guess it’s called. And the capabilities and abilities that we have are just amazing when you realize what environment we are operating on and in general.
Beyond that, I’ve written a bunch of books both on writing DSLs in .NET, how to use revenue properly, [I have a] half unfinished book about database development from scratch, those sort of things.
Jamie : Cool. Lots of stuff going on. There’s a lot to be said for folks who have been there from effectively the beginning, because whilst I wasn’t there from the beginning, you kind of were. And you can reflect upon where we’ve come from and where we’re going to, right? Whereas I can only reflect on what I’ve seen.
I can tell you something really frightening: I started using .NET when it was before the 1.0 release, which was C# 1.0, which didn’t have generics. And then we got generics at 2.0 and LINQ at 3.5 and async I think in 5.0 or something like that. And when you realize the pace of change is amazing. Some of the things that I’m looking at right now, we have switch expressions now and pattern matching. They allow you to write very succinct code. But I think to myself, "if I was trying to learn C# right now from scratch, the scope that I would have to deal with is far larger and some of those things are really complicated." I will give you a simple example.
Well, in 2005 and six, I was working on a project called Rhino Mocks, which was a mocking library for .NET. And in order to do that, I needed to generate a class at Runtime and implement that, which meant that I had to learn how to emit IL, the assembly of .NET at runtime, build, compile, all those sort of things, which was an amazingly effective thing to do. And like five years later there was async/await came out and I had to figure out how the async machinery is working. But I already have some basis in understanding the internals of .NET and generator and stuff like that. So I had much better grounding to understand that. And when you start looking at everything from scratch, the scope that you have to deal with is huge and I honestly don’t know how people are able to grasp all of that.
Then again, when I started learning to program, the challenge was to understand C++. And I would still say that C++ is more complicated than .NET today, by far.
Totally. I appreciate what you’re saying there and I agree completely.
Getting started today just because there is so much must be like it must feel like an insurmountable task. I remember you mentioned about using .NET before generics, one of the first professional code bases I looked at. So like once I graduated and once I got my first programming job, it used the
ArrayList type, which is holdover from before generics. And so the way that that would work is, let’s say in modern .NET you may say, "I want a list of strings," right? And if you try to put an integer into that list, .NET will say, ,"hey, no, this is specifically for strings, you can’t do that.," And every time you say, "give me an item from that list,," you can guarantee that it will be a string because there’s a type safety in place.
Problem is that ArrayList didn’t have that and you had to keep track of the order in which you had placed things into the array list so you could place a string, the, an integer, then a complex object, and then some other type, maybe a double or something. So you were pushing the into the array list, or when you were popping them back out, you had to remember, ,"wait, was the first item a string or was it an integer?," Because when you pulled it out, it would come out as object and then you had to cast it back to whatever it was that you placed it into the array list with.
So yeah, you gave me a bit of a dreadful memory there, Oren.
Oren : The interesting thing about that is that for most code bases, it was fairly easy to understand that, ,"this is a list of strings, because it was customer names or something like that,," that you had. But from the perspective of the JIT and optimizations, you wanted to have the strong type safety, not because of just the compiler correcting you, but because that would generate much better code. And then you realized that there were a set of templates. Dear God, it was called T4. You had a strongly typed collection generation at build time and something like that, which we now have using Source Generator in Roslyn, which is a whole different topic. And the cycle goes again.
Jamie : Yes. I’m lucky enough in that I’ve never touched any of the T4 stuff because it felt like a bit silly. Oh rather. Okay, let me rephrase that. Not that it felt silly, it just felt like it was doing a little bit too much. Because I totally understand why folks may want to generate some code, but I had never actually hit upon a maybe it was because I was mostly doing CRUD apps, but I never actually hit that point where I needed to generate code at build time. For me, I was just happy typing it all in common.
Oren : That was super common, and it’s still common today. You just call it by a different name. If you look at tables for Entity Framework, for example, this is exactly what it does, because it used to be that you would have a template that would look at your database structure and generate the data model and the data access layer and all those sort of things, and you would start writing the code. And on day one of your application, you had 25,000 lines of code that no one would read because it was generated. And then you had to do nasty things to the code base to make it work. And that was an insanely common thing to want to do that. Yeah, you’re taking me way back.
No, I’m super glad that things have gotten simpler and easier. Like you said, they’ve added to the complexity of the language. But I feel like everything’s become easier. Like for instance and this is the example I use, it’s a useless example. Pointless example. But if you look at Hello World in .NET 5 onwards, it’s one line of code and a comment. If you do
dotnet new console, it gives you Hello World as one line, technically. Whereas prior to that, if I’m teaching someone how to do C# I would then have to teach them what all of the eleven lines around that are what a using statement is, what a namespace is, what a class is. What’s a
private static void main? What is a string? Square brackets.
args. What’s all that and then finally stuff.
That you need to run. That’s it. Ignore that. Focus on what is in the main method afterward. We’ll talk about this later. And yeah, absolutely. Going down to that level is wow. Especially if you’re dealing with a lot of new projects all the time because the level of sermon that the have to deal with is so much lower. Then again, I think that one really good thing that happened with .NET is that all of this complexity isn’t something that we have to deal with on a day-to-day basis. And the perfect example here is LINQ, where for the most part, users of LINQ don’t need to deal with how it is implemented or the underlying complexity behind that. At the same time, writing a LINQ provider is a huge undertaking on the level… I remember the first time that I wrote RavenDB, the first version took me three months. The LINQ provider for RavenDB. And that’s by the way, that’s writing both client and server and the LINQ provider for RavenDB took me more than that. So it was literally that it’s easier to build a database than write a LINQ provider for this database.
It’s that insane level of complexity.
Jamie : Sorry.
Oren : No, go ahead, please.
Jamie : No, you please.
It’s funny because a lot of those things that you see, I remember really struggling with the complexity of link when it came out and even today code bases such as Roslyn absolutely forbid using Link because of the performance implications. But from a convenience factor, from the ability to succinctly state what I want to say, it is amazing. And then you realize that oh, as a language and that comes back to the level of complexity that you have to deal with. I get to choose at what level do I want to express myself and I can express myself at a very high level concepts and get a lot of things done very nicely in an organized fashion.
Or I can drop down to the bits and bytes, and pointers, and unsafe code. And how it represents, at the level, that I’m writing C# code. The I’m looking at the generated assembly from the JIT in order to optimize that, and then I realized that, hey, this is something that you don’t generally do in a managed language, but you can do that in C#.
Now, I had a case recently where I had a list, and I needed to filter out all of the negative values from this list. And this is a one liner in LINQ
.Where(x => x > 0), and that’s it. And this turned out to be a week of really hard work and minute optimizations to try to optimize what I’m doing ended up being, "oh, I can do 20% faster than the LINQ version because I’m actually bounded by the memory bandwidth that I have available at the hardware level," but I actually have the ability to go from this amazingly expressive single statement all the way to manually working on the assembly in C#. And it’s the same language, same environment, which means that I can actually do a lot of interesting things with, oh, take a junior person, have him write the code as they sit, and then I can, oh, this is slow, we can make this better, and I can optimize that particular piece without going insane.
If you’re familiar with how you would optimize a Python system, for example, the typical way of doing that is, okay, you write the code in Python and then you find the hotspot and you write it in C and you can do the same thing in C#, but you also write in C#. Except that, to be fair, the level of code that you write when you’re trying to do optimization at the high end is not something that most C# developers would even have a clue and understand me without. I need a week to meditate over this piece of code.
Jamie : Sure. Or indeed a 60-year long career of poking around in CPUs to address individual registers and stuff. Right.
Oren : Yeah. It’s insane. Absolutely.
Jamie : Yeah, 100%. I really like the tooling around .NET and C#. It just feels like it’s been created by devs for devs, if that makes sense. Right. Like you’re to let’s say I’m in Visual Studio and I’m in that situation you mentioned, right, where maybe I’m working with someone who’s written some code and it’s not the most optimized code ever. I don’t have to leave Visual Studio to do the optimization that you were talking about. I don’t really have to go anywhere. And that, I think, is the most important thing, is when we context switch. And you may not think it’s a huge thing, but when you change UI, that’s a huge thing. You have to rethink the paradigm behind it. Right. So if one developer is using, say, Visual Studio, and another developer is using Rider and another developer is using Visual Studio Code, moving between those machines, there’s going to be it might be a non trivial amount of time, but there’s going to be an amount of time where you switch your thinking paradigm from, "I was looking at Visual Studio. Now I’m looking at Rider". Right.
Oren : It’s beyond that. So when I’m pairing with people, one of the things that is killing me, absolutely killing my productivity is keyboard shortcuts. Like I sit in front of the computer and I code and I don’t conceptually in my head, I go, okay, "extract method, jump back, jump forward." I don’t really think about what I’m doing, but if I’m switching machines, then the keyboard shortcuts are different, which is so jarring. So even if I’m using Visual Studio here and Visual Studio there, I’m typically using the JetBrains keyboard binding and they’re using the Visual Studio keyboard binding, which are different. And like major shock each and every time. And don’t get me started on the types of keyboards. And now I’m starting to sound like an old man shouting in clouds that was not my intent.
No, I think it’s an important thing to talk about even just briefly like this, because it even extends all the way down to operating systems. So I run a Linux-on-the-desktop desktop on my main desktop. How many times can I say desktop then? And I have in the Linux world, we don’t call it the Start button, we call it the super button. I have
super and F or
Start and F mapped to, "show me my file browser." Then I jump over to Windows Eleven where I’m doing some work for some people and I go, "oh cool, I need to get to Explorer,"
Windows and F and it brings up the Windows feedback menu and, "I’m like so close. It’s like one key over."
But you’re right, it’s that muscle memory, right? It’s one of the reasons why cars have standardized on a certain user interface. And some people may be listening going, "what user interface is a car? There’s a steering wheel you’re interfacing with the car, right?" There is a steering wheel in a specific place. If you drive manual, there is a gear stick in a specific place, right? You can guarantee that when you sit in a car, all of the controls that should be there are there in the right place. So then you can focus on the task of driving. But we haven’t gotten to that point with software development where we have a common overarching set of keyboard shortcuts. Yeah, we’ve got copy, cut and paste. And yeah, you could argue that you could do most of your work doing those.
But if I’m in writing Mac versus Linux or Windows, it’s killing me. Especially if I’m doing a zoom call and we try to keyboard share on something on a Mac, then I’m doing a
Control C, and on the other end it does nothing and are you kidding me? And it’s also about the power of default. One of the problems as you move between environments and if you’re pairing you’re doing that a lot, is that you’re typically not on your machine and maybe you’re on a new machine or in a common machine, and they tend to use the default versions. So changing the default, even if you have this option, like in the key mapping option, is actually a huge hassle to the point where I actually memorize both key sets and I can more or less switch between them. But it’s still really jarring.
RJJ Software’s Podcasting Services
Welcome to “RJJ Software’s Podcasting Services,” where your podcast becomes extraordinary. We take a different approach here, just like we do with our agile software projects. You see, when it comes to your podcast, we’re not just your editors; we’re your collaborators. We work with you to iterate toward your vision, just like we do in software development.
We’ve partnered with clients like Andrew Dickinson and Steve Worthy, turning their podcasts into something truly special. Take, for example, the “Dreamcast Years” podcast’s memorable “goodbye” episode. We mastered it and even authored it into CDs and MiniDiscs, creating a unique physical release that left fans delighted.
Steve Worthy, the mind behind “Retail Leadership with Steve Worthy” and “Podcasters live,” believes that we’ve been instrumental in refining his podcast ideas.
At RJJ Software, agility is at the core of our approach. It’s about customer collaboration and responding to change. We find these principles most important when working on your podcasts. Flexibility in responding to changing ideas and vision is vital when crafting engaging content.
Our services are tailored to your needs. From professional editing and mastering to full consultation on improving the quality and productivity of your podcast, we have you covered. We’ll help you plan your show, suggest the best workflows, equipment, and techniques, and even provide a clear cost breakdown. Our podcast creation consultation service ensures you’re well-prepared to present your ideas to decision-makers.
If you’re ready to take your podcast to the next level, don’t hesitate. Contact us at RJJ Software to explore how we can help you create the best possible podcast experience for your audience, elevate your brand, and unlock the vast potential in podcasting..
Jamie : I have… okay, so let’s go one step further. Right. Let’s say you’re working with someone who’s using Rider and the you have to do some Android development. So you switch to Android Studio, both made by the same company, both completely different key mappings.
No, leave aside. Think about this is still Rider and Visual Studio still have the same concepts. So you have solutions and projects and assemblies and etc. move to Java where you have something really different with the packaging menu and how you deploy and debug and all sorts of stuff like that. It’s actually really funny when you think about everything beyond the code. So I can read Java code, I can more or less write Java code. I cannot debug Java code. I get the Java guy to sit with me so he could magic incantation that needs to run something and understand that, which is hilarious. But it would take me a couple of weeks to actually learn how the concept and I never had the time to actually invest that much time in Java. On the other hand,
make and C is, okay, I get that. I got that when I started working and it sort of makes sense to me.
Don’t get me started on Python and go for deployment. The notions there just everything is weird.
Jamie : It’s like they’re different languages.
Oren : No, I’m fine with the languages. It’s the environment that is killing me that it’s not doing the right thing.
Jamie : So let’s talk about RavenDB. Right. That feels like not a very good segue, but let’s talk about RavenDB. And we wanted to talk about RavenDB anyway because you’ve got a new version coming out, I’ve heard. And there’s some new features which sound pretty cool, but I was wondering for folks who maybe haven’t come across RavenDB before or haven’t had a chance to listen to the previous episode yet. Could you give us a brief description of what it is? And does it differ from other database provider systems? Is it a database? What does it do and how does it work?
So let me try to give you the basic context. That conversation we just had about giant things and things not working, and you want everything to work just the way you want it. That’s how I felt when I built RavenDB because I was so tired and frustrated from having to fight a database hit another day. And I used to be a database performance consultant and I did that without being a DBA. And the way that I did that was that I would go into applications and figure out what they were doing wrong and some application did things so wrong. The highlight of my day was figuring out that the application that would massively - his was in an Aptec system and they were used to having lots of processing power dedicated for the database. But one of their applications was actually hitting 100% CPU utilisation on the database side and they couldn’t figure out what was going on because everything was supposed to be simple. So they called me and I started looking into that. And then I came to them and tell them, "do you know that rendering a single page cost you 17,000 queries?" And they had no idea this was the case. Yeah, because their DBA, he was a very talented DBA, he looked at the queries and he optimized the queries. Okay, that’s great. But he didn’t look at what the application is doing. That is the record for what I’ve seen.
But I’ve seen so many places where rendering a single page means talking to the database dozens or hundreds of times for rendering a single page. And the way that we do that is so inefficient. And it was really annoying, like super annoying to see that over and over again. And I started having this vision in my head that I wanted to have a database that would work with you and wouldn’t give you those pitfalls that you have to be so careful about all the time. And eventually that became RavenDB. RavenDB stores data as documents, so as JSON documents, which means that you have the ability to express arbitrarily complex data very easily. And if we talk for example, a car lease scenario so you want to lease a car. So let’s think about what do you need from a data perspective. I have the car, I have the customer, I have the rental agreement and that’s about it. But then you realize that, wait "a second, in the car I have to select which car, which model, what features. I want a baby car seat, I need a young driver/rider on the insurance policy. I want to have this few return option and show much for other stuff like that. For payments we are paying on three different cards, different percentage each and a bunch of other stuff like that." And suddenly what seems to be like a really simple system is now dozens if not hundreds of tables.
And now you have a very simple problem: Show me the cars that I have out so that I have rented them out. I need to know which are expected to be today, which seems like something that you would reasonably want to do. But the problem here is that okay, I want to see them, I want to see what options there are and what mileage I need from them and all of those sort of things. And suddenly, oh, I have to have an expert in databases craft this query, because otherwise what’s supposed to be a really simple system is going to start hammering the database with tons of queries. If you store the data structures as data in documents then all of that can reside in one location, which makes the job a lot easier for the database.
And then I started looking into other aspects that you can improve things, how you manage relationship between documents, how do you avoid cartesian product on joint, those sort of things. One of the primary features that I also wanted to do is how do you deal with walking in high availability mode with data that is distributed across geographical location, those sorts of things. RavenDB, I mentioned has been around for a while and got quite a lot of results. I’m really proud, however, that the core feed remain mostly stable in terms of the basic offer. It is the database that you use and you don’t really need to think about all the time.
We are now about to go to the 6th version and this one has about three major things that are happening. One of them is about integration. So RavenDB can now read and write data from Kafka or RabbitMQ, other systems like that. And the basic idea here is that you throw the data into RavenDB and, "oh, data shows up in RavenDB and then RavenDB itself is going to push that to Kafka or the other way around." You have some part of your system that the data into Kafka and it’s automatically being pulled into revenue and you can start query over that process that those sorts of things.
Other things that are new, which I’m really excited about. We have a couple of multi projects we actually started. One of them is called Corax, which is a new indexing engine and we have been working on that for close to a decade. The first commit that I found for Corax is from 2014. And an indexing engine. Yeah, it sits at the heart of what a database does because you want to throw data into a database and then you want to pull it out and you need it to be efficiently and easily. And the current indexing engine that we use is called Lucene, which is pretty much industry standard and it took us a lot of time to replace that because it is a really good engine. But what we have done is basically take advantage of all of the new things that you can do in C#, whatever this is hawrdware intrinsics or vector instructions. Take all of the knowledge that we have about what our users are actually doing with the database and produce something that on average is about ten times faster than Lucene could be. Which means that for indexing and queries, we’re able to deliver much lower latency than what we used to and far better than most of the competition has on the market right now.
And it’s really interesting because one of the challenges that we had is that oh, we already have an existing user base and features that we have to support. So we couldn’t just build the baseline, we have to build the complete package, whatever this is. "Oh, I need to be able to support full text search or geospatial queries or any of those sorts of things on top of the new indexing engine. And we have done that and I’m super happy about that. Which means that you upgrade to the version, you switch the indexing engine for the particular indexes that you want and you basically just get a better performance, basically for free. You don’t need to change your code, you don’t need to change your system.
And that’s something that we have strived for a very long time and it has been an incredibly complicated and challenging task. Specifically because one of the things that we do in RavenDB, we give you a lot of freedom in how you design your indexes. So for example, I mentioned the Car lease agreement. Try to imagine that you have multiple riders on the lease, multiple drivers on the lease. So I want to find all of the leases that has a driver named Jamie. So in a relational database, that would be, okay, let’s jump to two or three association tables to find that. But in a document database, I can query that directly, but now I have to query an array over a document. And another aspect that is really interesting is that when you start allowing the queries from the users, you have to realize that the typical user of a database don’t want to use a database, they want to be in an application. So the database actually, at least the way that we design RavenDB is to assume that the people who are using us don’t want to care, just want everything to work. And you only learn about the database internal stuff out of two reasons why, because you think it’s interesting, I think it’s fascinating, but that has been my life for the past 15 years. Or you run into a problem and you have to understand what’s going on in order for you to be able to achieve your goals.
And with RavenDB, one of the primary driving forces was I want to avoid that. One of the things that we have really tried to do is anticipate what you want. So let’s talk about something fairly simple. You make a query over some documents, a collection or something like that, and you get some results back. And as you’re doing that, you’re probably running on your developer instance, which has minimal amount of data and you’re running on a developer machine which is really powerful and you have one user of the system. Everyone is happy. The system is hammered on and it’s fast and everything, and then you throw it in production, and all of those things that used to be true are no longer true. So you’re running on data set that is much larger, especially if you have a lot of historical data. You have many users using the system at the. Same time. And funnily enough, you’re probably running a system that is much weaker than your developer instance or developer machine because it used to be the case that you would have good machine as a developer, but for production, "oh, this is a production machine. It’s much bigger, has more memory, better disk," all of those. And then the cloud came and you’re trying to optimize by putting your database on the cheapest machine that would bear its weight and combine all of them together. And you throw that in production and it screeches to a hub. And then you need to understand how the database work, what’s going on, lots of things.
So one of the things that we did with RavenDB is trying to alleviate that. For example, if you run a query on RavenDB, then it’s going to give you the results. But think about it, how are you actually going to find results in a database?
So I have tried to mention a stack of documents on your table, all of the lease documents that you ever rental, documents that you ever had. And now you try to say, "okay, give me all of the leases for Toyota Corolla" or something like that. Okay, you have to go through each one of them independently and that is going to cost you the more leases you have. So the answer to that is that you have to index them. You have to create some sort of way to very speedily get to the right location. I used to use the idea of card catalog at the library, but then I realized no one understand that any more. So maybe some people remember physical phone books and how you find people there because they are sorted based on the surname and then the first name. So you can very quickly find people there. Indexes are exactly the same way. You sort the data based on some field and then you’re able to query over that field to query quickly find the items that match the particular term that you want. Now, the way that we handle queries is actually really interesting because you issue a query and we try to find the relevant index to execute this query so we can optimize. What happens if this is the first time you query over this field and there is no index that cover that? Well, we have a problem, but we also have a really good solution.
What are we going to do? We’re going to say, "okay, I have to scan through all of those documents anyway to find the data that you want. While I’m scanning them, I can go ahead and say, oh, you know what, I want to build an index for this. I want to make sure that the next time I’m querying over this data, over this field, I don’t have to pay the cost of traversing the entire set of documents." This is a relatively simple concept when I’m describing like that. But it has some profound implication when you think about the overall design of the system to start with, it means that you don’t really need to think about indexing from the get go. In fact, if you don’t think about any indexing at all, RavenDB is still going to produce the set of indexes that you want, but at least the set of indexes that you need in order to do whatever you want right now. And if you change your system, one of the challenging tasks that you have to deal with is, "oh, I change the behaviour of my system sometimes very subtly and that change indexing behaviour with RavenDB." You don’t need to do that because it would adjust to those changes automatically.
It also means some interesting aspects about the internal architecture of RavenDB because typically when you tell someone, "oh, we’re going to create an index at production time," they are scared. This tends to be the open house surgery for databases. Seriously. In many cases you are consuming tremendous amount of resources, both compute and storage. In some cases you have to lock the particular table that you operate both for it and writes. So this is decidedly non trivial operation, but with RavenDB, because this is something that we always wanted to have, the entire system is actually designed to allow us to do that. So that means that I have the ability to create an index in production and maintain operation while the index build is going on. Which means in turn that I can afford to do those sort of things in production. So the next time that you go and query all feed that I don’t have, I can generate a better index. And then I have the old index and the new index running in parallel until they are in sync and then I can drop the old index. In effect, the more queries you make against the database, the more information give the database engine and then it allows you to basically produce the optimal set of indexes for you.
We also allow you to do the same thing when you want to define your own indexes, so not ones that the database manager defined for you. So let’s say that you want to define an index over some complicated part of the leasing agreement. Okay? For example, "give me all of the compute the end date of the listing which requires some business analyst, whatever business logic as part of the indexing process." You can do that and then you need to change that because you realize, oh, I didn’t account for, I don’t know, some new rules that came out that give them another three days grace period, something like that. Okay, this is the sort of thing that you typically say, "okay, I have to drop this index, I have to create a new index. This means downtime for however much longer it’s going to take me to do that." And things of this nature with RavenDB. You’re going to say, no, I’m just going to run them side by side. So again, we have the ability to create the new index, run it in a reduced resource mode until it catch up because we want to ensure that we aren’t starving. The system for the resource needs to do its normal operations and then we need catch up. I can atomically replace them.
So this one feature that came from understanding that developers don’t actually care about the database, just want them to go away. Store my data, don’t lose anything and give it to me when I’m asking it for you. I don’t want to know anything else. Just make it work. That’s the address that we keep hearing and we started building that and there were a lot of implication on that and features that we had to do for this feature that were then really valuable for the more advanced scenarios where you actually do know what you’re doing, you want to do more interesting things.
A Request To You All
If you’re enjoying this show, would you mind sharing it with a colleague? Check your podcatcher for a link to show notes, which has an embedded player within it and a transcription and all that stuff, and share that link with them. I’d really appreciate it if you could indeed share the show.
But if you’d like other ways to support it, you could:
- Leave a rating or review on your podcatcher of choice
- Head over to dotnetcore.show/review for ways to do that
- Consider buying the show a coffee
- The BuyMeACoffee link is available on each episode’s show notes page
- This is a one-off financial support option
- Become a patron
- This is a monthly subscription-based financial support option
- And a link to that is included on each episode’s show notes page as well
I would love it if you would share the show with a friend or colleague or leave a rating or review. The other options are completely up to you, and are not required at all to continue enjoying the show.
Anyway, let’s get back to it.
So I have a couple of thoughts and I’m going to throw them at you all at once and then see what you think about them. The first one, I guess, is, well then why don’t we just index everything? Right? I’ll let you answer that one in a moment. Because to put some context behind that, I used to work with a .NET dev who was like, "the database is slow. It’s definitely not my code. So I will put indexes on every single property, on every single field on every single table in the database and that’ll make it faster." And I’m like, "well, not really."
So there’s that which I’d love to get your input on. And I guess what you’re talking about with this automatic indexing of first we need to index on this thing. I’ve detected. I am RavenDB. I’ve detected we need to index on this field. And I’m also RavenDB and I’ve realized actually it’s this other field I need to index on. I’ll index on both until the first one is no longer needed, then silently switch over. Sounds a little bit like how the C# runtime JIT engine. So for folks that don’t know what that means, just in time, compiler that re-JITs your code as it runs. Right? So it’s sounding to me, it’s sounding a little bit like that. It’s like almost like trying to optimize itself as it goes along. Is that the case?
I have never thought about it this manner, but yeah, there’s a lot of similarities around that.
So to give the context about the JIT, one of the things that the JIT can do is profile your application while it is running. For example, let’s say that you have a piece of code that has an if statement in it and that if statement is true 95% of the time. In some cases, the JIT can use the information "this is true 95% of the time" to massively optimize your system by shuffling code around, by doing other things that we don’t really care about usually. For RavenDB, we do pretty much the same thing. The more queries you give me, the more information I have about your system and I’m able to change the internal structure of how the database lays out the data on disk in order to optimize it. And it’s funny because in the JIT this is usually one time deal, at least for the .NET JIT, I don’t think that it does deteriorate. So it doesn’t say, "oh, this information is now out of date, and now I have to select another to regenerate this method again with this new information," which I’m happy it doesn’t do, because already the tiering system and how it works make my life as someone who want to achieve the best performance a lot harder because I have to do more things in order to get the JIT to understand, "oh, this is exactly what I want to do."
But going back to your first scenario about why don’t I just index everything? And in RavenDB, you can write a tree line index that basically says index across all documents, across all fields, make it happen. And it does that. And that is generally a really bad idea. Why is that? Well, to start with, indexing isn’t free.
Indexing is going to cost you with both CPU times and storage. Storage cost. And in the case of storage cost, we are actually talking about two separate things. One of them is the amount of storage that you consume on the disk. And the other thing is the bandwidth to the storage, which is often ridiculously low, especially if you want on the cloud. Consider the fact that you may have, "oh, I’m indexing now a document leasing document, which is 10-20 kilobytes. And if I’m indexing all of the fields, I may be indexing 20 or 30 KB" versus "I’m indexing the feed I actually care about, which is hundred bytes or something like that", and multiply that by hundreds of thousands of operations at the same time, and it’s really easy to hit the limits of the underlying systems, and it’s a waste regardless. It also means that you’re going to take more memory. It means that you’re going to consume all of the resource from your instances. You have to move to the next level of instance, usually for nothing.
A great example of where it bites you is actually not with RavenDB, but with CosmosDB. By default, you start running on CosmosDB and you can query and it’s working great, except that behind the scene, by default, it would literally do that. It would index all of the fields and you’re paying for that. CosmosDB cost model is based around something called WCU and RCU. I don’t remember - Read consumption unit and write consumption unit. And you write a document that is 1 KB in size. Okay, that’s great. And then you pay as one write consumption unit, but then you index all of the fields and a 1 KB right will turn into 10-20 kilobyte write and the CPU associated with that and all of those sort of things. And it’s very easy to have 10-20 times higher cost just because of this single default. Now, the reason that cost CosmosDB does that is that indexing for them is a fairly complicated and very costly scenario. So you either have to define what you want upfront and realize that, "oh, if I need to change that. This is a big issue that we have to deal with, including looking at the monthly expense for Azure, it’s going to jump by two, three times. Or we just have to accept from the get go that will index everything. We have more flexibility, but we’re paying a lot more from the get go."
Yeah, and it’s other things like that relate to can I actually make use of the index in the way that I would like it? So the way that Raven handle[s] indexing is really quite different. So consider the case of, again, going back to the leasing agreement, maybe I want to search by the license plate or by the customer ID or by the date. So let’s say that I want to ask a question such as, "okay, give me all of the rental agreements for car number one, two, three for the past year," and I have those three fields indexed, but in a relational database in MongoDB, in CosmosDB, I cannot use those separate fields to optimize this query. I may use something called a compound index. So I index both the license plate and the date as one field, basically. But in general, this is a very advanced scenario. You have to deal with combinatorial explosion of all of the options that you have. The way that Raven and Corax does that is really different because we index each field independently, but we are actually able to optimize across the board, across all of them without having to deal with, "oh, I have to have one index for license plates by ID. And another license plate by date, and another one for customer by date." With RavenDB, you define one index for all three of them and it cover all of your scenarios.
So what I’m taking mostly from this conversation so far is that perhaps a lot of devs don’t quite fully understand how a database works, and perhaps we should learn a little bit more about how databases work.
To an extent, because I mentioned this a little bit about C#, but think about standard developer today has to deal with a whole bunch of crap, whatever this is, "okay? I have to deal with responsive designs and I have to understand accessibility issues, and I have to understand, I don’t know, GDPR rules or any bazillion things that you have to deal with" just to get the okay to release a software. And the idea here is that if you also require them to understand database internals us, that’s a big ask.
Yeah, so that’s one reason RavenDB even exists, because I thought this is ridiculous that you have to become a database expert in order to build what’s going to be a business application. Focus on the business value, not on where the data lies on this. That’s not something that you want or need to care about.
One of the things that we actually made in 6.0 is another feature along the same lines. So typically when you think about databases, you think about, "oh, I have a database that sits on a server." But you have to realize, okay, what happened if I have if that server is down? So you need to win a high availability scenario, which means multiple servers never we had that from the get go, and we have a really good system where you spin [up] three servers and they manage themselves, do fail over the entire time. But that still means that you’re limited to the amount of data that you can put on a single machine because they are basically replicas of one another. In 6.0, we enable again a feature that we used to have and we dropped, which is called Sharding, which allows you to store more data than you can fit into a single machine. And the reason that we drop it in the past was that it was too complicated to operate in production for most people and we would rather not have the feature than have something that was half asked or didn’t mesh with the usual way that we did things.
So we actually spent, we started working on that even before Corona, COVID. And so this is now three years in development for this one feature. And the whole point is this feature is that you can run your system, you can run it on data sets that are larger my Kim have on one machine. So talking about 5-10+ terabytes of data and you will not notice that there is any difference, which is ridiculous because we spent an enormous amount of time trying to make sure that you wouldn’t notice that we spent an enormous amount of time on this feature. Yeah, it’s really funny, but that’s my pet peeve. I want you to really just use the database and forget that you’re using that. Because I have seen so many basically crimes against code and maybe even humanity at some point. The sort of queries that I had to read is like, "okay, you’re going there, but you’re going to London via Las Vegas in a bicycle. It’s not really going to end up well for you, nor for the fishes." It’s utterly ridiculous.
And I really want the scenario to be something that… let’s talk for a second about Sharding, specifically. One of the key problem with Sharding is that, okay, you can run your system on larger than one machine setup. This is not a unique feature. CosmosDb has it, MongoDB has it, many other database has it. But the difference from my perspective is that when you start using Sharding you have to select something called the Sharding strategy or the partitioning function, there is all sorts of names for that. And the basic idea means that how do I determine where a particular piece of data is meant to go? Is it going to be on node A, node B, node C, etc.?
So you define some sort of partitioning scheme that manage that. The issue here is that you have to define this function very early on in the process when you don’t know much about your system and then you realize that, "oh, I made a mistake and the partition function that I chose was based on per customer data." Which makes sense because I have customers, right? That’s supposed to be a good partitioning scheme. What’s the problem? Well, your customers aren’t nicely distribut[ed]. You have 2 million customers who have two, three items each and you have hundred customers, all of them have a million items each. And your biggest customer is now going to hit the partition size limit and this is something that can happen to you very easily. With DynamoDB, with CosmosDB, with MongoDB, you hit the size limit and the database start rejecting writes. And the size limit isn’t big. In some cases it’s about maximum of ten gigabyte; and consider, "oh, I have a very big customer or very big set of customers. They are probably the customer that the make the most money on and at some point I’m going to hit the limit and going to start rejecting the writes." That is an [unknown words].
With the way that we design Sharding for RavenDB you define a Sharding function and it’s going to do its best. And if you didn’t do something properly? Well, you have to wait until one partition is across the terabyte range before it become, "oh, we need to do something about it at some point," but not by mind that we have to have it chipped. But in a month to next release next year we will deal with that and in the meantime everything is still going to work because the underlying assumption that we have for RavenDB is that you’re uninterested in the database and you just want it to work. "Shut up, keep my data, give it to me when I’m asking, otherwise I don’t want to hear from you." That’s the behaviour that users want from a database.
Jamie : Sure, I mean that makes sense, right? Because as a person who drives a car, I don’t need to know show the internal combustion engine works, I just need to know that if I’m in gear, if the engine is switched on and I push on the accelerator it should move forward, right?
Oren : And go back 30, 50 years. It used to be that in order to be a responsible car owner, you were expected to do a weekly oil check and every two weeks you had to kick the tires, make sure that there was ongoing maintenance concerns for the car. If you did not do that, the car would break in the most awful possible manner. And today for modern car, take it to the car once every 10,000 km, 15,000 km. Don’t worry about it otherwise. And you know what, if something is wrong, we’ll pop a light in the dashboard to tell you, "hey, take this to the garage before it’s become a big issue." And the experience around that, not having to do routine maintenance on the car, unless you really want to, means that the car became from something that you deal with, something almost invisible. I sit in the car, I move forward, I go backward; like with the keyboard shortcuts. I don’t need to think about that. I don’t need to worry about it. It’s just there and it does what it needs to do.
Jamie : I like it. So how do folks go about so someone’s listening to this going, "oh my goodness, I got to go to RavenDB and move all my data." How are they going to go about that? Do they just need to go to the website and read a bunch of documentation? Is that the best way to learn about it? Are there like YouTube videos or what?
Oren : There are YouTube videos and guidance around that. There is a book I wrote that is available on the website for free. There is also a guide that walks you through basically boot camp of how you go through everything in the process. You can go to ravendb.net/download and by the time you hear that, we’ll have the 6.0 version out and ready and you will be able to basically just download run it. There is a setup wizard that takes you to the process and give you everything. You can also go to demo.ravendb.net which is going to give you examples of how to do all sorts of really interesting things inside of RavenDB and expose how things are operating. And beyond that, go to the website. There is a learn section which huge amount of materials that will teach you how to use RavenDB effectively.
Jamie : Excellent. Okay, so what about just real quick thinking ahead, someone asking this question, "can I talk to it using Entity Framework or do I need to get like a special… is there an Entity Framework provider for it or do I need to download some other code and wire that up in my app?"
Oren : So the way that you typically use RavenDB from a .NET application is through the NuGet package RavenDB Client or RavenDB.Client and it has the same feel as the Entity Framework client. So you’re working with entities, C# entities, no need for attributes or something like that. Different from Entity Framework is that you don’t need to think about, "oh, I have to define the tables up front, I have to define the relationship, everything." You write the code and you just persist them and they show up in the database almost magically, especially if you’re used to the drudgery of working with databases. This is so much easier. And the client has all of the features that you’re familiar with, which is whatever this is, the LINQ provider, or unit of work, change tracking, all of those sort of things.
Jamie : Cool. Okay, so go get the RavenDB Client NuGet package, wire some things up, maybe a connection string or two, get some data into your RavenDB instance, and you’re off to the races, as it were.
Oren : Absolutely.
Jamie : Fantastic. Okay, what about interacting with you Oren? I know that at the time of recording there’s things happening on Twitter. Maybe people are walking away from it, there’s mastodon, a BlueSky and all that kind of stuff. Are you on those or are you like a no, just go through the website.
So you can find me on Twitter @ayende A-Y-E-N-D-E. Feel free to send me messages there or you can find me over email. oren
Fantastic. Fantastic. Okay, cool. Well, I mean, I’ve really enjoyed chatting with you about not just RavenDB, but about database things and our conversation there, about keyboard shortcuts and things. I really liked how we were able to sort of bookend the entire conversation, right. It’s all related, absolutely everything we said. So I really enjoyed that.
So what I want to say, Oren, thank you ever so much for spending some time with me and the listeners today talking about these kinds of things because I think it’s really quite important.
Oren : Thank you for having me. It’s been wonderful.
Thank you for listening to this episode of The Modern .NET Show with me, Jamie Taylor. I’d like to thank this episode’s guest, Oren Eini, for graciously sharing his time, expertise, and knowledge.
Be sure to check out the show notes for a bunch of links to some of the stuff that we covered, and full transcription of the interview. The show notes, as always, can be found at the podcast's website, and there will be a link directly to them in your podcatcher.
And don’t forget to spread the word, leave a rating or review on your podcatcher of choice - head over to dotnetcore.show/review for ways to do that - reach out via out contact page, or join out discord server at dotnetcore.show/discord - all of which are linked in the show notes.
But above all, I hope you have a fantastic rest of your day, and I hope that I’ll see you again, next time for more .NET goodness.
I will see you again real soon. See you later folks.
- Rhino Mocks
- Episode 111 - RavenDB with Oren Eini
- RavenDB’s search engine: Corax
- Apache Lucene
- Oren on Twitter
- Oren’s blog
- Supporting the show:
- Getting in touch:
- Music created by Mono Memory Music, licensed to RJJ Software for use in The Modern .NET Show