S06E14 - From .NET to DuckDB: Unleashing the Database Evolution with Giorgi Dalakishvili

Embedded Player

The .NET Core Podcast

S06E14 - From .NET to DuckDB: Unleashing the Database Evolution with Giorgi Dalakishvili

Supporting The Show

If this episode was interesting or useful to you, please consider supporting the show with one of the above options.

Episode Summary

In this episode of The Modern .NET Show podcast, Giorgi Dalakishvili, a seasoned .NET developer with 15 years of experience, shared his insights on various topics related to database technologies and development frameworks. As an experienced developer, Dalakishvili has worked with different versions of .NET and frameworks such as Windows forms, ASP.NET web forms, ASP.NET MVC, and ASP.NET Core. He has also delved into technologies like GraphQL, SOAP services with WCF, and mobile apps with Xamarin and .NET Maui.

Postgresql presents certain features that are hard to implement in SQL Server. For instance, it supports an array data type, allowing developers to store a list of strings associated with a record without the need for creating separate tables. Additionally, Postgresql offers range types, which can be useful for ensuring uniqueness in scenarios such as appointment scheduling or price ranges for products. These unique features make Postgresql a compelling option for developers looking to optimize their database solutions.

Furthermore, Dalakishvili discussed the benefits of adding extra validation and constraints directly in the database. Postgresql enables developers to implement an additional layer of data integrity by defining constraints and validations within the database itself. This approach can be particularly useful in scenarios where specific conditions need to be met before inserting data into the database, such as preventing overlapping events or ensuring non-overlapping price ranges. By leveraging these capabilities, developers can enhance data reliability and minimize potential errors.

During the episode, Dalakishvili also introduced an open-source library called “Entity Framework Exceptions.” This library simplifies the process of handling specific database exceptions, such as unique constraint violations or data length errors, in a more database-agnostic way. By utilizing this library, developers can streamline their error handling code and ensure compatibility with all major databases without the need to modify error codes or make changes to their existing codebase.

Beyond Postgresql, the episode also discussed DuckDB, a columnar database that excels in OLAP queries and analytical processing. DuckDB’s column-by-column data storage approach enhances query performance, particularly for complex analytics tasks. Notably, DuckDB supports querying Parquet files and JSON files. It can even run queries on files stored in S3 buckets on AWS through HTTP, dynamically fetching only the required portions of the file based on the query conditions. Additionally, DuckDB is compatible with Dapper, a mini ORM for .NET. This compatibility allows developers familiar with Dapper to seamlessly work with DuckDB, providing a familiar and efficient database interaction experience.

Throughout the episode, Dalakishvili emphasized the importance of focusing on business logic while minimizing the time spent on configuring and installing tools and libraries. He believes that developers should prioritize producing high-quality business logic and reduce the time spent on the underlying infrastructure. By leveraging powerful and efficient tools like Postgresql and DuckDB, developers can streamline their development process and optimize their productivity.

Episode Transcription

On Apple Podcasts

Just a quick note for those of you are iOS users (around 40% of the community of listeners):

When iOS 17.4 drops, you'll be able to read transcriptions for the show in Apple Podcasts. Problem is, I have no way to test whether these transcripts are correct or not, and given the technical nature of the show (and my own experience of creating the transcripts for the website) I doubt that they'll be correct.

I'm working with the podcast hosting company that I use for the show to figure out the best way to provide in-app transcripts, but until then I can't guarantee the quality of the Apple provided transcripts.

The best place for a transcription will always be the show's website, and there will be a link to this episode's transcript in the show notes.

Yeah. So what I was thinking the other day is that what we want is to concentrate on the business logic that we need to implement and spend as small as little time as possible configuring, installing and figuring out the tools and libraries that we are using for this specific task. Like our mission is to produce the business logic and we should try to minimize the time that we spend on the tools and libraries that enable us to build the software.

- Giorgi Dalakishvili

Welcome to The Modern .NET Show! Formerly known as The .NET Core Podcast, we are the go-to podcast for all .NET developers worldwide and I am your host Jamie “GaProgMan” Taylor.

In this episode, I spoke with Giorgi Dalakishvili about Postgresql, DuckDB, and where you might use either of them in your applications. As Giorgi points out, .NET has support for SQL Server baked in, but there’s also support for other database technologies too:

Yes, there are many database technologies and just like you, for me, SQL Server was the default go to database for quite a long time because it’s from Microsoft. All the frameworks and libraries work with SQL Server out of the box, and have usually better support for SQL Server than for other databases.

But recently I have been diving into Postgresql, which is a free database and I discovered that it has many interesting features and I think that many .NET developers will be quite excited about these features. The are very useful in some very specific scenarios. And it also has a very good support for .NET. Nowadays there is a .NET driver for Postgres, there is a .NET driver for Entity Framework core. So I would say it’s not behind SQL server in terms of .NET support or feature wise.

- Giorgi Dalakishvili

He also points out that our specialist skill as developers is not to focus on the tools, libraries, and frameworks, but to use what we have in our collective toolboxes to build the business logic that our customers, clients, and users desire of us. And along the way, he drops some knowledge on an essential NuGet package for those of us who are using Entity Framework.

So let’s sit back, open up a terminal, type in dotnet new podcast and we’ll dive into the core of Modern .NET.

Jamie : So thanks for being on the show, Giorgi. I hope I’m pronouncing that right. I’m probably not, because I’m really bad at pronouncing words and names and things, but thank you very much for being on the show.

Giorgi : Thank you for inviting me to the show and having me.

Jamie : No worries, my friend, no worries. So I was wondering, before we get started, started with talking about what we’re going to talk about today. So if you could give the listeners a quick elevator pitch about sort of who you are and your experience with .NET and things like that.

Giorgi : Yeah, sure. My name is Giorgi Dalakishvili. I’m a .NET developer for 15 years. I started working with .NET when .NET [Framework] 2.0 was the current version. So I have been going .NET for quite a long time and I have seen lots of developments and changes in the .NET world. I started my career building desktop applications with Windows forms. Later I moved to web development with ASP .NET web forms, ASP .NET MVC, and after that with ASP .NET core.

Nowadays I mostly build backend services with .NET Core, mostly RESTful services. But I have worked with GraphQL, and SOAP services with WCF as well. And I build mobile apps with Xamarin and lately with .NET MAUI. So basically I have worked with every major .NET framework that Microsoft has released during the past 15 years, apart from a couple of ones that I have not touched.

That’s a short introduction to my experience. I also have some small open source projects on GitHub which I work on on my free time, so I will be happy if you check out those projects. I hope you find them interesting.

Jamie : Excellent, excellent. Well, what I’m going to do is I think we’ve already got some links anyway, but I’m going to put some links to your GitHub and your open source stuff. Specifically, I’ll link to the DuckDB .NET driver that we’re going to talk about in a little bit. But I thought that it would be useful if we talked a little bit before we get to the DuckDB .NET driver. I thought perhaps we could talk about DuckDB, but then in order to talk about that, we may have to talk about databases a little bit. So that’s kind of the plan for everyone listening in: we’re going to talk about databases. So if you don’t know much about databases, hey, you’re going to learn some stuff. I certainly don’t know a huge amount, but there you go.

So my working database knowledge is that with .NET I’ve used both SQL and SQLite databases. But I do know that there are other database “technologies” I want to say. Is that the right word?

Giorgi : Yes, there are many database technologies and just like you, for me, SQL Server was the default go to database for quite a long time because it’s from Microsoft. All the frameworks and libraries work with SQL Server out of the box, and have usually better support for SQL Server than for other databases.

But recently I have been diving into Postgresql, which is a free database and I discovered that it has many interesting features and I think that many .NET developers will be quite excited about these features. The are very useful in some very specific scenarios. And it also has a very good support for .NET. Nowadays there is a .NET driver for Postgres, there is a .NET driver for Entity Framework Core. So I would say it’s not behind SQL server in terms of .NET support or feature wise

Jamie : Excellent, because that was going to be the first thing that I was going to talk about was like: if there are this many database technologies, then why do they all exist? And I appreciate you may not be able to answer this, but you did say earlier on that in a subset of situations, Postgres might be a better database technology to use. And I was wondering, do you have any examples of those situations, as it were, where a Postgres database may be more suitable than say a SQL or SQLite?

Giorgi : Yes. So Postgres has some features that are hard to implement yourself in SQL Server.

Some scenarios, for example, one of them is that Postgres supports an array data type. So sometimes you have a record and you want to associate some for example, you have movies and you want text associated with that, movies or with a blog post or something like that. And with SQL Server, usually you create a second table or maybe even a third table if you want many-to-many relationship and record which movie is associated with which text. And if you have only a few texts, like most posts, for example, usually have two or three texts associated with them. Having a separate table may not be the best. I mean it’s not a bad thing, but it’s an extra overhead to create the table, manage the table and so on. So with Postgres array type, you can just have a column, which is a string array and keep like a list of the strings for that specific record.

Another case that I have is the range types. With Postgres you can have a data range associated data range column. And what this means is that for example, imagine you are building an appointment type application, for example like meeting appointment similar to calendar application. And you want to ensure that there are no two events for the same person or for the same meeting rooms that are overlapping because you can’t have two meetings in the same meeting room, right? So you have an event which starts at some time and ends at some time and you want to make sure that there are no two overlapping events for some specific resource, either a person or a meeting room or some other resource.

With SQL Server, there is no easy way to ensure that uniqueness. So with Postgresql range, you can keep the start date and end date in the range field. And what postgres… and if you create a unique constraint on that column, Postgres will ensure that there are no overlapping items to two different rows which overlap with the range column. So this gives you a unique constraint. You can’t insert a second event if the start time and end time of this event overlaps with start time and end time of another event already in the database. And you will get a unique exception which your application will need to handle.

Jamie : I see, I see.

It’s more like, and I’m inferring from what you’ve said, it’s more like I can have the database set. I don’t think validation is the correct term to use, but it’s a little more smart when it’s inserting data into the database. Right. I can say, like you said, "I have a bunch of events and when I add a new event to the database, I don’t want any of them, their start and end times to overlap." So rather than having to write that, I mean, I would still write that validation in my server side code anyway before it goes into the database. But I can have that extra sort of belt-and-braces check. Right?

Giorgi : Right.

Also, if you are selling some products on your website or you have a CMS for the products that you are selling and you are doing price ranges, for example, if you buy between one and five number of items it will cost this specific amount. If you buy between six and 20 it will cost this amount and so on. And you want the users to be able to introduce, to edit these ranges of quantities which have different prices per item. And you want to ensure that these ranges that the user creates for prices and quantities, depending on the quantity you don’t want there are any overlaps; like you don’t want to have between 10 and 20 and 15 and 50 because then that case you can’t really tell how much you are going to pay. So it also is useful in that situation. You create a unique constraint on the quantity and Postgresql will ensure that you can’t insert two overlapping items.

Jamie : Right, I see. So if I want to, I can put more logic into the database so that, then there’s that little bit of extra stuff. That’s pretty cool. Yeah, I like that.

Okay, so then does that mean then that I need to rethink the way that I’m storing stuff in the database? So, for instance, in a SQL situation, I’m working with tables. I’ve got relational data. I may have a product table and an order table, and an order may have several products in there. So there’s some kind of referential integrity. I’m simplifying way too much here, but that kind of thing. And I realize we’re going to be talking about DuckDB in a moment, but because we’ve talked about Postgres, you may not know this. I don’t know, am I storing still things in tables or is it a document thing? Or I’ve got something here about storing things in vectors. And maybe it’s that, I’m not sure.

Giorgi : Yes. Before we move to vectors and DuckDB, first let’s plug in another [of] my open source project, the most popular one that I have on GitHub. The library is called EntityFramework.Exceptions. And I mentioned handling unique constraint, which will be thrown from Postgres if you try to insert overlapping records with ranges. So you may have other tables that may throw unique constraints. For example, you have a user table and the email column is unique and you want to handle that situation or you get another database error. For example, if you try to insert data that is too long for the column, you will get a database error.

So one of the situations that I found useful, that I found common in my experience, is that you want to know if the. So if you are using Entity Framework Core and you get a DbUpdateException, you may want to know, "is this exception because of a unique constraint?" or, "is this exception because of I’m trying to insert too much data for this column?" and without this library, you have to dig into the inner exception. In the nested inner exceptions of the DbUpdateException that Entity Framework gives you; you have to check the error number or number or whatever property the database specific library has, and you have to compare it to the error code of the error that you are interested in. And the error codes are different from one database to another database. So unique constraint error will have a different error number for SQL server, different error number for Postgres, different one for MySQL, and so on.

So if you are using Entity Framework core, usually you don’t want database specific error codes in your code base. And what this library does is that once you add the library to your project and configure it in the Entity Framework configuration, instead of getting DbUpdateException , you will get the exceptions defined in this library that are specific for the error. For example, if you try to insert a user with the same email and you have a unique constraint on the email column, you will get a UniqueConstraintException in your code instead of generic DbUpdateException . And the same way you will get MaxLengthExceededException if you are trying to insert column value which is too big for that column.

And the good thing is that this library supports all the major databases, so you don’t have to think about the error codes. And your error handling code does not need to know which database your code is working with. So in theoretical situation, if you switch to a different database, you just change the package for that database and you will still get the same exception. So you won’t have to make any changes to your error handling code. That’s what the library does. It recently crossed the 1000 stars mark on GitHub and it has a couple of million downloads on NuGet. So I think it will be interesting for the listeners.

HTTP timeouts. Database deadlocks. In software, it’s not a matter of if things fail, it’s a matter of when. One mishap like this and some valuable data is lost forever. And these failures occur - all - the - time. But it doesn’t have to be this way.

Introducing NServiceBus, the ultimate tool to build robust and reliable systems that can handle failures gracefully, maintain high availability, and scale to meet growing demand. For more than 15 years, NServiceBus has been trusted to run mission-critical systems that must not go down or lose any data, EVER.

And now, you can try it for yourself. NServiceBus integrates seamlessly with your .NET applications and can be hosted on-premises or in the cloud. Say goodbye to lost data and system failures, and say hello to a better, more reliable way of building distributed systems. Try NServiceBus today by heading over to https://go.particular.net/themoderndotnetshow and start building better systems with asynchronous messaging using NServiceBus.

Jamie : Wow. And just remind us which library is this again? Because obviously you’ve got a couple on your GitHub.

Giorgi : It is EntityFramework.Exceptions.

Jamie : Yes, okay, I can see it now, sorry. I’m also browsing through your GitHub whilst we’re talking. Excellent. What I’ll do is I’ll make sure to put a link to that in the show notes. That seems like it would be very useful. Excellent. Yeah, I’m forever getting exceptions in Entity Framework just because I do it wrong. But that’s okay.

Giorgi : Yeah, you can’t avoid exceptions. Like for example for this situation when you are inserting data that is too big, you can check the length, but for unique constraints you can’t really avoid the exception because sometimes, even if you check before inserting, because of race condition, you may try to insert the same record. So there is sometimes no way to avoid the exception. This is where this library helps.

Jamie : Excellent. I do like having libraries that help me out. The big thing with me is I want code to make my life easier as a developer, right? And that’s what we all want, isn’t it?

Giorgi : Yeah. So what I was thinking the other day is that what we want is to concentrate on the business logic that we need to implement and spend as small as little time as possible configuring, installing and figuring out the tools and libraries that we are using for this specific task. Like our mission is to produce the business logic and we should try to minimize the time that we spend on the tools and libraries that enable us to build the software.

Jamie : I agree 100%. I often tell juniors and people who are beginning their journey or asking me for any kind of mentorship, I often say to them that the person who has tasked you with building this stuff usually doesn’t care about how you’ve done it, just that it is done. And so, like you said, it’s all about that business logic. If you are talking to a customer and this can be someone internal to a company that you work for or external from you, maybe it’s someone who’s buying or paying for the software that you’re building. They don’t care. All they care is that, "when I push this button, the thing happens," and that it gives them value. It doesn’t give very many people value to store records in a database. The value comes from acting on those records, right?

Giorgi : Right. Yeah. That’s why I usually try to minimize the boilerplate code that I have to write to support the main features of the application.

Jamie : Sure.

We were going to talk a little bit about how Postgresql stores - whether it stores data differently, because it may not to say SQL server. And we were also going to talk about vectors and things like that. So I know that database and vectors are a thing, but almost all of my experience with databases has been, "here are some tables," or, "here is effectively a JSON blob," a lump of document data. I know the document data isn’t always stored in JSON, but that’s the way that I’ve conceptualized it.

So is that how perhaps Postgres or DuckDB work?

Giorgi : Yeah. So Postgres has a very good support for storing data in a non relational way. And by non relational I mean that it has a JsonB database type. And in the JsonB column you can store any valid JSON. So the question that listeners might have is like, "I can store my JSON in any text column, why do I need a specific data type for storing the JSON?" And the answer is that with the JsonB data type, Postgres gives you a lot of helper functions that you process the JSON that’s inside this column. Like you can extract elements, you can manipulate the JSON in any way you can think of that you would need to process the JSON.

SQL Server also supports storing the data in JSON. And if we go back to the example of array types that Postgres has if you are using SQL server and you don’t want to create a separate table, one possible solution in SQL Server is to store the text that I mentioned in a JSON column in the same table if you want to avoid creating another table. The difference between SQL server and Postgres is that in SQL Server there is no separate data type for JSON, you just store it in a virtual column and use JSON functions to process the content of the column.

So when it comes to the inner working of the databases, PostgreSQL and SQL Server both work in the same way. They are not document databases, but they allow storing data in a non relational way with JSON columns.

Jamie : Sure. Okay. And that makes sense, right, because it’s non-trivial if you have a lump of JSON to read through, right? Because you need to get the record out of the data store still within your database application. I suppose inside of your database server you need to retrieve all of the records and any indexes on them and then search through essentially a string, right?

Giorgi : Yes.

Well, with the JSON columns you can create indexes on the content of the JSON [that] sits inside the database and you can actually filter the JSON on the database side, so you don’t have to pull the whole table and do the filtering on your application code. For example, if you have JSON which has property age, you can do the filtering in the database if you put the filter in the where clause of your SQL query. So in that case it will use the index if it’s available on the JSON column and it will only return the rows that match your filter condition. So you don’t really have to pull all the filtering on the JSON in your application code.

Jamie : Sure. Okay.

So then I guess does this differ from vectors from vector embeddings or is this the same thing or just so we can set a bit of a background for folks. Right.

Giorgi : Vector embeddings are relatively new thing, I think. They are useful for finding similar items in your database. So the way they work are very interesting to me.

Imagine you have a black box where you feed some text or some images or audio and it gives you back a list of numbers. So it kind of converts the input that you give you to this black box to an array of numbers. And we don’t know how this black box works, but it gives you some list of numbers as an output. And if you have a way to search for similarity; so the way this black box works is that if you give similar inputs to this black box, it will return you the list of numbers that are near to each other. For example, if you give it like a dog and some specific type of a dog, like a [German] Shepherd, for example, it will give you the list of numbers that are similar or close to each other. So how do you define similarity between a list of numbers? Right, because you can’t say that. It’s not easy to say if two lists of numbers are near to each other.

And the way it works is that there are several functions that define distance between these vectors. So vector is the output of this black box. So we can treat this list of numbers as a vector. And one of the metrics is the cosine similarity, which calculates the distance between these two vectors.

And if we have a way to store these vectors in the database, if we have this black box which can turn the inputs into vectors, and if the database can search for - when the database is given an input vector, if it can find the other vectors that are near to this vector. This means that we can turn every text document, for example, or images or audio files that we have, turn these documents into vectors using this black box, then store the output vectors in the database. And when we want to search for user inputs, when we want to search for items that are similar to the user input, we turn the user input into a vector again and use the similarity metric and ask the database to give us all the records that have the vectors that are similar to this input vector.

And this means that we can, for example, index audio files and then when the user enters some keywords that they are looking for, turns those keywords into vectors and use the database vector search to get back all the images or audio files that contain the keywords. And the way this black box works, it’s basically behind ChatGPT. That’s what I think ChatGPT does behind the scenes. So we can use OpenAI APIs and use that to convert the input to the vectors. And there are other APIs available as well that can turn the documents into vectors.

Jamie : Right, I see.

Yeah, I was going to say it sounds a little bit like my understanding of how the OpenAI stuff works. And I guess it’s come from that sort of machine learning background. Right. If you need to convert some information into some digital format and you’re going to do loads and loads of searching and indexing on, it kind of makes sense that you would use something similar to a vector. That makes sense.

Giorgi : Yeah. The actual databases that can store the vectors: there are many of them. There are some that are specialized only for storing the vector data. They are called vector databases and there are some pretty popular ones.

For example, one of them is Chroma. There is also qdrant. These are all vector databases. They are specialized databases for storing the vector embeddings and searching inside the vectors. The interesting thing is that for Postgres there is a database extension. So Postgres has these cool features that anyone can - well, if they have enough knowledge of low level programming languages - they can develop a Postgres extension that you can load into Postgres and it gives you new features, new data types. So you can extend Postgres with your own data types if you want to. And there is an extension for postgres. It’s called pgvector and this extension gives you vector type for storing in Postgres and also implements the similarity search for the vectors. So if you don’t want to set up an external database, separate database for storing the vector embeddings and searching inside that, and probably synchronizing data between your main database and the vector database, you can use the pgvector extension for Postgres and store all the vector embeddings in Postgres and do the search there. And as far as I know, there are libraries for every mainstream programming languages for pgvector. So you can definitely use pgvector from .NET with the pgvector .NET library.

Jamie : So obviously if folks want to try vectorization out on their database, they can just pull in a number of extensions, especially for their own Postgres. So folks, go and try stuff out in Postgres. Excellent.

So then I guess my question would be, "how does this relate to DuckDB?" Because I know that we talked about DuckDB when we were putting together this episode. And for the folks listening, we originally got together and started chatting about this on October 10th of 2023. So my background information and investigation into DuckDB is based on what I found then. So my understanding will be not as complete as yours, Giorgi. So I was just wondering, so how does all of this that we’ve said so far fit in with DuckDB?

Giorgi : Is a completely different type of database.

So all the databases that we have discussed so far, like Postgres and SQL server the store data in a row by row, in a row by row manner; like the CSV files or like Excel file. When you imagine the database, you are probably thinking of rows which has columns. And the way DuckDB works is that it stores data in a… it’s a columnar database. If you could look at the inner representation of the data, you would see that it has data for this first column first, then all the data for the second column, then all the data for the third column and so on. So it’s kind of inverted compared to the row databases, to the databases that store data in a row by row way.

RJJ Software's Podcasting Services

Announcer : Welcome to "RJJ Software's Podcasting Services," where your podcast becomes extraordinary. We take a different approach here, just like we do with our agile software projects. You see, when it comes to your podcast, we're not just your editors; we're your collaborators. We work with you to iterate toward your vision, just like we do in software development.

We've partnered with clients like Andrew Dickinson and Steve Worthy, turning their podcasts into something truly special. Take, for example, the "Dreamcast Years" podcast's memorable "goodbye" episode. We mastered it and even authored it into CDs and MiniDiscs, creating a unique physical release that left fans delighted.

Steve Worthy, the mind behind "Retail Leadership with Steve Worthy" and "Podcasters live," believes that we've been instrumental in refining his podcast ideas.

At RJJ Software, agility is at the core of our approach. It's about customer collaboration and responding to change. We find these principles most important when working on your podcasts. Flexibility in responding to changing ideas and vision is vital when crafting engaging content.

Our services are tailored to your needs. From professional editing and mastering to full consultation on improving the quality and productivity of your podcast, we have you covered. We'll help you plan your show, suggest the best workflows, equipment, and techniques, and even provide a clear cost breakdown. Our podcast creation consultation service ensures you're well-prepared to present your ideas to decision-makers.

If you're ready to take your podcast to the next level, don't hesitate. Contact us at RJJ Software to explore how we can help you create the best possible podcast experience for your audience, elevate your brand, and unlock the vast potential in podcasting.

Giorgi : For example, with the traditional databases, they store row for every column, then the second row for every column and so on. And the difference that it makes is that this makes it very fast for OLAP queries, for analytical processing queries. Because for example, if you need to find a max value in specific column in the traditional databases, you have to go from the start of your table to the end of your table, and go through every row until you physically reach the end of the table. Because you need to check the data in every row. Yes, to find the maximum for some specific column. So you have to seek through the end and skip some of the columns that you need. With the columnar database, the data, the DuckDB engine properly stores the offset for every column where the column data starts and where it ends. So it can use it to read the data for only that specific column. So it can run the queries very fast compared to the traditional databases that store data in row by row manner.

Jamie : So does that mean that there are, again, a subset of requests that will run differently? Or am I needing to think about the different ways in which I will store and retrieve data in order to use DuckDB? So what I’m getting at here is that I’m not just going to mass migrate all of my apps from SQL Server to DuckDB and then say, "well, why aren’t I getting a performance boost? What’s happening?"

Giorgi : Yeah, as I said, [DuckDB is] for analytical queries. So if you are doing heavy reporting then you can speed up those queries. Also, the difference from SQL Server and Postgres is that it’s not a client-server type of a database like SQLite. It’s an embedded database, so it runs in the same process as your application. So it’s kind of similar to SQLite in that way.

The main scenarios when you should consider using DuckDB is if you are for example using SQLite to do some local processing of data before you feed this data to a reporting database engine like Apache Spark for example, before you move the data there and you want to clean up your data, DuckDB can give you very nice speed up.

DuckDB also supports querying not only databases, not only say a database, but parquet files. And parquet files are basically very roughly speaking, they are same as CSV files, but they are stored in columnar way. So in the same way as I described the columnar database. So instead of storing row by row information, parquet files store data in columnar way. And parquet files are like a de-facto way of storing data for analytical processing. So if you already have data stored in parquet files and you use some existing tools to query those files, you can use DuckDB without converting data to DuckDB specific formats, and you will still get quite a nice improvement in processing those parquet files.

Another feature that DuckDB has is that it can also query JSON files. If you have data stored in JSON files on disk, it can run queries on top of them. And also a nice feature is that if you have data stored in S3 buckets on AWS in parquet files, for example, you can run the queries on those files through HTTP. And DuckDB is clever enough to only pull the portion of the file that matches your where condition using the metadata contained in the parquet file. So it won’t pull the whole 1tb file if it only needs some little portion of that file to return the result of your query.

Jamie : Okay, so I really like that.

Let’s say I’m doing lots of maybe data sciency stuff, maybe. Oh, actually let’s take a step back. Maybe I’m doing a lot of telemetry stuff. If I have my telemetry data stored in a DuckDB compatible way, then like you said, when I’m generating a report, or perhaps if I’m showing a dashboard, I could perhaps use DuckDB to export that data, then pass it to - that’s my extract, right. And then I can do a transform to show it on screen, show it in the report, whatever. And then presumably that would be faster. Yeah, that would be a little bit faster. I don’t know how much "a little bit" is, but it would be faster for large data sets than, say, if I stored it all in a SQL database and was using SELECT * FROM table WHERE [DateTime] is @today, right?

Giorgi : Yeah, that can definitely be faster for analytical queries. I think in some cases it can be up to ten times faster than SQLite. I have encountered some posts which compare performance of SQLite and DuckDB.

Also, another point with DuckDB is that all the big data that people work with, with the improvements and advances in the hardware, all the data that we assumed was big data, and we needed a bunch of servers to process them and run queries on top of them. A major portion of that data now is it can sit in the laptop of an engineer, for example, and you don’t really need to run the special machines or a special software that will distribute the queries between this machine and to the aggregation and so on. And DuckDB can utilize all the cpus and threads available on the computer. So it can help you to avoid all the overhead of managing the distribute it databases and servers and clusters and so on. So it can really make it easier for data scientists and everyone who knows SQL to run the analytical queries on their own machines without setting up clusters and jobs and managing all that stuff.

Jamie : Okay, so does that mean then, let’s say I’ve got a big data set and it’s in a format that cloud D, sorry, not cloud DB that DuckDB is happy with. It can communicate with. It’s stored in a DuckDB compatible database, in a compatible way. When I’m interacting with DuckDB to extract this data, am I still using standard SQL commands or is it better to use some kind of client like an in-software client? We’ll talk about it in a minute. But you’ve got a DuckDB, I want to say "driver" - I don’t think that’s the right word, but like a database provider for .NET, are we using one of those or am I going through some other layer to communicate with the database, and am I using SQL?

Giorgi : Yeah. So one of the advantages, well, for me, it’s an advantage of DuckDB, is that you don’t need to learn any new library or any new language to talk with DuckDB. The way you talk with DuckDB is with SQL. So if you know SQL, you can use your existing SQL knowledge to run queries with DuckDB. And when it comes to .NET, the DuckDB .NET library that I have built, it’s an ADO .NET provider. So it follows the same interface. Like you have a connection, you create a command with that connection, assign the text, and get the reader back. So those who have used ADO .NET back in the days when Entity Framework was not available, using the DuckDB .NET library should not be a new thing because it’s a standard ADO .NET library that you can use with your SQL knowledge to communicate with DuckDB and run queries on top of it.

Jamie : Right. Okay.

So I guess then will I need to have a little bit of experience using ADO .NET in order to use the library that you’ve built or what I’m getting at? Is there a lot of documentation around just in case? Let’s say I’ve never used. I have, but let’s pretend that I haven’t, that I’ve never used ADO .NET. Let’s pretend I’ve always been Entity Framework. In fact, during my career, Entity Framework was a massive revelation to me. I was like, "wow, I can actually pull that data from the database as an object." Anyway, I’m getting beside the point here, but let’s say I’ve never used ADO .NET, do you have or documentation for the DuckDB provider that will help me to learn that? Or is it a case of, "no, you have to bring a little bit of knowledge with you."

Giorgi : So for the documentation, the readme file that I have, it assumes a little bit that you have ADO .NET knowledge. If you don’t have it, then you can use any tutorial for ADO .NET. Because even if it’s for SQL server, the way ADO .NET works is same for every database. And you can read ADO .NET examples or tutorials or documentation for any database and you just replace SQL server or Postgres specific classes with DuckDB specific classes and the flow stays the same.

Another thing is that with DuckDB .NET it’s compatible with Dapper, the mini ORM of .NET. So if you have experience with Dapper, then you can also use it on top of DuckDB .NET to talk to the DuckDB database. And the DuckDB database itself has a wonderful documentation. It has some small extensions on top of the standard SQL syntax. So you can also use it to make your queries shorter in some specific cases and more developer friendly queries.

Jamie : Right, I see.

So just for the folks listening in, that was dapper, the ORM. So that’s D-A-P-P-E-R I think. Not Dapr. The, no, hang on, because there’s two. It’s really confusing. Right. Naming things is difficult. Let me look it up. Because there’s Dapper from Stack Overflow, right.

And there’s Dapr from Microsoft. The ORM is Dapper and I believe it’s Dapr, which is the "Distributed Application Runtime" from, I believe it’s a Microsoft product. But yeah, like I said, naming things is hard, right. So just in case folks get confused, it’s not Dapr, it’s Dapper.

Giorgi : Yeah, Microsoft likes to release things with confusing naming.

Jamie : They really do. They really do.

Excellent. Okay, so let’s say I want to get started with DuckDB and the DuckDB .NET, I’m going to call it a database provider. I’m not sure the correct word to use, but we’ll call it that. And if I’ve got it wrong, then that’s on me. You said there’s some documentation on the DuckDB .NET Repo and I’ve had a really quick look around on the DuckDB website for some documentation there. So there’s definitely some documentation around so definitely go check that out.

Do you have like an example getting started repo that I can maybe just pull and say, "oh cool. So I put the connection string in here and I pull that information from there and this is what a query looks like," or am I looking through the documentation for both your repo and the database to figure out how to do getting started?

Giorgi : The DuckDB .NET project has a sample project inside which has a couple of examples inside it. It’s not an extensive set of examples, but it can be enough just to see what DuckDB looks like.

Another thing that I want to mention is that DuckDB has a special edition compatible with WebAssembly, and they have a website and you can run DuckDB queries inside the browser and with WebAssembly it can access your local files I think. So if you have some parquet files lying around you can start writing DuckDB queries directly through this website and you don’t even need to install any special software or things like that. If you just want to play with DuckDB without involving .NET from the start and see what the database is capable of, you could also use that.

Jamie : That’s really cool.

Yeah, being able to do that in the browser without making any changes is a fantastic idea, and I feel like more libraries could do that with the invention of WebAssembly. And the fact that it will run in your browser as well, that means that - just off the top of my head, I’m not saying that this is something you can do, I’m not going to hold you to this at all either. But the fact that I could, if I wanted, connect to my database in a WebAssembly context means that I don’t even need a native app. I can just do it all inside the browser and .NET have to worry about, "how am I connecting this? Do I need to build all of this architecture around a .NET or Python or whatever app or an R app or whatever?" I can just throw it into a browser context and just run it. That’s the future of development, right?

Giorgi : Yeah, probably with all the IDEs and tools moving to the web based environments with GitHub Codespaces. And I think there are also a bunch of other similar tools available. We may just build software directly inside the browser in future.

Jamie : Sure. Awesome.

Well, what I’ll say is we’re running a little bit short on time. So I was just wondering, are there any ways for people to get in touch with you? So like, let’s say folks have listened in and said, "this is brilliant. I want to pick Giorgi’s brain here and just ask a few questions about how this works." Is Twitter or LinkedIn or something like that a way that folks - maybe Mastadon, I don’t know, on the Fediverse? Is there a way for folks to get in touch? Or is it a case of, "just head over to GitHub and browse around in the repos, and maybe raise an issue to ask a question?"

Giorgi : You can reach out via Twitter or LinkedIn if you prefer it. And I also have a contact me form on my personal website at giorgi.dev. So whichever way you prefer, I’m open to your questions and ideas.

Jamie : Excellent. Excellent.

Well, I’ve said it wrong every single time and I do apologize, but I’ll try and get it right now: Giorgi, thank you ever so much for being with us today and talking to us about some of the different types of databases, Postgres SQL, vectorization, DuckDB, and indeed your database adapter, database provider, database driver. I’m sure that someone’s listening in and screaming into their phone, "it’s the wrong word. Jamie! You’re using the wrong word!" and that’s totally fine. I’m using the wrong word. Please get in touch with me and let me know what the correct word is. I’d really like to know because I’m always saying the wrong word.

Anyway, I’ve gone way off topic there. Giorgi, thank you very much for being on the show. I really appreciate it and I hope you have a wonderful rest of your day.

Giorgi : It was a pleasure to be a guest on your show and thanks for all the interesting episodes that you have done, and I look forward to other episodes as well in the future.

Jamie : Thank you very much. Very kind of you.

Wrapping Up

Thank you for listening to this episode of The Modern .NET Show with me, Jamie Taylor. I’d like to thank this episode’s guest, Giorgi Dalakishvili, for graciously sharing his time, expertise, and knowledge.

Be sure to check out the show notes for a bunch of links to some of the stuff that we covered, and full transcription of the interview. The show notes, as always, can be found at the podcast's website, and there will be a link directly to them in your podcatcher.

And don’t forget to spread the word, leave a rating or review on your podcatcher of choice - head over to dotnetcore.show/review for ways to do that - reach out via our contact page, or join out discord server at dotnetcore.show/discord - all of which are linked in the show notes.

But above all, I hope you have a fantastic rest of your day, and I hope that I’ll see you again, next time for more .NET goodness.

I will see you again real soon. See you later folks.

Useful Links

Giorgi’s GitHub
DuckDB .NET Driver
Postgres Array data type
Postgres Range data type
DuckDB
DbUpdateException
EntityFramework.Exceptions
JsonB data type
Vector embeddings
Cosine similarity
Vector databases:
- Chroma
- qdrant
- pgvector
- pgvector .NET library
OLAP queries
parquet files
Dapper
DuckDB documentation
Dapr
DuckDB Wasm; run DuckDB in your browser
GitHub Codespaces
Connect with Giorgi:
- on Twitter
- on LinkedIn
- on his website
Supporting the show:
- Leave a rating or review
- Buy the show a coffee
- Become a patron
Getting in touch:
- via the contact page
- joining the Discord
Music created by Mono Memory Music, licensed to RJJ Software for use in The Modern .NET Show

S06E14 - From .NET to DuckDB: Unleashing the Database Evolution with Giorgi Dalakishvili

Sponsors

Embedded Player

The .NET Core Podcast

S06E14 - From .NET to DuckDB: Unleashing the Database Evolution with Giorgi Dalakishvili

Supporting The Show

Episode Summary

Episode Transcription

On Apple Podcasts

Sponsor Message

RJJ Software's Podcasting Services

Wrapping Up

Useful Links