The Modern .NET Show

S06E07 - From Atari to Sep: Unleashing the Power of Performance in Programming with Niels Rassmussen


Support for this episode of The Modern .NET Show comes from the following sponsors. Please take a moment to learn more about their products and services:

Please also see the full sponsor message(s) in the episode transcription for more details of their products and services, and offers exclusive to listeners of The Modern .NET Show.

Thank you to the sponsors for supporting the show.

Embedded Player

S06E07 - From Atari to Sep: Unleashing the Power of Performance in Programming with Niels Rassmussen
The .NET Core Podcast

S06E07 - From Atari to Sep: Unleashing the Power of Performance in Programming with Niels Rassmussen

Supporting The Show

If this episode was interesting or useful to you, please consider supporting the show with one of the above options.

Episode Summary

In this thought-provoking episode of The Modern .NET Show podcast, CTO Niels Rasmussen, with 20 years of professional experience in programming, shares his expertise in performance optimization and mechanical sympathy. We dive into the intricacies of software development and shed light on Niels’ most recent project, the lightning-fast CSV parsing library called Sep. The conversation not only delves into the technical aspects of Sep but also explores the importance of understanding different programming paradigms and the value of simplicity in application development.

Niels emphasizes the importance of simplicity in application development, particularly when dealing with systems that must run continuously in factories worldwide. His belief is that developers armed with a deep understanding of how computers work can find innovative solutions and explore possibilities that might otherwise be overlooked. This approach allows for efficient and effective problem-solving, especially in high-stakes industrial AI environments.

The importance of understanding different design patterns, frameworks, and tools is discussed extensively. Niels and Jamie stress the need for developers to identify the most suitable approach for each project based on specific context and requirements. This includes knowing when to apply abstractions and patterns and when to deviate from them. By considering these factors, developers can optimize performance and create efficient and robust systems.

The podcast also touches on the legendary Doom source code, written by John Carmack, which is considered a well-designed system due to its modular nature. The open-source code’s ability to run on various devices, thanks to its easily interchangeable components, such as the graphics and sound engines, makes it a great resource for developers looking to enhance their skills and gain insights into good coding practices.

Episode Transcription

Welcome to The Modern .NET Show! Formerly known as The .NET Core Podcast, we are the go-to podcast for all .NET developers worldwide and I am your host Jamie “GaProgMan” Taylor.

In this episode, I spoke with Niels Rasmussen about a CSV parser he wrote called Sep - one of the fastest CSV parses in .NET - and the the mysteries of performance optimization and mechanical sympathy.

And I just got hooked by it. It has to be faster. It has to be faster than the fastest known to man. So that’s what I worked on a lot and that’s what I find fun. I’m very passionate about performance, mechanical sympathy, all that. That’s really what I dig, things I read about and stuff like that.

- Niels Rasmussen

Along the way we discuss the power of simplicity, the importance of understanding hardware intricacies, and the birth of Niels’ lightning-fast CSV parsing library, Sep. From exploring different programming paradigms to dissecting the legendary Doom source code, this podcast is a must-listen for developers seeking to enhance their skills and unravel the secrets of software development

In preparation for this episode, Niels actually provided a veritable cornucopia of performance-related stuff - from important points to links to blog posts and other resources. There was no way that we could include them all in this episode, so I have gotten his permission and have been able to supply them as a PDF, linked at the end of the show notes page on the website. How cool is that!

So let’s sit back, open up a terminal, type in dotnet new podcast and we’ll dive into the core of Modern .NET.

Jamie : So, Neil, thank you ever so much for spending some time with us today. In this episode, we’re recording this way ahead of time in July, and the planned release for this is October, which is after the great rebrand of the show. So this is all exciting to me. So welcome to the show.

Niels : Thank you. Thanks for having me, and thanks for inviting me.

Jamie : You’re very welcome. You’re very welcome. I’m always interested to talk to very interesting people, and so that’s why I invited you along, because we’ve got this thing I want to talk to you about, but I’ll tease it a little bit. I’d love to talk to you about Sep, but let’s talk a little bit about you first, if you don’t mind. Let’s talk about, like, do you have an introduction that you can give us or maybe elevate?

Niels : So, I’m a CTO at a small industrial AI company, computer vision company, and we actually do most of our vision system application in .NET C#. I have about 20 years professional experience programming, so developing system development. 15 of those are then C#/.NET. So I wasn’t early to the party, but I think around .NET 3 [or] 2, something like that. So, luckily, I have only seen C# with generics, which is awesome. So that’s great. I actually studied a semester in Sheffield, England. I’m from Copenhagen, Denmark. Oh, I’m from Denmark. I live in Copenhagen right now.

Jamie : Interesting that you did a semester in Sheffield. Without doxing myself, that’s not too far away from where I am at the moment, so that’s really quite cool. Yeah. I didn’t end up going to Sheffield for university. I don’t know why. I think I had two universities I wanted to go to. I think Manchester was one of them. And they ended up going to Hull. I don’t know why Sheffield didn’t enter my list of… people who are listening from the University of Sheffield. I’m sorry you weren’t on my list of places.

Niels : But I had a great time. The reason I went there was because there was a Professor Alan Watts who did who wrote a book about 3D games. He’s quite known at the time, probably still is, maybe. So I had to go there because I had to have some computer graphics at the semester I had to go there for. Had a great time there.

Of course, I did go there, thinking I’ll end up getting a great British accent. Right. Like, really? But I mostly made friends with Americans, so now I just speak like most people. More american. Sadly.

Jamie : Excellent. It says here so, peeking behind the curtains, it says here that you started out with Atari hardware back in the day.

Niels : The first computer we ever had in the house was an Atari console. I can’t really remember the specific module. I can just remember there was like a leaflet book with it and the way you had to play games was to actually type in the program before you could play. There was no cartridges, no tapes, no discs, no whatever. So whenever we had to play, you had to actually type it in. And if somebody I have a big brother and someone got mad, you could just turn off the computer and you wouldn’t start immediately again because the program was gone. There was no memory in the non volatile memory in the system. So that was pretty cool.

Later, of course, I had a Commodore 64, and then from on there, it went to PC, and now I’m a PC guy, basically.

Jamie : Awesome. Yeah, I had a similar start, although it was apparently a very British computer, not sold that well outside of the UK. My brother and I had an Amstrad CPC 464. And that was a similar situation: you had to type the program, and if you wanted to play, it did have a tape deck, like a cassette tape deck. You could load the games from there, but it was non volatile. So if you spent 2 hours waiting for the tape to load a game, then you switch the machine off, you’ve lost your game. Right.

Niels : All right, good times.

Jamie : Oh, absolutely. I have been tempted to see if I could track down one of the old computers from the pre-PC days computers and see if I could boot it up and actually do something with it. But I don’t think I’ve got any CRT screens in my house, so I don’t think they would work.

Niels : You want the full experience, of course. Yeah. I still have the Commodore 64 in my attic, but I also have the Raspberry Pi, so I can do like, emulated games instead. That’s also fun.

Jamie : Excellent. Yeah, I’ve got a couple of Raspberry Pi’s laying around. I got one of those Raspberry Pi 400, which is similar to the ZX spectrum and things like that. It’s inside of the keyboard, which is pretty cool. I just need to find something to do with it. That’s the problem. I have all these Raspberry Pi’s. I mean, we’re not talking .NET right now, but I’m happy for us to talk about this. This is cool.

Niels : No problem.

Jamie : I’ve got one Raspberry Pi that is sitting as an ad blocker using pi-hole. Got a separate Raspberry Pi running a Jellyfin audio server. So I have my own sort of personal Spotify throughout the house. I got another one that is using local playback only video for, like, a home theatre system, and I got another two that are just sitting in a drawer. So I need to find something to do with them.

Niels : Sitting on a lot of money there during the Corona, there was nobody could get a Raspberry Pi. You could have sold them for a lot of know.

Jamie : Absolutely. Excellent.

Okay, so let’s talk about… so we’re here today to talk about this library called Sep. Now, I found out about Sep because I saw something pop up on Twitter a couple of months back. It will be a couple of months back by the time people listen. I think it was a blog post about “this is the fastest CSV parser in the Wild West”.

Niels : And I was like, go bold or don’t go.

Jamie : Yeah.

Niels : I hope I backed it up with some good performance numbers. But yeah, so I definitely wanted to go bold and I spent a lot of time, I’ve been working on it for one and a half years, right? And I just got hooked by it. It has to be faster. It has to be faster than the fastest known to man. So that’s what I worked on a lot and that’s what I find fun. I’m very passionate about performance, mechanical sympathy, all that. That’s really what I dig, things I read about and stuff like that.

Jamie : So let’s talk about that real quick then, because I don’t think I’ve heard this phrase, “mechanical sympathy.” So what’s that like being sympathetic for the hardware, or is that what that is?

Niels : I’d say sympathy comes from understanding, right? So understanding the hardware you’re running on or how a processor works or CPU works means you have a better skill set regarding how to optimize, for example. So how does a cache hierarchy work? What about registers? And then there’s the whole vectorization and SIMD that we might talk about a little later here, because that’s a key part of how Sep is fast. It’s also a key part of why .NET is becoming faster and faster, right? So with every new iteration they use more SIMD everywhere, basically.

Jamie : Right, okay.

Because I think that’s something that a lot of developers who are working in these higher level languages, we don’t tend to need to think about, “what is it that my software is running on?” Right? I feel like, and I may be wrong about this - listeners do reach out and tell me that I’m wrong, I’m happy to be wrong about this - for the large percentage of people who are running .NET or Java or whatever, like I said, these higher level languages applications, because we don’t have to worry about, well, “my CPU I’m running on is a 64 bit Arm. It’s not Arm. 64. Hey, Arm. Doesn’t matter. It can be either,” right? I don’t have to worry about that because .NET is taking care of that minutiae for me. Whereas maybe if I was writing something in C or Python or maybe Go or something like that, it would be vitally important for performance reasons to know precisely the architecture I’m running on so that then I could squeeze out, so I could eke out just a little bit more performance. Because maybe the majority of our apps are running on web servers that are maybe Azure or AWS or GCP or Linode or wherever. So quite literally, by the time it gets to that server, it’s already abstracted away because it’s on a virtual machine that’s pretending to be a 64 bit machine, but it may actually be a range of Arm controllers, or it may be some GPUs or some FPGAs or…

Listeners if you don’t know what those acronyms mean, that’s totally fine. I’ll put a little glossary in the show notes, or maybe you can Google it yourself to find out. But these are different types of hardware, so maybe that’s it. Is that the opposite of how you’re feeling? Like, I need to know more about the hardware.

Niels : That’s definitely the opposite. I need to know everything from bottom up, basically. So that’s how I work. That’s how I understand it’s. Also how I’ve been educated, I guess I actually know everything down to a PNP transistor on a processor. Not that it’s my specific expertise area, but I’ve been taught in it, and I have a course book on it here.

But really, I think also in my career, it’s always been about performance being an aspect of what I’ve been doing, even in my thesis at university. It was also about latency and things like that. I did that in C++. And I’ve always been fond on C++. And if you work in C++, you get a tight relationship with how software works, pointers, all that stuff. Right. And I definitely understand that. It’s very different if you’re doing like line of business applications or web stuff, then you worry more about network or stuff like that, databases. I can’t remember the last time I used a database in any of our applications. We try to do it as simple as possible. The applications we develop have to run 24/7 at factories around the world. The systems we develop are in every country in the world, basically. So it’s a different kind of requirements you have. Maybe that’s what influences what I think about.

I do have maybe a slight tendency to think that developers who actually know about how computers work are usually also maybe a bit better at those things. Right. So they understand what’s going on and that helps. It kind of helps you kind of broaden your horizon to what are the possibilities and what potential solutions are there? And just focusing on maybe the API as such, because, you know, everything is just bits and you can do whatever you actually want to if you just go low enough.

Jamie : Sure, I like that because there’s something… I feel like maybe for people who are new to the industry and everything, maybe not focus on, “oh my goodness, I need to know about transistors and all that kind of stuff.” That’s totally fine if you want to learn that, but you don’t need to at the beginning. But there’s something that Scott Hanselman says a lot, and he says, “just learn a little bit about the next layer down.” Right. So if you’re in .NET, learn a little bit about how the intermediary language works or just that the intermediary language exists that if you’re typing C# or F# and you do a dotnet run, it’s actually going to compile your language, whatever you’ve typed in down to another language, which is then interpreted. Right. So just knowing that that exists can help. I feel like I’m cheating a little bit because I did electronics at sort of 16 to 18 year old level and then went straight into Assembler. So I’m cheating because I’m coming at it from the other side. Right?

Niels : That’s a good experience, I think sometimes wonder if it’s easier to come from the bottom up than it is to come from the top down. Right. Because if you’re at the bottom up, you know, “oh, well, it’s a sampling. It’s like instruction. It’s bytes just in memory that are interpreted by a CPU and they’re running what you say you’re running.” But if you mess that up, it will be whatever it will be. It will do something if it thinks it should do. While if you’re at the top, you may be used to being completely restricted by what the language tells you you can do, but really you can do whatever you want.

So it open ups the world of you if you know that down, if you go down, there’s an opportunity to do something else. Right. Because really, languages are about both restricting and having possibilities. They want to have restrictions because you want type safety, for example. That’s why I love C# because it’s a type safe language. All those constructs about that. I’m not a dynamic language kind of man - JavaScript or Python really isn’t my game. I really love C# because it does a lot of C++ two, by the way.

Jamie : Absolutely, yeah.

So this is my own personal experience. I feel like forcing people to go bottom up, like, from day one, bottom up may be a little bit gatekeepery because of the… I feel like if you go top down from a high level language and learn the next level down and the next level down, you’re going from something that’s almost human language to something with a couple less abstractions. That’s the words I was looking for, with fewer abstractions involved. And so it’s easier for the majority of people, I think, to actually remove those abstractions as you move your way down the list.

Whereas if you go, “right, okay, we’re going to talk about voltages.” And then we work up from voltages from zero to 1.5 volts is usually accepted as zero, and 3.75 to 5 volts or something like that is usually accepted as a one. Bang, there’s our voltages. Now we’ve learned binary, and now we’re going to learn about an individual resistor and a capacitor in series, and then learn about resistors capacitors and transistors and how they can create a logic gate. And then I feel like sometimes going from the bottom up, it can be quite daunting because there’s, like, loads of information you need to know before you get to actually, “I’ve written some code and the computer is doing what I’m saying.” Whereas I can load up a browser, hit F12 and just type in some JavaScript code alert(‘Hello, Jamie’);. And effectively I’ve got a computer program. Right.

Niels : Yeah. I find it easy because I’ve actually, in my education, I’ve actually kind of been taught both bottom and up at the same time, bottom, top at the same time. So both from the lower levels, like even how a transistor works, and at the top level, like Java or Pascal or something like that. Right. So I’ve been attacked from both sides and I’m sure that has shaped me in many ways. Right. I’m very fond of my education. I think I had a great education. I spent six years in university, so I should say that otherwise it’s a time of waste, time of time. Right.

Jamie : I do think that there is a benefit to both learning top down and bottom up, other than just so that you know what the next level down is, so you can understand it a little bit better. I’m one of these people who’s like, “I’d like to know things just because I want to know them.” And I feel like just the more information you have, that may make you a more well rounded person. Right.

Niels : I think just knowing that there is a level down, or that you could go a level down, not necessarily having to remember anything, like, I don’t remember every single SIMD instruction in the x86 instruction set or anything like that. I know there are some I know kind of the principles of how they work. I can then look up what instructions are there, so something that I can do this with and or that with and from there I can then pick.

So it gives you a broader set of tools to look at and choose from compared to if you just knew, like C# and LiNQ, for example. Let’s say you only programmed with LiNQ or something and you never did it for each loop or whatever, then you would never know. You could do that. You kind of have to know both things. So LiNQ is great, I use LiNQ all the time. But when you then hit certain other areas when the context shifts so it all depends on context. Right. Then I definitely don’t want anyone using LiNQ here or whatever you want to use. I want a nice for loop or something like that, because this is where performance matters or it’s whatever the reason could be.

Jamie : Right, sure. Absolutely. 100%. And I think you’re hitting on an important thing there about any kind of development, doesn’t matter what language you are in, I think, and that is knowing when to apply those abstractions and those patterns and the different tools and libraries. Right. There was someone I was talking to on LinkedIn, of all things, earlier today. And the chap was like, “I don’t get domain driven design. Why would I do it if…”

Niels : I read the book.

Jamie : Yeah, right. This person was like, “why should I do it?” And I said, “well, look, just learn a little bit about it and just know that it exists because there may come a time when domain driven design may fit the project you’re working on a little bit better than a different design pattern.” The whole point with design patterns and languages and frameworks and tools and libraries is to know enough of them so that you can see where one fits and one doesn’t.

Niels : Absolutely. I also think I read a lot of those things, and I found a lot of inspiration in it. And I’m not using it as it was intended kind of in the application we do. Because the are different, like a normal web application where you might have a business logic domain inside of it and stuff like that. But the principles that apply way around isolating externally facing code and stuff like that, those are the things that matter. Right. And the are pretty much the same for every system you build. You kind of have to, “how do I take complex logic and test it?” Stuff like that, but without actually having external facing code. Because the system we build, they have cameras, they have digital I/O cards. They have a lot of things like maybe talk to a Plc, maybe there’s like a robot we have to control stuff like that. Right. It’s very complicated systems already, but we want the complicated logic to be kind of isolated from that so we can test that as best as possible.

So I use that. So the domain driven things, I’ve used that as a way to kind of structure how we work around this. It may not be perfect. I’m sure if some of my colleagues hear this, they’ll [complain] about the things they’re in, but the are there for a reason why. And sometimes you have to accept that certain kind of architectures or principles may get in the way of how you want to do something, but they are the as a way for us to kind of isolate certain things and maybe also abstract away those things. Right. Which is kind of a little bit against how Sep is, of course, because Sep is about going to the low level in some things. Right?

Jamie : Yeah.

I just want to bring something up before we go back to Sep, and that is related to what you were saying about abstracting away the complexity and have that complex code over there where it can stay, and my boundary is here, and I can say, “hey, robot, go do that thing.” I don’t have to say, “robot dot leg dot lift,” that kind of thing. Right. And I think that one of the often overlooked examples of something that is a wonderfully designed system is the original source code for DOOM, mostly written by John Carmack, completely open source.

If you can track down the original source code, which I think will be on the iD or iD GitHub page, if you can read through that, it is in C and lots of it is in Assembler. But if you can abstract away from that, just ignore those bits and read through what it’s doing. The reason why DOOM is consistently one of those things that oh, “it runs on a pregnancy test,” “it runs on a lamp,” is because what Carmack did was he separated all of those complex parts out. Everything was designed so that he could swap out the graphics engine, so that he could swap out the sound engine, so that he could swap out all the different things and make it so that they were completely independent of each other. And I feel like he had to because he was writing for a 386 and compatible. Right. Because this was back in the days where your computer may have a sound card that is Sound Blaster, or it might be Adlib or it might be a generic one. Yeah. Right. And you had to set the individual IRQ lines and things like that. Right. And so the game had to support that. And so the code behind it is really, really well written. And I tell people all the time, “if you want to be a better developer, go read the DOOM source code."

And if you’re not 100% okay with C, there’s a book that has done it all for you, and it’s by Fabian Sangelgard, and it’s called the Game Engine Black Book: DOOM Edition. And he just breaks down the most important parts of the DOOM source code and talks about it in a historical context. This is why it’s written this way. This is why it’s written that way. And so when people come to me, they say, “oh my God, they’ve got DOOM running on a light switch!” I’m like yes, “I know, I know. They can do that because the code is glorious.”

Niels : Yeah, Carmack is definitely a huge idol for me, too. Like, I read the books about it and all that stuff back in the day, I thought, very funny. Yeah, I wanted program games, too, that’s why I did computer graphics, but I ended up looking at chicken eggs instead. That’s one of the problems, inspection of chicken eggs.

Jamie : Right. I did a games dev course where we went all in on taking games, engines apart and all that kind of stuff. And I do “basic” web apps, the standard sort of crud stuff, but hey, it pays the bills, right?

Niels : Yeah. I have a great time doing the stuff I do. I think there are a lot of challenges and performance has definitely been an aspect of that job for a very long time.

Jamie : So let’s loop back to that, talk about that then, right? The performance aspects of Sep and writing a CSV parser and things like that. So let’s talk about what Sep is first, right? Right.

Niels : So it’s a CSV parser or it’s library, a modern library. I call it modern because it’s only for .NET seven or later. So it’s a fresh take. I wanted to take the latest bits and the latest APIs available in .NET.

So it’s a modern and minimal library for reading a writer’s CSV files or separated values. And I had this idea because we have a library internally at work that is plus ten years old or something like that for CSV, which has a probably unique idiomatic kind of API that some developers at work complain about, even myself and I designed it. I think it has a lot of great features. Of course it’s built over ten years, like stuff like that. But I thought maybe with the new features like generic math and static interfaces, you could do like a new CSV library relatively easily with an API that is more straightforward and more easy for new developers to onboard on. And I did look at the existing landscape. There’s a lot of CSV libraries. One of the benchmarks that I used to compare Sep to all the other ones with is called NCsvPerf by Joel Verhagen. He works on NuGet at Microsoft, I think. So his benchmark is naturally related to NuGet packages. So it’s kind of package assets. So it’s names of .NET packages, versions, IDs of that. So that benchmark kind of relates to just passing that loading up into memory as class and sisters call package assets, for example.

So we have this benchmark, right? But for work, the need that we have, for example, right, is different than any of the existing like many libraries out there, at least I looked at them, I thought, “this isn’t really what I want. It doesn’t have this feature or that feature. I want something tailored to our specific needs.” So Sep is built for those specific needs. It’s not intended as a general CSV library that’s going to conquer the world and replace CsvHelper. CsvHelper is great. It has a lot of features. It’s at version 30 now, I think, or something, and has a staggering 116,000,000 downloads, something like that, which is really impressive, right? Like it’s pretty amazing, but I think most people probably use it with the reflection kind of API. So you define a type or class with some properties on and it will kind of automatically load into that, right. That was not a need that I have or that we have at work. That’s not how we use CSV files.

So my need was more about we have to load a feature vector, like for machine learning purpose, so as an array of floats or something, like a set of columns as floats in the CSV file. And there’s also some particular needs around how to write CSV and stuff like that. So I wanted that to be easy, but also fast. And the parsing of text to floats, for example, is one of the things that generic math and static interface kind of has a new solution for, because there’s a static generic interface that says I span possible, right? So you can say generically that I want to pass, or you can write code generically. They say, “I want to pass this string to this T, and this T could be a float, a double, an int long, or whatever you want.” And I don’t have to write any specific code for that. If you wanted to do that in the past, you kind of had to write specific code for all of those types, right? And you didn’t have to the so a lot of code would just go by. You just have to write a single function, take a span, give it to parse, and parse whatever you want. You can parse a good if you like, stuff like that. That’s amazing. That’s the promise. And I guess also what statics has delivered, right? It’s out there now.

It’s out in it’s great. It’s really awesome.

RJJ Software’s Podcasting Services

Announcer : Welcome to “RJJ Software’s Podcasting Services,” where your podcast becomes extraordinary. We take a different approach here, just like we do with our agile software projects. You see, when it comes to your podcast, we’re not just your editors; we’re your collaborators. We work with you to iterate toward your vision, just like we do in software development.

We’ve partnered with clients like Andrew Dickinson and Steve Worthy, turning their podcasts into something truly special. Take, for example, the “Dreamcast Years” podcast’s memorable “goodbye” episode. We mastered it and even authored it into CDs and MiniDiscs, creating a unique physical release that left fans delighted.

Steve Worthy, the mind behind “Retail Leadership with Steve Worthy” and “Podcasters live,” believes that we’ve been instrumental in refining his podcast ideas.

At RJJ Software, agility is at the core of our approach. It’s about customer collaboration and responding to change. We find these principles most important when working on your podcasts. Flexibility in responding to changing ideas and vision is vital when crafting engaging content.

Our services are tailored to your needs. From professional editing and mastering to full consultation on improving the quality and productivity of your podcast, we have you covered. We’ll help you plan your show, suggest the best workflows, equipment, and techniques, and even provide a clear cost breakdown. Our podcast creation consultation service ensures you’re well-prepared to present your ideas to decision-makers.

If you’re ready to take your podcast to the next level, don’t hesitate. Contact us at RJJ Software to explore how we can help you create the best possible podcast experience for your audience, elevate your brand, and unlock the vast potential in podcasting..

Jamie : I guess before we talk about measuring things and getting the next step further: was the shortcut to that just quite literally span everything? Because I work a lot in web, right, and I often get juniors come up to me and say, “I’ll just async away everything and it’ll just be faster!”

Niels : Span is an integral part of the API design for sep, for sure. So it’s what goes to be the externally facing API for Sep. So if you go to row, you can get a span for that row, which means it points to like a segment of memory internal to the set reader, but internally it’s just an array, like an array of charts, right? And then if you want a column, a specific column, you get a span for that, stuff like that. So span is ultimately at the core here and that means you can access all those things without actually allocating strings or anything. You just get a span that points to internal memory. And span is nice because it both limits the scope, but also is very lean, right? So you just point to a segment of already existing memory. So that’s an essential part of it.

There are already CSV libraries that have that, but CsvHelper does not, as far as I know. It always returns strings, I should say. If you want, like the charge, you have to ask for a string, but you can definitely use it without allocating A string to if you want to convert something to something, right? So span is definitely a big part of it. It doesn’t occupy a lot of the code inside of it though. So that’s a lot of refs, refs everywhere.

If anybody knows what that is. That’s a managed kind of pointer. It’s not like a C++ ref or anything like the reference, but it’s the keyword ref. So a managed pointer to and that’s used and that probably goes to the other thing that’s used a lot and that’s unsafe, the unsafe class. Not the unsafe keyword, but the unsafe class. There actually isn’t a lot of unsafe code per se in Sep because I don’t do native pointers, nothing like that. It’s all managed. It’s all managed pointers. It’s more managed. There’s no fixed I don’t pin any memory or stuff like that. I try to play nice with the garbage collector and everything.

Everything is currently at least managed memory. I allocate via the array pool and I use the array pool to avoid having repeated allocations. So the internal buffer is rented from the array pool. Array pool is a type I don’t know when it’s introduced in .NET, right? But you can ask for a certain array size and it will give back an array typically with a size, that’s the power of two closest above, right? So you get 2048 if you ask for 2000 or something like that length. And I use that for almost everything inside of set, right, for the buffer with the chars, for other internal structures, stuff like that.

But it’s also an integral part of what we talked about for because I wanted to have this feature where I can just say, “I have this set of column names. I want a float not array, but span of floats for that set of column names.” So there’s an internal way of handling that, except that does that very nicely without repeated allocations, right? So you can just ask with us and you can ask with us for every column. But you’re reusing memory again and again and again, right? So that’s a key feature of what I thought would be nice with Sep and what I haven’t seen in other libraries. Usually there’s APIs for single column access, stuff like that, but not for multiple columns for that.

A Request To You All

If you’re enjoying this show, would you mind sharing it with a colleague? Check your podcatcher for a link to show notes, which has an embedded player within it and a transcription and all that stuff, and share that link with them. I’d really appreciate it if you could indeed share the show.

But if you’d like other ways to support it, you could:

I would love it if you would share the show with a friend or colleague or leave a rating or review. The other options are completely up to you, and are not required at all to continue enjoying the show.

Anyway, let’s get back to it.

Jamie : So then how did you go about doing, we’re talking about how span is wonderful for direct memory access without having to worry about bounds. And I’m greatly reducing what you’re saying. But let’s say I’m making some library that’s dealing with number crunching and I want to deal with fast operations. I throw span in there and then I’m like, “well, that’s the magic plaster that I’ve just thrown onto the code base. It’s going to be faster from now on,” right? How do I measure that?

Niels : Right. So with everything you need to measure, right? And the same goes for Sep. So in Sep’s case, it’s a new library, so there’s no use case before, so I had to invent a use case. So I had to find a use case. So that’s this benchmark by Joe Hagen. So this package asset inside. So I basically just looked at that and then I said, “okay, let’s try step with this and Sep with this and how does it run?” And I was like, “okay, this is actually faster than CSV Helper already. What’s going on here? Maybe I can…” But Sylvan is the guy to beat. Sylvan is developed by Mark Pflug. I hope I pronounce his name right. In general, any name I say, I hope I pronounce correctly.
Josh Close is the author of CSV Helper.

But Sylvan is really fast and has been the one that was marked as the fastest before Sep right. So that was the one to beat. So I put in this benchmark and I could see, “okay, I’m not as fast as Sylvan, but I might not be that far.” Sometimes you’re further than you think you are, because even it’s always the last 10% that are the hardest, of course, always, right? And basically from there, it’s just hard work. You measure, you run a benchmark, and then usually what I do then is then I use the Visual Studio profiler. I think the Visual Studio profiler great, and it has a lot of different kinds of things you can profile with: CPU usage is one of them. You can also look at allocations if you like, stuff like that. I have so much experience that I know where allocations are when I see .NET code or C# code that’s like allocation, data allocation, there allocation a bunch of red flags for me, like, “oh no.” So for me, that was no problem, right? But you want to know where’s the time spent. Of course, with a Csv parser, you already kind of know where the time spent.

You have to find where the row ends, where are the columns, where’s the separator, like so for Sep, I wanted to support the four special characters that I talk about. So you have the separator or delimiter is also called by some it can be a comma. Usually that’s comma in CSV, like comma separated value, right? And there’s carriage return, there’s line feed, and then there’s quote. And of course, line endings are [tough] no matter what you do, because every operating system, of course, has their own set of line endings: carriage return, line feed is what you use on Windows, typically. On Linux, I think it’s line feed only. On the latest macOS, I think it’s line feed, too; maybe carriage return was used on the old macOS systems, I guess. And actually not all CSV parsers support carriage return anymore, except us. So I try to support whatever .NET supports. And .NET supports those three. Right?

So for a CSV parser, basically, you want to find those four special characters. You can do that one at a time, like just a loop and just check is this, this, and this. That’s quite slow. I didn’t start separate that. I actually used a built in function for this called IndexOfAny. IndexOfAny is actually already quite fast. It’s just not fast enough. It uses vectorization and SIMD internally, so it has an approach that’s kind of similar to what Sep ends up doing but there’s an overhead in repeated calls to that because you only get one index at a time, right? So you call IndexOfAny then you get one index of that then you have to call it again for the next index and again and again. So I used that as the starting point for Sep and as I said, it was actually quite fast, it was quite good, but not fast enough.

So I knew then I had to do my own vectorization. So that’s when we hit the real fun stuff, right? And I don’t know how much listeners know about it but SIMD stands for single instruction, multiple data, right? So you have a single instruction that can operate on multiple data. That multiple data is in a register in the computer which is usually called a vector or something like that because it’s multiple values, right? So on my computer I have a Zen 3 computer - AMD Zen 3. This support AVX or AVX 2. It has 256 bit registers, it also supports 128 but it has an instruction set called ADX so I use that to optimize this finding by using that instruction set directly and .NET has support for that now with what they call hardware intrinsics, right?

So intrinsics are also what you usually use if you code this in C you don’t usually type directly in assembly anymore, you use intrinsics. With intrinsics you then almost translate it directly into an instruction, right? It doesn’t have to be, but usually it is and that’s what you can do with .NET Two. So there’s a namespace for x86 and there’s also for Arm. I don’t have any Arm specific code in there because .NET also has cross platform vector primitives. So you have Vector256<T> and it has a set of methods that are cross platform defined in certain ways. Sometimes there’s a little variance between how it actually works on x86 or Arm for Sep most of them are quite similar.

So I implemented a number of different methods but in the version that I have released in Sep I am working on a new release that will be even faster. We can talk about that a little later but in the version that’s out there I had this idea that, “maybe I could just do the vectorized part as kind of an index builder, kind of,” so find the characters and store the position they are at in the chart buffer so that’s the output of this char finder. So that’s the approach that I used for this 0.1 release. I the have kind of an abstraction over that so I can have different kinds of finders based on what instruction set is available. So because I wanted to support Arm because Arm isn’t going away, x86 isn’t going to go away, I would like to support that and it’s not hard to support in .NET nowadays because there are these great thin abstractions over vectorized code. So you don’t actually know show to know Arm instruction or anything. You can actually use this generic kind of API. So that’s what I used.

And to get a little bit deeper into that, maybe. So instead of looking at one character at a time, you have this 256 bit register. A char on .NET is 16 bits, right? So you can have 16 characters in one vector, right? In one register you can look at 16 characters at a time. And that’s the normal approach. Doing what you wouldn’t then normally do is you would have to check among these. for each of these 16 you want to check is any of that, any of the four special characters that we talked about. Right. A simple approach with that is just to load one of those special characters into a vector. So it’s the same value for all 16. Like you have I think line feed is ten, just the decimal value ten. So you have ten for each of those 16 positions into vector. And then you could do a compare and simply you then compare that to the input you have and it will tell you in which position is that value. So you get 255 outfit, you get a or in this case it’s 16 bits. So you get 65,000. You get just full bits for that element to say, “here I found a position where that matches this ten, and here I found that you can do that for all four.” And then you can actually combine it. And then you can quickly scan ahead and say, “okay, if this is zero, there are no special characters,” then you could just keep running. Right, that’s very fast. You can do that almost at memory speed, like 20GB a second, something like that, right.

But you actually then when you find a special character, you have to then take it out and that’s when things get a little bit hairy because the are then special instructions to then, “how do I get this out?” Because looking at the vector one at a time isn’t actually faster than looking at one char at a time. But there are instruction that can map a mask or byte mask. In this case it was a char mask to a bit mask and then that bitmask you can then iterate over also with special instructions.

But maybe before we go to that, I should take a step back because I said we’re taking like char 16 bit, or 16 of those at a time, right. Maybe we could do better, I think. Because all those special characters that we are looking for are never above the value 255, so they never go about a byte. Right. I thought, “what if I can take two vectors of 16 characters and pack the together?” It’s also called narrowing sometimes. But I can’t just truncate the char, the 16 bit to a byte because the I might have whatever is 256 plus ten might get to look like it was a line feed. I don’t want that.

So there’s a special instruction called Pack With Unsigned Saturation. I call it an instruction, it’s actually the method name. The instruction is called something completely different. Sometimes it’s a little hard to kind of go, so if you Google something you will get the intrinsic name which is p on something something (PACKUSWB) a very short name and then you have to kind of figure out what’s the actual .NET method called. There’s a tool that helps with that. We’ll come back to that maybe because it’s one of the tools I used during the development but let’s table that for now.

So I use this instruction then to say, “I have two sets of 16 characters. I pack them together in one register then I can look at 32 characters at a time.” That’s twice as many as before. That’s really nice. It’s not twice as fast but it’s twice as many because there’s still some balance on that. And once you have that it also gets easier after because then you can take that 32 element vector you find kind of the mask, the byte mask for that you can then move mask as it’s called in the instruction that to a 32 bit interval. Each bit the tells you is there a special character on any of those positions. .NET actually has support for that, in cross platform tools is called Extract Most Significant Bit because that’s actually what it does. Move mask is just the instruction name that intel gave it, I guess, I don’t know.

Based on that. You can then look at that. If that mask, that entity is zero there are no special characters. If there are some you then have to get those out. And again there’s a special instructions you have to know about if you want to do this. In assembly internet you have this very nice class that’s been added over time. It’s called BitOperations. So it has a lot of really low level methods for operating with bits or stuff like that, and one of them is called TrailingZeroCount. So if you call TrailingZeroCount on an integer and it has a bit at position three, it will give you back three as the index of the first set bit in that integer. So that’s what we then use to get all those characters out.

And then there’s a lot of special tricks here because usually a CSV file or CSV file is dominated by separators. You just have some text and then there are separators once a while, and then only at the end of a row do you have line feeds. So most of the time you’re looking you’re only finding separators like delimiters a comma so you can special case for that and you can special case for other cases like when there’s a line ending for example, but then you have quotes, then you have to kind of handle that too. And you have a special case for that.

In the initial version of the Sep, I didn’t just build, like I said, an index. Instead, I just had a kind of a packed representation of, “what is the special character I found as a byte,” because everything is below 255 and then a position as 24 bits. So packed together in 32 bits. I had eight bits for the jar, 24 bits for the position in the buffer. So a row could never get above 16 megabytes, for example. But it’s a nice packed representation. And then I can just run through this and build this up for whatever I have of characters in the buffer. Say I had 16K characters in a buffer. I read that once in a time. Then I pass that to this index, and then after that, I find where are the separators, where are the line feeds, give you a row with all the columns that are in that CSV file. And that SIMD passing is, like, really fast. That’s the core of why Sep is fast. There’s a lot of work around it, though, to keep the overhead minimal in calling the API, getting out of span, just bound checks alone are a problem.

So maybe to put it a little into perspective, just a napkin math at the top of my head here. So I think at the very low level, when I benchmark, I kind of benchmark three levels for package assets. So package assets is actually a rather involved benchmark where most of the time is not spent on passing CSV if your CSV passer is pretty fast. So Sylvan and Sep, it’s like a very small part of the actual total runtime. Most of it is spent on just creating the package asset type and setting the fields in it and accumulating it in a list, for example. So I think for Sep, it was like maybe less than 10% is the actual CSV parsing. So I wanted to look at more than that. So I kind of divided the benchmarking up into three levels. One of them is row. That’s just basically iterating over each row I find, not doing anything with it. So it’s basically the core of passing of the CSV. You don’t do anything. You just run over the bytes. And then next row, next row, next row, and then the next level, you actually go and look at each column. And that then differs, of course, between the library from library to library.

For Sep, you can get span. We just talked about span. That’s so nice. You don’t do any allocation. You can get a column and the you can actually look at a span if you want to do that or stuff like that. Sylvan has the same CsvHelper does not. It returns a string for whatever you want to do there.

For a row benchmark, 40 pack assets. Sep can iterate over a row. It takes about 80 nanoseconds per row, like so a row has, in this example, 25 columns. So you can do the math here. That’s about 3.2 nanoseconds per column, right? That’s not a lot. This machine I have is at about 5 GHz, so you can multiply that by five. That’s about 16 instructions for each column. That’s not a lot. That’s like really, really fast. But when I then release Sep 0.1 here, I of course tried to tag along Mark Pflug and Josh Close the authors of Sylvan and CsvHelper because they already have been doing so much work around this. So I thought it’d be nice to tease the a little bit. And Mark Pflug then quickly saw what I did and especially this pack on timed Saturate trick, I said, “oh, wait, this is nice, I can use that.” So he used that and he got a really nice speed up on Sylvan. I think it was like 40% or something. So right now he’s actually at this benchmark a tiny little bit faster than Sep. And of course, I cannot have that. I have to be faster. So I have a faster version now. And it will hopefully soon be released as a 0.2. And in that release, it takes 60 nanoseconds. So I went from about 80, a little bit over 80 to 60 nanoseconds. That’s 2.2 nanoseconds per column.

That’s almost 10gb/second if you count each char as two bytes. 10gb/second. That’s fast. I think.

Jamie : It really is.

Niels : It could probably be faster.

Jamie : There’s a couple of things that I’d like to sort of circle back to based on what you were saying. And one of them goes right the way back to almost the beginning of what you were saying about line endings. And at the time that we’re recording this, Scott Hanselman put out an episode of his show, Hanselminutes. If you’re listening to this one, you don’t listen to his you should definitely listen to his.

Niels : Oh, I love it.

Jamie : Oh, totally. Yeah.

And he was saying how he was helping someone who was brand new, I believe at Microsoft, I’m getting some of the details wrong, but he was saying, “I’m helping a brand new developer who’s just done like git pull and everything’s broken.” And this person’s like, w"ell, why is everything broken?" And apparently he says what he likes to do is he likes to ask permission before saying, let’s go in this journey and figure everything out. And he said, “it’s to do with the way that line endings work. Would you like me to tell you why or should we just fix it?” And the person went, “tell me why.” And he the took 25 minutes because it takes a long time to explain it, why different line endings exist, and how it’s all related to how typewriters or rather teletypes worked in the show. It’s still a problem now in 2023, almost 100 years later, it’s still a problem. So I totally understand about line endings. It’s just like we need to standardize, right?

Niels : Yeah, it’s never going to happen. It’s been [tough to implement in] Sep for sure. The Windows line endings, you have to kind of always look forward or you look backward, or whatever your approach is. Because if you encounter carriage return, you have to see, is the next one a line feed? Then it’s one line, otherwise it might be two lines. And that’s kind of annoying. And it’s also a little annoying. It doesn’t really impact performance a lot, but you have to have some complicated code around it. It’s just annoying. And I understand carriage turn means the typewriter moves back, line feed is the next line. That’s why it’s there. That’s why we used to send by stream to these printers, just sending a stream of bytes to them. And then they had to have these to actually directly control the printer. Right.

Jamie : And that’s the problem. Right. They didn’t abstract away like I was saying earlier on, right. They didn’t abstract away the complication of dealing with the hardware out to the hardware. They did it internally in quite literally the instruction set for sending over, or rather the bytes that they were sending over the wire. You were saying earlier on about, “we’re controlling a robot.” Well, if you were doing it the same way that these folks back in the day, they didn’t know that computers would become a thing. They didn’t know that electronics would get as far as it did. And so they programmed things, they built it into the language that they would send over the wire as bytes to the machine. That would be like you and the robots that you’re controlling, sending individual commands to move individual motors as part of your higher level instructions. Right? It’s understandable.

Niels : Robots haven’t moved far, though they are very basic in many ways, still very hard to control. Not that it’s my specific expertise. We have a lot more talented people working on that in my company. But back in the day they had so little, both power CPU, power memory and all that stuff. So it’s basically just how do we control it? And maybe they thought maybe we can do a line feed and then print backwards, and then we don’t need to carry a turn or whatever. Good reasons for everything in some ways, keeps options open too, so we shouldn’t blame them.

Jamie : Oh, absolutely. It’s the same with like when you’re reading through some code and you get angry about, “why does it work this way?” Well, it doesn’t matter, right. The people who wrote that code, like you were saying, the people who wrote those systems were working under an assumption were working with a system that worked a certain way. And like you said, maybe they did just want to line feed and run the carriage backwards the other way. So they would type the sentence out backwards, but it would read forwards because it was faster to go one way line feet back the other way.

Niels : I thought, hey, that would speed up, right?

Jamie : Yeah, right.

Niels : Printing speed up twice as fast. Printing. You don’t need to carriage return and print it.

Jamie : Absolutely. But there was something else you said about how you showed this off to the Sylvan developer, Mark Pflug. Again, I don’t know if I’m pronouncing that correctly. And then they took your idea and made their code better. And I think that’s one of the great things about open source is that you can actually say, “hey, how does this thing work? And can I adopt that into my thing?” Right.

Niels : Yeah, it’s definitely been a win win. Right. And the he did something and then I thought, “okay,” because the latest version I’ve been working on, I had to kind of abandon the approach that I started out with this whole index where I kind of have the char and the position. Then I find that because there’s some overhead in it. I went with that idea because I was hoping I could find a kind of branch free way of so branches can be expensive, right? So I was thinking maybe there was a way that we could find all these special characters in a branch free way and just plow ahead and build this index up. There is a way, it’s just not faster. It’s faster for the worst case, but for the normal case, it’s just not at all faster. So I’ve abandoned that in this version I’ve been working on internally. And it’s faster. That’s great.

But you have to kind of that’s with everything you do in software, you go down an idea. You have to always evaluate, was that the best way to do it? Could it be done in a different way? Refactor, do whatever you need to do. And the same goes for performance right? Then you have to refactor, you have to reiterate, you have to consider, was this actually the best way? Measure, profile, what takes time. And I could easily see when I looked at this after and I said, “okay, I’m spending so much time there compared to there, I have to do something here. I have to do it in a different way, otherwise I won’t get to be faster than Sylvan”. So that was the goal. And it is.

Jamie : Sure.

Niels : And of course, you’ve hit important here. There’s an important disclaimer here because there’s a lot of talk about Sep is the fastest, right? But there’s a different feature set for each library, right. I call Sep minimal. All the other libraries have different feature sets. CsvHelper has a lot of features. So in many ways it’s harder for CsvHelper to go as fast as Sep because it has a different feature set. Right. You just have to support more things or do more things. One thing that sep doesn’t do, for example, is escape quotes. Escaping quotes is like if you read this column with some quotes in, you would remove kind of the quotes, right? There’s no automatic way of doing that step. I don’t know if I ever want to add it because it’s not a need that we have. I wanted to support quotes at the very minimum, because that’s a basic feature you want to have.

Other libraries have this support. And of course, there’s a cost to every feature you have, right?

A Request To You All

If you’re enjoying this show, would you mind sharing it with a colleague? Check your podcatcher for a link to show notes, which has an embedded player within it and a transcription and all that stuff, and share that link with them. I’d really appreciate it if you could indeed share the show.

But if you’d like other ways to support it, you could:

I would love it if you would share the show with a friend or colleague or leave a rating or review. The other options are completely up to you, and are not required at all to continue enjoying the show.

Anyway, let’s get back to it.

Jamie : Absolutely, 100%. And I think you’ve said all of these important things and I’m just going to reiterate them because then it makes me sound clever as well. But I’m not clever. The importance of knowing what your use case is and optimizing for that, if you’re going to optimize. One of my lecturers at university, he used to use this rather grisly example and he’d say, “imagine a chainsaw, right? I can build a chainsaw with all of the security features on it, ever. It could have a laser to detect if your fingers go near the blades. It could have we have guards over the…”

Niels : We ave one of them on our robots. The robot is very deadly that we make it’s like you wouldn’t believe. It’s incredibly deadly. It has like, not a chainsaw, but a blade on it, right. So don’t go near that.

Jamie : So the idea with this metaphor that he used to use was like, “I can build this thing that has a million and one security features on it, but because the security features are present, it will not be as fast at chopping down a tree as a chainsaw without all of them, right? And that was because in one instance, as designers of this chainsaw, we have optimized for safety. And for the other chainsaw situation, we’ve optimized for speed, right, speed of use. And we’ve had to jettison a bunch of stuff to get to that point."

I suppose maybe with like, electric cars, they’ve designed for environmental stability and they’ve had to jettison a whole bunch of things related to how cars work if it is an internal combustion engine. So one thing that I know about electric cars is they have to put a fake sound on them because obviously blind people and people with sight issues can’t hear, or rather they’re listening for the sound of an internal combustion engine making its way down the street. And so because there isn’t something that sounds like an internal combustion engine, they may just walk out into the street and unfortunately, something could happen, right? I’m using a lot of grizzly examples today. I do apologize.

But I think you’ve hit on a really important thing there, and I think it’s worth stating again, know what it is you’re building it for and know what your benchmarks are and feature set up, because it is pointless trying to optimize something if I can’t think of any more examples. But it is pointless. Trying to optimize something. If you’re trying to beat some other library, beat some other software, and your software doesn’t do what it does. Right?

Niels : Right. So in many ways that’s what I’ve done for this package as it’s benchmark. I don’t actually need that. And I had to add, actually, the reason why it is fast is it’s not related as much to the CSV python, but it has to. So it kind of loads the rows into objects with well, a lot of the fields are strings, and if you would just allocate a new string for each of those, for every row that you’d make, it’s not particularly fast. So you have to employ string pooling or string caching, whatever you want to call it. And as part of that you have to calculate I use a HashMap lab that is kind of the same approach. I adopted this from Sylvan. So I basically copied his code and then optimized it and did some changes to it. So thanks Mark, for that. I hope you don’t mind.

And the, the reason then is why is Sep then fast for that is because I actually spent some time actually optimizing that at a very low level. And when we go to the actual very low level, we’ll go to this tool I wanted to talk about before, a tool you can use, an extension or a tool you can use in Visual Studio called Disasmo. It’s authored by and I’m sorry again for pronunciation here, but Egor Bogatov, he works on the JIT compiler for .NET at Microsoft.

With this tool you can and with .NET seven, it’s very easy to use. Before that you actually had to have .NET runtime locally and compile from source. But with .NET seven there’s a built in kind of disassembler or exporter in the .NET runtime directly. With this, you control+alt+shift+D in a method that’s not generic, and you will get the actual assembly code right next to it in Visual Studio. So you can look at the instructions directly and see, “oh, okay.” And I’ve spent a lot of time out on sep, I can say, and just trying to shave off one instruction here and there.

That’s the thing about both JITs or compilers in general. Sometimes they do weird things. So you have to kind of massage it and then you hit those constraints. Right? You’re in a language, you’re in C#, it has some constraints. You can’t just do whatever you want. You’re not writing assembly code. So you have to kind of find out, “how do I tweak this and that to get it to do that, put this in a register, don’t load it from memory every time,” stuff like that. Yeah, so that’s what I’ve been doing. That’s the fun part, I think that’s what I like working with. It’s spare time projects, I can do whatever I like, right? I want to spend on the things that I like, at least before I get some adoption at my company for this library.

Jamie : Okay. We talked a lot about all of the different techniques that you’ve used to make Sep faster in the direction in which you were going with it, right. In that specific feature set. And we talked a little bit about how you went away and benchmarked it and thought, “right, how do I speed this up? How do I speed that up?” And you talked about Disasmo. What would be your advice to someone if someone came to you and said, “hey, Niels, I have this app. It’s not very well optimized something that should take 30 seconds is taking two, three, four minutes. Or maybe something that should take nanoseconds is taking seconds. It’s using half a terabyte of RAM to load a three megabyte JPEG into memory.” What would be your first steps for that person to figure out where those “problem” I’m putting quotes there, right? But those problem areas exist.

Niels : We call them hot paths also. Right? So we’re looking for the hot path. Right?

So measure, measure, measure that’s whatever you do, you should measure test is a measurement, by the way. It’s just a different kind of measurement. And of course, context depends here. Sometimes maybe you have something you can run offline. You have a test that shows this is slow. Maybe you have to actually get data from a production machine.

We’re in Visual Studio. We have the profiler. We’ve got like a diag session, as it’s called nowadays. That’s the file extension name. It’s very long. This will kind of grab events from what you’re recording for. Like, if you use the CPU uses by default, it uses a sampling profiler. So it say once every millisecond, it will kind of say, “where are you in the program? What’s the stack at this point?” And then you can see based on those samples, where is the time spent, what percentage of time is spent where in the program? I think that’s very easy to use and very easy to get started with and find where the problem is.

Once you have that, you usually want to then because sometimes you can’t wait three minutes every time you want to optimize something. That’s just not… everything in development is about getting the best kind of development loop, the fastest kind of development. We write unit tests to get quick feedback. I don’t want to go to a machine in the States, for example, to test out my new software for something. I want to be able to run tests here. I want to now. And the same goes for profiling and optimization. So you usually then either perhaps look at isolating the method or whatever code is responsible for taking up the most time, right?

Maybe there are multiple parts you want to do that with, but you find one method, maybe it accounts for 20% of the CPU usage. Then look at that. If you’re in a Greenfield project you have unit test for it. If not write them. Then you make a benchmark for that method or something like that. Something that can give you a quick feedback on how long it takes for that method to do whatever you want to do. Then you can start optimizing.

You can also kind of do, I sometimes do that. You keep the old one around, just copy it, call it naive or slow, whatever you want to and then use that for testing the other one. So you can show data at it and do test cases so you always know they are in sync. You can move that to the test project later if you want to or something.

Then you start optimizing it and that’s basically again you could then also do profiling of just that method if you want to look deeper inside it or something like that. You can get per line kind of statistics. The are not always great because real release assembly code doesn’t map directly to every single line. You have one line might be multiple instructions or vice versa might be folded into something or something like that. And then it’s just hard work from there just trying to optimize.

But then you can then use Disasmo if you know or if you want, you can then learn to read assembly code and you can then understand what is it that’s actually going on in this method. On the actual CPU, what are the instructions running? And if you want to learn even more than then you kind of have to know how long does an instruction take, what’s the latency, what’s the throughput? And those are actually numbers you can look up on the internet, people have these test programs so you can get like numbers. There’s different websites for it. It would have been great if we could get like one tool where everything is in. Like Disasmo could have shown this. It’s typically specific for processors so you would have to say, “select send three something,” and then it would tell you for, “this instruction the latency is two cycles and the throughput is,” something or something. Right? I only use that as a guide. I’m not that much of an expert in this area. I have enough knowledge and it’s completely self taught. Like I can read assembly but if there’s an instruction I don’t know which there often might be or something, I had to look it up. I actually had a case here for Sep where I source an assembly is like why is it doing that? Every single time I call this instruction it puts an or just before and there was something called a false dependency, apparently. I didn’t know about this for this instruction. So luckily, someone on Twitter told me this, and I said, oh, okay, if you don’t put that there, then there’s kind of independence, and it makes everything slower because there’s also a lot of things related to, you know.

The level you have to go down to get really good at optimizing fully for a CPU is, like, very high. There’s a lot of things to consider. Processor, modern processor is extremely complicated. It has a number of ports that can do different things. Certain instructions only run on certain ports, stuff like that. The more ports it has, the more it can do at the same time, lots of things. And I follow some various people on Twitter on that and read blogs about it. And I don’t know if you’ve ever been to Stack Overflow and try to if you go to Stack Overflow, search for something related to simply in x86, there’s a 99.9% chance that Peter Cordes has answered it. It’s amazing. Like, he’s answered every question everywhere about that. It’s just never seen anything like it. It’s really great. He’s had a lot of good replies on that.

And actually for Sep and again, here’s a name I don’t actually know if I pronounce it correctly, but Wojciech Muła, his Twitter handle is pshufb - so the Shuffle Instruction - he had an article that basically lays out the algorithms you could use to find these special characters. There are different approaches you could use. I tried most of them that fit. The one I started with early was based on shuffle. Actually, it turned out not to be the fastest. It was slightly slower than the one I use now. But here’s an article called SIMDized check which bytes are in a set. And that’s basically what we’re doing. We’re checking the set of these four special characters, right? I hope maybe we’ll link to them in the show notes. I have tried to make a list of the links we can put in the show notes for the people listening.

Jamie : Yeah, everything that we’re talking about is going to be listed in the show notes. That’s not a problem.

I do have a feeling that because you’ve been so wonderful in preparing a bunch of notes for me to read through as we do this Niels, hope you don’t mind, but I’m actually going to supply these notes as part of the show notes. So if you’re listening along going, “wait, go back a second, just talk about that again.” It’s all in the show notes, folks. You can just click a button. There’ll be a PDF that has all of the notes that Niels has put together, very kindly put together for you so you can read through, and all of the things that you’re talking about that you’ve linked to. I’ll make sure they’re in there too, with the individual people to go look at and poke on Macedon and Twitter and Stack Overflow and things. So all of that will definitely be there, for sure. Good.

Niels : Yeah. Because there is an introductory article out by I think it was Adam Sitnik who wrote he works on the .NET team, too. He’s also, I think, one of the people who work on BenchmarkDotNet, which is one of the things we haven’t mentioned yet.

But benchmarkDotNet is the library I use and also the one Joel Verhagen used for NCsvPerf for actually doing the benchmark. And it has a lot of features related to doing this in a statistically sound way. All of those things are handled and that’s what I run all the time. Just run this, run this. It’s a great library. But he has an article out about with an introduction to do this vectorization API in .NET, and how to use it, and kind of concepts behind it, and how do you do this? There’s a lot of subtle things you have to take care of around it and they are covered in that.

Also recently an online friend of mine, Alexander Mutel, I hope he doesn’t mind I call him a friend came up with a blog post actually looking at vectorizing some code that’s kind of basically similar to sep you have. To look for a specific integer value during an array of integers. And he then optimized that using similar code to what he used in Sep. So I also think I have left a link for that in the notes.

Jamie : There’s a link that I went to.

Niels : And I commented on his code and told him it wasn’t fast, he should do something else. He was kind enough to change it based on my suggestion.

Jamie : I think there’s a very important thing that you kind of glossed over that I’d just like to circle back to no problem before we end the show because we’ve been going a little long and I feel like we’re frying people. Oh, I’m not frying people’s brains. You’re frying people’s brains.

Niels : I can keep talking forever about this.

Jamie : I think the most important thing that you mentioned and we kind of glossed over really quickly is: that optimization is a very expensive thing to do from a person perspective. Right. From an engineering perspective. You were saying the about all the different CPUs have their own feature sets and their own vectorization stuff, and SIMD feels like it’s a very… SIMD vectorization is a rabbit hole. You can fall down. You can spend a week just reading blog posts and just get confused, right.

Niels : Easily.

Jamie : And I feel like that it goes back to your thing about measuring and isolating and things like that, is that you need to know when to stop. Right. At what point is it more expensive in engineering time for me to spend three days to eke out a one millisecond performance tweak versus just release it because a millisecond per request, if we’re getting four requests a minute, there’s no gain there, right?

Niels : But I do think there is. I do want to argue a little bit against this whole, “premature optimization is the root of all evil,” because I kind of hate that.

Because I totally agree that you should probably not go to the level I’ve done for Sep or something in your everyday job or whatever you’re doing. It’s very unlikely you need to do that. .NET already has a lot of good functionality that’s highly optimized for different things. You can use that. You don’t need to go to the level of actually writing vectorized code yourself. If you have to, it’s really nice. It’s there. We have this feature now, it’s very powerful and we can use it .NET in a cross platform way nonetheless. That’s really important. I think that’s a key feature in the new .NET and it required a lot of work and they are soon going to come with AVX 512 in .NET eight, which of course is great too.

But I think you can get a long way maybe I would say two times faster with just 2% extra work or something. If you just learn to kind of unlearn some habits that people maybe have. I’ve seen some examples on both Twitter and I’ve seen them myself. Things like people tend to kind of count lines when they write code. They want fewer lines. You should not count lines. You can count lines and if you can express the same code as readable as correct of course, in the same way and as performant then as the one with the more lines you should do fewer lines, right? Of course you should try to be as succinct as you can but don’t count lines if you’re doing like a for each for each by definition, usually in C# is four lines, for example. Don’t fret about it. Two of those lines are completely doesn’t matter, it’s just the braces, right? Those lines don’t count. Don’t count them. And the four is nice to have on its own line usually and then you have the meat of the body. That’s a nice line too. That actually helps reading it, right? You can immediately see there’s a for loop. You can immediately see that you’re iterating over. You can immediately see what the block of the matter is.

Sometimes you want to use LiNQ, of course. And LiNQ is great for a number of certain things, but if you’re writing something that just requires doing something per something, just use a for each. Because the example I think it was David Fowler that also referred to that and I’ve seen this in myself in some code, is you have some kind of array or something and then you call to list and then foreach because list has a foreach method. There’s no foreach in LINQ and there is no foreach for a reason. It’s because they don’t want you to use a IEnumerable or whatever statement for each. They want you to use the most optimal for whatever you have sequence and that’s the for each statement, right? Or a for loop in many cases. If you have like an I read only list, I prefer using a for statement because then you don’t have the allocation of an enumerator. And those things if you just unlearn the don’t take extra time to do we have an IDE, we have a great IDE that just spools out code for you now. There are AIs that just ramp out code for you. Don’t count lines. Just readability. Of course, primary thing, I think sometimes it helps to have for loops or whatever, they can help readability too. Debug-ability is also one thing, right?

Just a small thing sometimes if you get an exception, what kind of info do you get for that exception? For example? That’s a question for you then Jamie, what kind of information do you get if you get like an exception?

Jamie : So you usually get like a stack trace and a couple of messages and maybe some frames as well. There’s a whole bunch of stuff that comes with an exception that most people don’t read. I have a feeling I know where you’re going, but I’m going to let you keep asking me questions.

Niels : Some of it will be actually the source code and the line number, right? If you have debug symbols or whatever in it.

So the line number tells you something. If you put everything in one single line, if you’re debugging something, you don’t know what part of that single line did it, right? So it can help with that too. So it doesn’t necessarily mean that you have the information in a completely compiled away release something. It doesn’t actually always know how to map to the right line or something, of course, but in many cases it will and you will get that information. So that helps. That helps.

So don’t count lines, just write your code, be succinct if you can. And of course if you’re in a hot path, don’t use LINQ. I think a lot of people will say that too.

Did that answer all of it?

Jamie : Yeah, no. I like it.

Because there’s something that you said it a few times about being succinct and readable it’s in Code Complete. About the thing that you should be writing code for is other humans, not the compiler. Because the compiler is - and this isn’t meant to be a flex against anyone or anything like that - compilers are way smarter than most of us are, right? So it will be able to take your code and do stuff to it that you probably wouldn’t even think about.

But if you write your code like that, I then have to be able to parse it in my own head before I can figure out what’s going on, before I can make a change. So my thing has always been, “I need to write code that other people can read before I can write code that is optimized,” right? And I feel like a lot of people take the optimization as the root of all evil… “premature optimization is the root of all evil,” quote that you mentioned earlier on the Donald KNuth it all means different things to different people. To me, it means that the code is not as readable to another engineer. And another engineer could be me tomorrow. Because tomorrow I don’t have the context that I have today.

Niels : I also think readability is probably also a little bit different from people to people. Right.

I very much argue that in many cases it’s good to keep performance in mind in whatever you do. It’s just one of the things you have to keep in mind when you write code is kind of a checklist you go over. So the process I have, even for stuff I did for Sep, it’s just me writing the code. Right. But I do pull requests for Sep because I use the pull request as a way of reviewing the stuff I do on my own. Right.

I think it helps change context. So if you’re just in Visual Studio and you’re just looking in Visual Studio all the time, the context is “I’m writing the code and I’ve already written this,” and it doesn’t give you, I don’t know, a fresh perspective. For me at least, it’s probably different from people to people, but for me at least, it helps change context.

And when you’re in, like, typically we use Azure DevOps or if you use GitHub, you also kind of get all the changes in a long doom scroll kind of way. Right. So instead of having to jump through files or use Git compare whatever you can use inside Visual Studio you can scroll through it and you can look, “okay, wait, there’s this thing I forgot?” or, “why is that there?"

And you get a bunch of red flags, as I always say. And you have to change those flags until it gets green. So red, green and blue for refactor or blue for optimization. Those are the things you always iterate around depending on what you’re working on.

Jamie : Excellent. Excellent. I think what we’ll do, Niels, is we’ll probably leave it there. I am happy to come back again and talk more optimization stuff if you’re willing to share and if the listeners are interested in hearing it. I’m sure they will be. But in the intervening time, where can folks go to learn a little bit more about the work that you’re doing and all that kind of stuff? You said that you follow a bunch of people on Twitter. Are you active on Twitter? I mean, I know that you are, but for the listeners, they may not know.

Niels : Right? So first of all, I have a small blog where I kind of blog. It’s not a lot, but a little. It’s, so that’s basically my name squished together. I wanted it to be N-I-T-R-A. But that was taken. So with my imagination, I just took two letters more and put them on. So don’t judge me for my handle. It’s not very imaginative, it’s just my name. And I was on GitHub, and I then used kind of the same for Twitter, but Twitter put a one on the end of it. I don’t know why. It just did. I’m also on Mastodon, and it’s also @nietras, and it’s at Mastodon Social, of course, because Twitter is dying. We all know it.

So go to my blog. There’s also some links there. There’s also if you like an email, you can use for that too. And of course, I introduced Sep on the blog post as a blog post there as well. It’s basically a dumb with the README, but with some extra stuff added at the end with some assembly codes. So if you like to read assembly, go there.

Jamie : Excellent, excellent.

Well, Niels, I just want to thank you for spending your evening talking to me and the listeners about all of this stuff. I know that there is a shed load of stuff we talked about today where I’m like, “I don’t know what we’re talking about here. I’ve got to go look this up.” So I know I’ve gone and learned a whole bunch of stuff, and I’m going to be listening to this at least two times more before the episode comes out for quality assurance stuff. So I’m going to learn some more stuff.

Niels : Maybe I’ll learn some stuff when I listen to it.

Jamie : What I’m saying listeners is: you may have to give this one multiple listens. And I’m not saying that so that we boost the stats. Only download it know, you may have to give this one multiple listens. And I will take the notes that Niels has prepared and take anything out that I think might be a bit I don’t think there’s anything in there personal. But just in case there is, I’ll take things out and I will share them as is, so you can actually have a look at what we were talking about. And there’s notes in there. There’s links in there. There’s things in there we didn’t even talk about. Right. So I’m giving you a whole bunch of this stuff for free.

So thank you very much for that, Niels.

Niels : I’m giving you it for free.

Jamie : Yeah. Excellent. Well, like I said, Niels, thank you ever so much for being on the show. I’ve really appreciated it.

Niels : Yeah, me too. It was great. I had a great time. It was fun talking to you. Always up for talking about performance. Very, very nice.

Jamie : Thank you very much.

Wrapping Up

Thank you for listening to this episode of The Modern .NET Show with me, Jamie Taylor. I’d like to thank this episode’s guest, Niels Rasmussen, for graciously sharing his time, expertise, and knowledge. Make sure to check the full show notes for a collection of links and lots of background information that Niels graciously provided to all of you, as a place to start your journey in learning about performance-based programming.

Be sure to check out the show notes for a bunch of links to some of the stuff that we covered, and full transcription of the interview. The show notes, as always, can be found at the podcast's website, and there will be a link directly to them in your podcatcher.

And don’t forget to spread the word, leave a rating or review on your podcatcher of choice - head over to for ways to do that - reach out via out contact page, or join out discord server at - all of which are linked in the show notes.

But above all, I hope you have a fantastic rest of your day, and I hope that I’ll see you again, next time for more .NET goodness.

I will see you again real soon. See you later folks.

Follow the show

You can find the show on any of these places