S07E06 - Cleipnir and Beyond: On Resilient Development Practices with Thomas Sylvest

Embedded Player

The Modern .NET Show

S07E06 - Cleipnir and Beyond: On Resilient Development Practices with Thomas Sylvest

Supporting The Show

If this episode was interesting or useful to you, please consider supporting the show with one of the above options.

Episode Summary

In this interview, Thomas, a software developer with over 12 years of experience, shares his insights on resilient programming and introduces his framework, Cleipnir .NET.

The conversation starts with the core concepts of resilient programming. Thomas explains that resilient programming aims to provide developers with a user-friendly abstraction for implementing distributed systems while ensuring security and consistency. He highlights the importance of resilience against various issues, including downtime from crashes and even complications arising during software deployments. Using the example of an e-commerce platform, he describes how complex processes—like ordering items and handling payments—can become problematic when state management across microservices is inconsistent, especially during failures or restarts.

Thomas proceeds to elucidate what making an application resilient entails. He explains that unlike traditional transactions within a single database, microservices lack uniform transactional states, often leading to inconsistent states if not managed properly. He utilizes the analogy of an order workflow that can falter if a server crashes or restarts, resulting in half-processed states where certain transactions could be duplicated or left incomplete. Resilient programming seeks to unify this process, ensuring that operations can either be completed fully or not at all.

Transitioning to his framework, Cleipnir .NET, Thomas explains how it simplifies the workflow for developers. He introduces the concept of “effects,” which are a way to encapsulate potentially non-deterministic operations, such as generating transaction IDs or making network calls. By using these effects within workflows, Cleipnir enables the application to recognize previous actions and effectively continue from where an operation left off after a system restart.

The discussion touches upon the challenges of managing message-driven architectures. Thomas contrasts traditional service bus setup where every message type needs a handler with Cleipnir’s approach to parsing messages. He emphasizes that Cleipnir allows developers to directly connect messages with workflows, simplifying coding and improving clarity.

Thomas also underlines the significance of idempotency in distributed systems—ensuring that repeating an action will yield the same outcome. By linking this with Cleipnir’s mechanisms, he illustrates how developers can confidently manage communications with external services without the risk of unintentional duplications or erroneous states.

Overall, the interview sheds light on the complexities of resilient programming in modern software design, highlighting how frameworks like Cleipnir can help streamline processes, manage state effectively, and foster a more intuitive development experience. Thomas’ insights provide a valuable understanding of the balance between technical rigour and user-friendly implementation in the realm of distributed systems.

Episode Transcription

So part of what Resilient Programming is about and what the framework does is that it kind of like tries to provide a nice abstraction, a developer-friendly abstraction for implementing distributed systems.

Welcome friends to The Modern .NET Show; the premier .NET podcast, focusing entirely on the knowledge, tools, and frameworks that all .NET developers should have in their toolbox. We are the go-to podcast for .NET developers worldwide, and I am your host: Jamie “GaProgMan” Taylor.

In this episode, Thomas Sylvest joined us to talk about both Resilient Programming and Cleipnir .NET - a framework that Thomas worked on to implement the concepts of Resilient Programming in .NET applications. Cleipnir, and Resilient Programming, are fantastic for supporting message-driven architectures; whether you’ve built a monolith, series of microservices, or anything in between.

But the idea is the same, kind of like that you try and remember the result of actions that you’ve done in a way that if you then start again, you won’t… you kind of like you’ll check in your little notebook if you already performed this action. If you did then you’ll just return the result of the previous execution. If you look in your in your notebook and you can see, "okay actually I haven’t done this before," you will then perform the action.

- Thomas Sylvest

Anyway, without further ado, let’s sit back, open up a terminal, type in dotnet new podcast and we’ll dive into the core of Modern .NET.

Jamie : [0:00] So, Thomas, welcome to the show. We’re going to talk about a whole bunch of stuff today, mostly stuff that I have very little experience with, but I always appreciate people taking the time to be on the show, so thank you very much.

Thomas : [0:13] Yeah, thank you for having me. Looking forward to it.

Jamie : [0:16] No worries, no worries. So, Thomas, I wonder, could you give us a brief introduction to yourself before we talk about things like resilient programming and stuff like that?

Thomas : [0:26] Yeah, sure. Definitely. So I’m a software developer. I have like 12 years of experience now as a back-end .NET developer.

Thomas : [0:39] So I’ve also, I mean, I did my master’s degree also in computer science, and I worked for a few years after that.

Thomas : [0:49] So I worked at a big Danish bank called Danske Bank, and we did like event sourcing framework and nice things like that. And as part of that I started wondering about something called consensus algorithms which is I don’t know if you know Paxos and Raft, but they kind of like the underlying, a lot of major databases in the world and it’s kind of like the fundamental distributed systems concept that you use there. And as part of this, I thought I came up with my own algorithm, consensus algorithm, and I tried looking around for different universities that I could try and talk to them about this algorithm and try and maybe do a PhD in it. So it turned out that there was a university in Norway. I’m from Denmark, so Norway is not too far away. And they said, "yeah, too many papers on consensus algorithms, but you can come and join us as a PhD student." And as part of that, I started looking into how you could make these distributed algorithms easier to implement. And that’s actually what got me, I guess, from today, where I am today.

Thomas : [2:09] Where I’ve created a framework called Cleipnir .NET. I’m back as being a back-end developer, but yeah, I’m working on this framework on the side.

Jamie : [2:21] Nice.

Jamie : [2:23] So before we can talk about Cleipnir, let’s talk about the kind of underlying principles behind what it allows folks to do, right?

Thomas : [2:33] Yeah. Yeah, sure. So part of what Resilient Programming is about and what the framework does is that it kind of like tries to provide a nice abstraction, a developer-friendly abstraction for implementing distributed systems. So, I mean, these kind of distributed systems, I mean, they’re all over, right? Every single system in principle have it as long as you have a browser and you have a database you have a distributed system with multiple, entities in it you could say. And what the framework is about and what resilient programming is about is all trying to, provide some nice abstractions that the developers can use to implement these systems in a more I guess secure consistent way than we’re able to today.

Jamie : [3:31] Okay. So I have an app and I am wanting to make it resilient. What am I making it resilient against? Is it like downtime? Is it DDoS? Like, what are we actually doing?

Thomas : [3:45] Yeah. That’s a good question yeah so it’s um it’s more it’s similar I guess in the in like you have transaction in a database. Which is also kind of like what is an abstraction we can use right to to not have that we have these half committed states that we would have to try and handle on a subsequent restart, right. So resilient programming in this aspect is is the same. But the issue we have nowadays in microservice architecture so in cloud service and and such is that we don’t have a transaction that spans our different services, if that makes sense. So we’re kind of like back into the world before we had the database transactions, if that makes sense.

Jamie : [4:31] Yeah. So like, let’s say I had, for instance, an e-commerce platform, right? That may or may not be similar to Amazon, right? I’m a user, I’m going to make an order, right?

Jamie : [4:45] So I need perhaps a checkout, or maybe before that I need a basket. So I need some kind of way of storing the contents of my basket. That could be perhaps a service, a microservice or whatever. And then when I got the checkout I need to get the state of the basket in order to make the checkout right.

Jamie : [5:03] And then perhaps I need to be able to get the user information, perhaps address details perhaps billing details, in order to place that order. So I collate all of that information, that goes through the order system. That doesn’t mean that the, that means the order has been placed, but perhaps i’m waiting on some async thing to happen to take payment maybe it goes out to stripe maybe it goes out to a credit card company. So there’s some other state happening over there. And then only after that has taken place that billing information has come back and said, "yes the payment has been authorized. Here the token," or whatever, am I able to move on to say fulfilling or picking or whatever the word is for getting the products off of the shelf and putting them into some kind of package right.

Jamie : [5:48] So I think what you’re saying is and please correct me if i’m is that we have multiple steps along this journey and there is no one unifying like you said no one unifying database transaction one no unifying um overview of the system state and so my basket can be in one state and my order can be perhaps in another state and you know if if my server suddenly shuts down and restarts I want to be able to perhaps deal with that is that what we’re doing we’re helping to deal with that power loss but also persisting that state are those two separate things?

Thomas : [6:25] Yeah, so you’re spot on in this kind of like we have these flows and we have them all over, right? Spanning different external systems, right? So we have our own service and it communicates with other external services. It might be other microservices, but it could be, like you said, Stripe is a payment provider. And what it’s about is trying to… So power loss and this kind of fatal crisis is one aspect, of course, that I think most people probably are not too concerned with, but we should, I guess, more than we are now. But we also have a situation where, for instance, doing a deployment of our software, these days we can deploy several times a day, right? And if you have a system where a lot of data is going through, you might actually end up inducing this issue yourself because the service will restart. And the problem is, like, let’s say that we have a simple flow, right, where we need to reserve funds from a payment provider and then we ask some shipping service, logistic service to actually ship the products and then we’re going to capture the funds and maybe send an email at the end.

Thomas : [7:38] If we are able to do, let’s say the first, we reserve funds and we ship the products, right? And then we do a restart. If we don’t do anything, then we have kind of like the user is going to get free products, because we only made a reservation. Sorry yeah because we only made a reservation on the credit card but we actually shipped the products, And if we have some kind of so there are systems that will allow us to restart this flow after the service went down right but if we just restart it naively perhaps depending on how code is is made we might actually end up doing another reservation on the customer’s credit card and shipping the products twice does that make sense?

Jamie : [8:26] Yeah. So we’re after sort of protecting against everyone… we’re protecting to make sure the happy path exists and always works right?

Thomas : [8:38] Yeah. And it could also be I mean the exception path uh so if something goes wrong we need maybe need to to do some kind of clean up um so let’s say that the shipping service goes back and says, "i don’t have any of these products in stock. So I can’t fulfil or ship the order," then we should probably cancel the reservation the credit card um and and what resilient programming is about this the framework that i’ve created is trying to to make this developer experience nice um and there’s I mean there’s several ways that we do it now.

Thomas : [9:15] And I think that kind of like the ways that we program these solutions, it’s kind of like we go away from normal programming. And we end up in kind of like let’s say we have a handler per message type and we have to kind of like do kind of like implement a state machine by hand um which can yeah which is kind of like, it’s it doesn’t have to be so complex I guess that’s my point um yeah.

Jamie : [9:43] Okay so um you know traditionally without using any kind of framework i’ve got to write a whole bunch of code that maybe sets up spins up and watches my microservices right. So let’s say we’re using that order flow that you talked about there right. we’ve got this order flow and my application is written it split it maybe i’ve split it into microservices let’s say I have for this uh for the sake of argument right. Split it up into microservices, i’ve got all my different things happening. I then need some kind of observer or watcher or maybe daemon that is watching my app um the whole thing um from the top level perhaps and making sure I guess this is what Kubernetes does right is making sure that things are still running right maybe health checks or some other thing. And and all of that sort of wiring up and and creation of all of that kind of stuff that feels really complex to do. So is this where is this where Cleipnir comes in?

Thomas : [10:43] Yeah so so it’s true yeah Kubernetes does this and that’s kind of like on the process level you could say It ensures that the process, let’s say it’s an ASP.NET application, so it ensures that if it goes down, that will come up again. But even if that application comes up again, there’s nothing restarting the flow necessarily.

Thomas : [11:06] So a solution that you could use is Hangfire, which has this background job or Quartz, I guess it’s called also in .NET. And they will do this. They will take the job and they will restart it. But they don’t have any way of handling these kind of retries. So let’s say I just did like before where I just create a new, let’s say, transaction ID on each restart, which I sent to the payment provider. They will just naively restart the code from the top again. And so it’s kind of like we need some abstractions for actually handling that it could also be that we have a Hangfire application and we use like an external services down and we’re using like Polly doing exponential backoffs waiting for the external service to come up again. But it could also be that after a while, we don’t really want just a request to keep coming in and keep on creating more and more workflows, right? We might want to suspend some of them and then, I mean, load them up again after 10 minutes. And it’s not also something that those frameworks provide the ability to say, "I want to suspend the invocation now and then start up in a while", at least not as far as I know.

Jamie : [12:29] So what i’m thinking then is that um since what you’re saying is Kubernetes is more, "let’s look after the processes that are running." So i’ve split my app into microservices perhaps, and one of my microservices goes down, Kubernetes is going to jump in and start that back up but there’s going to be no state no restoration of state right, It’s going to be as if i’ve started that microservice up brand new. Is that right?

Thomas : [12:56] Yeah yeah. So for instance if you have a browser and it and it did a request um I mean waiting for the order to complete uh that will just have been terminated in the middle right. So the you the the user and the browser will get some kind of error saying that the connection was broken and and the service when it starts up again might have a database which state but there’s nothing indicating to it that there was this uh I mean request in progress and it should I mean pick it up again.

Jamie : [13:27] Right so and my in our example, my order is gone. As a user, my order is gone, and I have to start again, right?

Thomas : [13:35] Yeah. Yeah, or worse, it might be that your order is half-processed. And what resilient programming in the framework is about is trying to provide you with some tools to try and handle these situations. Because you almost always want these things to happen, And I mean, either all of it or none of it, if that makes sense, right? If you started processing an order for a user and your service restarts, you want to, I mean, complete the order. Or perhaps you want to clean up and then, I mean, not do the order. But I guess if it’s money involved, the company will want the order to be fulfilled in the background.

Jamie : [14:14] Of course.

Jamie : [14:16] Yeah, okay. So that’s where resilient programming comes in. And it’s this ability to sort of restore state, look at what’s happened, look at what happened just before everything broke, and figure out what to do next.

Jamie : [14:29] Like you said, it’s a, "do all the things or do none of the things," right? Not get in this weird state where you’ve half processed an order or half taken payment or, you know, you’ve sent something off to be processed asynchronously and then system goes down, but it’s still being processed. Maybe it’s, you know, the money has been taken or an email has gone out, that kind of thing.

Thomas : [14:51] Exactly, yeah. So, yes.

Thomas : [14:55] So, we can use Hangfire and Quartz. It’s kind of like for the simple where we have like HTTP communication. But we also have like the other big player is like a message-driven application, right, where you have a message queue. And that will, I mean, keep on re-delivering a message until it’s handled, right? But even then, if we have like a half completed flow, an order flow, and we just, I mean, restart and then we start processing that message again, I mean, we have to do something in order to avoid, I mean, deducting funds from the customer’s credit card twice or shipping the products twice. So it’s also about kind of like allowing the developer, essentially it’s trying to allow the developer to write, I mean, code almost as normal code, right? But you get this ability that even if your code is restarted from the top, it will kind of like go to the point where you were before the crash and then it will just pick up from there.

Jamie : [16:00] Okay. So before we start talking about things like message queues and all of the other different solutions that are available. One of the things that I think is kind of important to point out is that we’ve so far talked, I mean, I did talk about a payment system, but so far I’ve talked about all of the services that I’ve written that are sort of internal to the app, right? What if, you know, the system that I’m calling to, this external system for payment or for shipping or maybe an email sending system, what if that fails, right? Is that can I just get away with retrying again with Polly every couple of minutes until it goes through or like like what are the problems with that I guess?

Thomas : [16:45] Yeah. So so I mean so Polly I think it’s a nice framework and kind of like encapsulate this way you do like back off strategies and just I mean instead of just I mean doing a denial of service essentially on a on an external service. So in that sense it makes a lot of sense. But the issue can be that even if I have Polly and I guess another an external system is down and it’s kind of like the exponential back off is now to that i’m going to wait for 15 minutes um there’s kind of like two issues with that one is that it might be that while i’m waiting for this external system i’m restarting, let’s say it’s a deployment went through right, then then there’s nothing ensuring that Polly will pick it up after restart. The other thing is this thing that if you have a service and an external service is down and you keep on accepting new requests and it just keeps on piling up on these kind of like flows that I communicate with this or cannot communicate with an external service, you might actually bring down that service itself, if that makes sense, because it will run out of resources at some point.

Thomas : [17:53] So, a way to handle this is using a framework like this, where you have the ability to say, "okay, this external system is down, I’m now waiting for 15 minutes, which is a bit too much in memory, so I want to suspend the invocation, and then maybe I’ll come up again in 15 minutes or half an hour." I guess it can go quickly if you have a system with a lot of data going through and you all of a sudden have an external system which is down and you have a lot of input coming into the service. So you have to be able to kind of like do something to save resources on that server or process.

Jamie : [18:37] Yeah, I like the point that you’ve raised there about, "why don’t we just let the system continue running and just build up a big queue of stuff to do?" I think that a lot of devs, especially those of us who are, shall we say, long in the tooth, we remember programming for systems that didn’t have an infinite amount of RAM. We built systems that weren’t cloud-based, so we had a finite amount of storage, a finite amount of processing time because we were probably on a server with 12 other applications, right. Whereas technically, I mean technically in the cloud you are on a server with a bunch of other applications, but you don’t feel it because you feel like you’re isolated and you’re you’re set up in that way, You know you you don’t have to worry about the person running the other app knocking on your cubicle door and saying, "hey your system’s brought down my system. Can you go fix it?" So I really appreciate the point that you made about, you know, even in the cloud, you still have this finite system. So just letting those requests pile up is maybe not the correct answer as well.

Jamie : [19:52] So we’ve said that there are message queues and Service Bus and Hangfire and things like that that are out there that allow us to sort of put a message or a thing to be processed onto a queue and we can pop that back off of a queue, process it, and maybe send something back the other way that says, "hey i’ve processed it,"or make a note somewhere to say that i’ve processed it. But my feeling with those systems is: do they not still have the same memory issues? Like if I have a message system and I send a million messages to it or a billion messages or a trillion messages, but I don’t let it process anything, eventually that’s going to run out of memory as well, right?

Thomas : [20:38] Yeah, true. Yeah, good point. So at least from when I was working, I guess, how long is it now? Seven, eight years? I used to work a lot with RabbitMQ And what we were taught then and agreed on as developers was this thing that we shouldn’t use the message queue as a database. That was kind of like an anti-pattern. So you had to be aware not just to keep on piling data into the message queue and then leaving it there if the system taking off those messages were down. So that’s also a thing to be aware of which I think also probably people are not too aware of today but, definitely we shouldn’t use, I still think we shouldn’t use the message queue as a database.

Thomas : [21:32] And also this thing when we take off a message of a message queue and we restart halfway through we’re also going to reprocess that message And I think, at least from what I’ve seen, because there’s a lot of patterns out there, we have the outbox pattern, and there’s also something called the inbox pattern. And all of these patterns kind of assume that we won’t restart doing processing of the message. And that’s kind of like, I think, is a fallacy, because it… It will happen, at least from where the software industry is going now, right? Where we do deployments, a lot of deployments. We will end up in systems where we have to reprocess messages. And it’s kind of like, how can we make that kind of like scenario nicer for developers?

Jamie : [22:29] Absolutely. I mean, all of my questions and examples so far have been, "what if the server falls over?" But like you pointed out earlier on and I didn’t jump on it until just now but you pointed out earlier on, "hey we’re pushing to production you know on a semi-regular basis. So when I push to production, at some point, that server has to be, or that application has to be, shut down and restarted. Or maybe a second instance of the application needs to be spun up so that we can actually run the new code," right. And I think a lot of folks kind of forget the the complexity involved with that or maybe they’re unaware of the complexity involved with that. Snd so it isn’t just a case of, "the server fell over. We had a DDoS attack, or some other thing." It could also be , "hey you know we pushed a production today that’s why we needed to reprocess things."

Thomas : [23:21] Yeah yeah exactly yeah. And I think these distributed systems uh kind of like issues are similar to I guess concurrency issues or threading issues that we face. In that it’s in the IDEs won’t uh I mean warn us it’s kind of like we see them sometimes when we get weird behaviour of a system but there’s nothing really helping us out in detecting these things. And a framework, I guess, like Cleipnir, is kind of like trying to provide you with the abstractions, the tools to simplify this, so you don’t have to think about it. Similar to that you can use a lock and a semaphore in .NET, right? It’s kind of like it provides you with the kind of like tools to make this simpler, right? Or to actually be able to solve it.

Jamie : [24:10] Sure. I guess that’s a good segue into, "let’s talk about Cleipnir and how that helps," right? Because you’ve mentioned a few things there about how it adds the abstractions that maybe we’re missing to allow us to deal with this. So first off, I guess, before we get to Cleipnir, imagine you were writing a library or something like this. How do you go about fixing this problem, right? Rather than just, you know, we’ll come on to Cleipnir in a minute. I want people to check that out. But like, what is the algorithmic or metaphorical solution to this, right? And then we’ll talk about how Cleipnir does that for people.

Thomas : [24:50] Yeah, sure. Yeah. Yeah, so if you were to do this yourself, I mean, it’s kind of like you would have the first issue that you would need the kind of like the code that was executing, for instance, order flow, you would have to have that to be restarted in case it didn’t complete, right? And you could use something like Hangfire or if you don’t ACK the message of the message queue until it’s been processed in its entirety, right? You can also use that as kind of like a restart mechanism, right? So that’s kind of like the first thing that you have to, I mean, realize somehow. The next thing is this ensuring that the code, when it actually is executed, is, I guess, it’s called idempotent or deterministic, right? In the sense that even if it’s, I mean, executed again, it won’t, I mean, do the thing twice or it won’t get into some kind of weird state. It will kind of like pick up from where it came from or where it came to and then continue from that point.

Thomas : [25:54] And I mean, so a simple situation like we had with the order flow, if the first line of your order flow is guid.NewGuid() in order to create a transaction ID that you’re going to send to Stripe. That’s not idempotent, right? Because if you restart, it will then, I mean, create a new transaction ID and it will be transparent and visible to Stripe that you’re actually meaning to talk about the same transaction, if that makes sense.

Jamie : [26:22] Yeah, I hadn’t even thought about that bit. We’ll talk about that in a moment. But just like for the folks, because idempotency is a very computer science-y term. And I know there’s a lot of folks in our industry who didn’t do the computer science-y path. So just real quick, my personal description of impotency is: if I send a request to a server to create me a new customer, for instance, right? I’m on the sign-up screen. I do the create customer and I send that off. No matter how many times I send that off, it will only create the one customer because all of the data is the same right. My request will always um it will always produce that same object right. It doesn’t have to go to the database and create it it can just say, "hey cool we created the jamie taylor object." And then you send again it goes, "cool um the jamie taylor object already exists, so i’m not going to bother creating it. Here’s the ID." You send it again [and] it goes, "hey like you’ve already said this twice. It already exists. Here’s the ID."

Thomas : [27:26] Please please stop. Yeah no you’re right you’re right. Yeah exactly. So that’s what idempotency so it’s just a fancy word for for this kind of like being able to handle that that things are duplicated. And in a distributed system and kind of like because the network is unreliable we call it right that it’s it it might lose messages so the only way you can kind of like handle that is if we don’t hear anything for a while we’re going to resend the message right. That’s kind of like how the network works at the kind of like lowest level.

Thomas : [27:59] And so it’s kind of like we have to, we often end up in just resending things. But we don’t want these effects or these kind of like actions to be executed multiple times, right? You don’t want your user to be created with different IDs in the same customer, right? We don’t want the transaction on the credit card to be multiplied and executed several times. So it’s kind of like just trying to ensure that that these extra I guess actions that won’t be taken there’ll only be one action no matter how many times we tell different systems to to perform an action.

Jamie : [28:39] Sure sure. So take an example from from my own personal life right. If I say to one of one of my kids, "hey i’d like you to go down to the store and buy me a tin of beans,"right. H e’s going to go down to the store buy a tin of beans and come back. And then I say, "hey I need to have one tin of beans in the house. Go to the store and buy me another tin of beans," he’s going to actually turn around to me quite rightly and say, "hey we already have a tin of beans. i’m not going to go to the store." And then i’m gonna say, "hey I needed one tin of beans in the house. Go buy me a tin of beans." And then he turns around to me again and says, "hey we already have the tin of beans right now."

Jamie : [29:17] I guess from a programming perspective how do we get to that point right? That’s I realized that it’s a kind of a woolly question um and you can totally bring Cleipnir into the answer. But like how do we make sure um your example was brilliant about I create a guid so that I can send off a request to stripe it fails maybe I failed um you know my my code needs to reboot and then when it goes to deal with that request again it creates a brand new guid which is not the same as the previous guid and sends the request off to stripe right. Is it just a case of literally logging everything and then figuring out where we got to and trying to recover from there? Like, how do I do it?

Thomas : [30:01] Yeah, that’s a good question. And it was actually, I guess, these thoughts that ended up in me trying to implement the framework. So, I mean, yeah, so we don’t really want to get into a situation where we have kind of like have to look through logs and try and find out because it might be, I guess, very hard to figure out what went wrong. And if we just naively retry from the top and get into some weird situation, I mean, it might just be that we end up taking a payment from a customer several times, shipping the product several times, in which case the customer gets angry.

Thomas : [30:43] So kind of like the idea, and also you talked to John about Temporal the other day, in a previous show. And the idea in Cleipnir is slightly different frameworks, of course. But the idea is the same, kind of like that you try and remember the result of actions that you’ve done in a way that if you then start again, you won’t. You kind of like you’ll check in your little notebook if you already performed this action. If you did then you’ll just return the result of the previous um I mean execution. If you look in your in your notebook and you can see, "okay actually I haven’t done this before," you will then perform the action. If that makes sense. So that’s kind of like the trick that that you’re using in order to to realize this.

Jamie : [31:36] Right right okay, And just so folks know, in case they’re coming at this episode brand new, you talked about the when I talked to John Kattenhorn about Temporal.io. And that’s um season six episode 19 released on may 31st of 2024 right. So just in case folks are listening in and going, "hey that that sounds like a great conversation. I want to get some more background information," that’s that episode. i’ll put a link in the show notes.

Jamie : [32:05] So from my understanding what we’re doing is where just like we would with it with a human person right i’ve been given a task to do i’m going to write down where I get up to so I can refer back like you said to that notebook and figure out where I am. So what does that look like for my code then? So let’s say we bring in Cleipnir. What’s Cleipnir doing and how does it help me to implement that kind of pattern of knowing where we were when the system fell over?

Thomas : [32:37] Yeah, so that’s kind of like where I guess it also takes a different approach from Temporal and also like Azure Durable Functions is also like a player in this field. But the idea is that you can just provide a lambda to something called an effect, which is just a method. So when you implement the flow, you’re just inheriting from a base class. And that has kind of like all the different, I guess, abstractions that you need. And one of these is called an effect.

Thomas : [33:13] And when you have some code and you want this code to be, I mean, able to handle a restart, or it’s kind of like non-deterministic, as we also say in a fancy word, for instance, creating the GUID, we can just wrap that kind of like provide a lambda to this method that you have on the flow. And that will then, the framework underneath will then say, "okay, have you done this before? If you have, I’ll just return the value that I created previously for you. Otherwise, I’ll generate a new one, save it to the database and return it to you." So this works, this kind of like effect idea or abstraction works for GUIDs, kind of like internal things that you want to ensure that are the same. But it can also work if you have a like do network calls. Let’s say you reserve funds to Stripe you can also wrap that uh in a lambda right so you do the invocation to stripe you you wrap that into an effect uh and that will ensure that they will only have them once.

Jamie : [34:20] Right. Okay and and like you’re saying this is different from how uh durable functions are working right?

Thomas : [34:28] Yeah, yeah. So durable functions and temporal have these concepts of activities and workflows, and you have to, I mean, divide them. Cleipnir is kind of like trying to do, I guess, a bit like you also had a talk a previous year about minimal APIs, right? And it’s kind of like similar that you can just provide the kind of like code inline that you want. You don’t have to implement lots of different functions, lots of different types and stuff. It’s basically just providing um I mean code to a lambda to to a method that you already have um on your on your type.

Jamie : [35:08] Right. Okay. So I go ahead and and bring Cleipnir in, and say, "hey here is this code. i’m going to give you a lambda that either itself is the entire action I want you to take or maybe it is a call to a function that I want you to take. And then some some kind of magic happens." The process time happens um we we have to call that method we call that lambda we get part way through the system crashes and restarts what happens then?

Thomas : [35:43] Yeah yeah. That’s a good question. So how you make a flow for instance in the order flow is that you would use an effect just for the GUID, then you would use an effect just for communicating with Stripe, and then you would use an effect just for communicating with the order or the logistics service, and the same right so it’s not that you would put the entire flow into an effect but you would kind of like have four different effects. And the workflow will just start from the top if it’s restarted and then we’d say to the first effect, "we already have a result here," so yes. Then it will return the same GUID that it created in the previous invocation. Then it will go down to the next line. And let’s say that we already did a reservation to Stripe. Then it will just, I mean, jump over that. And then let’s say it crashed just before we actually did ask the logistic service to send the product. Then it would then actually detect that and say, "oh, okay, I’ll continue from there." So it’s kind of like it’s just jumping over and returning the same results as previously when it’s detected that it’s already executed before. So conceptually, it’s actually fairly simple.

Jamie : [36:51] Right. Okay. That makes sense, right? It’s like our detective, right? Or our person who’s doing some work. They’re making notes of what they’ve done. I’ve got a to-do list, right? And I’ve checked off the things on my to-do list. That way, if I need to be rebooted, that’s a strange way to put it, but maybe I have a nap or I have to go do something else. I can come back to my to-do list and go, "right. Okay. I’ve done that, done that, done that. Cool. I’ve got these steps to go." So it’s kind of like that, I guess.

Thomas : [37:21] Yeah. Yeah. So at the kind of like fundamental level, a simplest level, that’s just what it is. But just having that allows us to do like, I mean, loops and if statements and everything. So it’s kind of like if you just sprinkle in these effects, you can just code like normal, really. So you don’t have to worry about being restarted, which is quite nice. So that’s kind of like the first, I guess, a major concept in Cleipnir. The other one is messages which uh which we can also talk about.

Jamie : [37:56] Sure sure let’s do that um just real quickly before we do so so that my understanding is correct I create some kind of lambda that does some long process that does maybe five different steps um my process crashes halfway through or I you know I push a new version to production we were on step three of five the app comes back up some kind of magic happens with slide near and it figures out hey we were on step three and this was the state it restores that for me and we can carry on from there is that what it what it’s doing yeah

Thomas : [38:30] So actually yeah it it just remembers the different effects that you have. And then it’s able to see, similar to how Hangfire works, it’s able to see how this flow didn’t actually complete previously. So we’ll just say, "okay it didn’t complete,"then it will start it again. And then because the effects they will ensure that the flow, that this the same results, it won’t perform these actions again. It will just, I mean, detect that they already occurred. Does that make sense?

Jamie : [39:00] Yeah yeah no. That makes sense. I just wanted to make sure that we covered that off, in case someone’s listening in and going, "wait a minute That’s magic. How is that working," right? So people can understand it a little bit.

Thomas : [39:13] Yeah, that’s good. Yeah, I agree. Yeah, so it’s actually, it’s also a part of what I think is quite nice with this thing is that you get it for quite simple ideas. You get a very powerful kind of like benefit when you implement these things. So if you think it’s fairly simple, then I mean, you’re on the right track.

Jamie : [39:35] Yeah, yeah, yeah. Because, like, my thing is, when I’m learning some kind of new technology, I want to be able to get started really quickly. So, you know, the the hello world type demo or whatever. But then I also want to understand… get an understanding of what’s happening behind the scenes, so that then the magic is slightly less magical. And that’s not to take away from the the wonder of wow it does this thing, it’s more a case of, "if I make it break, how can I fix it?" that kind of thing.

Thomas : [40:10] Yeah. I think that’s a very good very good, I mean, attitude definitely. I mean as soon as you get an understanding of how it works I mean it also makes a lot easier to kind of like reason about how much I mean how many resources does it need or why I’m seeing this weird behaviour. So definitely, I think an important part of using any framework should be that you should be able to understand it at least at a conceptual level, definitely.

Jamie : [40:41] Absolutely. I think it’s Scott Hanselman who says you should always aim to understand a little bit about the next layer down, right? So he often talks about C# and .NET, and he talks about how you should have at least a vague understanding of how il works, or how the CLR works. Because then you’ve got a better appreciation for what’s actually happening, you should then be able to maybe create more efficient work or better written programs and things like that. And I tend to agree with him, right. If you know that there is another layer exists below you then when that layer breaks if you know a little bit about it then you’re able to actually… you’re able to fix those esoteric issues, or you’re able to then provide better debugging information so that then the person who needs to fix them can fix them, right.

Thomas : [41:34] Yeah definitely. No I agree, I agree 100%. I mean just using like garbage collection right, it’s kind of like invisible to us but it’s it’s not, right? We we’re being affected by it, I mean, we can be affected abide in running applications a lot.

Jamie : [41:52] Yeah yeah. And just like maybe just understanding the basics of what the abstraction is rather than how it is implemented right. You know, like knowing that… this is a really old one from when I was at university, it gives you an idea of how long ago I went to university: these newfangled mp3 players, when I push the play button, I don’t need to understand that it sets up a buffer and reads from the internal storage, sends that over to something to be decoded, and then sends a buffer… raises a buffer and and all that kind of thing in the sound card, and sends all that down the cable to your head. All I need to know is that when I push play it loads a file into memory and plays it right. That’s as far as my abstraction needs to be; it doesn’t have to be, you know, it’s one step above, "push button; play music." It’s, "push button; find the file; play the file," right. And then it’s, "push button, load the file into into chunks, decrypt," decrypt is probably not the right word but, "convert," I guess, "from the format into whatever the sound card uses." So you’re slowly building, you’re slowly reversing those abstractions I guess right.

Thomas : [43:10] Definitely. Yeah, no. It’s completely true. And and I also think like part of this is also that, I mean, the the reason that I ended up, I mean, implementing a framework was just that I could see that things like Hangfire and and the kind of like systems built with message queues that didn’t actually have this simple effect, and also the ability just to suspend an invocation. And also I guess like we’ll talk about in a minute, messages, kind of like how do you handle messages coming in. So that was kind of like why I ended up implementing it. It’s fun to a degree to implement a framework like this, but also you’re also doing a lot of fairly basic things that I mean, it’s kind of like the end product, the end goal that’s nice to reach. Not necessarily, I mean, doing the same as Hangfire is already doing, which I guess Cleipnir is basically doing, and then a bit more.

Jamie : [44:14] Sure, sure. So why don’t we talk about that then? Let’s talk about the the messages, right. So the the reason I wanted to step back and have that quick chat about abstractions is because, you know, the way that Cleipnir works and its framework works is by using abstractions to abstract over this restoring of state. Those are my words, not not your words. So, you know, but being able to get back to where we were and continue processing, and I was just worried that listeners might—at least listeners early in their career—might stick around with that abstraction and just not bother thinking about, "what’s the next level down?"

Jamie : [44:50] But let’s talk about the messages. You know, I have in my notes here that it’s a messages abstraction. So it’s not messages, but it is messages? Like, what’s that and how does that work?

Thomas : [45:02] Yeah, yeah. So I mean I agree, I think it’s really important; and I remember from from the beginning of my career also I was so curious about how do things actually work. And you have to, kind of like, I mean, yeah how long have I thought about these things right? You only get to the place where you are by actually thinking about these things. I think that’s awesome, and I hope the listeners, I mean, listening to the show will kind of like try and take the things apart. And, I mean, try to think about, "how do these things actually work?" And yeah if i’m saying anything wrong they can reach out it would be awesome to start, I mean, have other people to talk about these kind of like, esoteric distributed systems concepts.

Thomas : [45:49] But yeah so, messages is kind of like the last thing that I also saw that was missing in this kind of like handling or being able to implement I mean these workflows in a nice way. So when you have message-driven systems you normally have this something called sagas, right? So there’s big Service Buses in, I mean, Mass Transit, there’s something called Rebus, and there’s also NServiceBus.

Thomas : [46:22] And they normally have this pattern where you kind of like have to define, I mean, a handler per message type. And if you ever try to implement something a bit complex logic, so it’s not just a flow that goes from one step and then to the next. But if you have kind of like, I mean, if statements and loops and so on, and if you have timeouts also coming in, it can actually be quite difficult to implement a flow using this kind of like approach. And the last kind of like a strategy in Cleipnir is this messages. And that, I mean, it’s an abstraction that basically just, I mean, holds the messages that has been delivered to this flow. But what it allows you to do is that you can take the kind of like the same example that we talked about with the order flow. You can just almost one-to-one translate that to a message-driven flow where you get messages from a message queue, where you also just say, "okay, the first step is I’ll send a command to the bus, right, to actually, I mean, to reserve the funds. And then I’ll wait for the reply to come in," kind of like, I mean confirming that the reservation has been placed. And then after that, I can do another line that actually ships the products. And then I can wait for that. So even though it’s message-driven, you still get this kind of like just symbol coding from top to bottom, if that makes sense.

A Request To You All

If you're enjoying this show, would you mind sharing it with a colleague? Check your podcatcher for a link to show notes, which has an embedded player within it and a transcription and all that stuff, and share that link with them. I'd really appreciate it if you could indeed share the show.

But if you'd like other ways to support it, you could:

Leave a rating or review on your podcatcher of choice
- Head over to dotnetcore.show/review for ways to do that
Consider buying the show a coffee
- The BuyMeACoffee link is available on each episode's show notes page
- This is a one-off financial support option
Become a patron
- This is a monthly subscription-based financial support option
- And a link to that is included on each episode's show notes page as well

I would love it if you would share the show with a friend or colleague or leave a rating or review. The other options are completely up to you, and are not required at all to continue enjoying the show.

Anyway, let's get back to it.

Jamie : [48:01] Yeah, absolutely. What I like is that you’re taking a message, perhaps off of the queue or off of whatever abstraction we’re using, and you’re sending it off to be dealt with, but you’re actually waiting for the response before saying, "that has been dealt with," and moving on to the next message you know. That’s… that I think is, I know others others do that: they wait for the ACK but, you know, I feel like it’s something that people who implement their own messaging system, their own sort of resilient system sometimes forget about.

Thomas : [48:37] Yeah definitely. Soo I mean, so that’s… Also I didn’t actually mention, but while you are waiting for the message you can also suspend the the flow right. So you don’t get into this denial of service or DDoSing yourself by having too many things in memory at the same time.

Thomas : [48:57] But yeah, so Cleipnir is basically these two main things, right? The effects and the messages and trying to allow you to write, I mean, code like you don’t really mind that it’s in a message-driven system using a message queue. Or if you have just a normal, I mean, HTTP communication, or you can even, I mean do like a hybrid where you do a bit of HTTP communication and also have message queues involved. And I think that kind of like the benefit that you get, and I guess I appreciate that it’s a bit abstract, at least it seems a lot when I talk to people about it, but the idea is that you end up with being able even though you have this kind of complicated architectures in play, you’re still being able to write code almost like you would normally. You use effects sometimes just to allow the workflow to restart.

Thomas : [50:03] And then you can use messages, which is kind of like, it has a Linq-like syntax, right? So you can say, "I want two of these messages. So I want this, I mean, payment reserved, or I want a timeout for, let’s say, 30 seconds. I want to wait for that." So the idea is that you end up in a situation where you’re very close to just ordinary code, ordinary C# code. Which I think is a lot actually, I think it’s a lot stronger because it’s what we normally implement everything in. And I think that the reason why C# ended up the way it looks is because it’s kind of like a nice way for us as developers to express our intent, both technical but also business-wise.

Jamie : [50:56] Yeah, I agree with your point there about C# being a great way to implement what we want, the intent, as you put it, in a human-ish language that is also adhering to coding standards.

Jamie : [51:15] So you said that it has like a Linq-like API, I guess, for these messages. So top of my head, I’m just literally picking some random Linq things. I can say like messages, assuming that my message queue is called messages, messages.First. And then I can say where perhaps, you know, I don’t know, Priority == 1, right? So that I can actually work through my message queue in perhaps the order that I want as well. Like, let’s say a priority one message comes through and it’s like, this person needs to have this thing happen. Maybe this is a retry or whatever, and it needs to be done now. And then I can say, "hey, go do that thing." In human-like language, is that kind of the case?

Thomas : [52:05] Yeah, yeah. So the idea is the thing that you could, for instance, say, I have one example where you do like a fraud detection in a bank where you like send out, let’s say, three requests to different fraud detectors and you want to wait. Let’s say it could be a fraud detector loan application, and you want to wait for at least two or all three of them, wait for the result that they get to. But you also want to have a timeout, so you don’t want to wait forever, right? So you’ll have a timeout. And then if you receive at least two within that time-frame, you’ll then look through them and see if they all want to approve. Or just if one of them wants to reject them, then you’re going to reject the application or detect the fraud. And I think trying to implement this in the old style is not a very nice experience. So yeah, it is kind of like this way that you don’t really care about the order, in which the message just comes in. You’re just saying that, "I want this message." And then, yeah. Sorry, it gets a bit difficult to explain in English.

Jamie : [53:26] No, no, I understand that. I really like the idea of being, just as a side note, I really like the idea of being able to sort of suspend the queue as well. So presumably nothing else can be added to it whilst you’re processing. Let’s say you’re doing a big processing task and you want to actually hold off adding anything else to the end of the queue before this process has finished. I like that because then that shows that you are thinking about, you know, "how do we deal with these massive processes?" which is really cool. I really like that. Just wanted to say that that’s nice.

Thomas : [54:10] Yeah, no. I was just saying thank you.

Jamie : [54:16] So I guess then does that mean that you move from a sort of, I’m wary of saying "move away from an object-oriented or functional-oriented paradigm," but what I mean is like an architecture or a workflow? Do you then move away from, you know, "request comes in, I deal with it, response goes out," to more like a message-driven workflow for your apps? Is that the case?

Thomas : [54:42] Yeah, so actually you can decide yourself. So I guess for longer running flows, it’s normally that you would kind of like do it asynchronously. At least you would only do part of the flow and then you would respond back to the browser, to the user waiting saying it’s been accepted. It looks good to your order and then you will send an email or somehow all the way notify the user that the order has been processed. But actually, so you can decide on whichever paradigm really you like. It’s more about trying to allow you to write, I mean, the code and the style you like. And being close to, I guess it’s called idiomatic, C#. So kind of like the way that you would normally code C#.

Jamie : [55:35] Right, okay. So what I’m thinking is we’re coming towards the end of our time together. So I wonder if folks wanted to get in touch with you and learn a little bit more about Cleipnir. I know that there’s the GitHub, but is there on the GitHub page, is there maybe a getting started guide? I know you said earlier on you’ve got some tutorials and some sort of code snippets that show off some of the processes. So my questions are, how do folks learn more about Cleipnir and sort of get started using it? And is there a way for folks to get in touch with you if they get stuck, or is that not something that you’re interested in?

Thomas : [56:15] It’s definitely something I’m interested in. So, yeah, people do reach out. I guess you can send me messages in GitHub, I guess, but also I’m on LinkedIn. So I guess as a link to this podcast, you can connect to me. So that would be fine. I mean, I would be very happy to help anybody trying it out and also hearing feedback from people. That would be really nice, definitely. And also yeah, sorry, and also your other point Jamie about how to get started: yeah I think the easiest way is to go to the github repo and there are some different kind of like getting started guys and it talks a bit about some of the different concepts. There’s also a, I guess a YouTube video from Microsoft open that also shows a bit of kind of like me coding along, it’s only 20 minutes so it’s not as deep dive as this but it’s also I guess a good way of getting started.

Jamie : [57:24] Sure, sure. What I’ll do is I’ll make a point of putting all of those into the comments. So I’ll put all of those into the comments and we’ll get that sorted. And then that way folks aren’t looking over there, they’re not driving along and diving over and hitting buttons on their phone to make notes or calling out to their personal assistants on their phones and say, "hey, make a note to go to check this out." They can just actually pull the show notes up and it’s all there. I really appreciate this conversation. I’ve certainly learned a lot about the sort of message-driven… and workflows as well. And especially resilient programming, I hadn’t even thought about like I said, my idea of the system crashing was: the system crashes because it’s in some exceptional state or because the server falls over. It wasn’t, like, the thought that me pushing to production is technically a system crash because it takes the app down and starts it back up, which means there will be some process which hasn’t completed yet. That’s, yeah, that’s an eye-opener for me.

Thomas : [58:39] Nice good to hear, Jamie. I also think that actually part of this, right, it’s a bit unknown or hidden thing for most people, right? And it’s also, I mean, yeah, you only see, I mean, it’s difficult when you just get weird data in a database, right? It’s kind of like, "how did that happen?" I mean, it’s probably not the first thing that people would think of, right? It might just be, "my code is buggy," or, I mean, maybe it was a weird day of the week or something. It’s extremely difficult, right, to get back to that point where I say, "oh, okay, it’s because we have rolling release in our deployment pipeline."

Jamie : [59:20] Absolutely. Well, like I said, Thomas, this has been a fantastic conversation. I’m walking away from this with loads more, actually, a few more questions, but we’ve run out of time. Maybe I can send you an email or something and we can figure them all out.

Thomas : [59:36] Yeah, definitely, Jamie. Yeah, definitely.

Jamie : [59:40] Thank you for being on the show. I really appreciate it.

Wrapping Up

Thank you for listening to this episode of The Modern .NET Show with me, Jamie Taylor. I’d like to thank this episode’s guest for graciously sharing their time, expertise, and knowledge.

Be sure to check out the show notes for a bunch of links to some of the stuff that we covered, and full transcription of the interview. The show notes, as always, can be found at the podcast's website, and there will be a link directly to them in your podcatcher.

And don’t forget to spread the word, leave a rating or review on your podcatcher of choice—head over to dotnetcore.show/review for ways to do that—reach out via our contact page, or join our discord server at dotnetcore.show/discord—all of which are linked in the show notes.

But above all, I hope you have a fantastic rest of your day, and I hope that I’ll see you again, next time for more .NET goodness.

I will see you again real soon. See you later folks.

Useful Links

Paxos
Raft
Polly .NET
Hangfire
Quartz
Inbox and outbox pattern
Idempotence
Azure Durable Functions
Mass Transit
Rebus
NServiceBus
Thomas on LinkedIn
Microsoft Open: Introduction to Cleipnir.Flows a tool to get resilient code
Supporting the show:
Getting in touch:
- via the contact page
- joining the Discord
Music created by Mono Memory Music, licensed to RJJ Software for use in The Modern .NET Show

S07E06 - Cleipnir and Beyond: On Resilient Development Practices with Thomas Sylvest

Sponsors

Embedded Player

The Modern .NET Show

S07E06 - Cleipnir and Beyond: On Resilient Development Practices with Thomas Sylvest

Supporting The Show

Episode Summary

Episode Transcription

Sponsor Message

A Request To You All

Wrapping Up

Useful Links