Engineering for System Uptime | Azure DevOps Podcast Episode 387

I recently joined Jeffrey Palermo on the Azure DevOps Podcast to talk about a topic near and dear to my heart: engineering for system uptime.

Over 20+ years of building software in the .NET ecosystem, I’ve worked across some pretty demanding industries - military, healthcare, and ticket brokering - where downtime isn’t just inconvenient, it’s costly. In this episode, we dig into the practical strategies and engineering practices that keep systems running reliably.

If you’ve ever had that sinking feeling when a production system goes down at 2 AM, this one’s for you.

Listen to Episode 387: Engineering for System Uptime

What We Covered

Lessons learned from building and maintaining systems that can’t go down
How different industries approach uptime and reliability differently
Practical engineering practices that go beyond just “add more servers”
My experience as CTO at Shows On Sale and what uptime means in the ticketing world
Upcoming events: Hampton Roads DevFest 2026 and Stir Trek (May 1st in Columbus, OH)

Upcoming Events

If you want to catch me in person, I’ll be at a couple of events coming up:

Hampton Roads DevFest 2026 in Norfolk / Virginia Beach
Stir Trek Conference on May 1st in Columbus, Ohio

I’d love to see you there!

If system reliability is on your mind, you might also enjoy my post on monitoring ASP.NET Core application health with health checks or my guide to building Windows services in .NET. And if you want a cautionary tale about cloud architecture decisions, check out my $8,000 serverless mistake.

Azure DevOps Podcast Transcript

This transcript was generated using AI-based transcription. While we’ve done our best to clean it up, there may be errors or inaccuracies. Any mistakes are unintentional.

Introduction and Welcome

Jeffrey Palermo: Welcome to the show. I’m Jeffrey Palermo, your host for helping you and your teams move fast and deliver quality and to run your software with confidence and Azure. All while using everything that the dot net ecosystem has to offer. Quick announcement before we jump in. I have another edition of my advanced dot net software engineer and architect boot camp. That’s for March of this year, 2026, so you can go to clear measure dot com and look for the schedule and I’m really excited to do another training course. This is for this is for people who are already pretty much mastered of dot net and see sharp the programming language, but you want to move forward. So excited to three day immersive in person course in Austin, Texas, and I’m looking forward to teaching that my guest today on the show is Kevin Griffin. And he has over 20 years of software development experience and counting. He’s he’s a passionate and versatile leader and trainer and consultant in the dot net ecosystem. And he’s worked with various industries from US military to health care to ticket brokering and always looking to deliver high quality solutions and empower his clients and teams to succeed in his day job. He’s the CTO at shows on sale where he oversees the technical strategy and direction of the company. And he’s also served as the president of the .NET Foundation for the last term and Microsoft has conferred the Microsoft MVP award to him at least 16 dozen times and counting. He speaks at tons of conferences and is a board member of the Stir Trek conference series as well. But Kevin, welcome to the show. How are you, sir?

Kevin Griffin: I’m great, Jeffrey. I need to bring you as my hype man everywhere I go. You’ve done a far better job of explaining me than I think I can explain me. I appreciate that.

Origin Story: From DOS for Dummies to a Career in Software

Jeffrey Palermo: Well, you know, with every with every passing year, you do more and more and more. It’s like, OK, there’s a lot of stuff. It’s nonstop, isn’t it? Yeah. But and we’ve known each other for a while. And for the for the audio only listeners, you know, you’re you’re listening to a very experienced expert in our industry. And I want to ask you about some of those things. But before I do, I wonder if we can kind of go back in time a little bit earlier in your career. Are there any kind of key points that were really foundational in solidifying your desire and passion to effectively be a lifelong programmer?

Kevin Griffin: Yeah. I mean, my origin story, if I go back to when I was I was even a teenager, I think I was maybe 10 years old. I was right that I’m like that weird millennial, that elder millennial that remembers life before the Internet. Yeah. And I remember being nine, 10 years old, getting my first PC or I’m sorry, the family PC. Sure. We went to Circuit City and my dad spent fifteen hundred dollars on this big box and we brought it home. And it was like it was the family computer. And I wasn’t allowed to touch it unless my dad was in the room because it was a very expensive toy. And my dad’s very adamant, like, don’t touch it unless you we don’t know what you don’t know what you’re doing. I’m like, OK, but I’m a Navy brat. My dad was deployed often. And I remember pretty distinctively like the first deployment he was on and we had the computer. And back then you had to go to libraries if you wanted to learn things. I went and I checked out books and I had this book and it was DOS for Dummies. And I 10 year old Kevin just flipping through this book like, oh, that’s cool. I don’t know what this is. And just being very curious about what this computer could do. And I got to a section on QBasic and it was not a very long section, but it was just introducing to, hey, you know, your computer comes with a programming language. And I’m like, well, what’s that? And just built a very simple app where I could tell the computer what to do. And that was the initial fire for, oh, I want to do more of this and going back to the library and checking out books specifically on basic. And it’s like building text adventure games and the public school system back then didn’t know how to help a kid like me. They just very interested in programming. So a lot of it was self-learning, getting books at the library, and that just fueled like for the rest of my school career. And I decided, all right, I think this is what I want to do and went to college for it. And the problem in college was the first half of college. I already knew everything. So it was just showing up to the classes and trying not to be a entitled jerk to my professors like every teenager is. Like, you just think you know more than the professor does. And yeah, after that graduated, got my first job at Symantec on their security team, which blew my mind as a twenty something. And yeah, had that job for exactly three months before they decided to lay off our entire team. So that that fueled just wanting to get involved in community and just keep learning and more importantly, just teach everything I know. So I’ve I’ve been blessed immensely by this community. Every person I meet, every person I have a relationship with, I’ve learned something from. And I really try just reciprocate that. So if I learn something new, I want to tell everyone about it because I think software development in general, we’re just in a unique position where we we’re all not. We’re competing against each other kind of indirectly, but we’re all just here to help each other. And rising tide raises all ships. And that’s just what I try to be a part of.

Jeffrey Palermo: Yeah, it’s so fragmented. I mean, it’s almost like people who work at a Taco Bell in one state don’t look down on the Taco Bell in another state just because their competitors or Burger King. It’s like, look, your customers are not going to come way over here just to go to that one. Exactly. So yeah. And well, we’re aren’t we in a professional perpetual software engineer shortage worldwide, not just the US.

Kevin Griffin: So I hear I hear all different sides of this story. And at one point, there’s not enough. At one point, there’s too many and A.I. is going to take all of our jobs. It’s I don’t think anyone can make up their minds.

The .NET Foundation and Supporting Open Source

Jeffrey Palermo: I think I think we’re I think it’s a massive deficit. I mean, the demand for for custom software is exploding. And it seems like the A.I. realm is not saying, oh, professional developers are going to stop doing that. It’s like, no, no, no, no. All these other people now are going to be able to create certain classes of software when they otherwise wouldn’t. So, I mean, you know, websites being simplest is like you need a simple marketing website or some little tool. Boom, done. So, well, let me let me ask you about the .NET Foundation, because that’s something really unique. But you’ve been the president of the .NET Foundation. And I think there’s a lot of listeners that they’ve heard of it before. We’ve had some other people talk about it. But but I think that there’s a lot of listeners that still haven’t joined or don’t have any involvement. So, yeah, quick, quick pitch. Should they care about the .NET Foundation?

Kevin Griffin: I think all dot net developers should care about the .NET Foundation in some way, shape or form. So specifically what the foundation is here for is to support a vibrant dot net open source ecosystem. So a lot of us out there, if we’re developing dot net, it’s very likely we’re building with a package that’s open sourced. And that means a person or persons have contributed their time to solving a very specific problem that you would either have to solve yourself and take that. Well, before I could take weeks, months, years to develop a similar solution. Someone’s just done that work for you. And there’s a lot of examples out there. If you’ve used Jason dot net like that, that was open source project because there used to be a time in dot net where Jason wasn’t very well supported or supported at all. So the package comes along to help developers parse and serialize, deserialize Jason. So awesome. Now it’s just kind of baked into the framework. But there’s a lot of these packages because the dot net framework can’t do it all or at least not do it all in the way that we we want it to be done. So there’s a lot of these packages out there that solve very specific problems for us. And as a maintainer, it’s very easy to get overwhelmed by just the sheer amount of stuff that goes into managing a open source project. And what the foundation tries to do is we try to provide various resources that maintainers might not necessarily think about. So things like code signing, if you’re deploying a package and basically putting your code out there into the world. Well, a lot of enterprises and dev shops won’t use it unless it’s signed like they have a requirement that has to be a signed package. Well, if you ever try to get code signing certificates, it’s a lot effort. You have to go through an identification process. There’s a little bit of legalese and then you have to pay for the certificate. That’s only good for a certain number of months or years. And then you have to go through the process again. Now, if you open source, that’s money and time out of your out of your resources. The foundation does that for you if you’re a foundation project. We have some projects that have had legal needs outside of code signing. We have a attorney that we work with who knows open source and has worked with our projects in the past to solve various legal issues that might come up. We have the community of fellow maintainers. So if you’re a new maintainer and you’re running into some problems, the odds are that other maintainers have run into these same problems. So it’s very nice to be able to go have a kumbaya moment with other maintainers who can give you advice and mentoring. And a lot of these projects have been around for 20 some years, as long as that net’s been around. So they they know what some of these newer maintainers are going through. And I think one of the bigger things we do is just doing marketing. A lot of projects exist and people don’t know they exist. So what the foundation really tries to do is tell people, hey, here are all these projects that exist to solve specific problems. And we’ve really worked on trying to build systems where we’re actively talking about the projects under the foundation. We even talk about projects that aren’t in the foundation yet or don’t want to be in the foundation. We just we just want to talk about dot net open source and tell everyone there’s a lot of good work being done out there by a lot of talented people. And it’s it exists and we just try to provide these resources. And we have a lot of great partners that help us do that.

Jeffrey Palermo: Nice, nice. It seems like your teenager self would have loved to meet a Kevin Griffin of the future or somebody experienced like that just to be able to kind of guide and lead you into all these things. Because all these people doing open source projects, some of them might feel alone. And it seems like you’re doing a lot of connecting.

Kevin Griffin: We try to do a lot of connecting. We are projects committee in the foundation is probably our most active out of all of them. And the foundation does a lot of other stuff we manage. We pay for meetups on meetup.com. So if you run a dot net oriented meetup, we’ll pay for it. And we have resources for for education needs and stuff. There was a point where I think the foundation was doing too much. And since I’ve come on to the board, we’ve really tried to pull back. What’s the actual mission of the foundation? And it’s supporting the the projects and also the people in in the dot net industry.

Stir Trek and Hampton Roads DevFest

Jeffrey Palermo: Yeah, yeah. Well, you’re a leader in a lot of different places. And the Stir Trek local conference in Ohio is another one of those places. And people who haven’t traveled to Ohio, maybe they don’t know about that. But can you tell us just a little bit about that and why folks might want to check it out?

Kevin Griffin: Yeah, Stir Trek is a it’s a fun event. So they take over a movie theater in Columbus, Ohio. And for the day, it’s nothing but technical talk. So if you’ve ever wanted to sit in a nice comfy chair in a movie theater and watch a technical talk, that’s Stir Trek. So we’re talking about 13 to 1500 people all in one space, all day, technical talks. And then at the end of the day, there’s a movie and actually not quite sure what the movie is this year. But like last year, it was the Marvel Thunderbolts movie. It just just came out and there was a private showing for all Stir Trek attendees. It just included in your ticket price. And a year before that, I think it was the Star Wars Episode one re-release. And every year there’s some sort of nerdy geeky theme for the conference. This year, there isn’t a movie that aligns. So we we might have a special showing of a an old geeky movie or two or three or four. I actually don’t know. That hasn’t been decided yet. But it’s yeah, it’s a great time. If you’re within driving distance of Columbus, Ohio, you want to spend the day just learning inside of a movie theater. Stir Trek is a great opportunity. And it’s May 1st this year, May 1st, Friday, May 1st, 2026.

Jeffrey Palermo: Fantastic. And we’re continuing, keeping going. DevFest, yet another conference that you’re an instigator in. What’s that all about?

Kevin Griffin: So Hampton Roads DevFest is an event that’s specifically for the local. Norfolk, Virginia Beach community in in southeast Virginia. So I’ve I’ve always kind of said I can go out to the world and I can go to dozens and dozens of conferences and I can impact other communities. But I really want to make sure I’m putting an impact on my local community because we have a lot of software developers here. We have a lot of meetups that want to get going. They want support. How can I support my local community? A long, long time ago, we started Hampton Roads DevFest as just an opportunity to highlight the developer talent that we have here in town. So a couple of rules with Hampton Roads DevFest is the speaker has to be from Hampton Roads. So Hampton Roads is the Norfolk, Virginia Beach metro area. And so it has to be a local speaker. Anyone can come. It’s cost effective. It’s cost effective. Our initial tickets were only thirty dollars to get in the door. So you see the session get you still feed you. You hang out and network. We’ve helped a lot of people connect with their new employer at these events because our partners are also local companies. So everyone’s networking and really we’re helping show that our metro area is filled with developer talents. We’ve had cases in the past where people come here and get educated and then they’ll leave. They’ll go somewhere else. They’ll go to a larger metro area. And we really would love to keep our developer talent here. And it’s better. It’s better for our area. So we really try to celebrate the folks that we have here in town. And this is an event we’ve been doing. This is our sixth year doing it. And every year it’s I think is better than the previous.

Building Systems That Don’t Crash

Jeffrey Palermo: Nice. Nice. You do so much speaking and leading and you were a keynote speaker at the Tech Fest conference in Pennsylvania just in the fall. And you also gave some technical topics talking about dumb mistakes with relational databases and so many other topics that you made. What are you what are you thinking about these days? What part of this technology landscape is really captures your mind when you have a when you have a free block of time? What is sucking you in?

Kevin Griffin: I’m I’m very boring. I have kind of pushed back on the the bleeding edge tech. Right. Because I work with clients who can’t necessarily afford or don’t they necessarily need to go into the bleeding edge tech there? They have solutions, but the or they have problems with the solutions. Those problems have existed for forever. I I tend to talk about or get excited about just how do you build a system that doesn’t crash? How do you build a system that you’re not getting a 2am wake up call about? And I’ve spent a large amount of time just architecting around solving that problem of how can I build something reliable that isn’t is going to cause a 2am wake up call? Because I’ve I think all of us have had that 2am wake up call of hey, the system’s down and every time the system’s down, the company’s losing money in some way, shape or form. So I talk a lot about that stuff and just building robust production systems. I’m I’m on some of the the bleeding edge trains like I think you’re doing yourself a disservice if you’re not at least waist deep in AI right now, consuming it in some way, shape or form. We’re definitely not going in the direction of how do we how do we design a problem for AI to solve? But we’re looking for natural problems where AI could be a solution. And we’re telling our entire team to use it as part of your workflow and ultimately you have to remember you’re so responsible for the code that goes up for review. So don’t just let the AI do all the work and then you indirectly take credit for it. We’re doing heavy heavy stuff on that. I’m a I’m an Azure guy through and through. So it’s looking at the offerings there and how I can either improve our workflows with the offerings. Maybe how can I lower cost with the offerings? Because if you’re in Azure deep enough, it can get pricey. There’s that’s where I spend a lot of my time. I wish I was deeper in it than than I am. I used to be. I remember being in my mid 20s early career and going, oh man, new dot net stuff. Let me let me let me read everything. Let me become an expert. And now I kind of I look at the the releases and go, that sounds fun. And I wait six months. And usually what happens is six months later, you see what people are actually using in all these different updates. You see what’s going to pick up steam. You also see what is Microsoft really passionate about? Like Aspire is a good good example. Aspire released and everyone I just kind of went, huh, that looks fun. And then I went off and they did something else. And now you can see there’s a lot of just resources put towards Aspire and go, OK, well, maybe this is something actually worth looking into now. So that’s yeah, I know it’s such a boring answer to the question. And I wish I was more exciting.

Jeffrey Palermo: That sounds exciting. I mean, OK, you’ve gotten you’ve gotten a multiple decades of experience now. And so, yeah, this new language feature or this new library, it’s like, oh, that’s interesting. But you have the perspective to realize, wait a minute, some massive problems still haven’t been solved once and for all. We are still getting systems in production that go down. We’re still getting people are getting woken up at night and businesses are surprised losing money because they’re down and we didn’t see it coming. And now everybody’s scrambling. That’s a big industry problem that still isn’t solved. And this this new language feature isn’t going to solve that problem. And so you’re putting your brain towards that. I mean, that’s massive. Everybody has to do something to try to keep those. Yeah, surprise. You’re down. Surprise. You’re losing money moments from happening. So, I mean, that I want to skirt by that too quick because everybody has to do that. When you go from system to system or client to client, are there things that you figured out worth? OK, of course, we’re going to do this. We have to have some of this input, regardless of what tool we use here. We have to have this in place or we don’t have any hope of preventing these these outages or destabilizing events.

Logging, Metrics, and Empowering Your Team

Kevin Griffin: Yeah, absolutely. For us, it’s one hundred percent making sure we have a good system for pulling out just logs, logs and metrics. So are are the systems running like that’s easiest? Is everything up and running? If it’s not up and running, why not? Are we throwing errors in our logs? Our logs are readable. So we we spend a lot of time putting in the place structured logging and it’s not just a random text file. It’s going into a logging system and anyone on the team has access to it and can go look, see, oh, we’re having a problem here and here and here. If it’s more like there are certain logs that will just throw you’re expecting them to happen, but then there’s stuff that you don’t expect to happen. And then if the too many issues happen within a certain time period based off whatever heuristics we come up with, everyone gets an alert. And we’ve we also are lucky enough to have a somewhat distributed team where if I’m sleeping, someone else is awake and it just happens. Everyone’s empowered to fix the problem. I’ve I’ve been on some teams where only one person has the keys to fix the problem. And I learned very early on, I never want to be that person. So every time we bring a new we bring a new person onto our team, we’ll we won’t turn the water hose on for him. It will be more gradual. But the goal is to get that team member to a point where any issue that comes up, they are able to to potentially resolve. If I have to get involved, that was a failure on my part. I shouldn’t have to get involved in any problem. And if I am getting involved, it’s because I just happen to be the person there seeing the issue happening. But that’s always been the big thing. We’re we have systems in place for making sure that we’re doing graceful deployments. We always have rollback strategies and we always prioritize or is this a big release or a small release? And we plan accordingly. Our customers typically have a very defined time frame that they use our system. So we’re lucky we don’t have to really set maintenance hours. We could say, all right, after three p.m. eastern time, we can do this low effort deployment. It’s not going to really affect anyone. So we we have systems and processes in place and everyone understands those systems and processes. And that has knock on wood has served us very well up to this point.

Monitoring What Matters: Testing Real User Transactions

Jeffrey Palermo: Yeah. Yeah. So you mentioned always knowing that the system is up. And so what is it just? Oh, let me connect application insights. Or are there some patterns that you need to have in the system to really get a faithful answer?

Kevin Griffin: The big thing we look for is we know what our customers commonly do. And we have periodic testing that perform those actions. And we know what the expected results of those actions are. So if we have a case where that doesn’t work, we can alert on that. And we also do these various tests geographically distributed. This is a nice thing that actually just gives you that application insights gives you right out of the box. You could say run this series of tests from various data centers and you can compare the results. There’s cases where there’s an issue, but the issue is confined to an EU data center. But so not not necessarily even us. Problem is just a data center problem. And we we see these issues and we determine we triage them and determine is this an us problem or them problem? That’s yeah, that’s kind of where we start. And a lot of it is much simpler than than I think people make it out to be. I mean, a lot of folks want to go in and create very complicated setups. And we’re just very, very basic about the telemetry that we do.

Jeffrey Palermo: That sounds running common transactions. Make sure they work. That sounds like the tracer bullet pattern that’s been published or the spirit of it. Is that do you all think about that or?

Kevin Griffin: You know, I I’ve never really related to that. I’m sure there’s a published pattern out there and I’m just emulating what someone else has already figured out 20 years ago, 30 years ago, 40 years ago. It sounds very, yeah, very similar. It’s the it’s go back to like 20 years ago. You know, everyone was talking about unit testing is the brand new thing. Yeah. And what we saw a lot of people talking about unit testing is that they were building tests around code that would never run in production. And it’s like, all right, awesome. You built a you built a test, but never actually test the thing that that I’m worried about. Our our tests test the system and they do it as a as a privileged or nonprivileged user, depending on what it’s testing. And just sees, can I do the thing that’s 95 percent of customers do every day? And if it fails, it’s that’s a red flag. We need to go take a look at it.

Jeffrey Palermo: I think that’s a big idea. I’m glad you said that the listeners need to hear that. That’s a big idea because there’s so many tools that you can connect that say, well, the server’s running or that the Web server returned a ping of HTTP 200. So green. Good. Thumbs up. But that’s not the whole story. There could be there could be some dependency of some vendor that you use that has an expired token or something. And now when you get through the path and that fails, guess what? The customer, whatever was behind the button that that customer clicked doesn’t happen.

Dependencies, SLAs, and the Four Nines

Kevin Griffin: Yeah, we do take advantage of health checks and we health checks all of our dependencies because you’re absolutely right. Nine times out of 10, it’s not our system that fails. It’s one of our dependencies that fail. It’s integration. Yeah. And you just have to expect that it’s going to fail. And does it fail gracefully? Yeah. And I think that’s that’s just a hard thing to determine. All right. I’m pressing the button and the does the button was the button do. All right. The button goes to SQL Server database. Cool. All right. What happens if that SQL Server is down for 20 milliseconds? Yeah, because it’s going to be my luck that that’s when the customer hits that database. Now, David is going through an error or my timeout code is going to fire. But something’s going to happen and I can’t gracefully execute whatever action the customer wants to do. How do we handle that? And or if you’re talking to a third party service, I’m trying to think we we talk to Google APIs often. Yeah. I mean, they work ninety nine point nine nine nine percent of the time. But then there’s that fourth nine that it doesn’t. And what happens when when that occurs? And then the people running these systems, not just our system, but all this is they’re human. Well, to a degree, right? They’re they’re people putting stuff into action and you you hope you have test coverage for every possible thing that can go wrong. But, you know, we’re going to miss stuff. And what you do is you you see the issue, you triage the problem and you fix it and then maybe write a test for it saying, all right, we don’t want this to happen again. Now we have coverage for that. It’s why I don’t like doing greenfield projects. Everyone loves the idea of a greenfield, but no one actually wants to write a greenfield project. It’s it’s always the what problems do we not know about yet? And are our products very brownfield at this point? It’s been around for 10 years. And what I love about it is we have seen most of the issues. So anything new that pops up is usually something that we just recently introduced.

Jeffrey Palermo: Yeah, you mentioned the four nines that a lot of a lot of services, you know, they publish, you know, how many nines, but four nines is fifty two per fifty two minutes of downtime per year. So I’m thinking, well, is it what is it? Twenty five minutes of downtime, you know, one month and then another month. It’s another twenty five minutes of downtime. I mean, if it happens at the wrong time, that’s a bad thing.

Kevin Griffin: Exactly. For us, it would be horrible if that was eleven o’clock in the morning Eastern time. That’s just when most of our users are online and they’re doing their work. And if our system goes down, that impacts all their businesses. Yeah. And that’s that’s an issue. And I just quick look up the Google the Google off APIs because so many people log in as Google and whatnot. Their uptime SLA allows them twenty two minutes of downtime every month, every month. Wow. That’s four and a half hours a year. So it’s going to happen and they’ll charge you for for all that time and they’ll be like, whatever.

Jeffrey Palermo: Yeah. Yeah. So well, with with the answer of is it working? It’s almost like and I remember like your story, I remember being a young programmer being 18 years old and put my first system in production where people were going to use it to do their job. It wasn’t complicated, but they were using it to do their job. And and yeah, if nobody was using the application, it’s like, is it going to be good when they do use the application? If somebody was constantly doing a transaction like buying a ticket or whatever the transaction is, if somebody was just constantly doing it, you know, it’s like, well, that person would report something if there was a problem. It’s almost like it’s almost like you want to design some robot to be some this phantom user that’s just constantly running some dummy transaction where if ever something in that line breaks, you automatically find out about it instead of an actual user.

Using AI to Fill Test Coverage Gaps

Kevin Griffin: This is a place where I think AI is going to be key, at least for us. We’re using it heavily to try to figure out these these gaps in our coverage. Like I’ll ask cloud code, I need to do an analysis on this specific view. And we have some coverage already. But what coverage are we lapsing in? And with the right skills and agents, it can go in and in 10 minutes tell you, here’s everything you screwed up on. You’re missing coverage on this, this, this, this. And I mean, the reason we have have never hit these issues is because it’s a one percent chance of someone ever typing the correct bad response in there. Yeah. Or the combination of bad responses. But you know what? We should have coverage for this. So we ask AI to create the test coverage for us so then we can go fix the bugs. And then just like you said, in the future, if we make a change and it’s something that we’re not paying attention to, the test suite should should knock on wood, catch it.

Jeffrey Palermo: Yeah, that’s a that’s a great point that you just mentioned about filling in missing test coverage. I think everybody, everybody has gaps in test coverage. And of course, we don’t have time. We don’t have time to just sit there and fill the gaps. What’s the what’s the value for that? It’s the value is hidden. But if all we do, I think the linchpin is if we can design a category of tests, it’s like, OK, what is the pattern to properly isolate this part of the system from other dependencies and test this pathway? If we can if we can figure out the pattern, we’re doing one of that type of test. Yeah, then do an AI prompt and say, give me the combinatorial how many factors, how many combinations of stuff and use this pattern and give it give it to me. The generated stuff, the generated test will be good enough.

Kevin Griffin: Yeah. Yeah. And that’s what we’re seeing. And where we have sandboxed environments and we’re getting to the point where we’re going to release our test suite. With user accounts against our tests, our test environment. So we’re not risking ever screwing up customer data. But I could say, you know, go test my my login process and I can give it fake user information. Here’s a here’s a username and password for a user on the sandbox. And it will go log in as that user and go do user things and and come back and say, hey, we ran into this issue in a very specific point. It knows what it’s looking for. And if it doesn’t see it, it’s going to throw an error. And that’s something that a human has to come look at. We’ve been very bad about that before AI. It was always an afterthought. We should really have a test for this. And you go, all right, well, next sprint, we’ll write the test for that case. And then that never happens. But now with AI, it’s very easy for me to just write the prompt, go get a cup of coffee or go work on something else while it’s figuring out some of the details for me. So it’s really helping us fill the gaps that we just naturally have had in our process.

Health Checks Beyond “Is the Server Running?”

Jeffrey Palermo: Yeah. Yeah. Awesome. As we kind of get to the end of our time here, one of the things that you mentioned that I think really interesting to listeners is that you use health checks or use the concept of health check. Not .NET has a library now, but I think there’s a lot of .NET developers that don’t have that in the vocabulary of, do I have a health check? What the heck is this? Why would I need it? Where would I put it? How do I know if it’s enough? How do you think about that concept?

Kevin Griffin: We look at our dependencies. So the best example I can give you is we depend on SQL Server. And we have a series of tables that we hit often. And what we did with our health check for our SQL Server, it’s not a matter of, can I just connect to the SQL Server? Like, that’s a part of it. But I need to be able to connect to the SQL Server. I run a very specific query on that SQL Server, and I expect that query to come back in a certain amount of time. There’s a threshold. So the health check for SQL Server isn’t just connect to the server. Connect to the server, run this query. That query is more than, I don’t remember what our threshold is, let’s just say more than 50 milliseconds. If it takes more time than that, because it’s supposed to be a quick query, there’s a potential issue with the server. And we run into deadlocks and stuff often. Sometimes someone will accidentally push a process. I say someone, but really I mean me. I have pushed the process that accidentally, you know, just DDoS is the entire SQL Server. Yeah. All right. Well, when that happens, our health check fails because now that health check is fighting for server resources that are tied up doing something else. And what should be 50 milliseconds might take 10, 15 seconds, or worst case, it times out. If it times out, that throws the health check and someone gets an alert that, hey, there’s an issue. But we do it on all of our external services. So SQL Server, Redis, we have a couple of our own internal services that we rely on. We just make sure they’re up and that they’re talking the way they’re supposed to be talking. And what’s nice with health checks is you can go check all these independently. And then if they fail, it kind of rolls up and you have a failure, a degradation. When we’re pushing more and more services in the containers and taking advantage of Azure’s container services, what I love about health checks there is if something’s failing, it could just restart itself. And where we’re looking at our applications and services as very there, they’re all throw away. So we can we can stand it up. It could run for a day. It could run for weeks. But at any time that services can get closed and are destroyed and restarted. And usually this as we’re architecting some of this thing, if something fails, sometimes the easiest thing to do is just restart the process. And so that’s what we do. We we do the health check. If the health check fails, shut it down, restart it. Boom. It all works good again. And it’s just kind of changing some of the architecture for the applications and services we’re building.

Jeffrey Palermo: So it seems like if if I’m analyzing a particular software system, then I’d want to health check just to come back and say, oh, yeah, it’s like the main process here is running. And if the system is like five running processes, OK, one for each of the five, these five are running. And then with dependencies, it seems like from every process that’s connected to something else, you want to health check to test. Is that connection good and healthy? Is it operating like it should be? And it doesn’t need to be complicated. It just needs to be. Yep. Is this URL responding to a to a request? And we don’t actually, you know, I’m thinking about it now. We go so much deeper with our our health checks than than what we’re currently doing. I mean, we health check six or seven dependencies. But we really probably have two dozen dependencies that should be health checking. And we just haven’t gotten there yet. Maybe in the future.

Kevin Griffin: Yeah, it’s almost like over time, the bad outages that have happened cause a reaction to wait a minute. We were out because that connection was down or that thing was OK. Let’s put a health check around that so that never happens again. And then you have another outage. Well, let’s put a health check around that so it never happens again. It seems like we could all do an inventory and just say, how many running processes do we have? Great health check for each. And of all the running processes, how many connections from every process? OK, health check for each connection. That’s our inventory. Now we know we have our basic health check hygiene. Exactly. And and then make sure you you that that data goes somewhere. So if the health check fails, all right, what’s detecting that it fails? Who’s it telling? We we’re very heavy in slack and we have slack notifications for for everything because that’s where the team is. There are some checks. There are some very critical services that send me and another person on the team a text message if something goes down. And it’s we try to make it to the point where if I see that text message, that’s something I can’t ignore. I have to I have to check in with the teams, go someone needs to look at this now because the critical systems not operating. But nine times out of ten slack does the job.

Structured Logging and Correlation IDs

Jeffrey Palermo: That’s a good point. And you’d mention like capturing logs and getting them off somewhere as well. It seemed like a man. Now you’re now you’re making me realize that I’ve got unfinished work that besides just the health check completing super simple just to measure how many milliseconds did that health check take to complete? Put it in some structured logging and then over time I can say, wait a minute, is that connection is that particular health check that represents this connection? Is it getting slower over time? How fast did it used to happen? And now how fast is that’s changing? Why? Yeah.

Kevin Griffin: It’s just more information is always better. That’s always been our our standpoint is well, I do. I do a talk on on logging and the is essentially starts with log everything. And so because you have different levels of logging you can do, you you don’t you can send all the logs to a service. So we use Seq for all of our structure logs. And what’s nice is I can just say right now ignore anything that’s not warning or higher. So the logs go in, they just don’t get processed. But if we start seeing issues, all right, maybe I go in and I I drop that down to informational or debug level logs. Ideally, you don’t want to change your app. You don’t have to restart your app or do anything. The app just does what it’s going to do. Your logging endpoint is the one that goes, all right, I only really care about warnings and errors right now. And arc like our structure logs are broken up into individual applications and services. So I can see that, all right, this service made a call to that service and this other service is one that error. OK, and then funnel down and we can track all that. We’re starting to get more sophisticated where we have actual telemetry IDs between the different services. I’m sorry, correlation IDs. So I know, all right, the request from this user called this, they called that, and they all have the same ID. So I can track the user’s journey through our systems and go, all right, it was this input that caused this issue that did something else. And it’s a lot of data and we have to have some robust retention scripts as well. So if the data comes in, it’s been very easy for us to fill up a couple terabytes of just log data in a day. It’s just because systems do system things and we have to clean a lot of that out or we have to be selective about what logs we actually take in at any one time.

Jeffrey Palermo: Yeah, yeah, for sure. Wow. This is good. This is good stuff. Well, Kevin, thanks for coming on the podcast and sharing, kind of give us an update on your industry dealings and then really good, really good information about keeping systems online and running and knowing that they’re online and running. Knowing is half the battle.

Kevin Griffin: Exactly. There we go. Awesome. Appreciate it. Thanks so much. My pleasure, Jeffrey. Take care.

Jeffrey Palermo: All right. And until next time, dear listener, keep shipping.