DevOps Troubleshooting Opinion AI

The Art of Troubleshooting Is Slowly Disappearing

2026-03-28

There is a skill that is quietly dying in our industry, and nobody is talking about it because we are all too busy pasting error messages into Claude.

It is troubleshooting. Not the word everyone uses when they mean “Googling a stack trace”. Real troubleshooting. The kind where you actually understand what went wrong instead of trying random fixes until something sticks. The kind where you can explain, after the fact, why a specific change caused a specific failure and why the change you made next is going to work.

That skill used to be the defining characteristic of a good operations engineer. In 2026 it is something I increasingly have to teach juniors from scratch, because nothing in their workflow has required them to develop it yet. And I do not blame them. The tools are very good at hiding the gap. You can have a successful career shipping features in 2026 without ever doing a single minute of real debugging, if you are willing to restart enough pods and accept enough suggested fixes. The gap only shows up when something breaks that the tools cannot reach, and by then it is too late to develop the muscle.

The Old Loop

When I started in IT, debugging was a loop you learned by doing it badly over and over until you stopped doing it badly. Read the error message carefully, not like a skim. Think about what changed recently. Form a hypothesis that could explain both the error and the timing of the change. Test the hypothesis in the cheapest way possible. If it is wrong, refine or replace it and test again.

Simple on paper, boring in practice, and it built something important that nothing else builds: a mental model of how the system you work on actually behaves. Not the documented behavior. The actual behavior, with all the weird interactions between components and all the places where reality diverges from the diagram.

When you have manually traced a network packet through iptables rules at two in the morning because production is down and your manager is asking for updates every five minutes, you understand networking in a way that no YouTube video can give you. Not because the video is wrong. Because the video covers the general case, and your production problem is always in the specific case that the general case missed. The only way to develop instincts for the specific is to have your hands in the specific repeatedly, and fail, and learn what failure looks like from the inside.

The New Loop

Now the loop looks different. Copy the error message. Paste it into a language model. Read the suggested fix. Apply it. If the fix does not work, paste the new error. Repeat until something works or you give up and restart the service.

I watch junior engineers do this. I watch senior engineers do this. I catch myself doing it when I am tired. And the worst part is that it works. Most of the time. Most production issues are sufficiently well-documented in training data that a language model can pattern-match your error to a known solution and save you an hour of reading logs.

But here is the question nobody wants to answer out loud: if the fix works, can you explain why? Not in the sense of “the error was X and the fix was Y”. In the sense of what the underlying mechanism was, how it ended up in a broken state, and what you would do differently next time to prevent the same class of failure. Because if you cannot answer those questions, you did not debug anything. You rolled a dice and the dice came up working. The next time it will not, and when it does not you will be back at the start with no more understanding than before.

When the Tool Runs Out of Road

The problem with the new loop is not that it is wrong. The problem is that it only works for problems that look like problems other people already solved. As soon as your failure involves something specific to your environment, the tool is useless, and you are naked.

I will give you a concrete example. A few months ago I had to debug a weird issue where database connections from one service were hanging randomly, sometimes for thirty seconds and then succeeding, sometimes forever. It was not a new deployment. Nothing had changed on the application side. The connection pool was fine. The database was healthy. Every monitoring dashboard showed green.

I fed the whole thing into Claude. It gave me three reasonable suggestions: check pool sizing, check firewall rules, check DNS resolution latency. All three were worth checking. All three came back clean.

The actual cause was that our cloud provider had silently rolled out a change to the NAT gateway timeout behavior in that region, and the service was holding connections long enough to hit the new timeout exactly at the thirty-second mark. There was no release note. There was no changelog entry I could find on day one. The only way to find it was to notice that the hanging duration was suspiciously close to a round number, go searching for NAT-related timeouts at that duration, and correlate with provider status updates from the previous week that nobody had read because they were filed under “minor infrastructure maintenance”.

That whole chain of reasoning, from “hangs for thirty seconds” to “NAT gateway timeout change”, lived entirely inside my head. None of it was in a log. None of it was in a ticket. None of it was in any source Claude had access to. If I had not developed the habit of noticing suspicious numbers and cross-referencing vendor status pages, I would still be increasing my connection pool and hoping.

That is what I mean when I say the new loop only works for problems other people already solved. The problems that matter most in production, the ones that cost you a weekend, are almost always the ones that nobody has solved yet because they just happened to you for the first time in that exact combination. And the skill to solve them is exactly the skill the new loop does not develop, because the new loop is optimized for the problems that are already solved.

The Espresso Machine

Before I worked in tech I used to run a sandwich shop. When the espresso machine broke, which it did regularly because espresso machines are the most temperamental pieces of equipment you will ever meet, I did not Google “espresso machine not working”. I knew that machine. I knew its sounds, its quirks, the specific way the pressure gauge dropped fractionally before the heating element was about to fail. I knew that the portafilter would start letting steam through the wrong gasket when the rubber was a month away from needing replacement. I knew that if you ran it too hard without a cooling break on a summer Saturday, the group head would start producing shots with a sour note that only showed up after the third cup.

That knowledge did not come from studying the machine. It came from cleaning it every single morning, fixing every small issue the moment it appeared, and developing a physical intuition for how steam, pressure, water, and heat interact inside a system that none of the documentation fully described. The manual gave me vocabulary. The repetition gave me instinct. Without the repetition, the vocabulary was useless.

Infrastructure is the same. You need to have broken things to know how they break. You need to have fixed things the hard way to recognize when the easy way is going to fail. You cannot skip the hands-on part and pick up the instinct later, because the instinct is what the hands-on part is for. The manual will not teach it to you, and neither will Claude.

What I Am Not Saying

I am not saying throw away your AI tools. I use Claude Code every day and I would fight anyone who tried to take it away from me. It is incredible for the seventy percent of problems that are well-documented and well-patterned. If your error message has been seen ten thousand times before in public repositories, let the language model save you thirty minutes of searching. That is not laziness, that is good tool use.

What I am saying is that if you never do the other thirty percent, you never develop the instinct that the other thirty percent requires. And the other thirty percent is where careers are built, because it is the only part of the work that has not been automated yet. An engineer who can only solve the seventy percent is an engineer who is competing directly with a subscription that costs fifty dollars a month. An engineer who can also solve the thirty percent is an engineer whose employer cannot replace them with a subscription, because the subscription does not have the instinct.

The Discipline

If you are early in your career, here is the discipline I wish someone had forced on me. The next time something breaks, do not paste the error anywhere. Spend thirty minutes reading logs, checking recent changes, and forming your own hypothesis before you touch a single AI tool. Be wrong. Learn from being wrong. Form a new hypothesis based on what the wrong one taught you. Only reach for the language model after you have done that work, and when you do, use it to verify your own reasoning rather than to replace it.

That will feel inefficient. You will be slower in the short term. You will be measurably slower for weeks. And at the end of those weeks you will have something most of your peers do not have: an actual mental model of the system, built the only way mental models get built.

If you are already senior, here is the discipline you can enforce on your team. Code review catches bad code. Nobody reviews bad troubleshooting, which means bad troubleshooting compounds silently until it takes down a production system and everyone is surprised. Make debugging part of the work that gets reviewed. When someone fixes an issue, ask them to explain why their fix worked and why the issue happened. If they cannot answer, the issue is not actually fixed. It is paused.

The tools are getting better. The engineers need to keep up, and keeping up does not mean using AI faster. It means understanding enough to know when AI is wrong, which, for the record, is more often than the marketing pages suggest. The margin where human instinct matters is not shrinking. It is moving, and the engineers who are paying attention are moving with it.

Enforcing CIS Compliance at Design Time with Azure Verified Modules

The Non-Traditional Path to DevOps Engineering

← All posts