Bas Franken
← All posts
DevOps Troubleshooting Opinion AI

The Art of Troubleshooting Is Slowly Disappearing

There’s a skill that’s quietly dying in our industry, and nobody’s talking about it because we’re all too busy pasting error messages into Claude.

It’s troubleshooting. Real troubleshooting. The kind where you actually understand what went wrong instead of trying random fixes until something sticks.

The Old Way (That Actually Worked)

When I started in IT, debugging looked like this:

  1. Read the error message. Like, actually read it.
  2. Think about what changed recently.
  3. Form a hypothesis.
  4. Test it.
  5. Repeat until fixed.

Simple? Yes. Boring? Also yes. But it built something important: a mental model of how systems actually work.

When you’ve manually traced a network packet through iptables rules at 2am because production is down and your manager is breathing down your neck, you understand networking. Not “I watched a YouTube video about OSI layers” understand. Actually understand.

The New Way (That Worries Me)

Now debugging looks like this:

  1. Copy error message.
  2. Paste into AI.
  3. Apply suggested fix.
  4. If it doesn’t work, paste the new error.
  5. Repeat until something works or you give up and restart the pod.

I’ve watched junior engineers do this. I’ve watched senior engineers do this. And the fix works! Most of the time. But ask them why it worked and you get a blank stare.

Why This Actually Matters

“But Bas, if the fix works, who cares?”

You will. At 3am. When the AI suggests the same three fixes that don’t apply to your situation because your environment has a quirk that isn’t in any training data.

Here’s a real example: we had an Azure DevOps pipeline that kept timing out, but only on Tuesdays. Only. On. Tuesdays. An AI would tell you to increase the timeout, check the agent pool, or look at resource constraints. All reasonable. All wrong.

The actual cause? A scheduled compliance scan ran every Tuesday morning and consumed all the bandwidth on the self-hosted agent’s subnet. You’d only figure that out by understanding the infrastructure, talking to the security team, and correlating timestamps like a detective in a crime show.

No AI is going to ask your security team about their scan schedule.

The Espresso Machine Analogy

I used to run a sandwich shop. When the espresso machine broke, I didn’t Google “espresso machine not working.” I knew that machine. I knew its sounds, its quirks, the way the pressure gauge dropped slightly before the heating element was about to fail.

That knowledge came from cleaning it every day, fixing small issues, understanding how steam, pressure, and water flow worked together.

Infrastructure is the same. You need to have broken things to understand how they break. You need to have fixed things the hard way to know when the easy way is wrong.

What I’m Not Saying

I’m not saying “throw away your AI tools.” I use Claude Code every day and I’ll fight anyone who tries to take it away from me. It’s incredible for the 80% of problems that are well-documented and pattern-matching.

But I am saying: if you can’t explain why the fix works, you haven’t fixed anything. You’ve just postponed the next outage.

The Fix (Ironically, Not AI-Generated)

If you’re a junior engineer, do this: next time something breaks, don’t paste the error anywhere. Spend 30 minutes actually reading logs, checking recent changes, and forming your own hypothesis. Even if you’re wrong, you’ll learn more from that process than from 100 correct AI suggestions.

If you’re a senior engineer, do this: stop letting your team skip the debugging process. Code review catches bad code. But nobody reviews bad troubleshooting, and that’s where the real damage happens.

The tools are getting better. The engineers need to keep up. And “keeping up” doesn’t mean using AI faster. It means understanding enough to know when AI is wrong.

Which, for the record, is more often than the marketing pages suggest.