AI Code Quality DevOps Opinion

The AI Slop Problem Nobody Wants to Admit

2026-03-26

Here is the uncomfortable thing about AI-generated code that I do not see enough people saying out loud. The problem with it is not that it is worse than what a junior engineer would write. The problem is that it is better-looking than what a junior engineer would write, and that is what makes it dangerous.

Bad human code announces itself. The naming is inconsistent, the structure is messy, the imports are a mess, the tests are missing, and any experienced reviewer spots it in ten seconds and sends it back. You do not need to read the logic, because the shape of the code already tells you to slow down. Your reviewer instincts kick in before your brain does.

AI-generated code does not announce itself. It follows every style guide it has ever seen. It has consistent naming, tidy structure, full docstrings, reasonable tests, and the kind of polish that signals “a careful engineer wrote this”. Those signals are real, but they are no longer reliable, because the thing that used to earn them (thinking hard about the problem) is now decoupled from the thing that displays them (looking like someone thought hard about the problem). A bot can ship the visible layer of care without any of the underlying care.

That decoupling is the whole problem. The industry is calling it AI slop, and I want to explain three things in order. First, what the visible layer of care actually hides. Second, why volume turns a small review gap into a structural one. Third, what this means for the job of code review itself, because the instincts that made you a good reviewer before are the exact instincts AI output is trained to bypass.

The Decoupling, Specifically

When I say the appearance of care is decoupled from actual care, I mean something concrete. An LLM trained on millions of public repositories has learned what careful code looks like on the surface. It has not learned why any of that code is shaped that way. The difference only matters when the answer to “why” is load-bearing, which in real software is most of the time.

Here is a Terraform example I deal with weekly. You ask a language model for a module that provisions an Azure web app with a SQL backend. You get back a module with variables, descriptions, outputs, a README, and a test file. It deploys. It passes validation. It looks like a competent engineer took the time to structure things properly.

Then you look closer. The SKU is hardcoded to a premium tier that costs three times what you need, because the training data skewed toward production examples. The SQL server has public network access enabled, because most tutorials showed it that way. It uses count where for_each would have prevented state drift on deletion, because count appears more often in training data than for_each. There is no lifecycle block anywhere, because lifecycle blocks vary by organization and the average case does not include them. And the README confidently describes this as “production-ready with CIS compliance considerations”, because that phrasing is what the training data associated with high-quality modules.

None of those problems are bugs. The module does what it says.

The problems are that every decision was made by averaging across a corpus instead of reasoning about the specific workload, the specific compliance requirements, and the specific cost profile. Averaging across a corpus gives you the most common answer. The most common answer is almost never the right answer for your situation, because your situation has constraints that are not in the corpus.

This is the core of the slop. It is not broken code. It is code where every decision was made without regard to whether the decision should have been made at all. The output looks like a thoughtful engineer wrote it, because the output was trained on thoughtful engineers writing it. The thinking is the part that did not come along for the ride.

Volume Changes the Math

If the only problem was that AI output is subtly wrong in the way I just described, careful teams could catch it in review. The review would take longer than usual, because the reviewer would have to read the logic instead of relying on shape, but teams could adapt. That is not what is happening.

What is happening is that volume is exploding at the same time. Before language models, an engineer might produce fifty to a hundred lines of code a day in a focused session. That is the natural ceiling for code a human writes by thinking about each line. With AI in the loop, the same engineer produces five hundred or a thousand lines in the same day. The extra lines are not all bad. Some are genuinely useful. But the ratio of problems per line has not dropped to compensate for the new volume, so the absolute number of problems per engineer per day has multiplied.

Review capacity did not scale up with it. A reviewer can still only read so much code per hour before their attention degrades, and that limit did not move just because the author got faster. Teams that adopted AI tooling in 2025 and early 2026 ended up with review queues growing faster than reviewers could drain them. The rational response was to spend less time per pull request. The response that actually happened was the same thing.

So each pull request now contains code that is subtly wrong in ways that require reading logic to catch, and each reviewer has less time per pull request than before. The gap between what a careful review would catch and what the review actually catches is the largest it has been in the time I have worked in this industry. It is widening every quarter.

Review Is Now the Product

Put the pieces together. AI-generated code looks polished even when it is wrong, because it mimics the visible signals of careful work without the underlying thinking. Volume multiplies the output without multiplying the reviewer’s time, so per-PR scrutiny has dropped exactly when it needed to rise. The defense mechanism that used to keep your codebase healthy, experienced humans noticing that something feels off, does not fire anymore. Nothing looks off. Everything looks textbook-perfect, because it was copied from textbooks.

That changes what code review is for. It used to be a secondary quality gate. A reviewer caught the occasional bug, enforced style, pushed back on bad design choices. The primary quality came from the person writing the code, who thought carefully because thinking carefully was the only way to produce code at all. You reviewed other people’s work because two sets of eyes are better than one, not because the first set of eyes was unreliable.

Now the first set of eyes might not exist. The author might be a language model, the prompt might have been ten words, the decision about whether the output was right might have been “looks reasonable, ship it”. When that is the case, review is no longer secondary. Review is the entire mechanism. You are not double-checking someone’s careful work. You are the only careful step in the pipeline.

That is a much bigger job than code review used to be, and most teams have not adjusted. They treat AI output like a senior engineer’s work when they should treat it like a stranger’s work. They spend the same three minutes they used to spend on a human pull request, and they are surprised when production breaks in ways that pattern-matching a diff could never have caught.

The engineers who will do well in this environment are not the ones who generate the most code. They are the ones who can look at generated code and say with confidence why it is wrong, what it is missing, and what the author should have done instead. That is not a skill you pick up by using AI more. It is a skill you pick up by understanding the systems you work with deeply enough that you can notice when polished output is quietly off. Which means the best way to stay valuable as an engineer in 2026 is paradoxically the least AI-native one: build things by hand, trace problems through layers, and grow the kind of instinct that AI output cannot fake because it does not come from pattern matching.

I used to run a sandwich shop before I worked in tech, and there is a rule in food service that applies here. You do not serve food you have not tasted, and you definitely do not serve food from a line you have not watched. Not because the cook is untrustworthy, but because you are the last person between the kitchen and the customer, and that is not a role you can delegate. Code review used to be the cook’s job. Now it is the line supervisor’s. The work is the same, but the stakes moved and almost nobody is talking about it.

Your code deserves someone who tastes it. In 2026 that someone has to be you, because the cook is a machine that has never eaten.

Enforcing CIS Compliance at Design Time with Azure Verified Modules

← All posts