Azure Terraform DevOps Infrastructure

Azure Infrastructure: Lessons from the Field That Microsoft Doesn't Teach You

2026-03-31

Microsoft’s documentation is excellent at telling you what Azure services do. It’s terrible at telling you what to actually do with them. There’s a canyon between “this is what Azure Policy is” and “this is how you implement CIS compliance baselines across 50 subscriptions without breaking everything.”

I’ve spent the last few years building Azure infrastructure for real organizations. Not in labs. Not for certifications. For companies that need their stuff to work on Monday morning. Here’s what I’ve learned that no exam will teach you.

Landing Zones: The Foundation Everyone Skips

Every Azure architecture starts with a landing zone. Or it should. In practice, most organizations skip this step entirely and start deploying resources directly into a single subscription. Then, 18 months later, they hire someone like me to untangle the mess.

A landing zone is your subscription topology, your network architecture, your identity model, your security baseline, and your governance framework. It’s not exciting. It doesn’t demo well. It’s also the single most important decision you’ll make because every resource you deploy inherits its constraints.

Azure Verified Modules

If you’re building landing zones in 2026, you should be using Azure Verified Modules. Not because Microsoft says so, but because writing your own modules from scratch is reinventing the wheel with a higher chance of getting the security defaults wrong.

I wrote about enforcing CIS compliance with AVM in detail. The key insight is that AVM modules come with security defaults baked in. You’re not adding compliance as an afterthought. You’re starting with it.

The catch? AVM modules are opinionated. They make decisions for you. Sometimes those decisions don’t match your requirements. Knowing when to use AVM as-is and when to wrap it with your own module is a judgment call that requires understanding both the module and your organization’s needs.

Terraform vs Bicep: The Honest Truth

Every Azure engineer eventually faces this question. Here’s my answer after using both in production:

Terraform wins if you’re multi-cloud or if your team already knows it. The state management is annoying but well-understood. The provider ecosystem is massive. The community is larger.

Bicep wins if you’re Azure-only and want the closest-to-native experience. No state file to manage. Deployment is a first-class Azure operation. What-if is built in.

The honest truth: the tool matters less than the practices around it. A well-structured Bicep deployment with proper CI/CD, testing, and review processes beats a Terraform setup where people run terraform apply from their laptops.

I’ve seen teams spend months debating Terraform vs Bicep while their infrastructure has zero version control, zero testing, and zero review process. Fix the process first. Pick the tool second.

Security Baselines: CIS Is the Starting Point, Not the Goal

CIS benchmarks are great. They give you a concrete, measurable security baseline. They’re what auditors ask for. They’re what compliance frameworks reference.

They’re also generic by design. CIS tells you to disable public blob access. That’s good. But CIS doesn’t know that your application needs a public blob container for user-uploaded images with a CDN in front of it. That’s where your own security decisions start.

The Compliance Pipeline

The pattern that works:

Start with CIS as your default deny. Everything is locked down.
Document exceptions with business justification. Not “we need it” but “application X requires public endpoint Y because Z, approved by security team on this date.”
Enforce with Azure Policy in audit mode first, then deny mode. Never go straight to deny unless you enjoy emergency calls.
Test with pre-deployment checks. Your CI/CD pipeline should catch policy violations before they hit production, not after.

The tools for this exist: Azure Policy, Defender for Cloud, Sentinel. The challenge isn’t tooling. It’s discipline. It’s the difference between “we have policies” and “we enforce policies and review exceptions quarterly.”

Cost Optimization: The Conversation Nobody Wants to Have

Azure costs are predictable until they aren’t. I’ve seen monthly bills double overnight because someone deployed a D-series VM for testing and forgot to shut it down. I’ve seen organizations pay for premium storage on development databases that nobody uses on weekends.

What Actually Reduces Costs

Auto-shutdown on non-production resources. This is the single highest-impact, lowest-effort optimization. A simple Azure Policy that shuts down dev/test VMs at 19:00 and starts them at 08:00 saves 50% on those resources instantly.

Right-sizing. Most VMs are oversized because someone picked “Standard_D4s_v3” during initial deployment and never revisited it. Azure Advisor tells you this for free. Listen to it.

Reserved Instances for production. If it’s running 24/7 and you’re paying on-demand, you’re overpaying by 30-60%. This is not a technical decision. This is a finance decision that IT needs to drive.

Tagging discipline. You can’t optimize what you can’t attribute. Every resource needs an owner, an environment tag, and a cost center. No exceptions. Enforce this with Azure Policy on day one, not after you get the first surprise bill.

I built an AI agent that finds Azure cost waste automatically - not because the analysis is hard, but because nobody does it manually with enough consistency.

The Patterns That Keep Breaking

After years of building Azure infrastructure, I see the same mistakes over and over:

“We’ll add security later”

No you won’t. Security debt compounds faster than technical debt. If you don’t have NSG rules on your subnets in dev, you won’t have them in production either. Build it in from day one or accept that you’re building on a foundation of hope.

”We don’t need IaC for this small project”

Every small project becomes a big project. The portal deployment that “only takes a minute” becomes the undocumented snowflake that nobody can recreate when it breaks. Start with IaC. Always. Even if it feels like overkill.

”Our naming convention is flexible”

Flexible naming conventions aren’t naming conventions. Pick a standard, enforce it with policy, and accept that it won’t be perfect. An imperfect standard that everyone follows beats a perfect standard that nobody does.

”We manage access manually”

RBAC managed through portal clicks is RBAC that nobody can audit. Define your role assignments in code. Review them quarterly. Remove access you can’t justify. This is boring work that prevents interesting incidents.

What I Wish Someone Had Told Me

Start with governance, not services. The first thing you deploy should be your management group hierarchy, your policies, and your naming conventions. Not your first VM.

Read the Azure Architecture Center. It’s free, it’s comprehensive, and it’s written by people who’ve seen more failures than you have. The Cloud Adoption Framework is genuinely useful, not just marketing.

Talk to your security team early. Not after you’ve built the architecture. Before. Their requirements will change your design. Better to discover that in week one than week twelve.

Budget for operations, not just deployment. Building the infrastructure is 30% of the work. Running it, monitoring it, patching it, optimizing it - that’s the other 70%. Plan for it.

Accept that you’ll get things wrong. The goal isn’t a perfect architecture on day one. It’s an architecture that can evolve without a complete rebuild. Design for change, not for perfection.

The Bottom Line

Azure infrastructure is not complicated because Azure is complicated. It’s complicated because organizations are complicated. The technology is well-documented. The organizational challenges (governance, security, cost, compliance, team skills) are not.

The engineers who succeed at this aren’t the ones who know every Azure service. They’re the ones who understand how to make technology decisions in the context of business constraints. That requires wearing multiple hats, understanding security, and being honest about what you don’t know.

Certifications help with the vocabulary. Experience helps with the judgment. You need both, but if I had to pick one, I’d pick the judgment every time.

Help, Claude Code Is Smarter Than My Co-worker

Why Working IT at a Small Company Is the Best Job in Tech

← All posts