← Back to writing

How We Cut $350K From Cloud Spend in 6 Months (And What I'd Do Differently)

Migrating an AI-first product from GCP to Azure cut $350K from infrastructure spend over six months. The negotiation that mattered more than the architecture, the $50K we accidentally cost ourselves back, and the four migrations I'd refuse to do today.

The migration that saves $350K is the same migration that costs $250K if you sequence it wrong.

At Lavender, we moved an AI-first product from Google Cloud Platform to Microsoft Azure over a six-month window and netted $350K in infrastructure savings against the previous run rate. The number is real. The story is more interesting than the number, because the savings came from somewhere most cloud-cost write-ups don't talk about, and we made $50K worth of mistakes along the way that I'd avoid if I were running it again today.

What follows is the playbook, the wins we didn't expect, the mistakes that bit us, and a section at the end on when I'd refuse to do this migration at all.

The trigger

We did not move clouds because GCP was bad. GCP is excellent. We moved for two reasons that compounded.

First, the product had become AI-heavy and our largest cost line had shifted from compute and storage to inference. Microsoft's commitment-based pricing on Azure OpenAI was meaningfully more flexible than GCP's equivalent terms at the time, especially for a startup our size. The negotiation room was real and large.

Second, our customer base was tilting toward enterprises that already had Microsoft Enterprise Agreements. A non-trivial subset wanted us deployed on Azure for procurement reasons. We weren't going to win those deals as a GCP-only vendor without major friction.

One of those reasons would not have been enough. Both together made the math work.

The rule: lift-and-shift first, optimize after

The single most important architectural decision we made: do not redesign during the migration.

The temptation is enormous. You're already touching everything. Why not refactor the messy parts? Why not move from VMs to managed services while you're at it? Why not adopt that new pattern you read about?

Because every redesign multiplied the migration's risk and timeline by orders of magnitude. We chose to:

  1. Lift services from GCE/GKE to equivalent Azure VMs/AKS, preserving topology where possible
  2. Re-point DNS and verify functional equivalence under load
  3. Decommission GCP
  4. Then, and only then, start optimizing for Azure-native primitives

The lift-and-shift took roughly three months. The optimization phase ran for the next three. Both phases produced savings, but only the second phase produced large savings. Trying to combine them would have produced no migration and a lot of side projects.

Where the money actually came from

I expected the savings to come from architecture. They didn't, mostly. Here's the breakdown of the $350K we actually saved, in approximate order of contribution.

Contract negotiation: ~$180K

Microsoft's BD team was hungry for an AI-first startup that would publicly use Azure OpenAI. We came in with a credible threat of staying on GCP, six months of usage data, and a willingness to commit to a multi-year reserved spend.

The discount we negotiated, both on Azure compute and on AI inference, was meaningfully better than what GCP had offered for an equivalent commitment. This single negotiation accounted for about half the total savings.

The lesson here is uncomfortable for engineers: cloud cost is a sales negotiation, not an architecture problem. If your engineering team has not been told the actual rate card you're paying, they cannot evaluate whether to optimize or to renegotiate. Treat your cloud bill as a contract that gets renegotiated, not a fixed input.

Instance right-sizing on the new platform: ~$80K

The lift-and-shift produced an opportunity. When we moved each service, we measured its actual CPU/memory/IO profile under real production load on Azure rather than what we'd over-provisioned on GCP three years prior.

About 60% of our compute footprint was sized 2x larger than its real workload required. Azure's instance taxonomy is slightly different from GCP's, so we couldn't just port the same SKUs — we had to think about it. Thinking about it surfaced the over-provisioning. Right-sizing during the move saved roughly $80K annualized.

If we'd just lifted the SKUs over and kept GCP, we'd have left this on the table. The migration was the forcing function that made us look at sizes again.

Storage tier rationalization: ~$40K

We had years of data sitting in GCS Standard that was almost never read but was paying Standard rates. Moving to Azure forced us to inventory it. We moved cold archives to Azure Archive Storage and warm-but-rarely-read data to Cool Storage. The active set stayed on Hot.

This had nothing to do with the move except that the move made us look at the data. We'd been "going to get to that" on GCS for two years and never had.

Inference spend on managed endpoints: ~$30K

The Azure OpenAI committed-tier pricing kicked in once we had three months of stable usage data. The negotiated rate beat our prior on-demand inference cost meaningfully. Smaller savings number than I'd hoped — most of the inference negotiation was already counted in the contract line above.

Kubernetes consolidation: ~$20K

On GCP we had run two GKE clusters (production and staging) for historical reasons that no longer applied. On Azure we collapsed them into a single AKS cluster with namespace isolation. Smaller line item but pure profit going forward.

The $50K we cost ourselves back

The mistake column. There were three.

Egress costs we didn't model. During the migration we ran services on both clouds simultaneously to validate parity. Cross-cloud calls produced GCP egress charges we hadn't fully forecasted. About $25K of unplanned spend over the three-month overlap. We could have minimized this by snapshotting test data into Azure once, instead of having Azure services pull live from GCP databases. Lesson: model the migration overlap period as its own line item before kickoff.

Snapshot strategies in the cross-cloud period are also worth investing in: data versioning, write-through caching, and a clear policy on which cloud is the source of truth at each stage.

Reserved-instance lock-in we couldn't unwind. We had committed-use discounts on GCP that didn't fully expire until two months after the migration completed. We paid for compute we no longer used. About $20K. Should have audited and timed the migration to the commit cycle.

Observability tooling we double-paid for. Our APM vendor billed by host count. During overlap we doubled host count and got the bill. Roughly $5K. Trivial in the bigger picture, but irritating.

Total dropped: about $50K out of $400K of theoretical savings. Net: $350K.

The six-month timeline

For anyone planning a similar migration, the rough cadence:

  • Month -2 to 0: Negotiation. Stand up an Azure tenant. Run pilot workloads. Get pricing in writing. This is where the contract savings come from. Do not skip this.
  • Month 1: Move stateless services. Logging, metrics, web tier. Validate Azure parity for everything you don't store state in.
  • Month 2: Move stateful services with active replication strategies. Databases, caches, queues. This is the dangerous month.
  • Month 3: Cut over DNS. Run on Azure as primary. Keep GCP warm for rollback for two weeks.
  • Month 4: Decommission GCP. End commitments. Final billing reconciliation.
  • Month 5–6: Optimize. Right-size, consolidate, rationalize storage. This is where ongoing savings get baked in.

The first two months are where most teams blow the timeline. They underestimate the negotiation phase and how long it takes to validate stateful service parity. Both are worth getting right; both reward patience.

When I would refuse to migrate today

Cloud migrations are massively over-prescribed. About four times out of five, when a founder or CTO asks me whether they should migrate clouds, the right answer is no.

Things that make me push back:

  • "We'd save 20% by switching." 20% is not enough to justify a six-month migration. The opportunity cost of the engineering team is higher than 20% of cloud spend at any company under $50M ARR. Renegotiate your existing contract first.
  • "Their managed service is better." Maybe. But the cost of porting all your code, dependencies, and operational muscle memory to a new managed service is large and rarely accounted for. Use better managed services where you are unless the gap is enormous.
  • "We're getting a free credit grant." Free credits are a customer-acquisition cost the cloud is paying. If your decision is dominated by 12 months of free credits, you're optimizing for the wrong year.
  • "We want to be multi-cloud for resilience." Multi-cloud for HA is one of the most expensive forms of theater in software engineering. Single-region failures are rare; cross-region within one cloud handles most of what people imagine multi-cloud handles. Unless your customers contractually require multi-cloud, run on one cloud and run it well.

The migrations that make sense are the ones where there's a structural reason — major customer demand, contract leverage, a real product capability difference, or a regulatory requirement. "We can probably save some money" is not enough.

Treat the cloud bill as a living document

The biggest enduring lesson from this migration was not about migrations. It was about how to relate to your cloud bill in general.

For the three years before the move, our cloud bill was something the finance team looked at and the engineering team didn't. The infrastructure had been "right-sized" once, four years prior, and never revisited. The contract had been negotiated once, two years prior, and never revisited. Both were wildly suboptimal by the time we looked.

The fix going forward — and the practice I now insist on at every company — is treating the cloud bill as a living document with two scheduled review cadences:

  • Monthly: a 30-minute review by the engineering and finance leads of top 10 line items, week-over-week deltas, and any anomalies. The goal is to spot $5K-a-month leaks before they become $60K-a-year leaks.
  • Annually: a contract review three months before any major commit expires. Renegotiate from a position of leverage, with usage data in hand and a credible alternative quoted.

Neither of these is hard. Both are skipped at most companies because no one owns them. Assign an owner and run the cadence. The savings will be quiet but persistent.

The takeaway

If the math works, sequence it like this: negotiate first, lift-and-shift second, optimize third. Do not redesign during the migration. Model the overlap period. Time the move to your existing commit cycle. Expect about half the savings to come from the contract renegotiation and the other half from the operational hygiene the migration forces on you.

And if the math doesn't work, do not migrate. Renegotiate where you are. Optimize what you have. The cloud you're already on is almost always the right cloud, until it isn't, and the math tells you when "until" arrives.