The Ruby to Elixir Migration That Cut Our Service Footprint From Ten to Six
We started with ten Ruby and Elixir services serving real-time messaging for 450K students across 900+ universities. Two years later we had six, fully Elixir, and on-call alerts had halved. The migration order, the patterns we leaned on, and what I'd do differently today.
We had ten microservices that were 60% Ruby and 40% Elixir. Two years later we had six, fully Elixir, and our on-call alert volume had halved.
The migration was less about the language and more about what running real-time messaging for 450,000 active students across 900 partner universities forced us to think about. Memory pressure. Long-running connections. Concurrency that didn't tip over. Operational ergonomics that made on-call survivable. Ruby could do all of these things, but every solution required a layer of accidental complexity that Elixir's runtime gave us for free.
What follows is the migration playbook from InsideTrack — the order that worked, the patterns we leaned on, the unexpected wins, and the parts I'd do differently with what I know now.
The stack we started with
The platform served two-way messaging between coaches and students. Mostly SMS, some email, increasing volume of in-app chat. The architecture, when I joined:
- Three Rails monolith services (web, API, admin)
- Two Sidekiq workers (one for messaging dispatch, one for analytics ingestion)
- Three small Sinatra services (one webhook receiver, one cron scheduler, one feature-flag service)
- Two early Phoenix services (a real-time inbox and a notification dispatcher) — both written by the previous team in a "let's try Elixir" experiment
Total: 10 services, 6 Ruby, 4 Elixir. Combined the team operated 60+ background workers and a Postgres cluster handling several thousand writes per second at peak.
The motivation to consolidate wasn't ideological. It was operational. The Ruby services were memory-hungry, the Sidekiq workers had to be horizontally scaled aggressively to keep up with peak load, and the on-call rotation was getting paged 8–12 times per night during exam season because of the cumulative weight of running too many services.
The trigger to start moving
Two specific events forced the decision.
First, we lost a contract with a large university because our messaging dispatch latency P99 spiked above the contractual threshold during an exam-season peak. The latency wasn't a code bug — it was Sidekiq queue depth backing up because the worker fleet couldn't scale fast enough. We could have thrown more Sidekiq workers at it, but the marginal cost was high enough that we'd have eaten the contract margin.
Second, our on-call engineer quit. The exit interview was honest: too many services, too much ambient alert noise, no clear ownership boundaries. The team morale knock was as expensive as the lost contract.
The combined message — both customer-facing and internal — was that the architecture was the bottleneck. Not the team's effort, not their skill, not the underlying tech of any single service. The number of services was the problem, and the runtime characteristics of Ruby + Sidekiq made consolidation in Ruby genuinely hard. Elixir's BEAM gave us a runtime that handled the same workload with one or two services instead of seven.
The right migration order
The first lesson I learned was that migrations work backwards. You don't migrate the easy thing first; you migrate the thing that's most painful to keep on the old stack.
Our order, in retrospect:
- The messaging dispatcher. The most painful service. The one driving the on-call alerts. Migrating it first meant on-call ergonomics improved within the first quarter and the team had visceral evidence the migration was paying off.
- The analytics ingestion worker. Second-most painful. Sidekiq queue depth here was a chronic capacity issue. Re-implementing as a GenStage pipeline in Elixir collapsed memory usage by ~70%.
- The webhook receiver and cron scheduler. Smaller services we consolidated into a single Phoenix app with multiple endpoints and a Quantum scheduler. Saved two services in one move.
- The feature-flag service. Replaced wholesale with a managed service (LaunchDarkly). Not strictly an Elixir migration — but the Ruby-to-Elixir framing forced us to evaluate "is this our problem to host at all?" and the answer was no.
- The Rails admin service. Migrated to Phoenix LiveView. Surprised us by being one of the easier moves once we got over the learning curve.
- The Rails API service. Migrated last and most carefully. This was the customer-facing surface; we ran a dual-deploy period for two months with traffic mirrored to both stacks for parity testing.
- The Rails web monolith. Stayed Ruby. We never migrated it. Too much business logic, too low a marginal benefit. Lesson: not everything needs to move.
Final state: six services, all Elixir except the Rails web monolith. One major Phoenix app handling messaging dispatch, ingestion, webhooks, scheduling, and admin. Three smaller Phoenix apps for the inbox, notifications, and a public API. Plus the Rails web monolith. Down from ten.
The wrong order I tried first
My initial plan, before reality course-corrected it, was to start with the API service. Reasoning: it's the most visible, it has the most code, getting it migrated first proves the platform.
That plan was wrong. The API service was the riskiest single move and had the lowest operational pain associated with it. We would have spent six months on a high-risk migration that wouldn't have meaningfully reduced on-call burden, while the messaging dispatcher kept paging us. The team would have lost faith in the migration before we got to the actually painful services.
The corrected ordering — pain first, value-prove second, polish last — is the framework I'd use again. Migrate the service that's hurting you most, even if it's not the most strategic one. The early operational win pays for the political capital you'll spend later on the harder migrations.
The Elixir patterns we leaned on
Three OTP primitives did the bulk of the work.
GenServer for stateful work. The messaging dispatcher's previous architecture was Sidekiq + Postgres rows for state. Re-implementing as GenServers per active conversation eliminated the database churn for state machine transitions and let us hold conversation state in memory cheaply. The supervision tree handled crashes per-conversation without taking down the whole dispatcher.
Registry for routing. Looking up "which GenServer handles conversation 42" is a few microseconds with Registry. We used it everywhere — for active conversations, for active user sessions, for active webhook subscriptions. Dead simple, fast, and it eliminated a class of "where does this message go" problems that had been complex in the Ruby version.
Supervision trees for failure isolation. The single most important property of the Elixir runtime is that one bad message can't take down the service. A ten-thousand-conversation dispatcher might have one or two crashing GenServers at any given moment; they get restarted in milliseconds and the other 9,998 conversations don't notice. Sidekiq could not give us this without significant infrastructure investment.
The fourth pattern, less universal but useful: GenStage for backpressure-aware pipelines. The analytics ingestion worker was a GenStage pipeline with explicit demand-driven flow control. Made the queue-depth-spike pattern that had been killing us in Sidekiq simply not exist as a category.
The unexpected wins
Halved on-call alerts. By far the biggest morale and retention win. The team that had been getting paged 8–12 times a night dropped to 3–4. Not because the services were doing less work, but because they handled load shedding, partial failures, and self-healing without paging humans.
Better dev ergonomics for the kind of work we did. Pattern matching against incoming messages made the dispatcher code dramatically clearer than the Ruby case statements it replaced. iex with remote shell into a running production node was an operational superpower.
Hiring quality went up. This surprised me. The Elixir candidate pool is smaller, but the candidates who self-select into Elixir tend to be more curious and more rigorous than the Ruby candidate average. We hired better engineers per interview hour after the migration than before.
The unexpected losses
Gem ecosystem. I missed Devise. I missed ActiveAdmin. I missed Sidekiq Pro's UI. There were Elixir-equivalent libraries for most of these, but the Elixir ecosystem in 2018-2019 was visibly less mature, and rolling our own auth or admin UI cost more time than the migration math accounted for.
Hiring pool narrower. Yes, the candidates who came through were better. But the funnel was smaller. We'd see 30 Ruby applicants for every 5 Elixir applicants. For a small team this didn't matter. For a team scaling fast, it would have been a constraint.
Internal training cost. Engineers coming from Ruby need 2–3 months to be productive in Elixir. We absorbed that cost but it was real and it slowed the migration. Account for it explicitly in your timeline.
When NOT to migrate to Elixir today
The math has shifted somewhat since 2019. I would not unconditionally recommend a Ruby-to-Elixir migration today. The cases where I'd push back:
- You're not running real-time, long-lived connections. The killer features of the BEAM are concurrency and supervision. If your workload is short, request-response, and stateless, Ruby/Rails on a modern hosting platform is genuinely fine.
- Your team has zero Elixir experience and you're already understaffed. The 2-3 month productivity dip per engineer is real. If you can't afford it, don't start.
- Your product is dominated by AI features, not real-time messaging. The AI ecosystem in Python is significantly stronger than in Elixir. Most AI startups today should be in Python or Go for the AI portion, regardless of what the rest of the stack runs.
- Ruby 3 + YJIT is meeting your needs. The performance gap between modern Ruby and Elixir narrowed considerably with YJIT. If your Ruby services aren't hurting you, leave them alone.
The right reason to migrate is operational pain that's expensive to solve in your current runtime. The wrong reason is novelty.
What I'd do differently
If I were running this migration again today:
- I'd budget the per-engineer onboarding cost explicitly. 60 days off the keyboard for the first migration project, then ramp. We crashed into this; it should have been planned.
- I'd build the dual-stack observability layer first. Migrating with consistent metrics across both stacks would have made the parity testing meaningfully easier. We bolted this on.
- I'd skip the LiveView migration of admin and use a managed admin tool. LiveView is great. The admin we built was fine. But the time we spent on it was better spent on the API migration.
- I'd not migrate the Rails web monolith. Same conclusion. We made the right call there.
The takeaway
Migrations are paid for by operational pain reduction, not by language preferences. The Ruby-to-Elixir move at InsideTrack worked because real-time messaging is exactly the workload BEAM is built for, and the operational pain we were running into was specifically the kind that BEAM eliminates.
For other workloads, the calculation may go the other way. The disciplined version of the question — "what's hurting us today, would moving runtimes solve it cheaply, and can we afford the transition cost" — is a much better framing than "what's the right tech stack for our company in 2026." The right answer to that latter question is almost always "the one you already have, optimized harder."