RAG in Phoenix: Hand-Rolled pgvector or Arcana?

You don't need a vector database — the Postgres you already run does RAG fine. The hand-rolled pgvector path in Phoenix, and when Arcana earns its place.

TL;DR: RAG needs exactly three things: a place to put vectors, a way to fill it, and one query that returns nearest neighbors. The Postgres you are already running does all three through pgvector, and the hand-rolled Phoenix version is one migration, one schema change, and one Ecto query — all of it below. The new Elixir RAG libraries (Arcana, rag_ex) are real and worth your attention, but they solve a different problem than the one people reach for them to solve: they own the pipeline — chunking, query rewriting, reranking, evals — not the storage. Figure out which problem you actually have and the decision makes itself. Here’s the honest version of both paths, including where Postgres itself is the wrong call.

You don’t have a vector database problem

The first decision most teams get wrong about RAG happens before any code gets written: they go shopping for a vector database. It’s an understandable reflex — nearly every RAG tutorial is written in Python, and the Python ecosystem’s default posture is “stand up Pinecone, Qdrant, or Chroma first, ask questions later.” So the Phoenix developer dutifully provisions a second stateful service before writing line one.

Stop and count what that second service actually costs you. It’s another thing to deploy, monitor, back up, and upgrade. It’s a second source of truth, which means a synchronization problem: when a document is updated or deleted in Postgres, something has to guarantee its embeddings change in the other store, and that something is now code you own, with failure modes you own. And it’s a network hop in the middle of your hottest read path.

Now count what it buys you at your scale. If you’re doing what most product teams are actually doing — Q&A over your docs, semantic search across support tickets, “find related items” over a catalog — your corpus is thousands to low millions of chunks. Postgres with the pgvector extension handles that range comfortably, with real indexes (HNSW), real distance operators, and one property no bolt-on vector store can offer: your embeddings live in the same database as your data, deleted in the same DELETE, written in the same transaction. The sync problem doesn’t get solved. It stops existing.

This is the same argument I made about cutting a service footprint from ten to six, pointed at a different layer: every piece of infrastructure you don’t run is operational surface you don’t pay for. The BEAM lets you collapse app servers; pgvector lets you collapse the data layer back to one database. Take the win.

The hand-rolled path: a migration, a schema, a query

Here is the entire storage layer, end to end. I want you to see how little there is, because the size of this section is itself the argument.

Add the dependency:

# mix.exs
{:pgvector, "~> 0.4"}

Teach Postgrex about the vector type, and point your Repo at it:

# lib/my_app/postgrex_types.ex
Postgrex.Types.define(
  MyApp.PostgrexTypes,
  Pgvector.extensions() ++ Ecto.Adapters.Postgres.extensions(),
  []
)

# config/config.exs
config :my_app, MyApp.Repo, types: MyApp.PostgrexTypes

One migration enables the extension, creates a chunks table, and adds the index. Use HNSW unless you have a measured reason not to — it builds slower than IVFFlat but doesn’t need retraining as data grows, and query speed is what you’ll feel:

def up do
  execute "CREATE EXTENSION IF NOT EXISTS vector"

  create table(:chunks) do
    add :document_id, references(:documents, on_delete: :delete_all), null: false
    add :body, :text, null: false
    add :embedding, :vector, size: 1536
    timestamps()
  end

  execute """
  CREATE INDEX chunks_embedding_idx ON chunks
  USING hnsw (embedding vector_cosine_ops)
  """
end

Note the on_delete: :delete_all. That single option is the entire “keep the vector store in sync” subsystem you’d otherwise be writing.

The schema:

defmodule MyApp.RAG.Chunk do
  use Ecto.Schema

  schema "chunks" do
    field :body, :string
    field :embedding, Pgvector.Ecto.Vector
    belongs_to :document, MyApp.RAG.Document
    timestamps()
  end
end

Ingestion is the part with actual decisions in it, but the mechanics are small: split the document, embed each piece, insert. text-embedding-3-small is the boring, correct default — cheap enough that re-embedding your whole corpus when you change chunking strategy (you will) costs pocket change:

defp embed!(texts) when is_list(texts) do
  Req.post!("https://api.openai.com/v1/embeddings",
    auth: {:bearer, System.fetch_env!("OPENAI_API_KEY")},
    json: %{model: "text-embedding-3-small", input: texts}
  ).body["data"]
  |> Enum.map(& &1["embedding"])
end

def ingest!(document) do
  chunks = split(document.body, max_tokens: 500, overlap: 50)

  chunks
  |> embed!()
  |> Enum.zip(chunks)
  |> Enum.map(fn {embedding, body} ->
    %{document_id: document.id, body: body, embedding: Pgvector.new(embedding)}
  end)
  |> then(&MyApp.Repo.insert_all(MyApp.RAG.Chunk, &1))
end

Run that in an Oban job, not in the request — embedding calls are exactly the kind of flaky, retryable external work job queues exist for.

And retrieval is one query:

import Pgvector.Ecto.Query

def retrieve(question, k \\ 5) do
  [query_embedding] = embed!([question])

  from(c in MyApp.RAG.Chunk,
    order_by: cosine_distance(c.embedding, ^Pgvector.new(query_embedding)),
    limit: ^k
  )
  |> MyApp.Repo.all()
end

That’s it. Interpolate the top-k chunk bodies into your prompt, send it to the model, you have RAG. The whole thing is maybe 150 lines including the chunker, it’s all code a mid-level Elixir developer can read in one sitting, and — this is the underrated part — it’s Ecto. Want only chunks from documents the current user can see? Add a join and a where. That composability is something the dedicated vector stores make you reimplement through their metadata-filter DSLs.

Where the hand-rolled version stops

Now the honest part. What you built above is retrieval. The gap between retrieval and a RAG system your users describe as “good” lives almost entirely outside that Ecto query, and it’s worth naming the pieces, because each one looks like an afternoon and the sum is a quarter:

  • Chunking strategy. Naive fixed-size splitting is why most RAG demos disappoint. Respecting document structure — headings, paragraphs, code blocks — matters more to answer quality than any index tuning you will ever do.
  • Query rewriting. Users ask “why is it slow?”; the chunk that answers says “latency regression in the connection pool.” Embedding the raw question and hoping is the weakest link in the naive pipeline. Good systems rewrite, expand with synonyms, and split multi-part questions into focused sub-queries.
  • Reranking and filtering. Top-5 by cosine distance includes near-duplicates and confidently-irrelevant chunks. A scoring pass that filters them is the difference between a model that answers and a model that hedges.
  • Evals. Without a question set scored against expected sources, every one of the above is vibes-driven development. This is the same discipline argument as instrumenting your AI product before reaching for a better model — you cannot tune what you don’t measure.

You can absolutely build all of this yourself. I have. But recognize that the moment you start, you’ve left “an afternoon with pgvector” and entered pipeline engineering — and that’s the actual decision point, not SQL versus vector DB.

What Arcana and rag_ex actually buy you

This is where the new Elixir libraries come in, and the first thing worth noticing is what they didn’t build: storage. Arcana is explicitly embeddable — it plugs into the Ecto Repo and Postgres you already have, pgvector underneath, no separate service, no indexing daemon. rag_ex (a fork of bitcrowd’s rag) takes the same posture with pluggable vector stores. The ecosystem looked at the Python default of “stand up another database” and declined, which tells you the argument in the first section isn’t just mine.

What they did build is the pipeline. Arcana ships the agentic retrieval loop — query rewriting, sub-query splitting, per-chunk relevance scoring and filtering, answer generation — plus ingestion for real-world document formats and a dashboard to watch it work. rag_ex adds multi-LLM routing and GraphRAG-style knowledge-graph retrieval if your corpus has structure worth exploiting. Every bullet in the previous section, somebody already wrote, in Elixir, against the database you already run.

The trade is the usual one with frameworks: you adopt their opinions. Their chunking, their pipeline shape, their schema in your database. When your retrieval problem fits those opinions, that’s months of pipeline engineering for free. When it doesn’t, you’ll be reading library source to find the extension point — both projects are young, and young libraries’ extension points are where the sharp edges live.

The decision

Hand-roll when the corpus is one kind of thing, the questions are direct, and answer quality at “good top-k retrieval” is good enough — docs Q&A, related-content, semantic search where the user sees a result list rather than a synthesized answer. You’ll own 150 transparent lines, you’ll understand every moving part, and you’ll have no dependency on a pre-1.0 library’s roadmap. This is also unambiguously the right first move if your team hasn’t built RAG before: you can’t evaluate what a pipeline library is doing for you until you’ve felt the failure modes it exists to fix.

Reach for Arcana when the pipeline gap is the product gap — when users ask multi-part questions, when ingestion means PDFs and wikis and tickets rather than clean markdown, when you’re about to spend a sprint on reranking and evals you could adopt instead of write. The embeddable design means the migration path is gentle: it’s the same Postgres, so trying it isn’t a re-platforming.

The quiet good news is that this isn’t a one-way door. Both paths put your vectors in the same database, behind the same Repo. Moving from hand-rolled to Arcana — or back — is a refactor, not a data migration. That is not true of the Pinecone path, which is precisely why I keep insisting the storage decision is the one to get right first.

Where Postgres is the wrong call

Fairness requires the boundary. pgvector stops being the answer when vector search is the product at scale: tens of millions of embeddings, high-QPS approximate-nearest-neighbor with heavy metadata filtering, recall targets you tune weekly, sharding across nodes. The dedicated stores earn their operational cost there — that workload is what they’re for. If you’re building semantic search as the core of the business rather than a feature of it, hire for that infrastructure honestly.

But be suspicious of arriving at that conclusion early. “We might need to scale” is how teams end up operating three databases for a corpus that fits in Postgres’s shared buffers. Measure first; the migration is real but it’s a known road, and you’ll travel it with evals and production query logs you didn’t have on day one.

The BEAM was already good at this

One closing observation. A RAG request is an IO-bound fan-out: an embedding call, a database query, a model call, sometimes several of each in parallel under one user interaction that lives for seconds. That shape — many concurrent, slow, failure-prone external calls per user, isolated from every other user — is the workload the BEAM’s concurrency model was built for, the same reason it’s the runtime that fits agents. Python teams buy infrastructure to get isolation and concurrency around their pipelines. In Phoenix, the pipeline runs inside the same runtime properties, against the same database, in one deployable.

The stack you already run was the right one. Add a migration and find out.