I Built a RAG Chat Agent on a Confluence Wiki. Here’s What Actually Broke.


Everyone demos RAG on a clean PDF. Nicely formatted, flat structure, no surprises. It works great. The demo looks impressive. Then you hand it a real enterprise knowledge base and the wheels fall off.

I recently built a chat agent on top of a Confluence wiki for a SaaS client. A well-built chat agent could return instant, accurate answers grounded in their actual docs.

The LLM side of that project? Honestly, the easy part. The ingestion pipeline — getting pages of real-world Confluence content into a state where retrieval actually works — that’s where I lived for the majority of the project.

Here’s what broke, and how I fixed it.


The Confluence Format Problem Nobody Warns You About

Confluence doesn’t export clean HTML. It exports Atlassian’s own macro format — XML-flavoured tags like <ac:structured-macro> and <ac:parameter> — nested inside the HTML, invisible to standard parsers, and absolutely toxic to your chunk quality if you don’t deal with them.

Run a standard BeautifulSoup pass over a Confluence export and you’ll think you’ve got clean text. You haven’t. You’ve got fragments of macro names, parameter values, and serialised XML quietly mixed into your content. That noise ends up in your embeddings, and it degrades retrieval quality before the LLM ever sees a single token.

The fix is a hand-rolled stripping pass specifically targeting Atlassian macros before any chunking happens. It’s not glamorous work, but skipping it means you’re building on a dirty foundation.


Tables Are the Silent Killer

Confluence pages love tables. Release notes, feature comparisons, config options, pricing tiers — all tables. And naive HTML-to-text stripping absolutely destroys them.

Strip the tags without handling structure and a five-column table becomes a meaningless string of values with no relationship to their headers. Feed that to an embedding model and you’ve created a retrieval black hole — the content is technically there, it just produces bad answers every time it gets retrieved.

I solved this with Docling to convert table content to Markdown before chunking. Markdown tables preserve the row/column relationships in plain text that the embedding model can actually make sense of. It adds a step to the pipeline but the answer quality difference on table-heavy pages is significant.


Upsert by Natural Key. Not Autoincrement.

This one bites people the first time they re-sync.

If you’re using autoincrement IDs for your vector store records, every time you re-ingest a page — because the content was updated, because you’re refreshing on a schedule, because someone fixed a typo — you create a duplicate. Now you’ve got two versions of the same chunk in your vector table. Retrieval starts returning stale answers alongside current ones.

The fix is simple but has to be deliberate: upsert by natural key, specifically the confluence_id. Every Confluence page has a stable ID regardless of edits. Use that as your upsert key and your ingestion becomes idempotent — re-sync as many times as you like, it updates what’s changed and doesn’t duplicate anything.


Your Embedding Model Choice is a Schema Decision

This is the one that feels like a configuration detail until it isn’t.

I used Amazon Titan Embeddings v2 at 1024 dimensions. That dimension count is baked into your vector table schema from the moment you create it. If you decide halfway through the project to switch models — different provider, different dimension count — you’re not just swapping a config value. You’re rebuilding the entire vector table and re-embedding every chunk from scratch.

Choose your embedding model before you write your schema. Treat it like a database design decision, because that’s exactly what it is.


How the Stack Fit Together

For anyone curious about the full setup:

  • Postgres with pgvector for semantic search and tsvector for full-text — both in the same database, which lets you run hybrid retrieval and weight the results
  • Separate logical databases for raw document storage vs. processed chunks — keeps your pipeline clean and makes debugging significantly easier
  • Docling for Markdown conversion, particularly for tables and structured content
  • Amazon Titan Embeddings v2 at 1024 dimensions for the vector layer

The retrieval layer then combines semantic and full-text scores, re-ranks, and passes the top chunks to the LLM with a tightly constrained prompt. The constraint matters — without it, the model will confidently answer questions using knowledge outside the documentation, which defeats the entire purpose.


The Honest Takeaway

If you’re planning a RAG project on enterprise documentation, budget accordingly: the pipeline before the LLM is where you’ll spend 60% of your time. The model is the last 10 minutes of the problem. The ingestion, cleaning, chunking, and retrieval architecture is everything else.

The good news is that getting it right produces something genuinely useful — an agent that gives accurate, grounded answers from your actual documentation instead of hallucinating plausible-sounding nonsense.

That’s the difference between a demo and a tool people actually use.


Building something similar? We’ve done the hard yards on enterprise doc ingestion. Get in touch and let’s talk about what your pipeline needs.