Why we ship refusal as a feature

Every other chatbot product I’ve evaluated treats refusal as a failure mode. The system prompt begs the model to say “I don’t know.” Escalation flows catch the cases where it doesn’t. Teams spend weeks tuning thresholds. And most of the time, when the bot gets asked something its docs don’t cover, it says something anyway — because that’s what models do.

We took the opposite stance. Refusal is a feature, not a fallback. When the docs don’t cover the question, the visitor gets a clear “I can’t answer that from your docs” with the closest related pages we did find, rendered in a distinct purple so it’s visually different from a “the bot broke” error. The visitor learns something they couldn’t have learned from a fluent hallucination: the answer isn’t here, look in those three places.

What we expected to happen, vs what happened

We shipped refusal expecting a CSAT trade-off. The pitch was: “we’ll cap deflection lower than the competition, but the answers we do give will be trustworthy.” We thought we were buying quality at the cost of volume.

The data didn’t match. Two things showed up in user research that we didn’t predict:

The user trusts the next answer more. A bot that’s willing to refuse has cheap signal: when it does answer, it has good reason. Every answer the bot gives after a refusal rates higher in our eval pool than the same answer given by a bot that never refuses anything.
Support tickets go down, not up. Common worry: “if the bot can’t answer, won’t the user just open a ticket?” Our pilot data says no. The bot’s “closest related sources” links get clicked 71% of the time, and the docs page solves the question. The visitors who would have opened a ticket open a docs page instead. Ticket volume dropped on refusal-heavy weeks, not just resolution-heavy weeks.

That 71% click-through is now the number we point to most often when buyers ask why refusal is good for them, not just for our marketing copy. It’s the most-clicked screen in the entire bot.

The numbers we publish

Across our eval suite, 38% of out-of-corpus queries that other RAG defaults answer confidently are refused by FlowChat instead. Of the remaining 62%, most are answered correctly with citations; the rest fail our verification gate after generation and ship as partial answers (uncited claims stripped).

The user-facing consequence: every published “0 hallucinations escaped to users” line on our homepage is earned by these gates, every day. If a confidently-wrong answer ever ships, the eval runner catches it post-hoc and pages us. So far this year, zero pages.

Want the engineering? (the three gates, technically)

For teams who want to know how the refusal machinery actually works, there are three gates. Any one of them can stop generation:

Gate 1 — retrieval confidence. After hybrid retrieval (Vectorize for dense + D1 FTS5 for BM25, fused with Reciprocal Rank Fusion) and a Cohere Rerank 3.5 cross-encoder pass, we look at the rerank score on the top chunk. Below threshold, generation never runs. No tokens are spent. The widget returns the refusal directly with the closest related pages.

Gate 2 — token-level citations at generation time. When generation does run, the prompt requires every factual claim to include a [#N] token referencing one of the retrieved chunks. The parser rejects any sentence without one. Output that doesn’t conform is regenerated up to a retry budget; if it still doesn’t conform after retries, the answer is downgraded to a partial.

Gate 3 — NLI verification post-generation. After the LLM answers, we run each cited claim through a RAG-fine-tuned Natural Language Inference model: does the cited chunk actually entail this claim? If the entailment score falls below threshold, the claim is stripped from the answer post-hoc, before the user sees it. This is the gate that catches the LLM citing a chunk that’s adjacent-but-not-supportive — common in long answers where the model reaches for a citation to satisfy the format constraint.

The architecture diagram and per-gate thresholds are on /security and /developers if you want to go deeper.

The bigger lesson

We shipped refusal because it was the right thing to do. We expected buyers to thank us for the honesty and the safety. What we didn’t expect was for refusal to raise the metric we thought it would lower. The instinct in this space — and in a lot of AI product work right now — is that more answers means better outcomes. The opposite turned out to be true. Fewer, more-trustworthy answers built more user trust, which led to more user engagement, which led to fewer support tickets.

If this is the engineering philosophy you want shipping on your own site, start a 14-day trial — first crawl in under an hour. Or click around the refusal demo on /product/ground; it walks through three preset paths and shows the gates firing in real time.