ClarifAI | James Mitchell

The Problem

Vague feedback tells designers nothing.

Asynchronous Online Focus Groups (AOFGs) are a widely used tool to collect feedback at scale, while addressing key geographic barriers and participant time constraints; however, the feedback collected via AOFGs is often too vague or ambiguous to act on.

"Q: Was this helpful? A: Kind of" gives a designer nothing to act on.

ClarifAI intercepts vague feedback before it becomes unusable. A four-module LLM pipeline filters out irrelevant responses, flags vague or ambiguous comments, and conducts a targeted follow-up dialogue to turn a thin comment into richer, more granular data for design and evaluation teams.

ClarifAI Feedback Dialogue

You rated the discharge-summary tool as "somewhat useful." What made it only somewhat useful?

Human

It helped a bit, but I still had to open the chart.

What information did you need from the chart that was missing from the summary?

Human

Medication changes. I needed to see which meds were stopped, which were new, and why.

Thanks. I captured that the summary was useful for orientation, but not granular enough for medication reconciliation because it omitted medication-change rationale. ✓ Saved

Problem Clarified

Raw response: "somewhat useful"
Granular data captured: useful for orientation, but missing stopped/new medications and rationale needed for medication reconciliation.

Methodology

Engineering vagueness into structure.

ClarifAI's four-module pipeline is the result of an iterative design and validation programme combining contextual inquiry, multi-stage prompt engineering, and controlled empirical testing. Each module operates as a specialised agent with a distinct task, scoring rubric, and prompt template, developed, tested, and refined independently before being chained into the full pipeline.

A working multi-agent LLM system for live feedback workflows.

ClarifAI is designed to operate inside an AOFG-based feedback workflow. Per-module evaluation is built into the pipeline, so individual agents can be re-tested, swapped, or extended without rewriting the orchestrator. New modules can be added by authoring a single prompt template alongside an existing one.

Engineering skills

Prompt engineering, structured-output scoring rubrics, tool-use orchestration, model routing.

Research skills

Contextual inquiry, validation experiment design, ground-truth labelling protocols.

Evaluation skills

Per-module accuracy, composite pipeline scoring, held-out validation sets.

Product skills

Information architecture, prompt-as-interface design, integration into the AOFG flow.

At a glance

From feedback collection to actionable insight, in one continuous workflow.

Each agent is specialised, evaluated, and replaceable. The orchestrator routes between modules based on intermediate results so only feedback that needs clarification is escalated.

Agents

AOFG

Framework

Tool-use

Orchestration

01 Observe

Contextual inquiry

Observed how research teams process AOFG transcripts at scale, identifying the moments where vague responses bottleneck downstream analysis.

Relevance

Vagueness

Dialogue

Refactor

02 Specify

Module boundaries

Decomposed the disambiguation task into four discrete agent responsibilities, each with a single owning prompt and a clear input/output contract.

03 Prompt

Iterative engineering

Each module's prompt was refined against a held-out validation set, with structured scoring rubrics replacing free-text output to guarantee machine-readable consistency between agents.

Validation

04 Validate

Empirical testing

Pipeline evaluated end-to-end against ground-truth labels from clinical researchers, with per-module accuracy reported alongside a cross-pipeline composite score.

Live pipeline

05 Integrate

Inside the workflow

Designed to surface clarification dialogues to participants in real time inside an AOFG flow and write structured results back to the research dataset.

The Pipeline

Four modules. Four agents. One workflow.

A · Telemetry

Filter out the irrelevant.

The Telemetry module classifies whether each piece of feedback (e.g., an individual answer to a feedback question/prompt) is addressing the informational intent of the question/prompt. Off-topic or tangential responses are filtered out before they enter the pipeline.

Doing this classification first means downstream modules only see feedback that genuinely belongs to the question being asked, reducing both LLM cost and false-positive escalations into the clarification dialogue.

Contextual relevance Cost control Structured output

A · Telemetry

Input feedback

"The discharge summary tool helped me a little but I miss home cooking."

18%

Decision

Off-topic

Relevant

B · Flight

Relevant feedback

"It helped a bit, but I still had to open the chart."

Detected signals

Vague "a bit"

Implicit what was missing?

Decision

Escalate to CapCom

B · Flight

Detect what needs clarifying.

The Flight module further filters relevant feedback for responses that are not specific enough to act on (i.e., contain any vagueness or ambiguity). Only feedback that contains vagueness or ambiguity is escalated to the clarification dialogue, keeping the experience lightweight for users who already gave clear and relevant responses.

This routing decision is the load-bearing piece of the pipeline. Over-escalation creates participant fatigue; under-escalation leaves vague responses unclarified. Flight's prompt was iteratively tuned to balance the two.

Vagueness detection Routing logic Workload calibration

C · CapCom

Ask only what's needed.

The CapCom module engages the user in a short, targeted follow-up conversation for each vague or ambiguous topic identified by Flight. The LLM interviewer asks the questions needed to resolve the vagueness or ambiguity, then sends the transcript of the conversation to the Payload module for final processing.

The dialogue is bounded — CapCom stops as soon as the original vagueness signal has been resolved — so participants experience a focused exchange rather than a generic AI chat.

Targeted dialogue Bounded interaction Stop conditions

C · CapCom

What information did you need from the chart that was missing from the summary?

Medication changes — stopped, new, and why.

Captured. ✓ Done

Stop condition met

D · Payload

Original response

"It helped a bit, but I still had to open the chart."

After refactor

Useful for orientation but not granular enough for medication reconciliation — omitted stopped, new, and rationale-for-change medication details, forcing the user to consult the chart.

domain: meds task: reconciliation gap: rationale

D · Payload

Refactor into something actionable.

The Payload module extracts the relevant and granular information elicited from the user by the CapCom module, and injects the new granular information into the original feedback response, resulting in a structured, machine-readable, and specific insight that designers can act on directly.

The output is structured rather than free-text: it preserves the original phrasing while attaching the elicited specifics, so designers and downstream analysis tools can both read the human version and query the structured fields.

Summarisation Structured output Designer-ready

System Architecture

From feedback to insight.

The complete ClarifAI pipeline, from AOFG collection through four LLM modules to structured, actionable output.

AOFG Platform

Stage 1

Prerequisite Task

Consent

Demographics

Stage 2

Project & Tasks

PI Usability Questions

Stage 3

Discussion Board

Per Question Discussion

LLM-assisted Pipeline

A) Telemetry

Contextual
Relevance

B) Flight

Requires
Clarification

C) CapCom

Clarification
Dialogue

D) Payload

Summarisation
& Refactor