Iris | James Mitchell

UI Overview

Evaluation interface.

The Iris interface shows the input controls, framework progress, composite score, and criterion-level matrix.

Parallel Agents

Ten agents.

Framework results stream back as each evaluator finishes.

Composite View

Framework summary.

Scores, issue counts, criteria counts, and grades are shown in a single table.

Criterion Detail

Criterion detail.

Each framework opens into criterion-level findings with severity, rationale, strengths, and recommendations.

The Problem

Multi-framework UX evaluation.

HCI evaluation often spans usability, accessibility, cognitive load, visual design, and behavioural psychology. Iris was built to run multiple framework-based evaluations from the same interface capture.

Iris uses 10 evaluator agents, each grounded in a distinct HCI framework, and returns colour-coded criterion matrices, severity scores, and a cross-framework composite score.

At a glance

Specialist frameworks 10

Total criteria evaluated 84

HTML-sensitive criteria 38

Time to full audit <90s

Methodology

Ten HCI frameworks.

Iris is a multi-agent system. An orchestrator agent dispatches 10 evaluator agents in parallel, each grounded in a distinct peer-reviewed HCI framework. The interface presents both the individual framework outputs and the cross-framework summary.

Markdown-defined evaluators.

Each framework is defined in a single markdown file with YAML frontmatter, name, criteria, scoring rubric, and prompt template. The orchestrator reads the directory at startup. Adding a new framework requires authoring one file. The interaction evidence pipeline captures live CSS, focus states, and rendered DOM via Playwright, so evaluators reason about real interactive behaviour rather than a static screenshot.

Agentic AI skills

Multi-agent orchestration, tool-use, parallel dispatch, model routing.

HCI skills

Framework selection, criterion-level scoring, cross-framework validation.

Engineering skills

Playwright automation, interaction evidence capture, NDJSON streaming.

Evaluation skills

Cross-framework normalisation, severity composite scoring, validity studies.

At a glance

From URL to criterion matrix.

Iris sends the same UI evidence bundle to 10 agents, normalises their scoring scales, and aggregates the result into a per-criterion severity matrix.

Agents

Criteria

<90s

Audit time

URL · HTML · IMG

01 Input

URL, HTML, or image

Three input modes. For URLs and HTML, Playwright captures a full-page screenshot plus interactive CSS state and a keyboard focus screenshot.

02 Dispatch

Orchestrator agent

A Claude Haiku orchestrator dispatches 10 evaluator agents in parallel via tool-use, with a concurrency semaphore guarding the API.

03 Score

Per-criterion severity

Each evaluator returns per-criterion scores with severity ratings (None to Critical). Five heterogeneous scoring scales are normalised to a common 0–100 range.

Composite

04 Composite

Cross-framework grade

Per-framework means are averaged into a composite score mapping to a letter grade A+ to F, with cross-framework agreement shown in the summary view.

name: Nielsen

scale: 0-4

criteria:

- visibility

- match

- control

05 Extend

Author one file

Markdown-driven agent specification. To add a new framework, drop a .md file with YAML frontmatter and a prompt template into the agents directory. No orchestrator changes required.

The System

Three stages.

01 · Input

URL, screenshot, or raw HTML.

Paste a live URL, upload a screenshot, or drop in HTML source. For URLs and HTML, Iris launches a headless Playwright browser session that captures the full-page screenshot, extracts every interactive CSS pseudo-class rule, takes a keyboard-focus screenshot, and returns the fully rendered DOM.

The input bundle is the contract between the browser layer and the evaluators. Every agent gets the same structured payload, regardless of whether the original input was a URL, a screenshot, or raw markup.

Playwright DOM + CSS extraction Focus-state capture

Input your UI

https://example.com

Captured

Full-page screenshot (1440px)

:hover, :focus, :active CSS

Keyboard focus screenshot

Rendered DOM with ARIA

Orchestrator · Dispatch

Claude Haiku

orchestrating 10 agents

Nielsen

WCAG 2.1

Shneiderman…

Norman…

Gestalt…

Fogg BM…

Cognitive Load…

HEART…

Honeycomb…

Ability Heuristics

Semaphore max 4 in-flight

Stream SSE / NDJSON

02 · Parallel evaluation

Ten agents.

An orchestrator agent dispatches the evaluator agents in parallel. A concurrency semaphore caps in-flight calls at four, and failed evaluations are retried with exponential backoff.

Each agent receives the same input bundle and evaluates independently. Results stream back to the client via SSE/NDJSON as each framework completes.

Tool-use orchestration Concurrency semaphore Streaming results

03 · Composite grade

Composite scoring.

Five heterogeneous scoring scales (0–4↓ severity, 1–10↑, 1–10↓, pass/fail, and 0–4↓ ability) are normalised to a common 0–100 range. Per-criterion severity labels (None / Low / Medium / High / Critical) are assigned by each evaluator.

Framework means are averaged into a composite score mapping to a letter grade: A+ (≥90), A (≥80), B (≥70), C (≥60), D (≥50), F (<50). The per-framework breakdown remains visible below the composite score.

Cross-scale normalisation Letter grade Per-criterion severity

Iris audit · Composite

82.4

Composite score

Nielsen88

WCAG 2.184

Norman76

Cognitive Load81

Ability Heuristics68

+ 5 more frameworks

Frameworks

Ten peer-reviewed frameworks.

Iris keeps each framework result separate, then provides a composite score for comparison across evaluations.

Nielsen's 10 Usability Heuristics

10 criteria · 0–4 severity scale

Shneiderman's 8 Golden Rules

8 criteria · 1–10 scale

WCAG 2.1 (Level AA)

15 criteria · pass/fail

Cognitive Load Assessment

8 criteria · 1–10 scale

Visual Design Principles

10 criteria · 1–10 scale

Google HEART Framework

5 criteria · 1–10 scale

UX Honeycomb

7 criteria · 1–10 scale

Don Norman's Design Principles

7 criteria · 1–10 scale

Fogg Behavior Model

5 criteria · 1–10 scale

Ability Heuristics

9 criteria · 0–4 severity scale

Construct Validity

Measuring design quality.

We built four versions of a single webpage with systematically varying design quality, from catastrophic (every criterion deliberately violated) to Grade A, and confirmed that Iris scores progress monotonically across all 10 frameworks.

11.9

V1 Composite

Catastrophic design. Every criterion violated.

66.2

V2 Composite

Structural fix. Semantic HTML, WCAG-compliant.

74.7

V3 Composite

Full polish. Design tokens, hero, social proof.

80.4

V4 Composite

Iterative refinement guided by V3 Iris output.