UI evaluation across 10 HCI frameworks.
10 agents · 84 criteria · Criterion-level scoring
UI Overview
The Iris interface shows the input controls, framework progress, composite score, and criterion-level matrix.
Parallel Agents
Framework results stream back as each evaluator finishes.
Composite View
Scores, issue counts, criteria counts, and grades are shown in a single table.
Criterion Detail
Each framework opens into criterion-level findings with severity, rationale, strengths, and recommendations.
The Problem
HCI evaluation often spans usability, accessibility, cognitive load, visual design, and behavioural psychology. Iris was built to run multiple framework-based evaluations from the same interface capture.
Iris uses 10 evaluator agents, each grounded in a distinct HCI framework, and returns colour-coded criterion matrices, severity scores, and a cross-framework composite score.
At a glance
Methodology
Iris is a multi-agent system. An orchestrator agent dispatches 10 evaluator agents in parallel, each grounded in a distinct peer-reviewed HCI framework. The interface presents both the individual framework outputs and the cross-framework summary.
Each framework is defined in a single markdown file with YAML frontmatter, name, criteria, scoring rubric, and prompt template. The orchestrator reads the directory at startup. Adding a new framework requires authoring one file. The interaction evidence pipeline captures live CSS, focus states, and rendered DOM via Playwright, so evaluators reason about real interactive behaviour rather than a static screenshot.
Agentic AI skills
Multi-agent orchestration, tool-use, parallel dispatch, model routing.
HCI skills
Framework selection, criterion-level scoring, cross-framework validation.
Engineering skills
Playwright automation, interaction evidence capture, NDJSON streaming.
Evaluation skills
Cross-framework normalisation, severity composite scoring, validity studies.
At a glance
Iris sends the same UI evidence bundle to 10 agents, normalises their scoring scales, and aggregates the result into a per-criterion severity matrix.
10
Agents
84
Criteria
<90s
Audit time
01 Input
Three input modes. For URLs and HTML, Playwright captures a full-page screenshot plus interactive CSS state and a keyboard focus screenshot.
02 Dispatch
A Claude Haiku orchestrator dispatches 10 evaluator agents in parallel via tool-use, with a concurrency semaphore guarding the API.
03 Score
Each evaluator returns per-criterion scores with severity ratings (None to Critical). Five heterogeneous scoring scales are normalised to a common 0–100 range.
Composite
04 Composite
Per-framework means are averaged into a composite score mapping to a letter grade A+ to F, with cross-framework agreement shown in the summary view.
05 Extend
Markdown-driven agent specification. To add a new framework, drop a .md file with YAML frontmatter and a prompt template into the agents directory. No orchestrator changes required.
The System
01 · Input
Paste a live URL, upload a screenshot, or drop in HTML source. For URLs and HTML, Iris launches a headless Playwright browser session that captures the full-page screenshot, extracts every interactive CSS pseudo-class rule, takes a keyboard-focus screenshot, and returns the fully rendered DOM.
The input bundle is the contract between the browser layer and the evaluators. Every agent gets the same structured payload, regardless of whether the original input was a URL, a screenshot, or raw markup.
02 · Parallel evaluation
An orchestrator agent dispatches the evaluator agents in parallel. A concurrency semaphore caps in-flight calls at four, and failed evaluations are retried with exponential backoff.
Each agent receives the same input bundle and evaluates independently. Results stream back to the client via SSE/NDJSON as each framework completes.
03 · Composite grade
Five heterogeneous scoring scales (0–4↓ severity, 1–10↑, 1–10↓, pass/fail, and 0–4↓ ability) are normalised to a common 0–100 range. Per-criterion severity labels (None / Low / Medium / High / Critical) are assigned by each evaluator.
Framework means are averaged into a composite score mapping to a letter grade: A+ (≥90), A (≥80), B (≥70), C (≥60), D (≥50), F (<50). The per-framework breakdown remains visible below the composite score.
Frameworks
Iris keeps each framework result separate, then provides a composite score for comparison across evaluations.
Nielsen's 10 Usability Heuristics
10 criteria · 0–4 severity scale
Shneiderman's 8 Golden Rules
8 criteria · 1–10 scale
WCAG 2.1 (Level AA)
15 criteria · pass/fail
Cognitive Load Assessment
8 criteria · 1–10 scale
Visual Design Principles
10 criteria · 1–10 scale
Google HEART Framework
5 criteria · 1–10 scale
UX Honeycomb
7 criteria · 1–10 scale
Don Norman's Design Principles
7 criteria · 1–10 scale
Fogg Behavior Model
5 criteria · 1–10 scale
Ability Heuristics
9 criteria · 0–4 severity scale
Construct Validity
We built four versions of a single webpage with systematically varying design quality, from catastrophic (every criterion deliberately violated) to Grade A, and confirmed that Iris scores progress monotonically across all 10 frameworks.
11.9
V1 Composite
Catastrophic design. Every criterion violated.
66.2
V2 Composite
Structural fix. Semantic HTML, WCAG-compliant.
74.7
V3 Composite
Full polish. Design tokens, hero, social proof.
80.4
V4 Composite
Iterative refinement guided by V3 Iris output.