Patent pending

Iris

UI evaluation across 10 HCI frameworks.

10 agents  ·  84 criteria  ·  Criterion-level scoring

UI Overview

Evaluation interface.

The Iris interface shows the input controls, framework progress, composite score, and criterion-level matrix.

Iris evaluation interface showing source input, target page preview, composite score, and cross-framework overview table
Iris showing parallel agent evaluation progress

Parallel Agents

Ten agents.

Framework results stream back as each evaluator finishes.

Iris cross-framework overview table with scores, issue counts, criteria, and grades

Composite View

Framework summary.

Scores, issue counts, criteria counts, and grades are shown in a single table.

Iris Nielsen usability heuristics detail view with criterion-level severity rows

Criterion Detail

Criterion detail.

Each framework opens into criterion-level findings with severity, rationale, strengths, and recommendations.

The Problem

Multi-framework UX evaluation.

HCI evaluation often spans usability, accessibility, cognitive load, visual design, and behavioural psychology. Iris was built to run multiple framework-based evaluations from the same interface capture.

Iris uses 10 evaluator agents, each grounded in a distinct HCI framework, and returns colour-coded criterion matrices, severity scores, and a cross-framework composite score.

At a glance

Specialist frameworks 10
Total criteria evaluated 84
HTML-sensitive criteria 38
Time to full audit <90s

Methodology

Ten HCI frameworks.

Iris is a multi-agent system. An orchestrator agent dispatches 10 evaluator agents in parallel, each grounded in a distinct peer-reviewed HCI framework. The interface presents both the individual framework outputs and the cross-framework summary.

Markdown-defined evaluators.

Each framework is defined in a single markdown file with YAML frontmatter, name, criteria, scoring rubric, and prompt template. The orchestrator reads the directory at startup. Adding a new framework requires authoring one file. The interaction evidence pipeline captures live CSS, focus states, and rendered DOM via Playwright, so evaluators reason about real interactive behaviour rather than a static screenshot.

Agentic AI skills

Multi-agent orchestration, tool-use, parallel dispatch, model routing.

HCI skills

Framework selection, criterion-level scoring, cross-framework validation.

Engineering skills

Playwright automation, interaction evidence capture, NDJSON streaming.

Evaluation skills

Cross-framework normalisation, severity composite scoring, validity studies.

At a glance

From URL to criterion matrix.

Iris sends the same UI evidence bundle to 10 agents, normalises their scoring scales, and aggregates the result into a per-criterion severity matrix.

10

Agents

84

Criteria

<90s

Audit time

URL · HTML · IMG

01 Input

URL, HTML, or image

Three input modes. For URLs and HTML, Playwright captures a full-page screenshot plus interactive CSS state and a keyboard focus screenshot.

02 Dispatch

Orchestrator agent

A Claude Haiku orchestrator dispatches 10 evaluator agents in parallel via tool-use, with a concurrency semaphore guarding the API.

03 Score

Per-criterion severity

Each evaluator returns per-criterion scores with severity ratings (None to Critical). Five heterogeneous scoring scales are normalised to a common 0–100 range.

Composite

04 Composite

Cross-framework grade

Per-framework means are averaged into a composite score mapping to a letter grade A+ to F, with cross-framework agreement shown in the summary view.

name: Nielsen
scale: 0-4
criteria:
- visibility
- match
- control

05 Extend

Author one file

Markdown-driven agent specification. To add a new framework, drop a .md file with YAML frontmatter and a prompt template into the agents directory. No orchestrator changes required.

The System

Three stages.

01 · Input

URL, screenshot, or raw HTML.

Paste a live URL, upload a screenshot, or drop in HTML source. For URLs and HTML, Iris launches a headless Playwright browser session that captures the full-page screenshot, extracts every interactive CSS pseudo-class rule, takes a keyboard-focus screenshot, and returns the fully rendered DOM.

The input bundle is the contract between the browser layer and the evaluators. Every agent gets the same structured payload, regardless of whether the original input was a URL, a screenshot, or raw markup.

Playwright DOM + CSS extraction Focus-state capture
Input your UI
https://example.com
Captured
Full-page screenshot (1440px)
:hover, :focus, :active CSS
Keyboard focus screenshot
Rendered DOM with ARIA
Orchestrator · Dispatch
Claude Haiku
orchestrating 10 agents
Nielsen
WCAG 2.1
Shneiderman…
Norman…
Gestalt…
Fogg BM…
Cognitive Load…
HEART…
Honeycomb…
Ability Heuristics
Semaphore max 4 in-flight
Stream SSE / NDJSON

02 · Parallel evaluation

Ten agents.

An orchestrator agent dispatches the evaluator agents in parallel. A concurrency semaphore caps in-flight calls at four, and failed evaluations are retried with exponential backoff.

Each agent receives the same input bundle and evaluates independently. Results stream back to the client via SSE/NDJSON as each framework completes.

Tool-use orchestration Concurrency semaphore Streaming results

03 · Composite grade

Composite scoring.

Five heterogeneous scoring scales (0–4↓ severity, 1–10↑, 1–10↓, pass/fail, and 0–4↓ ability) are normalised to a common 0–100 range. Per-criterion severity labels (None / Low / Medium / High / Critical) are assigned by each evaluator.

Framework means are averaged into a composite score mapping to a letter grade: A+ (≥90), A (≥80), B (≥70), C (≥60), D (≥50), F (<50). The per-framework breakdown remains visible below the composite score.

Cross-scale normalisation Letter grade Per-criterion severity
Iris audit · Composite
A
82.4
Composite score
Nielsen88
WCAG 2.184
Norman76
Cognitive Load81
Ability Heuristics68
+ 5 more frameworks

Frameworks

Ten peer-reviewed frameworks.

Iris keeps each framework result separate, then provides a composite score for comparison across evaluations.

01

Nielsen's 10 Usability Heuristics

10 criteria · 0–4 severity scale

02

Shneiderman's 8 Golden Rules

8 criteria · 1–10 scale

03

WCAG 2.1 (Level AA)

15 criteria · pass/fail

04

Cognitive Load Assessment

8 criteria · 1–10 scale

05

Visual Design Principles

10 criteria · 1–10 scale

06

Google HEART Framework

5 criteria · 1–10 scale

07

UX Honeycomb

7 criteria · 1–10 scale

08

Don Norman's Design Principles

7 criteria · 1–10 scale

09

Fogg Behavior Model

5 criteria · 1–10 scale

10

Ability Heuristics

9 criteria · 0–4 severity scale

Construct Validity

Measuring design quality.

We built four versions of a single webpage with systematically varying design quality, from catastrophic (every criterion deliberately violated) to Grade A, and confirmed that Iris scores progress monotonically across all 10 frameworks.

F

11.9

V1 Composite

Catastrophic design. Every criterion violated.

C

66.2

V2 Composite

Structural fix. Semantic HTML, WCAG-compliant.

B

74.7

V3 Composite

Full polish. Design tokens, hero, social proof.

A

80.4

V4 Composite

Iterative refinement guided by V3 Iris output.