Documentation
Design
Architecture, source layout, lifecycles, safety boundaries, and testing model.
a11y-catscan -- design
On this page
a11y-catscan is a multi-engine WCAG accessibility scanner. It crawls HTML pages with Playwright/Chromium, runs one or more accessibility engines against each page, normalizes their findings into one shared EARL-shaped result model, and streams reports to disk as it goes.
The central design target is practical audit work on large, real sites:
- A scan can run for thousands of pages without holding all findings in memory.
- Partial results survive process termination.
- A failed browser, engine, or auth session is visible instead of quietly producing a clean-looking report.
- The same page-scan code path is reusable from the CLI crawl loop, the MCP server, tests, and small one-off verification scripts.
The scanner is not a replacement for manual WCAG judgment. It is an
evidence collector: automated failures are prioritized, cantTell
items are surfaced for review, and engine attribution is preserved so
auditors can decide whether a finding is broadly confirmed or tool
specific.
Non-goals
- Full WCAG certification. Automated engines cannot verify every WCAG requirement and cannot reason about intent.
- Scanning every resource type. Scope is HTML/XHTML pages rendered in a browser. PDFs, Office documents, videos, audio, and other media are explicitly out of scope.
- A general browser automation framework. The scanner exposes enough hooks for login and session checking, but page interaction beyond authentication belongs in site-specific plugins or pre-scan setup.
- A long-running MCP crawl service. Full crawls write files, handle signals, and may run for hours; they stay in the CLI. MCP tools are quick, bounded analysis and one-page scan helpers.
Execution surfaces
There are three supported ways into the codebase:
| Surface | Entry point | Responsibility |
|---|---|---|
| CLI | ./a11y-catscan.py |
Parse operator flags, load YAML config, select scan/analysis mode, write final summaries, register completed scans |
| Library | scanner.Scanner |
Start browser + engines, scan one page, normalize selectors, manage auth context |
| MCP | mcp_server.py or a11y-catscan.py --mcp |
Expose bounded tools for one-page scanning, report analysis, registry lookup, and WCAG metadata |
The CLI imports the same public functions used by tests and external
integrations. Some names are intentionally re-exported from
a11y-catscan.py so older scripts that imported the single-file tool
continue to work after the module split.
Source layout
| File | Responsibility |
|---|---|
a11y-catscan.py |
CLI entry point. Defines flags, loads config, dispatches analysis modes, resolves crawl settings, invokes crawl_and_scan, prints final summaries, registers scans, and starts MCP mode when --mcp is present |
scanner.py |
Reusable async scanner. Owns Playwright lifecycle, engine construction/start/stop, authenticated browser context, login plugin loading, saved storage state, per-page navigation, element resolution, browser restart, and session checks |
crawl.py |
Crawl engine. Owns URL frontier, visited set, worker sliding window, shared rate limiter, robots integration, incremental JSONL writes, report flushing, signal handlers, resume state, browser restart cadence, and auth recovery cycle |
crawl_utils.py |
Browser-free crawl helpers: URL normalization, strip-query rules, same-origin checks, robots.txt loading, should_scan, HEAD/GET HTTP probe, cookie loading, process cleanup, numeric parsing |
results.py |
Page-result helpers: node counting, RunningTotals, and cross-engine deduplication by (selector, primary_tag, outcome) |
allowlist.py |
YAML allowlist loader and indexed matcher. Filters known-acceptable findings by normalized rule id, URL substring, target substring, engine, and outcome. Also classifies deduped page totals |
engine_mappings.py |
WCAG Success Criterion metadata, SC aliases, engine rule mappings, best-practice categories, ARIA categories, HTMLCS rule-code parsing, and shared EARL constants |
engines/base.py |
Engine interface and normalized result schema documentation |
engines/axe.py |
axe-core adapter: loads axe.min.js, injects it into the page, runs axe.run, normalizes tags and best-practice / ARIA categories |
engines/alfa.py |
Python side of Siteimprove Alfa integration. Starts alfa-engine.mjs, connects to its browser server, serializes scans through an async lock, forwards cookies, and normalizes Alfa results |
alfa-engine.mjs |
Node.js Alfa subprocess. Launches Chromium through Playwright launchServer, filters Alfa ACT rules by WCAG level/version, scans pages in isolated contexts, resolves selectors in the DOM, and returns newline-delimited JSON |
engines/ibm.py |
IBM Equal Access adapter: injects ace.js, runs IBM's checker, maps policy/confidence pairs to EARL outcomes and normalized tags |
engines/htmlcs.py |
HTML_CodeSniffer adapter: injects HTMLCS.js, runs a WCAG standard, maps ERROR/WARNING messages to failed/cantTell |
engines/__init__.py |
Engine registry and make_engine factory with engine-specific kwargs |
report_io.py |
Streaming JSON / JSONL readers plus iter_deduped and URL extraction helpers |
report_html.py |
Streaming human-readable HTML report generator. Aggregates stats in one pass, renders per-page details in a second pass |
report_llm.py |
Token-efficient Markdown report for LLM-assisted remediation. Groups violations and incompletes, includes representative HTML snippets, and points to full reports |
report_group.py |
--group-by terminal summary printer |
report_diff.py |
CLI text diff between two JSONL scans |
registry.py |
Named scan registry plus search, page status, and structured diff utilities used by CLI and MCP |
cli_modes.py |
No-scan CLI modes: cleanup, list scans, page status, search, audit help |
mcp_server.py |
FastMCP tool definitions and MCP-specific safety checks |
login-hubzero.py |
Example login plugin implementing the scanner auth contract |
Result model
Every engine returns a list of normalized findings:
{
"id": "color-contrast",
"engine": "axe",
"outcome": "failed",
"description": "...",
"help": "...",
"helpUrl": "...",
"impact": "serious",
"tags": ["sc-1.4.3"],
"nodes": [{
"target": ["#main"],
"html": "<main>...</main>",
"any": [{"message": "..."}],
}],
}
Outcomes use W3C EARL names:
| Outcome | Meaning |
|---|---|
failed |
Automated check found a definite failure |
cantTell |
Automated check is inconclusive; manual review needed |
passed |
Rule passed |
inapplicable |
Rule does not apply |
Tags are normalized across engines:
| Prefix | Meaning | Example |
|---|---|---|
sc- |
WCAG Success Criterion | sc-1.4.3 |
aria- |
ARIA-specific category | aria-valid-attrs |
bp- |
Best-practice category | bp-landmarks |
The sc- prefix is deliberate. Native engine tags such as
wcag2aa, wcag21aa, and wcag143 mix WCAG version, level, and
criterion identifiers. The normalized sc-* vocabulary names only
Success Criteria, which keeps reports and search filters unambiguous.
Engine model
All engines share the same Playwright browser when possible, but they do not all run the same way.
| Engine | Mechanism | Notes |
|---|---|---|
| axe-core | Browser injection | Loads node_modules/axe-core/axe.min.js, calls axe.run(document, opts) |
| IBM Equal Access | Browser injection | Loads accessibility-checker-engine/ace.js, calls new ace.Checker().check |
| HTML_CodeSniffer | Browser injection | Loads html_codesniffer/build/HTMLCS.js, calls HTMLCS.process |
| Siteimprove Alfa | Node.js subprocess | alfa-engine.mjs launches a Playwright browser server and runs Alfa ACT rules |
Alfa is special because it is a TypeScript ecosystem. Instead of
opening an unauthenticated debug port, the subprocess launches Chromium
with Playwright launchServer() and sends Python the GUID-bearing
WebSocket endpoint over stdout. The GUID acts as a bearer token.
Python connects to that endpoint and uses the same browser server for
its own page work. Alfa scans are serialized by an asyncio.Lock
because the subprocess protocol is one request/response stream.
If every selected engine fails to start, scanning fails closed. This matters for audit integrity: missing dependencies, broken browser binaries, and bad engine configuration should be setup errors, not "clean" reports.
Page scan lifecycle
Scanner.scan_page(url, extract_links=False, dedup=True) performs the
unit of work the rest of the system builds on:
- Require
Scanner.start()to have launched Playwright and started at least one engine. - Open a fresh page from the authenticated context when auth is configured; otherwise open a new browser page.
- Navigate with the configured
wait_untilstrategy and timeout. - Reject browser responses that are missing, HTTP 4xx/5xx, non-HTML, too small, or DOMs whose rendered document is not HTML.
- Run each started engine against the page.
- Normalize engine-native element references through a live-DOM resolver. CSS selectors, IBM XPath, and Alfa tag/attribute hints are converted into deterministic CSS selectors when possible.
- Optionally extract HTTP links for the crawl frontier.
- Attach
session_activewhen a login plugin providesis_logged_in(page). - Close the page in a
finallyblock. - Return a page-result dict, or a structured skipped result with a reason.
Engine scan failures are intentionally local to that engine's result list; a single noisy engine should not abort a page when other engines still produced useful data. Startup failure of all engines is different and is treated as fatal.
Crawl lifecycle
crawl_and_scan() owns long-running site traversal. Its data structures
are deliberately plain:
queue: URLs waiting to be scanned.visited: normalized URLs already claimed by the frontier._logout_urls: URLs proven to trigger logout and banned from future visits.pending: active asyncio tasks in the worker sliding window.running_totals: summary counters updated as pages are written.
The main loop keeps at most workers scan tasks in flight. Worker tasks
never write report files directly; they return complete page results.
The main loop consumes finished tasks, updates counters, writes one
JSONL line, and enqueues discovered links. That single-writer shape is
important enough to be documented at the write site because consumers
assume a line is one complete page and never half of two interleaved
pages.
URL filtering
URLs are normalized before they enter the frontier:
- Fragment removed.
- Path trailing slash normalized.
- Configured query parameters stripped globally or by path regex.
should_scan() then applies:
- Same-origin check.
- Static non-HTML extension skip list.
- Include path prefixes.
- Exclude path prefixes.
- Exclude regex list.
- Known non-HTML query patterns.
- robots.txt, unless disabled.
The HTTP HEAD/GET probe in crawl_utils.http_status() is a cheap
pre-browser filter for obvious non-HTML or error responses. It sends
configured auth cookies when available so authenticated pages do not
look like redirects to login.
Streaming persistence
JSONL is the primary write format:
{"https://example.test/page": {"url": "...", "failed": [], "...": "..."}}
The crawler opens the .jsonl once and appends one line per scanned
page. Each line is flushed immediately so signal-triggered snapshots
and concurrent readers see complete records.
Periodic and final flushes convert JSONL into:
.json: one large JSON object for compatibility and inspection..html: human-readable report..md: optional LLM summary..state.json: crawl queue / visited / logout ban state for resume.
The JSON and HTML products are derived artifacts. If a process dies mid-run, JSONL plus state are the recovery surface.
Signals
The crawl loop installs:
SIGTERM/SIGINT: set an interrupted flag, flush reports, save state, then let the async loop wind down.SIGUSR1: flush and save a snapshot without stopping.
Signal handlers avoid directly exiting the interpreter. They set state that the async loop observes; this keeps Playwright cleanup predictable and avoids orphaned browser processes where possible.
Authentication model
Authentication is plugin-based. A login script is a Python module with:
async def login(context, config) -> bool: ...
Optional hooks:
async def is_logged_in(page) -> bool: ...
async def init_from_context(context) -> None: ...
exclude_paths = ["/logout"]
The scanner stores Playwright storage_state in .auth-state.json.
Startup first tries saved state, verifies it by loading the configured
start URL, and falls back to login() when the state is missing or no
longer valid.
During a crawl, is_logged_in(page) is checked after successful page
scans. If the session is lost:
- The suspect URL is recorded.
- Recovery mode is entered.
- In-flight workers drain.
- The scanner re-authenticates.
- Suspects are tested serially.
- URLs that immediately break the session are added to
_logout_urls. - Safe suspects are requeued for a real scan.
If re-login fails, recovery is disabled for the rest of the run. That is intentionally imperfect but bounded: a broken auth plugin should not trap the crawler in an infinite recovery loop.
Report consumers
All report consumers read JSONL through report_io:
iter_jsonl()tolerates blank and corrupt lines, warning to stderr.iter_report()auto-detects JSON vs JSONL.iter_deduped()applies cross-engine dedup per page.
Dedup is done on read rather than at write time. Raw JSONL preserves
engine-native findings and keeps the on-disk format stable for existing
tools. Consumers that need presentation-level findings use
iter_deduped().
HTML report
report_html.generate_html_report() makes two streaming passes:
- Aggregate totals, impact counts, WCAG criteria, rule summary, and incomplete summary.
- Render per-page details and clean page list.
All page-sourced strings are escaped before entering HTML. Impact values are whitelisted before being used in CSS class names.
LLM report
report_llm.generate_llm_report() is intentionally compact. It groups
violations by normalized rule id and incompletes by axe messageKey
when available. Deduped findings may no longer have messageKey, so
they fall into an (unknown) bucket rather than disappearing.
HTML snippets from scanned pages are emitted inside fenced code blocks, with fence delimiters escaped. This is not a complete prompt-injection solution, but it prevents the most direct "break out of the code block" failure mode when the report is pasted into an LLM.
MCP safety model
The MCP server is a local tool surface that may be driven by an LLM client. It is therefore treated as a security boundary even though it runs on the user's machine.
scan_page validates:
- URL is a non-empty string.
- Scheme is
httporhttps. - Host is present.
- Host does not resolve to private, loopback, link-local, reserved, multicast, or unspecified IP space.
The private-address restriction can be disabled with
A11Y_CATSCAN_MCP_ALLOW_PRIVATE=1, intended for trusted local tests or
explicit operator use.
Report-analysis tools resolve direct paths only when they look like
report files (.json / .jsonl), or through the scan registry. This
keeps prompt-injected report requests from becoming arbitrary local
file reads.
Security notes
cookies.json,.auth-state.json, local config, reports, virtualenvs, and tool caches are gitignored. Tracked credentials are a release blocker; if a cookie file ever lands in history, revoke the sessions as well as removing the file from the current tree.yaml.safe_load()is used for config and allowlist files.- Subprocess execution is list-based, not shell-based.
- Browser cleanup tracks launched browser PIDs and child Chromium
processes. Manual
--cleanupis intentionally narrowed to Playwright-looking Chromium processes. - Output basenames are validated so
--namecannot write outside the configured output directory. - MCP
scan_pagereturns scanner setup errors as errors, not as clean skipped pages.
Testing model
The suite is split between pure helpers and browser-backed tests:
- Pure tests cover URL normalization, allowlist matching, report IO, registry/search/diff, grouped summaries, CLI argument behavior, and MCP tool JSON shapes.
- Browser tests use Playwright and local HTTP fixture servers to verify real engine execution, link extraction, auth plugin behavior, browser restart, and one-page MCP scans.
- Signal tests exercise flush/state behavior under SIGTERM and SIGUSR1.
The project-local command is:
.venv/bin/pytest -q
In sandboxed environments, browser and local-socket tests may require
running outside the filesystem/network sandbox because they bind
127.0.0.1 and launch Chromium.
Style
Python source follows the existing local style rather than an external formatter profile:
- Python 3.12 syntax is allowed and expected.
- Keep modules focused and browser-free where possible.
- Prefer small helper functions over inline ad hoc parsing when the behavior is shared by CLI, MCP, and tests.
- Comments explain invariants and failure-mode choices, not obvious assignments.
- Streaming readers/writers are preferred for scan-size-dependent data.
- Error handling should make audit integrity explicit: a failed setup is not a clean page; a page-specific scan failure is a skipped page with a reason.
The useful local checks are:
python3.12 -m compileall -q .
.venv/bin/pytest -q
npm audit --omit=dev