How do the 58 safety gates work in Mrs. Kitty AI Controller?

Gates are PreToolUse hooks written in Python that intercept every tool call Claude Code makes BEFORE it executes. They run inside pretooluse.py (4,877 lines) as a sequential pipeline. Each gate evaluates the proposed action and either allows it, blocks it, or requires additional steps. If any gate crashes, all edits stop (fail-closed design).

What is the 100% confidence engine?

The Thinking Partner gate (CHECK 02) uses a 5-factor weighted scoring system with a Reflexion loop (Shinn et al. 2023) where the AI evaluates its own reasoning quality. The AI must reach 100% confidence before any code edit is allowed. Factors include task understanding, research completeness, solution specificity, risk assessment, and implementation readiness.

How does the TDD enforcement loop work?

The gate system enforces Test-Driven Development: write a failing test first, implement minimum code to pass it, then refactor. Gate 10 (Test-Before-Deploy) blocks deployment until all tests pass. Gate 11 (PASS Lock) uses SHA-256 hashes — any file change after tests pass forces a complete re-test (Disabled since February 2026 — Gate 10 and Gate 13 handle its responsibilities). Gate 13 (Completion Blocker) checks actual filesystem state, not AI claims.

system_architecture.md

Every edit passes through 58 sequential gates. Not suggestions. Not lint warnings.
Infrastructure-level blocks the AI cannot see, cannot reason about, and cannot bypass.

Modules

Lines of Hooks

Gates

State Dirs

the pipeline

pretooluse.py — 5,874 lines of enforcement

Gates

Hook Entries

Event Types

24,088

Lines

100%

Confidence

<10ms

Per Gate

gate deep-dives

Click to expand the actual decision logic

M Maintenance Mode Emergency Override ▶

Sometimes you need the AI to just do what you say — skip the verification gates and work at full speed. Maintenance mode bypasses all PreToolUse gates, but critical safety mechanisms still run.

// Activation — AskUserQuestion with "maintenance mode":
USER: "Enable maintenance mode"
→ MAINTENANCE_MODE file created (PostToolUse auto-creates)
→ All 30 PreToolUse gates bypassed

// What STILL runs in maintenance mode:
→ Backup before every edit (version_manager)
→ All PostToolUse hooks (journal, file tracking, learning)
→ Stop hook BLOCKS until maintenance disabled
→ Automation Enforcer approval checks
→ Auto-expire TTL: 1 hour safety net

// Deactivation (MANDATORY before stopping):
CLAUDE: rm ~/.ai-controller/MAINTENANCE_MODE
→ All gates re-engaged

Why This Exists

Batch edits across 10 files would trigger gate checks on every single edit — making it impossibly slow. Maintenance mode lets you do bulk work in one burst, then re-engage all safety checks when you're done. Critical insight: backups still run, the stop hook still blocks, and PostToolUse hooks still fire. You skip the verification gates, but your files are always safe to roll back, and the AI cannot silently finish while maintenance mode is active.

0.5 Plan Mode Required Hard Block ▶

Before any code file (.py, .js, .ts, .cpp, etc.) can be edited, the AI must have entered plan mode and received explicit approval.

IF is_code_file(target):
  FOR path IN [
    1. plan-approved.json exists?
    2. pending-implementation.json claimable?
    3. implementation-state phase=implementing?
    4. CWD-based implementation state?
    5. File mentioned in approved plan text?
  ]:
    IF any path TRUE → CONTINUE
  ALL FALSE → BLOCK: "Call EnterPlanMode"

Design Innovation

Five independent approval paths means the system is resilient to state file corruption — any single proof of planning suffices. The AI can't accidentally bypass by deleting one marker file.

0.5b Plan File Enforcement Hard Block ▶

Plan markdown files (in /plans/) can only be edited while plan mode is active. This prevents the AI from silently modifying approved plans after the fact.

IF target is plan file (.md in /plans/):
IF plan mode NOT active (no plan-entered marker):
→ BLOCK: "Plan files can only be edited in plan mode"
ELSE → CONTINUE

Why This Exists

Plan files ARE the artifact the user approves. If the AI could edit them outside plan mode, it could retroactively change an approved plan to match what it actually built — defeating the purpose of plan approval entirely.

01 Project Analyzer Always On ▶

Forces the AI to read and understand the entire project structure before touching anything. Always runs for code files — only maintenance mode bypasses this gate.

ALWAYS runs for code files // BUG-10 fix: never skipped
SCAN transcript for "I understand:" patterns
IF found → CONTINUE
ELSE → BLOCK: "Explore codebase first"

1.5 Memory Recall Self-Consuming ▶

Searches 826KB+ of accumulated learnings. If this problem was solved before, surfaces the solution automatically using TF-IDF scored retrieval.

IF pending-memory-ack-{session}.txt exists:
  Read {count, summary} from marker file
  DELETE marker file // self-consuming!
  → BLOCK once: "Review {count} past learnings"
ELSE → CONTINUE

Design Innovation

Self-consuming marker: the gate reads the file, then deletes it. Next attempt passes because the marker no longer exists. One-shot enforcement that destroys its own evidence after acknowledgment.

1.7 Multi-Task TODO Acknowledgment Self-Consuming ▶

When the user provides multiple tasks (via TodoWrite or numbered lists), this gate blocks implementation until the AI explicitly acknowledges the full task list. Prevents partial execution where the AI fixates on one task and forgets the rest.

IF pending-todo-ack-{session}.txt exists:
  Read {task_count, task_summary} from marker file
  DELETE marker file // self-consuming!
  → BLOCK once: "Acknowledge {N} tasks before proceeding"
ELSE → CONTINUE

Design Innovation

Same self-consuming marker pattern as Gate 1.5. The marker is written by the PostToolUse hook when it detects multi-task creation, then this gate reads and destroys it on acknowledgment. One-shot enforcement — zero false positives on subsequent tool calls.

02 Thinking Partner Reflexion ▶

Implements the Reflexion architecture (Shinn et al., 2023). The AI self-reflects on assumptions, asks informed questions, and scores confidence on a weighted 100-point scale.

IF risk == NONE/LOW OR plan_approved → SKIP
IF reflexion_triggered (2 failures):
→ BLOCK: "Reflect before retrying"
IF confidence < 100%:
→ BLOCK: "WebSearch for gaps: {gaps}"
ELSE → CONTINUE

Design Innovation

Based on the Reflexion paper (Shinn et al., 2023). When the AI fails twice, a feedback loop forces it to explicitly self-reflect before retrying — preventing blind repetition.

2b Post-Failure Research Hard Block ▶

After any failure, the AI cannot retry the same approach. It must do new research first.

IF failure_count >= 1 AND no research since failure:
→ BLOCK: "WebSearch before retrying"

2c Read-Before-Edit for Config Files Hard Block ▶

Blocks Edit/Write operations on configuration files unless the AI has Read the file first in the current session. Prevents blind overwrites of critical configs like .env, docker-compose.yml, tsconfig.json, and hook Python files.

IF tool == Edit/Write AND is_config_file(path):
  IF path NOT in session_read_files:
    → BLOCK: "Read {file} before modifying"
  ELSE → CONTINUE

Why This Matters

AI agents frequently overwrite config files based on assumptions rather than reading the current state. This gate forces "look before you leap" behavior specifically for files where a wrong edit can break the entire environment.

2.5 Hedging Scanner AI Self-Check ▶

Scans the AI's own output for uncertainty language. If the AI isn't sure, it can't proceed.

SCAN last 8KB of transcript after last WebSearch:
  Pattern: "I'm not sure"
  Pattern: "I don't know if"
  Pattern: "[10-79]% confident"
IF hedge_found:
  → BLOCK: "WebSearch to verify uncertain claim"

Design Innovation

The AI literally blocks itself. It can't hedge and proceed — the system reads the AI's own text and forces research when it detects uncertainty. The AI has no idea this is happening.

2.5b Bug Resolution Tracker Always On ▶

Tracks active bug investigations. If the AI is debugging a problem, this gate prevents it from abandoning the investigation to start new work.

ALWAYS runs // cheap gate, never skipped
IF active_bug_resolution AND attempting unrelated edit:
→ BLOCK: "Finish debugging before starting new work"
IF bug marked resolved AND no verification test:
→ BLOCK: "Prove the fix works — run tests"

Why This Matters

AI agents love to context-switch away from hard bugs. They'll abandon a half-debugged problem and start on something easier. This gate forces them to finish what they started — or explicitly escalate to you.

03 Research Saturation Risk-Routed ▶

Tracks research actions during the session. The AI can't start coding until it's done enough homework — proportional to task complexity.

// Thresholds by complexity:
TRIVIAL → 3 research points required
LOW → 3 points
MEDIUM → 3 points (min 1 WebSearch required)
HIGH → 6 points

IF action_count < threshold:
→ BLOCK: "Need {N} more WebSearch calls"

0.4 Universal Plan Lock Hard Block ▶

The main research enforcement gate. Uses a complexity classifier to determine the current task's difficulty, then requires proportional research effort before allowing implementation. Caches classifications per-session to avoid redundant LLM calls.

// Step 1: Classify task complexity (cached per session)
complexity = classify(prompt) // TRIVIAL, LOW, MEDIUM, HIGH

IF risk == NONE/LOW → SKIP // skip_behavioral
IF plan_approved → SKIP

// Step 2: Check research requirements
IF tool == Write/Edit AND research_count < required:
→ BLOCK: "Do {required} research actions first"
IF tool == Write/Edit AND no WebSearch/WebFetch in session:
→ BLOCK: "At least one WebSearch required"

Design Innovation

The complexity classifier runs once and is cached for the entire session, avoiding repeated LLM inference. The cache key includes the user's original prompt hash, so follow-up messages within the same task don't reclassify. Research actions include WebSearch, WebFetch, Read (external docs), and Grep across multiple files.

05 Research Persistence Soft Block ▶

Forces the AI to share its research findings with you before it starts coding. Nothing learned gets lost when context cycles.

IF complexity in [MEDIUM, HIGH]
AND research done BUT not shared with user:
→ BLOCK: "Share research findings (2-3 sentences)"

06 User Approval Hard Block ▶

The full plan is presented in plain English. You read it, you approve it, or nothing happens. No surprises.

IF plan_approval_required AND NOT plan_approved:
→ BLOCK: "Call ExitPlanMode for approval"

07 Project Snapshot Backup ▶

Creates a full session-level backup of all source files before any changes begin. Complete rollback point.

IF no snapshot for this session yet:
Create full backup to .ai-project/snapshots/{session_id}/
→ CONTINUE

08 Per-File Checkpoint 4 Strategies ▶

Backs up each individual file RIGHT BEFORE it's edited. Strategy selected automatically by file type.

// 4 backup strategies:
.py, .js, .css → GIT_SURGICAL (per-project git repo)
.png, .pdf → BLOB_ONLY (SHA256 content-addressed, dedup)
.log, node_modules → SKIP (not worth backing up)
User-excluded → IGNORE

Design Innovation

Binary files use content-addressed SHA256 hashing with automatic deduplication. Same image stored once, referenced everywhere. Zero context window overhead — fully programmatic, no conversation tokens wasted.

09 Simplification Guard Hard Block ▶

Blocks removal of 20%+ of code without explicit justification. Prevents the AI from "simplifying" your work into oblivion.

IF edit removes functions, imports, or try/catch blocks:
→ BLOCK: "Show test plan proving removal is safe"

10 Test-Before-Deploy Hard Block ▶

Blocks deployment commands (docker compose up, git push, etc.) until all tests pass.

IF is_deployment_command AND dirty_state == "DIRTY":
→ BLOCK: "{N} untested edits — run tests first"

FC File Claims Gate Hard Block ▶

Cross-session edit conflict prevention via coordination.db.

IF another session claims this file:
→ BLOCK: "Terminal {N} is editing {file}"
// SQLite WAL, TOCTOU prevention, 10min heartbeat TTL

SV Screenshot Viewing Gate Hard Block ▶

Blocks edits until pending screenshots are actually viewed.

IF pending screenshot not Read:
→ BLOCK: "Read the screenshot first"

LG Learning Gate Soft Block ▶

Forces knowledge capture after error resolutions before moving on.

IF error resolved, no learning stored:
→ BLOCK: "Store what you learned"
// Feeds memory_log.jsonl (1.3GB+ append-only)

TT Task Tool Gate Hard Block ▶

Plan mode enforcement for agent/task launches.

IF TaskCreate/Agent without approved plan:
→ BLOCK: "Enter plan mode first"

0.3 Test Integrity Hard Block ▶

Blocks test file edits when tests are failing. Prevents "fixing" tests by changing assertions.

IF target is test file AND tests failing:
→ BLOCK: "Fix the source code, not the tests"

0.6 Canonical Source Registry Hard Block ▶

Validates edits against registered canonical sources. Prevents editing generated files directly.

IF file in canonical registry:
→ BLOCK: "Edit {canonical_path} instead"

0.7 Plan File Research Gate Conditional ▶

Requires research before editing plan files. No plans based on assumptions.

IF editing plans/*.md, research below threshold:
→ BLOCK: "Read relevant files first"

1.1 Dependency Map Gate Hard Block ▶

Ensures imports/exports are traced before editing files with downstream dependents.

IF file has dependents, dep map not built:
→ BLOCK: "{N} files import this"

2.6 Dependency Read Verification Hard Block ▶

Verifies both upstream and downstream dependencies have been Read before editing.

IF deps not Read:
→ BLOCK: "Read {missing} first"
// Powered by access_tracker.py + cascade_detector.py

B Bash Sub-Gates (7 gates) Mixed ▶

Seven sub-gates for Bash calls, catching operations that bypass Edit/Write.

Maintenance Perm — Blocks direct MAINTENANCE_MODE creation
Kill Perm — Blocks kill/pkill of system processes
ChromeCDP — Verifies Chrome debug port ready
2b-Bash — Post-failure research for retries
0.5-PRE — Plan lock pre-check for writes
0.5-Bash — Catches echo/cat/tee bypassing Edit
10-Bash — Deploy gate for scp/rsync/docker/git push

E ExitPlanMode Sub-Gates (8) Hard Block ▶

Eight sub-gates when exiting plan mode. A separate AI subprocess reviews adversarially.

E1 — Thinking Partner questions verified
E1b — Exhaustive dependency verification
E1c — Behavioral diff per file
E2a — Optimality assessment (3-layer)
E2a-ii — Adversarial review subprocess
E2a-verify — Review claim verification
E0.5b — Plan file integrity
EC — Confidence (research saturation)

P Parallel Hook Gates (5 files) Mixed ▶

Five hook files running in parallel alongside pretooluse.py.

repo_boundary_gate.py — Prevents cross-repo edits
read_gate.py — Validates reads in plan mode
search_strategy_gate.py — Forces parallel agent dispatch
browser_swarm_gate.py — Blocks sequential browsing
research_gate_hook.py — Research enforcement

O Offload Enforcer Sub-Gates (6) Mixed ▶

Config-driven deployment safety from deploy_safety.json.

Deploy Safety — Server/command validation
Remote Command — SSH command checks
Script Execution — Script safety
Git Operations — Dangerous git detection
File Deletion — rm/unlink enforcement
Process Mgmt — kill/pkill checks

agent swarm architecture

Complexity-driven parallel dispatch — the controller forces multi-agent investigation for any non-trivial task

Complexity Tiers

Max Agents

MECE Axes

SQLite

Coordination

Why Parallel Dispatch?

Instead of Claude searching sequentially (read file, grep, read another file...), the controller blocks all manual searches on MEDIUM/HIGH complexity tasks until a minimum number of agents are dispatched. Each agent gets a unique investigation angle via MECE decomposition (WHAT / HOW / RISKS).

Complexity-Driven Scaling

TRIVIAL: 0 agents, fast path. LOW: 1-2 agents, plan mode required. MEDIUM: 3-6 agents, research required. HIGH: 6-10 agents per task, thousands across parallel sessions. Complexity never downgrades within a session.

Auto-Scaling Tiers

Agent count scales automatically: Tier 1 (0s): 3 agents immediately. Tier 2 (4s): 6 agents. Tier 3 (8s): 10 agents. Tier 4 (15s): 10 agents ceiling. The search strategy gate enforces this — Glob/Grep/Bash searches are blocked until the minimum is met.

Execution Models

Subagents (default): independent parallel agents, each working autonomously. Teams: used for HIGH complexity + LARGE codebases + 3 agents — provides TeamCreate + TaskCreate with dependency chains and a coordinator pattern.

Cross-Session Coordination

SQLite coordination.db (WAL mode) prevents conflicts: File Claims with TOCTOU prevention (BEGIN EXCLUSIVE), heartbeat TTL (10 min) with /proc/{pid} fallback, stale session cleanup that auto-releases dead claims. No external dependencies — just SQLite.

MECE Investigation Axes

Each agent gets a unique angle: Cluster axes (group files by directory, 1 agent per cluster), Dependency axes (root-cause + hotspot analysis), MECE axes (WHAT/HOW/RISKS investigations), Debugging axes (if failure count ≥ 2). Deduplication merges overlapping axes (Jaccard > 0.6).

specialist agent types

Seven named agent types, each optimized for a specific role in the investigation swarm.

scout — Codebase exploration (Glob, Grep, Read)
oracle — External research (WebSearch, WebFetch)
kraken — Implementation (Edit, Write, Bash)
spark — Quick fixes, single-file changes
arbiter — Test execution and verification
debug-agent — Systematic debugging with root-cause analysis
phoenix — Recovery from failed approaches

// User intent extraction (user_intent.py, 12K):
Extracts goals from prompts for complexity classification
// Fast path detection (fast_path_detector.py, 6.3K):
TRIVIAL tasks skip research/planning gates entirely

11 PASS Lock Disabled ▶

Disabled since February 2026. The PASS Lock was designed as an Azure DevOps–style exclusive lock — once tests pass, a SHA-256 hash locks the project state and blocks further edits without re-testing. In practice, it created too much friction for iterative development workflows and was disabled while evaluating a lighter-weight alternative.

// Original design (currently inactive):
ON tests_pass:
  state_hash = SHA256(all_edited_files)
  write PASS-{task_id}.json with hash
  LOCK: no edits allowed without re-test

// Currently: Gate 10 (Test-Before-Deploy) handles deployment gating
// Gate 13 (Stop Hook) handles session-end test enforcement

12 Bash Loop Detector Self-Heal ▶

Prevents brute-force retry loops. AI agents love to retry failed commands 10 times instead of investigating. This gate makes that impossible.

Layer 1 — Immediate:
IF last_bash_failed AND next_tool == Bash:
  HARD BLOCK: "Investigate first — use Read, Grep, or WebSearch"

Layer 2 — Accumulating:
IF 4 consecutive Bash-only:
  SOFT NUDGE: "You're looping — try a different approach"
IF 6 consecutive + 2 failures:
  HARD BLOCK: "Stuck in retry loop"

Reset: WebSearch or WebFetch fully clears state
// Researching = evidence of investigation, not brute force

Why This Matters

Without this, the AI burns through your compute budget retrying the same broken command. Error fingerprinting detects failures even when exit code is 0 (stderr patterns: "error:", "fatal:", "connection refused").

13 Unbypassable Completion Blocker Hard Block ▶

The final gate. Fires when the AI tries to end a session. It checks TestingGate.can_complete() mechanically — not by reading the AI's text. The AI cannot claim "done" with words.

// This is NOT text analysis. It reads state files.
ON session_end_attempt:
  state = TestingGate.load_state()

  IF state.dirty_state == "DIRTY":
    IF state.session_own_edits > 0:
      BLOCK: "Untested code changes exist"

  // On EVERY stop, write continuity handoff:
  extract: active_task, key_decisions, files_modified
  write: handoff-{terminal_id}.md
  Next session picks up exactly where this one left off

Why This Matters

Other AI tools let the agent say "all done!" while tests are failing. Mrs. Kitty checks the actual state files on disk. The AI's opinion about whether it's done is irrelevant. The filesystem is the source of truth.

testing architecture

LLM-classified tests, tamper-evident receipts, cross-session debt enforcement

T1 The Testing Pipeline Enforced ▶

Every edit triggers dirty state tracking. The AI physically cannot claim "done" without test proof.

✎

Edit File

→

⚠

Dirty State

→

⚙

Test Planner

→

▶

Execute Tests

→

❖

LLM Classify

→

✓

Receipt

ON Edit/Write tool:
  dirty_state = "DIRTY"
  Track: file path, timestamp, edit type

ON session end (stop.py):
  IF dirty_state == "DIRTY":
    BLOCK: "{N} files edited without test verification"
    Force test execution before session can end

ON deployment command detected:
  IF dirty_state == "DIRTY":
    BLOCK: "Run tests before deploying"

Design Innovation

The AI doesn't decide when testing is done — the infrastructure does. stop.py blocks session termination until every edited file has a corresponding test receipt. The AI can't rationalize its way past a filesystem check.

T2 LLM-Based Test Classification 6 Parsers ▶

Test output from any runner is parsed by specialized pattern matchers. When patterns fail, the raw output is sent to the LLM for semantic classification — the system understands test results, not just regex-matches them.

pytest

=== N passed in 0.42s ===

Python standard. Handles truncated output, fixture errors, and collection failures.

Jest

Tests: 2 failed, 5 passed

JavaScript/TypeScript. Supports JSON reporter mode and watch-mode output.

Browser Tests

2 passed (1.2s)

Visual test results from agent-browser verification runs. Parses pass/fail counts.

Cargo (Rust)

test result: ok. 5 passed

Rust test harness. Handles compilation errors vs runtime failures.

Go test

ok package 0.003s

Go native testing. Distinguishes ok from FAIL package lines.

Vitest

Tests 5 passed (3)

Vite-native runner. Jest-compatible patterns plus native format.

// Classification cascade:
TRY structured parser (regex per runner) → deterministic result
TRY generic fallback (line-by-line PASS/FAIL) → best-effort
FALLBACK LLM semantic analysis → contextual understanding

// Custom patterns (user-defined):
test_patterns.json → project-specific regex overrides

Design Innovation

Most CI systems parse one runner. This system parses any runner in any language — and when all parsers fail, an LLM reads the raw output and classifies it semantically. It literally understands what "2 scenarios (1 failed, 1 passed)" means even for frameworks it's never seen.

T3 Tamper-Evident Receipts Anti-Forgery ▶

The AI literally cannot forge a passing test result. Only the posttooluse hook — which observes actual Bash output — can write a receipt.

"attested_by": "posttooluse_hook"
"command": "pytest tests/ -v"
"command_hash": "sha256:a3f8c1..."
"output_hash": "sha256:7d2e9b..."
"result": "PASS"
"passed": 12, "failed": 0
"timestamp": "2026-02-24T14:32:01Z"
"ttl_minutes": 30
"version": 3

// Anti-forgery layers:
1. attested_by must be "posttooluse_hook" // AI can't write this
2. command_hash = SHA256(actual command) // proves what ran
3. output_hash = SHA256(actual output) // proves what returned
4. ttl_minutes: 30 // stale receipts rejected
5. version: 3 // old format receipts rejected

Design Innovation

SHA256 hashes of both the command and its output create a cryptographic proof chain. If the AI could somehow write a receipt file (it can't), the hashes wouldn't match the actual test run. It's not just access control — it's tamper evidence.

T4 Cross-Session Testing Debt Persistent ▶

Close the terminal with untested edits? The debt follows you. Next session starts with a testing obligation you can't dismiss.

ON session end with dirty_state:
  Export debt file: testing-debt-{project}.json
  Contains: file list, edit timestamps, project type

ON next session start:
  IF debt file exists for this project:
    Inject: "You have {N} files with untested changes"
    Gate blocks implementation until debt cleared

// Safety mechanisms:
Terminal-isolated: Terminal 2 can't clear Terminal 1's debt
4-hour TTL: prevents zombie debt from abandoned sessions
Project-type filtering: web debt doesn't block Python work

T5 Intelligent Test Planner Goal-Aware ▶

Not every file needs the same tests. The planner classifies edits into test types and selects the right verification strategy — browser-first for web, VM for installers, unit tests for libraries.

// TestType enum — what kind of verification?
VISUAL_BROWSER → Web edits tested in real browser (agent-browser CLI)
INSTALLER_VM → Installer tested in fresh VM
DATABASE_FLOW → Migration tested with real DB
API_ENDPOINT → Endpoint tested with curl/httpx
UNIT_TEST → Standard test runner (pytest, jest, etc.)
SKIP → Infrastructure files (CI configs, dockerfiles, docs)

// Dependency maps:
FILE components/Login.tsx → FEATURES [auth, session] → PAGES [/login, /signup]
FILE styles/theme.css → FEATURES [layout] → PAGES [all]

// Auto-detect routes:
Next.js app router, Flask routes, static HTML → auto-discovered

Design Innovation

Most test systems run pytest and call it done. This system knows that editing a .jsx file means you need a browser test, not a unit test. It traces the dependency from file → feature → affected pages, and tests what the user would actually see.

autonomous browser testing

Agentic browser automation — 141 commands, your real Chrome, completely undetectable

B1 Agent-Browser CLI Agentic ▶

A complete Playwright replacement that uses your real Chrome. Completely undetectable — no navigator.webdriver flag. 141 commands for navigation, interaction, state management, network control, and more. Open source.

# Your real Chrome. Not simulated. Not detectable.
agent-browser open "https://amazon.com"
agent-browser snapshot # reads page structure — 200 tokens vs 13,700 for Playwright
agent-browser click "@e3" # stable element refs, not fragile CSS selectors
agent-browser fill "@e7" "search query"
agent-browser screenshot "evidence.png"
agent-browser state_save "logged-in" # save session state, restore anytime
agent-browser errors # console errors + network failures

// Why Playwright is old news:
~200 tokens/page vs 13,700 for Playwright MCP → 98.5% reduction
10-step workflow: 7,000 tokens vs 114,000 with Playwright → 5.7x more test cycles
Your real Chrome profile → persistent cookies, real fingerprint, saved logins
Direct WebSocket CDP → no Node.js intermediary, no framework overhead
Open source on GitHub (Apache-2.0)

B2 Structure-First Verification Tiered ▶

Reads page structure instead of taking screenshots — 93% less context. Structure-first verification catches broken elements, missing content, and layout issues without burning tokens on pixel analysis.

Console Health

Check for JS errors, uncaught exceptions, failed network requests. Cheapest — no rendering needed.

Snapshot (Structure)

Reads the page's accessibility tree — buttons, headings, forms, links — in ~200 tokens. Replaces screenshot-based verification for most checks.

Screenshot (Visual)

PNG capture for CSS, colors, spacing, layout. Used only when structure alone isn't enough — design validation, pixel-level regression testing.

Responsive Check

Resize to mobile viewport (375px). Re-run tiers 1-3. Catches mobile-specific breakage.

Security Scan

SAST analysis via semgrep. XSS, injection, hardcoded secrets — caught before deployment.

Design Innovation

Structure-first verification uses ~200 tokens per page compared to ~13,700 for screenshot-based tools. A 10-page verification costs 2,000 tokens instead of 137,000. The AI reads the page like a screen reader — fast, accurate, and token-efficient.

B3 Agentic Automation Flow Agentic ▶

Not just verification — full browser automation. Navigate complex UIs, fill forms, interact with dynamic content, message suppliers, find products, manage workflows. The AI operates your browser like you would.

AI receives a browser task — testing, research, or interaction

Claude Code skill researches the UI, plans the smartest path

Agent-browser navigates, clicks, fills, scrolls — using stable @e references

Parallel sessions handle multiple pages or workflows simultaneously

Snapshot verification confirms each step succeeded — ~200 tokens per check

Task completes — results, evidence, and verification collected

Tamper-evident receipt generated with SHA256 evidence chain

B4 Deployment Auto-Verification Post-Deploy ▶

Tests don't stop at the local dev server. After git push or docker compose up, the system navigates to the live URL and verifies production.

// 23 deployment command patterns detected:
git push, docker compose up, npm run deploy,
vercel --prod, fly deploy, scp ... server:, ...

ON deployment detected:
  1. Extract live URL from output or project config
  2. Wait for deployment propagation
  3. agent-browser open {live_url}
  4. Run structure-first verification against production
  5. IF console errors OR visual regression:
    → ALERT: "Production issue detected post-deploy"

Design Innovation

The AI doesn't just deploy and hope. It navigates to your live production URL, checks for console errors, takes a screenshot, and tells you if something broke. All automatically, seconds after the deploy command finishes. Structure-first snapshots mean verification costs ~200 tokens per page, not thousands.

B5 VM Installer Testing Golden Snapshot ▶

Installer builds are tested in an isolated Windows VM with golden snapshot restore. Every test starts from a known-clean state — no contamination from previous runs.

// test-installer skill workflow:
1. Restore VM from golden snapshot (Docker volume)
2. Compile installer via InnoSetup (ISCC.exe)
3. Copy .exe to VM shared folder (\\host.lan\Data)
4. Execute installer silently inside VM
5. Verify installation via noVNC + agent-browser
6. Check: files exist, services running, config correct
7. Restore golden snapshot (clean for next run)

// Access:
noVNC: http://localhost:8006
Shared folder: tests/layer4-e2e/shared/ → \\host.lan\Data

Why This Matters

Most installer testing is manual: build, run on your machine, check if it works, uninstall, repeat. This automates the entire cycle with a fresh VM every time. Golden snapshot restore guarantees the VM is identical across runs — no "it worked on my machine" because the machine is literally reset to factory state between tests.

B5 Visual GUI Testing — How the AI Sees Pages Animated ▶

The AI doesn't use CSS selectors or XPath. It sees pages through snapshots (accessibility trees with element refs) and screenshots (pixel-perfect images it can read). The orient→identify→act→re-orient loop ensures refs are always fresh.

// The orient → act → verify loop:
1. snapshot -i → get @e1, @e2, @e3 refs (orient)
2. Read refs → find target element (identify)
3. click @e1 → interact (act)
4. snapshot -i → refs INVALIDATE after DOM change (re-orient)
5. Repeat until at target section
6. screenshot → Read image → describe what you see (evidence)

// Tiered verification (fastest first):
Tier 1: snapshot only — fast, text-based, checks structure/content
Tier 2: screenshot + Read — checks CSS, layout, colors, spacing
Tier 3: both snapshot + screenshot — comprehensive

// Scoped snapshots for large pages:
snapshot -i -s "#navigation" → limits to ~10 elements vs 100

B6 Browser Swarm — Parallel Visual Testing at Scale Gate-Enforced ▶

B7 Visual QA System (56 checks) Automated ▶

56 automated visual checks powered by Playwright, covering layout, colors, typography, and interactive elements.

visual_qa_checks.js (43K) — Full page QA suite
visual_qa_checks_sections.js (20K) — Per-section checks
visual_qa_runner.py (6.7K) — Python test runner
visual_testing.py (7.3K) — Test framework
visual_test_config.py — Configuration

When 3 URLs need verification, the browser swarm gate blocks sequential agent-browser open commands and forces parallel orchestration. All agents launch in a single message with run_in_background: true.

// 4-Phase Browser Swarm:

PHASE 0: DISCOVER (optional)
  3-5 WebSearches → collect 20-50 URLs → deduplicate

PHASE A: SETUP
  ab-swarm-setup s1=URL1 s2=URL2 ... sN=URLN
  → Opens ALL tabs simultaneously in background

PHASE B: DISPATCH
  N agents launched in ONE message (all run_in_background: true)
  Each agent: snapshot → interact → screenshot → report

PHASE C: AGGREGATE
  Coordinator collects all reports → synthesizes findings

// Decomposition rules:
ITEM-PARALLEL: N URLs → N agents
ASPECT-PARALLEL: N items × M aspects → N×M agents
SECTION-PARALLEL: N page sections → N agents

// Scale:
10-20 agents (~200MB Chrome) • 20-50 agents (~500MB) • 50-100 agents (batch waves of 25-30) • thousands across parallel sessions
Two-tier research: fast scan ALL URLs → deep dive top 10%

Why This Matters

Sequential browser verification of 10 pages takes 10× as long as parallel. The browser swarm gate forces parallel orchestration by blocking after 3 sequential opens. Combined with the agent swarm architecture, this means a 20-page research task launches 20 browser agents simultaneously, each operating its own Chrome tab.

confidence engine

Weighted 100-point algorithm with reflexion loops — won't write a single line below 97

100% confidence scoring

Every tool call is scored on a weighted 100-point scale across five factors. Each factor contributes exactly 20 points. The system won't write a single line of code until the total score reaches 100.

Factor	What It Measures	Points
Plan Quality	Is the implementation plan specific and actionable?	20
File Understanding	Has the AI read and understood all relevant files?	20
Dependency Analysis	Are imports, packages, and side effects mapped?	20
Web Research	Has external research been performed?	20
Context Coverage	Does the plan account for edge cases and tests?	20

Web research gets 0 if not done. Not 15. Not 10. Zero. Your training data cutoff was months ago. The library you're about to use might have a breaking change.

Plan +20

Files +20

Deps +20

Web +20

Ctx +20

// Threshold bands:
0–50: Heavy research required — major knowledge gaps
50–80: More context needed — read more files, check deps
80–99: Targeted verification — one or two gaps remain
100: Proceed to implementation

Based on the Reflexion architecture (Shinn et al., 2023). When the score falls below 100%, a feedback loop identifies the specific gap, forces research to fill it, then re-scores. This loop runs up to 3 times before hard-blocking.

Confidence Scoring Pipeline

Tool Call

→

Score 5 Factors

→

≥100%?

→

PROCEED

↓ No

Identify Gap

→

Research

→

Re-score (max 3x)

→

<100% after 3x → BLOCK

complexity classifier

Semantic risk scoring. Not line count — meaning. The classifier analyzes the task description and flags high-risk patterns that require additional research and verification.

"delete users table" → Destructive(+40) + Critical(+60) = 100 → HIGH
"fix typo in readme" → 0 signals = TRIVIAL
Unknown task → MEDIUM // fail-safe: unknown = needs research

19 high-risk patterns detected instantly:

Authentication / authorization changes
Payment / billing code
Database migrations / schema changes
SSH / key management
File deletion / destructive operations
API key / credential handling
Deployment scripts
User data handling (PII)
Security configuration
Infrastructure changes
// + 9 more patterns...

Each detected pattern adds risk points. Multiple patterns compound. The total determines which gates are enforced and how much research is required.

exitplanmode 3-layer gate

The plan approval process has its own triple-verification. A plan doesn't just need to exist — it needs to prove it was built on solid research.

Layer 1 — Research Saturation:
  Has enough external research been done for the task complexity?
  TRIVIAL = 0 points, LOW = 1, MEDIUM = 3 (min 1 WebSearch), HIGH = 5

Layer 2 — Confidence ≥ 100%:
  The 5-factor confidence score must meet the threshold
  Plan evidence boost: auto-detects URLs in plan text (+points for citations)

Layer 3 — Subprocess Review:
  A SECOND Claude instance reviews the plan
  Checks for: unverified claims, hallucinations, logical gaps
  The planner literally cannot grade its own homework

Design Innovation

A separate AI instance reviews the first AI's plan for unverified claims. The planner cannot review its own work — a second opinion is architecturally enforced, not optional.

self-healing cascade

When something breaks, 9 specialized agents investigate before bothering you.

Phase 1

direct-fix direct-fix

Cycle 1

error-trace similar-patterns docs-and-deps

Cycle 2

alt-approach regression-check env-check

Cycle 3

broader-context minimal-repro expert-consult

Escalate

Ask human (with 9 research reports)

memory system

826KB+ of accumulated learnings — TF-IDF scored retrieval with 3-tier injection

permanent memory

826KB+ append-only JSONL archive. Every decision, fix, and pattern stored permanently. The system doesn't just remember — it remembers the right things at the right time.

8 Learning Types:

ARCHITECTURAL_DECISION — "We chose JWT over sessions because..."
WORKING_SOLUTION — "Fixed by adding CORS middleware"
CODEBASE_PATTERN — "This project uses CSS Modules, not Tailwind"
FAILED_APPROACH — "Don't use library X, it breaks with Y"
error_resolution — Auto-captured error→fix pairs
test_result — Test pass/fail context
USER_PREFERENCE — "Always use bun, never npm"
in_session — General session learnings

Scoring Formula:

Every learning is scored against the current task using a multi-factor relevance algorithm:

// Relevance scoring formula:
base = (exact_match + stem_match * 0.7 + substring_match * 0.5) / token_count
score = base × type_boost × tag_boost × e^{(-0.03 × days_old)}

// Boost factors:
type_boost = 1.5× for corrections and user preferences
tag_boost = 2.0× when learning tags overlap with current task keywords
recency_decay = exponential decay at 0.03/day // ~50% weight at 23 days

Stem matching catches morphological variants (e.g., "testing" matches "test"), while substring matching catches partial overlaps. The combined scoring avoids both false positives and false negatives better than pure TF-IDF.

3-Tier Injection:

HOT (>0.3): Full content injected + acknowledgment gate forces Claude to read it
WARM (0.1–0.3): Preview shown, Claude can pull more if needed
COLD (<0.1): Not shown, but core MEMORY.md always injected

The HOT tier doesn't just show memories — it forces acknowledgment through a self-consuming gate. The AI must explicitly process the recalled learnings before proceeding. This prevents relevant past solutions from being ignored.

auto-capture pipeline

Memory capture happens at two levels — real-time during the session and comprehensively when it ends.

Real-Time (PostToolUse Hook):

The PostToolUse hook watches every tool result. When it detects an error followed by a successful resolution, it captures the pair automatically. No manual action needed — the system learns from every fix in real time.

Session-End Extraction:

When a session ends, the stop hook scans the full conversation transcript and extracts:

• Architectural decisions — "We chose JWT over sessions because..."
• Working solutions — "Fixed by adding CORS middleware"
• Failed approaches — "Don't use library X, it breaks with Y"
• User preferences — "Always use bun, never npm"
• Error→resolution pairs — Auto-captured during PostToolUse

Distillation:

The JSONL archive grows over time. Periodic consolidation into MEMORY.md keeps core knowledge lean — extracting the most important patterns and decisions into a concise document that's always injected at session start.

Memory Injection Pipeline

User Prompt

→

TF-IDF Search

→

Score?

→

HOT: Full inject + gate

↓

WARM: Preview

→

COLD: Core only

context compaction resilience

Claude has a ~200K token context window. When it fills up, the system compresses prior messages into a summary. Critically, compaction does NOT change the session_id — this was a misdiagnosis that was corrected in February 2026. The session_id persists across compaction events.

// How compaction resilience works:
Session abc123 → markers: confidence-cleared-abc123.txt
[context compacts — old messages compressed]
Session abc123 // same session_id preserved!
→ Markers still valid — gates pass correctly

Belt-and-suspenders: TTL-based expiration provides additional resilience. Memory markers have 10-minute TTL. Bug resolution markers have 1-hour TTL. Even if a marker lookup fails for any reason, fresh markers are accepted regardless of session key.

Engineering Honesty

The original design assumed compaction would create new session_ids, leading to orphaned markers. Investigation proved this wrong — session_id is stable. TTL-based fallbacks remain as defense-in-depth, not as a primary mitigation. We document what we learned, including our mistakes.

lesson extraction system

Separate from real-time memory: structured lesson extraction with a queryable archive.

lesson_ingest.py (11K) — Extracts lessons from transcripts
  Categorizes: architectural decisions, working solutions, failed approaches

lesson_query.py (9.1K) — Queryable lesson library
  TF-IDF search across all historical lessons

stop_learning_extractor.py (24K) — Session-end extraction
  Full transcript analysis for permanent learning capture

// Powered by: certainty_tracker.py (28K), plan_continuity.py (12K)

scope detection

3-signal monitoring with error fingerprinting — knows when to persist and when to pivot

scope change detection

3-signal monitoring. If the problem shifts mid-build, it halts and re-plans automatically. The system doesn't just detect failures — it detects when the nature of the failure changes.

Signal A (HIGH): Tests that WERE passing now fail with a new error fingerprint — the problem shifted. You were fixing auth, but now you've broken the database layer.

Signal B (MED): Same file fails 3 consecutive edit attempts — stuck in a loop. The AI is trying variations of the same broken approach.

Signal C (MED): 2 unique error fingerprints since plan was approved — scope is drifting. The original plan no longer matches reality.

→ Trigger: set phase=needs_replan, write plan-required marker, force the AI back to planning mode

Scope Change Detection

Test Fails

→

Normalize Error

→

Generate Fingerprint

→

New fingerprint?

→

REPLAN

↓ Same

Continue fix attempts

error fingerprinting

When tests fail, the system doesn't just see "test failed." It normalizes the stack trace into a fingerprint — stripping noise to detect whether this is a truly new error or the same one repeating.

// Raw error:
File "app.py", line 42, in process: KeyError: 'user_id'

// Normalization strips:
• Line numbers (change with every edit)
• Timestamps (always different)
• Hex addresses (memory-dependent)
• PIDs (process-dependent)
• Variable values (instance-specific)

// Keeps:
• Error type (KeyError)
• Module name (app.py, process)
• Message core (the structural pattern)

// Fingerprint:
app.py:process:KeyError

Same fingerprint = same bug, keep trying the current approach. New fingerprint = scope changed, trigger re-plan. This is what makes the TDD loop intelligent.

failure intelligence

The system learns from every failure. Error fingerprinting creates a taxonomy (syntax, runtime, test, deployment, integration). Fix success rates are tracked across sessions.

// Error taxonomy:
SyntaxError in auth.py → fingerprint: syntax:auth:import
Same fingerprint 3x → Pattern detected
→ Suggest fix from database (73% success rate)

// Semantic Fix Deduplication:
Intent-based hashing of fixes
If a fix is semantically equivalent to a previous failed fix → BLOCKED
Prevents burning money retrying the same broken approach with different syntax

After 3 identical failures, the system stops retrying and suggests proven fixes from its history. No more burning $50 on the same error.

failure intelligence

failure_intelligence.py (14K lines) maintains an error taxonomy and tracks fix success rates across sessions.

// Error taxonomy (5 categories):
syntax — Parse errors, missing brackets
runtime — KeyError, TypeError, null refs
test — Assertion failures, timeout
deployment — Docker, scp, server errors
integration — API mismatches, schema drift

// Cross-session learning:
Fix success rates tracked per error type
After 3 identical failures → suggest proven fixes
Intent-based deduplication prevents retrying same approach

architecture

Six-phase hook pipeline, invisible state markers, and bash evasion detection

six-phase hook pipeline

Every interaction passes through six interception points. All invisible to the AI. The hook system is the foundation everything else is built on.

SessionStart — Fires when session begins (5 hooks)
  Bootstrap, continuity restore, session register, memory load, plugin patches

UserPromptSubmit — Fires on every user message (4 hooks)
  Complexity classification, memory injection, correction detection, media detect

PreToolUse — Fires BEFORE every tool call (58 gates)
  Can: BLOCK, ALLOW, or inject context

PostToolUse — Fires AFTER every tool result (5 hooks)
  Scope change detection, test tracking, auto-learning

Stop — Fires at session end (2 hooks)
  Blocks if unverified edits exist, extracts final learnings

PreCompact — Fires before context compaction (2 hooks)
  Saves session state, extracts learnings before memory is compressed

Every interaction passes through six phases. The AI sends a tool call, the hook intercepts it, runs 58 gates, and returns a verdict. The AI receives only the result — it has no idea 24,088 lines of hook code just evaluated its request.

Six-Phase Hook Pipeline

SessionStart

→

UserPrompt

→

PreToolUse

→

58 Gates

→

ALLOW / BLOCK

→

Tool Executes

→

PostToolUse

→

Auto-learn

→

Stop

→

PreCompact

4-level risk classification

Every tool call is classified into one of four risk levels before the gate pipeline runs. The risk level determines which gates are evaluated — cheap gates always run, expensive gates are skipped for low-risk operations.

NONE — Read-only operations (Read, Glob, Grep, WebSearch)
  skip_behavioral = true → skip expensive gates

LOW — Small edits, single-file changes, safe Bash
  skip_behavioral = true → skip expensive gates

MEDIUM — Multi-file edits, config changes, agent spawning
  All gates run (confidence, research, saturation)

HIGH — Destructive commands, deployment, git push
  All gates run + additional confirmation checks

// Gates that ALWAYS run (regardless of risk):
CHECK 1 (Analyzer), CHECK 1.5 (Memory), CHECK 2.5 (Hedging), CHECK 2.5b (Bug Resolution)

// Gates that skip on NONE/LOW risk:
CHECK 2 (Confidence), CHECK 3 (Saturation), CHECK 4 (Research),
CHECK 5 (Persistence), CHECK 6 (Final Approval)

Design Innovation

Risk routing prevents gate fatigue. A simple Read call passes through in microseconds. A git push gets the full 58-gate treatment. The AI doesn't experience different behavior — it just notices that low-risk operations are faster.

9-phase workflow orchestrator

The system tracks the AI's progress through a 9-phase state machine. Each phase has required actions before the AI can advance. The orchestrator prevents skipping steps — no jumping straight to implementation without research.

1. IDLE — Waiting for user input
2. EXPLORING — Reading files, understanding codebase
3. RESEARCHING — WebSearch, external docs, prior art
4. PLANNING — Writing implementation plan
5. CHECKPOINT_1 — Plan review / user approval gate
6. CHECKPOINT_2 — Pre-implementation verification
7. IMPLEMENTING — Writing code, editing files
8. TESTING — Running tests, verifying changes
9. VERIFIED — All tests pass, session can end

Why This Matters

Without phase tracking, the AI will skip directly from "user asked a question" to "writing code." The orchestrator enforces the research → plan → approve → build → test loop that professional engineers follow. Phase transitions are tracked in state files, surviving context compaction.

invisible state markers

The system persists state through filesystem marker files the AI cannot see. Every gate decision is recorded. Every approval is saved. The AI only sees "BLOCK" or "CONTINUE" — never the mechanism.

// Marker files (invisible to AI):
plan-approved.json → Plan was approved by user
needs_replan.marker → Scope change detected, must re-plan
analysis-state-{session}.json → Project understanding cached
confidence-cleared-{session}.txt → Thinking partner passed
pending-memory-ack-{session}.txt → Memory gate one-shot trigger
thinking-state-{session}.json → Reflexion loop state
MAINTENANCE_MODE → Emergency bypass — all gates skip

When the AI calls EnterPlanMode, a hook silently writes plan-approved.json. When it tries to edit code, another hook reads that file to verify approval. The AI never knows these files exist.

bash write detection

The AI can bypass the Edit tool by writing files through Bash — echo > file, sed -i, python3 -c "open(...).write(...)". The system detects 20 evasion patterns and forces proper tool use.

// Detected patterns:
echo "..." > file / echo "..." >> file
cat << EOF > file / tee file
sed -i 's/.../.../' file
python3 -c "open('f','w').write(...)"
node -e "fs.writeFileSync(...)"

IF bash_write_detected:
→ BLOCK: "Use Edit/Write tools — Bash writes bypass tracking"

Without this, every gate in the pipeline could be bypassed. The AI could write any file through Bash without triggering dirty state, backups, or simplification guards. This gate closes that escape hatch.

hook daemon architecture

A long-running Unix socket server pre-loads all 12 hook handlers at startup. The thin client connects with a 2-second timeout and falls back to direct subprocess execution if the daemon is down.

hook-daemon.py (7.8K) — Unix socket at ~/.ai-controller/hook-daemon.sock
  Pre-loads 12 handlers, dispatches via length-prefixed JSON

hook-client.py (6.3K) — Thin client, 2s timeout
  Falls back to subprocess if socket unavailable

claude-supervisor.sh (5.5K) — Daemon lifecycle
  Auto-restarts on crash, health monitoring

// Fork-bomb prevention: env var guard blocks recursive subprocess spawning

state & persistence layer

16 state directories persist gate decisions, research progress, and session context across context compactions and session restarts.

// Intelligence state:
certainty-state/ — Confidence scores per decision (2.1MB)
phase-state/ — 9-phase execution state (548K)
research-state/ — Research findings & tracking (2.5MB)
thinking-state/ — Thinking partner analysis (636K)
prompt-clarity/ — Prompt analysis clarity (756K)

// Execution state:
bash-loop-state/ — Bash command loop detection
browser-flow-state/ — Browser automation flow
browser-swarm-state/ — Swarm orchestration state
scope-change-state/ — Scope change tracking
why-state/ — WHY comment enforcement
search-strategy/ — Agent dispatch tracking

// Verification:
dep-scan-state/ — Dependency scan results
enforcer-state/ — Enforcement tracking
verification-results/ — Verification artifacts
session-state-db/ — Session state persistence

// Bulk storage:
handoffs/ — 19,288 agent handoff files (77MB)
changes/ — 844 timestamped .diff files (4.1MB)
logs/ — Per-gate execution logs (134MB)
plans/ — 4,002 stored plans (69MB)
memory/ — Long-term memory (1.3GB)

configuration registry

Config-driven behavior via 8 JSON config files. No hardcoded constants — every deployment rule, test pattern, and project registration lives in config.

deploy_safety.json — Server allowlists, dangerous command patterns
test_patterns.json — Project-specific test regex overrides
test_environments.json — Test environment configuration
scope-change.json — Scope change detection rules
projects.json — Managed project registry
testing_gate.json — Testing gate configuration
active-plan-sessions.json — Plan session tracking
config.json — Gemini API + system configuration

context management

Two modules manage Claude's finite context window (~200K tokens), detecting pressure and cycling state to prevent information loss.

context_cycle.py (14K) — Context window cycling
Manages what stays in context vs. what gets persisted to disk

context_pressure.py (18K) — Overflow detection
Triggers PreCompact hooks before context fills up

// PreCompact hooks (fires before compaction):
session_start_continuity.py — Saves session state
precompact_learning_extractor.py — Extracts learnings
// State survives context death via filesystem markers

alerting & observability

System-wide monitoring with alerting.py (11K lines) generating alerts for system events, and observability.py providing hook execution metrics.

// Alert types:
Gate timeout alerts (hooks exceeding 55s)
Daemon health alerts (socket connection failures)
Memory pressure alerts (context overflow)
Coordination alerts (stale session cleanup)

permission_guard.py (10K) — Permission enforcement
Layer beyond gates for permission boundary checks

coordination

Cross-terminal awareness, blast radius analysis, and content-addressed versioning

cross-terminal coordination

Multiple Claude sessions can run simultaneously on the same codebase. A SQLite database coordinates them.

// Tables:
sessions → Active session registry + heartbeats
file_claims → File-level locking + conflict detection

// Flow:
Terminal 1: claims app/auth.py
Terminal 2: tries to edit app/auth.py
→ WARNING: "Terminal 1 is editing this file"

No merge conflicts. No lost work. Two AI sessions can work on different parts of your project at the same time without stepping on each other.

blast radius analysis

Before any edit, the cascade detector builds a full dependency graph to calculate blast radius. Import/require parsing traces every file that depends on your change.

// You edit auth/login.py:
Direct: auth/login.py
Cascade level 1: api/routes.py, middleware/auth.py
Cascade level 2: tests/test_api.py, app/main.py

BACKUP: All 5 files (not just the one you edited)
TEST: auth + api + middleware // the full blast radius

Always-backup patterns: *.config, *.json, *.env, *.yaml

One-line auth change? The system knows it affects 5 files, backs up all 5, and tests all 5. No surprises.

content-addressed versioning

Every file edit creates a git commit — before AND after. Binary files use SHA-256 content-addressed blob storage with automatic deduplication.

// On every Edit/Write:
1. git commit — pre-edit snapshot
2. Apply edit
3. git commit — post-edit snapshot
4. Tag with test status (pending/pass/fail)

// Binary files (images, PDFs):
SHA-256 hash → store in blob/a3/f2b1...
Identical content = same hash = stored once

git bisect → find exactly which edit broke it

AST-aware diffs for code files — not line-by-line, but structural. "Function foo changed parameters" vs "line 47 changed."

agent handoff system

19,288 handoff files (77MB) enable seamless agent-to-agent context passing across sessions.

// Each handoff contains:
Task context — what was the agent working on
Current state — where did it stop
Progress artifacts — what was found/built
Recommendations — what should happen next

// Change tracking:
changes/ — 844 timestamped .diff files
Format: YYYYMMDD-HHMMSS-filename.diff
// Tracks edits to the hook system itself

design philosophy

The principles that make engineers stop and think

fail-closed, not fail-open

try:
check_condition()
except Exception:
return BLOCK # Better to incorrectly block than allow data loss

Most systems fail-open: if the safety check crashes, let the action through. Every gate in this system does the opposite. A broken gate blocks, not permits. Because a false block costs you 30 seconds. A false permit can cost you your codebase.

the ai cannot see the gates

What the AI sees

A block message appears: "Call EnterPlanMode before editing code files."

The AI has no knowledge of the pipeline, its structure, or its logic.

What actually happens

58 gates evaluate in sequence. Decision trees branch. Markers are checked. Confidence is scored. The AI receives only the final verdict.

The hook layer is invisible. The AI can't reason about what it can't perceive — and therefore can't circumvent it.

self-consuming markers

The memory acknowledgment gate writes a marker file. When the gate fires, it reads the file's contents... then deletes the file. The next attempt passes because the marker no longer exists.

marker = read("pending-memory-ack-{session}.txt")
show_learnings(marker.count, marker.summary)
delete("pending-memory-ack-{session}.txt") # one-shot

One-shot enforcement that destroys its own evidence. Acknowledged once, never blocks again.

the ai blocks itself on uncertainty

The hedging scanner searches the AI's own output for uncertainty language:

patterns = [
  "I'm not sure",
  "I don't know if",
  "I don't know whether",
  "[10-79]% confident",
]

If the AI hedges, it blocks itself. The system forces a WebSearch before proceeding. The AI has no idea this scan is happening.

subprocess plan review

AI₁ writes plan → ExitPlanMode → spawn AI₂ → AI₂ reviews plan → approve/reject

A separate AI instance reviews the first AI's plan for unverified claims. The planner literally cannot grade its own homework.

infrastructure, not prompts

System Prompt

"Please be careful when editing files and make sure to back them up first."

Hook Infrastructure

if not plan_approved:
return {"decision": "deny"}

Prompts are suggestions. The AI can ignore them, forget them, or rationalize around them. Hooks are law — mechanically enforced at the infrastructure level.

research is mandatory, not optional

Web research scores 0 points if not performed. Not 15 (too generous). Not 10 (still lets you squeak by). Zero. Because your training data cutoff was months ago, and the library you're about to use might have a breaking change.

# Confidence scoring
web_research_score = 0 if not did_web_search else 20

state survives context death

Claude's context window is finite (~200K tokens). When it fills up, old messages are compressed into a summary. Most AI tools lose everything at this point. Mrs. Kitty doesn't.

Without Controller

Context compacts. The AI forgets the plan, the approval, the research it did, and the bugs it already fixed. You start explaining everything again from scratch.

With Mrs. Kitty

State markers persist on the filesystem. plan-approved.json still exists. Memory is re-injected from 826KB of learnings. The session handoff document captures what was in progress. The AI picks up where it left off.

# Compaction event:
Session: abc123 → markers on disk
[context compresses]
Session: abc123 # session_id preserved!
  → markers still valid, gates pass correctly

# Belt-and-suspenders TTL fallback:
  if marker.age < TTL: accept regardless
  # Memory: 10min TTL, Bug resolution: 1hr TTL

Session_id is stable across compaction — markers survive naturally. TTL-based fallbacks provide defense-in-depth: even if a marker lookup fails for any reason, fresh markers are accepted. Better to have two safety nets than one.

by the numbers

Python Modules

Lines of Hooks

Safety Gates

Event Hooks

Specialist Agents

Backup Strategies

Test Parsers

Verification Tiers

State Directories

Handoff Files

Config Files

Hook Phases