system_architecture.md

Every edit passes through 33+ sequential gates. Not suggestions. Not lint warnings.
Infrastructure-level blocks the AI cannot see, cannot reason about, and cannot bypass.

0
Modules
0
Lines of Hooks
0
Gates
0
State Dirs

the pipeline

pretooluse.py — 4,503 lines of enforcement
Gate Pipeline Flowchart 33+ sequential safety gates that every edit passes through before approval LEGEND Continues to next gate Can block (must fix to proceed) Risk-routed (skips for low risk) Approved / bypass EDIT / WRITE ARRIVES Maintenance Mode? BACKUP + BYPASS Plan Mode Required? BLOCK: EnterPlanMode 5 approval paths (OR gate) 01 PROJECT ANALYZER always runs 1.5 MEMORY RECALL WOW always runs 02 THINKING PARTNER WOW Reflexion loop BLOCK: <97% skip if NONE/LOW 2b POST-FAILURE RESEARCH HARD BLOCK 2.5 HEDGING SCANNER WOW BLOCK: "not sure" always runs 03 RESEARCH SATURATION skip if NONE/LOW BLOCK: need research 05 RESEARCH PERSISTENCE 06 USER APPROVAL BLOCK: need approval 07 PROJECT SNAPSHOT full backup 08 PER-FILE CHECKPOINT .py/.js → GIT .png → BLOB (SHA256) .log → SKIP excluded → IGNORE 09 SIMPLIFICATION GUARD BLOCK: 20%+ removed 10 TEST-BEFORE-DEPLOY BLOCK: untested edits EDIT APPROVED RUN TESTS Tests Pass? YES SHIP IT ✓ NO TDD LOOP Tests fail Scope change detected Re-plan → Research Implement fix → Test Research → retry

gate deep-dives

Click to expand the actual decision logic

Sometimes you need the AI to just do what you say — no checkpoints, no gates, no questions. Maintenance mode bypasses every single gate in the pipeline. One word activates it.

// Activation — just type in your terminal:
USER: "maintenance"
→ MAINTENANCE_MODE file created
→ ALL 33+ gates bypassed instantly
→ Backups still run (safety net)

// Deactivation — just type:
USER: "done" or "exit maintenance"
→ MAINTENANCE_MODE file removed
→ All gates re-engaged
Why This Exists

Batch edits across 10+ files would trigger gate checks on every single edit — making it impossibly slow. Maintenance mode lets you do bulk work in one burst, then re-engage all safety checks when you're done. The key insight: backups still run even in maintenance mode. You skip the verification gates, but your files are always safe to roll back.

Before any code file (.py, .js, .ts, .cpp, etc.) can be edited, the AI must have entered plan mode and received explicit approval.

IF is_code_file(target):
  FOR path IN [
    1. plan-approved.json exists?
    2. pending-implementation.json claimable?
    3. implementation-state phase=implementing?
    4. CWD-based implementation state?
    5. File mentioned in approved plan text?
  ]:
    IF any path TRUECONTINUE
  ALL FALSEBLOCK: "Call EnterPlanMode"
Design Innovation

Five independent approval paths means the system is resilient to state file corruption — any single proof of planning suffices. The AI can't accidentally bypass by deleting one marker file.

Plan markdown files (in /plans/) can only be edited while plan mode is active. This prevents the AI from silently modifying approved plans after the fact.

IF target is plan file (.md in /plans/):
  IF plan mode NOT active (no plan-entered marker):
    → BLOCK: "Plan files can only be edited in plan mode"
ELSECONTINUE
Why This Exists

Plan files ARE the artifact the user approves. If the AI could edit them outside plan mode, it could retroactively change an approved plan to match what it actually built — defeating the purpose of plan approval entirely.

Forces the AI to read and understand the entire project structure before touching anything. Always runs for code files — only maintenance mode bypasses this gate.

ALWAYS runs for code files // BUG-10 fix: never skipped
SCAN transcript for "I understand:" patterns
IF found → CONTINUE
ELSEBLOCK: "Explore codebase first"

Searches 826KB+ of accumulated learnings. If this problem was solved before, surfaces the solution automatically using TF-IDF scored retrieval.

IF pending-memory-ack-{session}.txt exists:
  Read {count, summary} from marker file
  DELETE marker file // self-consuming!
  → BLOCK once: "Review {count} past learnings"
ELSECONTINUE
Design Innovation

Self-consuming marker: the gate reads the file, then deletes it. Next attempt passes because the marker no longer exists. One-shot enforcement that destroys its own evidence after acknowledgment.

When the user provides multiple tasks (via TodoWrite or numbered lists), this gate blocks implementation until the AI explicitly acknowledges the full task list. Prevents partial execution where the AI fixates on one task and forgets the rest.

IF pending-todo-ack-{session}.txt exists:
  Read {task_count, task_summary} from marker file
  DELETE marker file // self-consuming!
  → BLOCK once: "Acknowledge {N} tasks before proceeding"
ELSECONTINUE
Design Innovation

Same self-consuming marker pattern as Gate 1.5. The marker is written by the PostToolUse hook when it detects multi-task creation, then this gate reads and destroys it on acknowledgment. One-shot enforcement — zero false positives on subsequent tool calls.

Implements the Reflexion architecture (Shinn et al., 2023). The AI self-reflects on assumptions, asks informed questions, and scores confidence on a weighted 100-point scale.

IF risk == NONE/LOW OR plan_approvedSKIP
IF reflexion_triggered (2+ failures):
  → BLOCK: "Reflect before retrying"
IF confidence < 97%:
  → BLOCK: "WebSearch for gaps: {gaps}"
ELSECONTINUE
Design Innovation

Based on the Reflexion paper (Shinn et al., 2023). When the AI fails twice, a feedback loop forces it to explicitly self-reflect before retrying — preventing blind repetition.

After any failure, the AI cannot retry the same approach. It must do new research first.

IF failure_count >= 1 AND no research since failure:
  → BLOCK: "WebSearch before retrying"

Blocks Edit/Write operations on configuration files unless the AI has Read the file first in the current session. Prevents blind overwrites of critical configs like .env, docker-compose.yml, tsconfig.json, and hook Python files.

IF tool == Edit/Write AND is_config_file(path):
  IF path NOT in session_read_files:
    → BLOCK: "Read {file} before modifying"
  ELSECONTINUE
Why This Matters

AI agents frequently overwrite config files based on assumptions rather than reading the current state. This gate forces "look before you leap" behavior specifically for files where a wrong edit can break the entire environment.

Scans the AI's own output for uncertainty language. If the AI isn't sure, it can't proceed.

SCAN last 8KB of transcript after last WebSearch:
  Pattern: "I'm not sure"
  Pattern: "I don't know if"
  Pattern: "[10-79]% confident"
IF hedge_found:
  → BLOCK: "WebSearch to verify uncertain claim"
Design Innovation

The AI literally blocks itself. It can't hedge and proceed — the system reads the AI's own text and forces research when it detects uncertainty. The AI has no idea this is happening.

Tracks active bug investigations. If the AI is debugging a problem, this gate prevents it from abandoning the investigation to start new work.

ALWAYS runs // cheap gate, never skipped
IF active_bug_resolution AND attempting unrelated edit:
  → BLOCK: "Finish debugging before starting new work"
IF bug marked resolved AND no verification test:
  → BLOCK: "Prove the fix works — run tests"
Why This Matters

AI agents love to context-switch away from hard bugs. They'll abandon a half-debugged problem and start on something easier. This gate forces them to finish what they started — or explicitly escalate to you.

Tracks research actions during the session. The AI can't start coding until it's done enough homework — proportional to task complexity.

// Thresholds by complexity:
TRIVIAL → 3 research points required
LOW → 3 points
MEDIUM → 3 points (min 1 WebSearch required)
HIGH → 6 points

IF action_count < threshold:
  → BLOCK: "Need {N} more WebSearch calls"

The main research enforcement gate. Uses a complexity classifier to determine the current task's difficulty, then requires proportional research effort before allowing implementation. Caches classifications per-session to avoid redundant LLM calls.

// Step 1: Classify task complexity (cached per session)
complexity = classify(prompt) // TRIVIAL, LOW, MEDIUM, HIGH

IF risk == NONE/LOWSKIP // skip_behavioral
IF plan_approvedSKIP

// Step 2: Check research requirements
IF tool == Write/Edit AND research_count < required:
  → BLOCK: "Do {required} research actions first"
IF tool == Write/Edit AND no WebSearch/WebFetch in session:
  → BLOCK: "At least one WebSearch required"
Design Innovation

The complexity classifier runs once and is cached for the entire session, avoiding repeated LLM inference. The cache key includes the user's original prompt hash, so follow-up messages within the same task don't reclassify. Research actions include WebSearch, WebFetch, Read (external docs), and Grep across multiple files.

Forces the AI to share its research findings with you before it starts coding. Nothing learned gets lost when context cycles.

IF complexity in [MEDIUM, HIGH]
  AND research done BUT not shared with user:
  → BLOCK: "Share research findings (2-3 sentences)"

The full plan is presented in plain English. You read it, you approve it, or nothing happens. No surprises.

IF plan_approval_required AND NOT plan_approved:
  → BLOCK: "Call ExitPlanMode for approval"

Creates a full session-level backup of all source files before any changes begin. Complete rollback point.

IF no snapshot for this session yet:
  Create full backup to .ai-project/snapshots/{session_id}/
CONTINUE

Backs up each individual file RIGHT BEFORE it's edited. Strategy selected automatically by file type.

// 4 backup strategies:
.py, .js, .css → GIT_SURGICAL (per-project git repo)
.png, .pdf → BLOB_ONLY (SHA256 content-addressed, dedup)
.log, node_modules → SKIP (not worth backing up)
User-excluded → IGNORE
Design Innovation

Binary files use content-addressed SHA256 hashing with automatic deduplication. Same image stored once, referenced everywhere. Zero context window overhead — fully programmatic, no conversation tokens wasted.

Blocks removal of 20%+ of code without explicit justification. Prevents the AI from "simplifying" your work into oblivion.

IF edit removes functions, imports, or try/catch blocks:
  → BLOCK: "Show test plan proving removal is safe"

Blocks deployment commands (docker compose up, git push, etc.) until all tests pass.

IF is_deployment_command AND dirty_state == "DIRTY":
  → BLOCK: "{N} untested edits — run tests first"

Disabled since February 2026. The PASS Lock was designed as an Azure DevOps–style exclusive lock — once tests pass, a SHA-256 hash locks the project state and blocks further edits without re-testing. In practice, it created too much friction for iterative development workflows and was disabled while evaluating a lighter-weight alternative.

// Original design (currently inactive):
ON tests_pass:
  state_hash = SHA256(all_edited_files)
  write PASS-{task_id}.json with hash
  LOCK: no edits allowed without re-test

// Currently: Gate 10 (Test-Before-Deploy) handles deployment gating
// Gate 13 (Stop Hook) handles session-end test enforcement

Prevents brute-force retry loops. AI agents love to retry failed commands 10 times instead of investigating. This gate makes that impossible.

Layer 1 — Immediate:
IF last_bash_failed AND next_tool == Bash:
  HARD BLOCK: "Investigate first — use Read, Grep, or WebSearch"

Layer 2 — Accumulating:
IF 4+ consecutive Bash-only:
  SOFT NUDGE: "You're looping — try a different approach"
IF 6+ consecutive + 2 failures:
  HARD BLOCK: "Stuck in retry loop"

Reset: WebSearch or WebFetch fully clears state
// Researching = evidence of investigation, not brute force
Why This Matters

Without this, the AI burns through your compute budget retrying the same broken command. Error fingerprinting detects failures even when exit code is 0 (stderr patterns: "error:", "fatal:", "connection refused").

The final gate. Fires when the AI tries to end a session. It checks TestingGate.can_complete() mechanically — not by reading the AI's text. The AI cannot claim "done" with words.

// This is NOT text analysis. It reads state files.
ON session_end_attempt:
  state = TestingGate.load_state()

  IF state.dirty_state == "DIRTY":
    IF state.session_own_edits > 0:
      BLOCK: "Untested code changes exist"

  // On EVERY stop, write continuity handoff:
  extract: active_task, key_decisions, files_modified
  write: handoff-{terminal_id}.md
  Next session picks up exactly where this one left off
Why This Matters

Other AI tools let the agent say "all done!" while tests are failing. Mrs. Kitty checks the actual state files on disk. The AI's opinion about whether it's done is irrelevant. The filesystem is the source of truth.

testing architecture

LLM-classified tests, tamper-evident receipts, cross-session debt enforcement

Every edit triggers dirty state tracking. The AI physically cannot claim "done" without test proof.

Edit File
Dirty State
Test Planner
Execute Tests
LLM Classify
Receipt
ON Edit/Write tool:
  dirty_state = "DIRTY"
  Track: file path, timestamp, edit type

ON session end (stop.py):
  IF dirty_state == "DIRTY":
    BLOCK: "{N} files edited without test verification"
    Force test execution before session can end

ON deployment command detected:
  IF dirty_state == "DIRTY":
    BLOCK: "Run tests before deploying"
Design Innovation

The AI doesn't decide when testing is done — the infrastructure does. stop.py blocks session termination until every edited file has a corresponding test receipt. The AI can't rationalize its way past a filesystem check.

Test output from any runner is parsed by specialized pattern matchers. When patterns fail, the raw output is sent to the LLM for semantic classification — the system understands test results, not just regex-matches them.

pytest
=== N passed in 0.42s ===

Python standard. Handles truncated output, fixture errors, and collection failures.

Jest
Tests: 2 failed, 5 passed

JavaScript/TypeScript. Supports JSON reporter mode and watch-mode output.

Browser Tests
2 passed (1.2s)

Visual test results from agent-browser verification runs. Parses pass/fail counts.

Cargo (Rust)
test result: ok. 5 passed

Rust test harness. Handles compilation errors vs runtime failures.

Go test
ok  package  0.003s

Go native testing. Distinguishes ok from FAIL package lines.

Vitest
Tests  5 passed (3)

Vite-native runner. Jest-compatible patterns plus native format.

// Classification cascade:
TRY structured parser (regex per runner) → deterministic result
TRY generic fallback (line-by-line PASS/FAIL) → best-effort
FALLBACK LLM semantic analysis → contextual understanding

// Custom patterns (user-defined):
test_patterns.json → project-specific regex overrides
Design Innovation

Most CI systems parse one runner. This system parses any runner in any language — and when all parsers fail, an LLM reads the raw output and classifies it semantically. It literally understands what "2 scenarios (1 failed, 1 passed)" means even for frameworks it's never seen.

The AI literally cannot forge a passing test result. Only the posttooluse hook — which observes actual Bash output — can write a receipt.

"attested_by": "posttooluse_hook"
"command": "pytest tests/ -v"
"command_hash": "sha256:a3f8c1..."
"output_hash": "sha256:7d2e9b..."
"result": "PASS"
"passed": 12, "failed": 0
"timestamp": "2026-02-24T14:32:01Z"
"ttl_minutes": 30
"version": 3
// Anti-forgery layers:
1. attested_by must be "posttooluse_hook" // AI can't write this
2. command_hash = SHA256(actual command) // proves what ran
3. output_hash = SHA256(actual output) // proves what returned
4. ttl_minutes: 30 // stale receipts rejected
5. version: 3 // old format receipts rejected
Design Innovation

SHA256 hashes of both the command and its output create a cryptographic proof chain. If the AI could somehow write a receipt file (it can't), the hashes wouldn't match the actual test run. It's not just access control — it's tamper evidence.

Close the terminal with untested edits? The debt follows you. Next session starts with a testing obligation you can't dismiss.

ON session end with dirty_state:
  Export debt file: testing-debt-{project}.json
  Contains: file list, edit timestamps, project type

ON next session start:
  IF debt file exists for this project:
    Inject: "You have {N} files with untested changes"
    Gate blocks implementation until debt cleared

// Safety mechanisms:
Terminal-isolated: Terminal 2 can't clear Terminal 1's debt
4-hour TTL: prevents zombie debt from abandoned sessions
Project-type filtering: web debt doesn't block Python work

Not every file needs the same tests. The planner classifies edits into test types and selects the right verification strategy — browser-first for web, VM for installers, unit tests for libraries.

// TestType enum — what kind of verification?
VISUAL_BROWSER → Web edits tested in real browser (agent-browser CLI)
INSTALLER_VM → Installer tested in fresh VM
DATABASE_FLOW → Migration tested with real DB
API_ENDPOINT → Endpoint tested with curl/httpx
UNIT_TEST → Standard test runner (pytest, jest, etc.)
SKIP → Infrastructure files (CI configs, dockerfiles, docs)

// Dependency maps:
FILE components/Login.tsx → FEATURES [auth, session] → PAGES [/login, /signup]
FILE styles/theme.css → FEATURES [layout] → PAGES [all]

// Auto-detect routes:
Next.js app router, Flask routes, static HTML → auto-discovered
Design Innovation

Most test systems run pytest and call it done. This system knows that editing a .jsx file means you need a browser test, not a unit test. It traces the dependency from file → feature → affected pages, and tests what the user would actually see.

autonomous browser testing

Agentic browser automation — 140+ commands, your real Chrome, completely undetectable

A complete Playwright replacement that uses your real Chrome. Completely undetectable — no navigator.webdriver flag. 140+ commands for navigation, interaction, state management, network control, and more. Open source.

# Your real Chrome. Not simulated. Not detectable.
agent-browser open "https://amazon.com"
agent-browser snapshot # reads page structure — 200 tokens vs 13,700 for Playwright
agent-browser click "@e3" # stable element refs, not fragile CSS selectors
agent-browser fill "@e7" "search query"
agent-browser screenshot "evidence.png"
agent-browser state_save "logged-in" # save session state, restore anytime
agent-browser errors # console errors + network failures
// Why Playwright is old news:
~200 tokens/page vs 13,700 for Playwright MCP → 98.5% reduction
10-step workflow: 7,000 tokens vs 114,000 with Playwright → 5.7x more test cycles
Your real Chrome profile → persistent cookies, real fingerprint, saved logins
Direct WebSocket CDP → no Node.js intermediary, no framework overhead
Open source on GitHub (Apache-2.0)

Reads page structure instead of taking screenshots — 93% less context. Structure-first verification catches broken elements, missing content, and layout issues without burning tokens on pixel analysis.

1
Console Health
Check for JS errors, uncaught exceptions, failed network requests. Cheapest — no rendering needed.
2
Snapshot (Structure)
Reads the page's accessibility tree — buttons, headings, forms, links — in ~200 tokens. Replaces screenshot-based verification for most checks.
3
Screenshot (Visual)
PNG capture for CSS, colors, spacing, layout. Used only when structure alone isn't enough — design validation, pixel-level regression testing.
4
Responsive Check
Resize to mobile viewport (375px). Re-run tiers 1-3. Catches mobile-specific breakage.
+
Security Scan
SAST analysis via semgrep. XSS, injection, hardcoded secrets — caught before deployment.
Design Innovation

Structure-first verification uses ~200 tokens per page compared to ~13,700 for screenshot-based tools. A 10-page verification costs 2,000 tokens instead of 137,000. The AI reads the page like a screen reader — fast, accurate, and token-efficient.

Not just verification — full browser automation. Navigate complex UIs, fill forms, interact with dynamic content, message suppliers, find products, manage workflows. The AI operates your browser like you would.

1
AI receives a browser task — testing, research, or interaction
2
Claude Code skill researches the UI, plans the smartest path
3
Agent-browser navigates, clicks, fills, scrolls — using stable @e references
4
Parallel sessions handle multiple pages or workflows simultaneously
5
Snapshot verification confirms each step succeeded — ~200 tokens per check
6
Task completes — results, evidence, and verification collected
7
Tamper-evident receipt generated with SHA256 evidence chain

Tests don't stop at the local dev server. After git push or docker compose up, the system navigates to the live URL and verifies production.

// 20+ deployment command patterns detected:
git push, docker compose up, npm run deploy,
vercel --prod, fly deploy, scp ... server:, ...

ON deployment detected:
  1. Extract live URL from output or project config
  2. Wait for deployment propagation
  3. agent-browser open {live_url}
  4. Run structure-first verification against production
  5. IF console errors OR visual regression:
    → ALERT: "Production issue detected post-deploy"
Design Innovation

The AI doesn't just deploy and hope. It navigates to your live production URL, checks for console errors, takes a screenshot, and tells you if something broke. All automatically, seconds after the deploy command finishes. Structure-first snapshots mean verification costs ~200 tokens per page, not thousands.

Installer builds are tested in an isolated Windows VM with golden snapshot restore. Every test starts from a known-clean state — no contamination from previous runs.

// test-installer skill workflow:
1. Restore VM from golden snapshot (Docker volume)
2. Compile installer via InnoSetup (ISCC.exe)
3. Copy .exe to VM shared folder (\\host.lan\Data)
4. Execute installer silently inside VM
5. Verify installation via noVNC + agent-browser
6. Check: files exist, services running, config correct
7. Restore golden snapshot (clean for next run)

// Access:
noVNC: http://localhost:8006
Shared folder: tests/layer4-e2e/shared/ → \\host.lan\Data
Why This Matters

Most installer testing is manual: build, run on your machine, check if it works, uninstall, repeat. This automates the entire cycle with a fresh VM every time. Golden snapshot restore guarantees the VM is identical across runs — no "it worked on my machine" because the machine is literally reset to factory state between tests.

confidence engine

Weighted 100-point algorithm with reflexion loops — won't write a single line below 97
97% confidence scoring

Every tool call is scored on a weighted 100-point scale across five factors. Each factor contributes exactly 20 points. The system won't write a single line of code until the total score reaches 97 or higher.

FactorWhat It MeasuresPoints
Plan QualityIs the implementation plan specific and actionable?20
File UnderstandingHas the AI read and understood all relevant files?20
Dependency AnalysisAre imports, packages, and side effects mapped?20
Web ResearchHas external research been performed?20
Context CoverageDoes the plan account for edge cases and tests?20

Web research gets 0 if not done. Not 15. Not 10. Zero. Your training data cutoff was months ago. The library you're about to use might have a breaking change.

Plan +20
Files +20
Deps +20
Web +20
Ctx +20
// Threshold bands:
0–50: Heavy research required — major knowledge gaps
50–80: More context needed — read more files, check deps
80–97: Targeted verification — one or two gaps remain
97+: Proceed to implementation

Based on the Reflexion architecture (Shinn et al., 2023). When the score falls below 97%, a feedback loop identifies the specific gap, forces research to fill it, then re-scores. This loop runs up to 3 times before hard-blocking.

Confidence Scoring Pipeline
Tool Call
Score 5 Factors
≥97%?
PROCEED

↓ No
Identify Gap
Research
Re-score (max 3x)
<97% after 3x → BLOCK
complexity classifier

Semantic risk scoring. Not line count — meaning. The classifier analyzes the task description and flags high-risk patterns that require additional research and verification.

"delete users table" → Destructive(+40) + Critical(+60) = 100 → HIGH
"fix typo in readme" → 0 signals = TRIVIAL
Unknown task → MEDIUM // fail-safe: unknown = needs research

19 high-risk patterns detected instantly:

Authentication / authorization changes
Payment / billing code
Database migrations / schema changes
SSH / key management
File deletion / destructive operations
API key / credential handling
Deployment scripts
User data handling (PII)
Security configuration
Infrastructure changes
// + 9 more patterns...

Each detected pattern adds risk points. Multiple patterns compound. The total determines which gates are enforced and how much research is required.

exitplanmode 3-layer gate

The plan approval process has its own triple-verification. A plan doesn't just need to exist — it needs to prove it was built on solid research.

Layer 1 — Research Saturation:
  Has enough external research been done for the task complexity?
  TRIVIAL = 0 points, LOW = 1, MEDIUM = 3 (min 1 WebSearch), HIGH = 5

Layer 2 — Confidence ≥ 97%:
  The 5-factor confidence score must meet the threshold
  Plan evidence boost: auto-detects URLs in plan text (+points for citations)

Layer 3 — Subprocess Review:
  A SECOND Claude instance reviews the plan
  Checks for: unverified claims, hallucinations, logical gaps
  The planner literally cannot grade its own homework
Design Innovation

A separate AI instance reviews the first AI's plan for unverified claims. The planner cannot review its own work — a second opinion is architecturally enforced, not optional.

self-healing cascade

When something breaks, 9 specialized agents investigate before bothering you.

Phase 1
direct-fix direct-fix
Cycle 1
error-trace similar-patterns docs-and-deps
Cycle 2
alt-approach regression-check env-check
Cycle 3
broader-context minimal-repro expert-consult
Escalate
Ask human (with 9 research reports)

memory system

826KB+ of accumulated learnings — TF-IDF scored retrieval with 3-tier injection
permanent memory

826KB+ append-only JSONL archive. Every decision, fix, and pattern stored permanently. The system doesn't just remember — it remembers the right things at the right time.

8 Learning Types:

  • ARCHITECTURAL_DECISION — "We chose JWT over sessions because..."
  • WORKING_SOLUTION — "Fixed by adding CORS middleware"
  • CODEBASE_PATTERN — "This project uses CSS Modules, not Tailwind"
  • FAILED_APPROACH — "Don't use library X, it breaks with Y"
  • error_resolution — Auto-captured error→fix pairs
  • test_result — Test pass/fail context
  • USER_PREFERENCE — "Always use bun, never npm"
  • in_session — General session learnings

Scoring Formula:

Every learning is scored against the current task using a multi-factor relevance algorithm:

// Relevance scoring formula:
base = (exact_match + stem_match * 0.7 + substring_match * 0.5) / token_count
score = base × type_boost × tag_boost × e(-0.03 × days_old)

// Boost factors:
type_boost = 1.5× for corrections and user preferences
tag_boost = 2.0× when learning tags overlap with current task keywords
recency_decay = exponential decay at 0.03/day // ~50% weight at 23 days

Stem matching catches morphological variants (e.g., "testing" matches "test"), while substring matching catches partial overlaps. The combined scoring avoids both false positives and false negatives better than pure TF-IDF.

3-Tier Injection:

HOT (>0.3): Full content injected + acknowledgment gate forces Claude to read it
WARM (0.1–0.3): Preview shown, Claude can pull more if needed
COLD (<0.1): Not shown, but core MEMORY.md always injected

The HOT tier doesn't just show memories — it forces acknowledgment through a self-consuming gate. The AI must explicitly process the recalled learnings before proceeding. This prevents relevant past solutions from being ignored.

auto-capture pipeline

Memory capture happens at two levels — real-time during the session and comprehensively when it ends.

Real-Time (PostToolUse Hook):

The PostToolUse hook watches every tool result. When it detects an error followed by a successful resolution, it captures the pair automatically. No manual action needed — the system learns from every fix in real time.

Session-End Extraction:

When a session ends, the stop hook scans the full conversation transcript and extracts:

Architectural decisions — "We chose JWT over sessions because..."
Working solutions — "Fixed by adding CORS middleware"
Failed approaches — "Don't use library X, it breaks with Y"
User preferences — "Always use bun, never npm"
Error→resolution pairs — Auto-captured during PostToolUse

Distillation:

The JSONL archive grows over time. Periodic consolidation into MEMORY.md keeps core knowledge lean — extracting the most important patterns and decisions into a concise document that's always injected at session start.

Memory Injection Pipeline
User Prompt
TF-IDF Search
Score?
HOT: Full inject + gate

WARM: Preview
COLD: Core only
context compaction resilience

Claude has a ~200K token context window. When it fills up, the system compresses prior messages into a summary. Critically, compaction does NOT change the session_id — this was a misdiagnosis that was corrected in February 2026. The session_id persists across compaction events.

// How compaction resilience works:
Session abc123 → markers: confidence-cleared-abc123.txt
[context compacts — old messages compressed]
Session abc123 // same session_id preserved!
  → Markers still valid — gates pass correctly

Belt-and-suspenders: TTL-based expiration provides additional resilience. Memory markers have 10-minute TTL. Bug resolution markers have 1-hour TTL. Even if a marker lookup fails for any reason, fresh markers are accepted regardless of session key.

Engineering Honesty

The original design assumed compaction would create new session_ids, leading to orphaned markers. Investigation proved this wrong — session_id is stable. TTL-based fallbacks remain as defense-in-depth, not as a primary mitigation. We document what we learned, including our mistakes.

scope detection

3-signal monitoring with error fingerprinting — knows when to persist and when to pivot
scope change detection

3-signal monitoring. If the problem shifts mid-build, it halts and re-plans automatically. The system doesn't just detect failures — it detects when the nature of the failure changes.

Signal A (HIGH): Tests that WERE passing now fail with a new error fingerprint — the problem shifted. You were fixing auth, but now you've broken the database layer.

Signal B (MED): Same file fails 3+ consecutive edit attempts — stuck in a loop. The AI is trying variations of the same broken approach.

Signal C (MED): 2+ unique error fingerprints since plan was approved — scope is drifting. The original plan no longer matches reality.

→ Trigger: set phase=needs_replan, write plan-required marker, force the AI back to planning mode
Scope Change Detection
Test Fails
Normalize Error
Generate Fingerprint
New fingerprint?
REPLAN

↓ Same
Continue fix attempts
error fingerprinting

When tests fail, the system doesn't just see "test failed." It normalizes the stack trace into a fingerprint — stripping noise to detect whether this is a truly new error or the same one repeating.

// Raw error:
File "app.py", line 42, in process: KeyError: 'user_id'

// Normalization strips:
• Line numbers (change with every edit)
• Timestamps (always different)
• Hex addresses (memory-dependent)
• PIDs (process-dependent)
• Variable values (instance-specific)

// Keeps:
• Error type (KeyError)
• Module name (app.py, process)
• Message core (the structural pattern)

// Fingerprint:
app.py:process:KeyError

Same fingerprint = same bug, keep trying the current approach. New fingerprint = scope changed, trigger re-plan. This is what makes the TDD loop intelligent.

failure intelligence

The system learns from every failure. Error fingerprinting creates a taxonomy (syntax, runtime, test, deployment, integration). Fix success rates are tracked across sessions.

// Error taxonomy:
SyntaxError in auth.py → fingerprint: syntax:auth:import
Same fingerprint 3x → Pattern detected
  → Suggest fix from database (73% success rate)

// Semantic Fix Deduplication:
Intent-based hashing of fixes
If a fix is semantically equivalent to a previous failed fix → BLOCKED
Prevents burning money retrying the same broken approach with different syntax

After 3+ identical failures, the system stops retrying and suggests proven fixes from its history. No more burning $50 on the same error.

architecture

Three-layer hook pipeline, invisible state markers, and bash evasion detection
three-layer hook pipeline

Every interaction passes through three interception points. All invisible to the AI. The hook system is the foundation everything else is built on.

PreToolUse — Fires BEFORE every tool call
  33+ gates evaluate in sequence
  Can: BLOCK, ALLOW, or inject context

PostToolUse — Fires AFTER every tool result
  Scope change detection, test tracking
  Auto-learning: captures error→resolution pairs

UserPromptSubmit — Fires on every user message
  Memory injection, session registration
  Cross-terminal coordination heartbeat

The AI sends a tool call. The hook intercepts it, runs 33+ gates, and returns a verdict. The AI receives only the result — it has no idea 4,503 lines of Python just evaluated its request.

Three-Layer Hook Pipeline
PreToolUse
33+ Gates
ALLOW / BLOCK
Tool Executes
PostToolUse
Auto-learn
4-level risk classification

Every tool call is classified into one of four risk levels before the gate pipeline runs. The risk level determines which gates are evaluated — cheap gates always run, expensive gates are skipped for low-risk operations.

NONE — Read-only operations (Read, Glob, Grep, WebSearch)
  skip_behavioral = true → skip expensive gates

LOW — Small edits, single-file changes, safe Bash
  skip_behavioral = true → skip expensive gates

MEDIUM — Multi-file edits, config changes, agent spawning
  All gates run (confidence, research, saturation)

HIGH — Destructive commands, deployment, git push
  All gates run + additional confirmation checks
// Gates that ALWAYS run (regardless of risk):
CHECK 1 (Analyzer), CHECK 1.5 (Memory), CHECK 2.5 (Hedging), CHECK 2.5b (Bug Resolution)

// Gates that skip on NONE/LOW risk:
CHECK 2 (Confidence), CHECK 3 (Saturation), CHECK 4 (Research),
CHECK 5 (Persistence), CHECK 6 (Final Approval)
Design Innovation

Risk routing prevents gate fatigue. A simple Read call passes through in microseconds. A git push gets the full 33-gate treatment. The AI doesn't experience different behavior — it just notices that low-risk operations are faster.

9-phase workflow orchestrator

The system tracks the AI's progress through a 9-phase state machine. Each phase has required actions before the AI can advance. The orchestrator prevents skipping steps — no jumping straight to implementation without research.

1. IDLE — Waiting for user input
2. EXPLORING — Reading files, understanding codebase
3. RESEARCHING — WebSearch, external docs, prior art
4. PLANNING — Writing implementation plan
5. CHECKPOINT_1 — Plan review / user approval gate
6. CHECKPOINT_2 — Pre-implementation verification
7. IMPLEMENTING — Writing code, editing files
8. TESTING — Running tests, verifying changes
9. VERIFIED — All tests pass, session can end
Why This Matters

Without phase tracking, the AI will skip directly from "user asked a question" to "writing code." The orchestrator enforces the research → plan → approve → build → test loop that professional engineers follow. Phase transitions are tracked in state files, surviving context compaction.

invisible state markers

The system persists state through filesystem marker files the AI cannot see. Every gate decision is recorded. Every approval is saved. The AI only sees "BLOCK" or "CONTINUE" — never the mechanism.

// Marker files (invisible to AI):
plan-approved.json → Plan was approved by user
needs_replan.marker → Scope change detected, must re-plan
analysis-state-{session}.json → Project understanding cached
confidence-cleared-{session}.txt → Thinking partner passed
pending-memory-ack-{session}.txt → Memory gate one-shot trigger
thinking-state-{session}.json → Reflexion loop state
MAINTENANCE_MODE → Emergency bypass — all gates skip

When the AI calls EnterPlanMode, a hook silently writes plan-approved.json. When it tries to edit code, another hook reads that file to verify approval. The AI never knows these files exist.

bash write detection

The AI can bypass the Edit tool by writing files through Bash — echo > file, sed -i, python3 -c "open(...).write(...)". The system detects 30+ evasion patterns and forces proper tool use.

// Detected patterns:
echo "..." > file / echo "..." >> file
cat << EOF > file / tee file
sed -i 's/.../.../' file
python3 -c "open('f','w').write(...)"
node -e "fs.writeFileSync(...)"

IF bash_write_detected:
  → BLOCK: "Use Edit/Write tools — Bash writes bypass tracking"

Without this, every gate in the pipeline could be bypassed. The AI could write any file through Bash without triggering dirty state, backups, or simplification guards. This gate closes that escape hatch.

coordination

Cross-terminal awareness, blast radius analysis, and content-addressed versioning
cross-terminal coordination

Multiple Claude sessions can run simultaneously on the same codebase. A PostgreSQL database coordinates them.

// Tables:
sessions → Active session registry + heartbeats
file_claims → File-level locking + conflict detection

// Flow:
Terminal 1: claims app/auth.py
Terminal 2: tries to edit app/auth.py
  → WARNING: "Terminal 1 is editing this file"

No merge conflicts. No lost work. Two AI sessions can work on different parts of your project at the same time without stepping on each other.

blast radius analysis

Before any edit, the cascade detector builds a full dependency graph to calculate blast radius. Import/require parsing traces every file that depends on your change.

// You edit auth/login.py:
Direct: auth/login.py
Cascade level 1: api/routes.py, middleware/auth.py
Cascade level 2: tests/test_api.py, app/main.py

BACKUP: All 5 files (not just the one you edited)
TEST: auth + api + middleware // the full blast radius

Always-backup patterns: *.config, *.json, *.env, *.yaml

One-line auth change? The system knows it affects 5 files, backs up all 5, and tests all 5. No surprises.

content-addressed versioning

Every file edit creates a git commit — before AND after. Binary files use SHA-256 content-addressed blob storage with automatic deduplication.

// On every Edit/Write:
1. git commit — pre-edit snapshot
2. Apply edit
3. git commit — post-edit snapshot
4. Tag with test status (pending/pass/fail)

// Binary files (images, PDFs):
SHA-256 hash → store in blob/a3/f2b1...
Identical content = same hash = stored once

git bisect → find exactly which edit broke it

AST-aware diffs for code files — not line-by-line, but structural. "Function foo changed parameters" vs "line 47 changed."

design philosophy

The principles that make engineers stop and think
fail-closed, not fail-open
try:
    check_condition()
except Exception:
    return BLOCK  # Better to incorrectly block than allow data loss

Most systems fail-open: if the safety check crashes, let the action through. Every gate in this system does the opposite. A broken gate blocks, not permits. Because a false block costs you 30 seconds. A false permit can cost you your codebase.

the ai cannot see the gates
What the AI sees

A block message appears: "Call EnterPlanMode before editing code files."

The AI has no knowledge of the pipeline, its structure, or its logic.

What actually happens

33+ gates evaluate in sequence. Decision trees branch. Markers are checked. Confidence is scored. The AI receives only the final verdict.

The hook layer is invisible. The AI can't reason about what it can't perceive — and therefore can't circumvent it.

self-consuming markers

The memory acknowledgment gate writes a marker file. When the gate fires, it reads the file's contents... then deletes the file. The next attempt passes because the marker no longer exists.

marker = read("pending-memory-ack-{session}.txt")
show_learnings(marker.count, marker.summary)
delete("pending-memory-ack-{session}.txt")  # one-shot

One-shot enforcement that destroys its own evidence. Acknowledged once, never blocks again.

the ai blocks itself on uncertainty

The hedging scanner searches the AI's own output for uncertainty language:

patterns = [
  "I'm not sure",
  "I don't know if",
  "I don't know whether",
  "[10-79]% confident",
]

If the AI hedges, it blocks itself. The system forces a WebSearch before proceeding. The AI has no idea this scan is happening.

subprocess plan review
AI1 writes plan → ExitPlanMode → spawn AI2 → AI2 reviews plan → approve/reject

A separate AI instance reviews the first AI's plan for unverified claims. The planner literally cannot grade its own homework.

infrastructure, not prompts
System Prompt

"Please be careful when editing files and make sure to back them up first."

Hook Infrastructure
if not plan_approved:
  return {"decision": "deny"}

Prompts are suggestions. The AI can ignore them, forget them, or rationalize around them. Hooks are law — mechanically enforced at the infrastructure level.

research is mandatory, not optional

Web research scores 0 points if not performed. Not 15 (too generous). Not 10 (still lets you squeak by). Zero. Because your training data cutoff was months ago, and the library you're about to use might have a breaking change.

# Confidence scoring
web_research_score = 0 if not did_web_search else 20
state survives context death

Claude's context window is finite (~200K tokens). When it fills up, old messages are compressed into a summary. Most AI tools lose everything at this point. Mrs. Kitty doesn't.

Without Controller

Context compacts. The AI forgets the plan, the approval, the research it did, and the bugs it already fixed. You start explaining everything again from scratch.

With Mrs. Kitty

State markers persist on the filesystem. plan-approved.json still exists. Memory is re-injected from 826KB of learnings. The session handoff document captures what was in progress. The AI picks up where it left off.

# Compaction event:
Session: abc123 → markers on disk
[context compresses]
Session: abc123 # session_id preserved!
  → markers still valid, gates pass correctly

# Belt-and-suspenders TTL fallback:
  if marker.age < TTL: accept regardless
  # Memory: 10min TTL, Bug resolution: 1hr TTL

Session_id is stable across compaction — markers survive naturally. TTL-based fallbacks provide defense-in-depth: even if a marker lookup fails for any reason, fresh markers are accepted. Better to have two safety nets than one.

by the numbers

0
Python Modules
0
Lines of Hooks
0
Safety Gates
0
Event Hooks
0
Specialist Agents
0
Backup Strategies
0
Test Parsers
0
Verification Tiers