system_architecture.md
Every edit passes through 33+ sequential gates. Not suggestions. Not lint warnings.
Infrastructure-level blocks the AI cannot see, cannot reason about, and cannot bypass.
the pipeline
gate deep-dives
Sometimes you need the AI to just do what you say — no checkpoints, no gates, no questions. Maintenance mode bypasses every single gate in the pipeline. One word activates it.
USER: "maintenance"
→ MAINTENANCE_MODE file created
→ ALL 33+ gates bypassed instantly
→ Backups still run (safety net)
// Deactivation — just type:
USER: "done" or "exit maintenance"
→ MAINTENANCE_MODE file removed
→ All gates re-engaged
Batch edits across 10+ files would trigger gate checks on every single edit — making it impossibly slow. Maintenance mode lets you do bulk work in one burst, then re-engage all safety checks when you're done. The key insight: backups still run even in maintenance mode. You skip the verification gates, but your files are always safe to roll back.
Before any code file (.py, .js, .ts, .cpp, etc.) can be edited, the AI must have entered plan mode and received explicit approval.
FOR path IN [
1. plan-approved.json exists?
2. pending-implementation.json claimable?
3. implementation-state phase=implementing?
4. CWD-based implementation state?
5. File mentioned in approved plan text?
]:
IF any path TRUE → CONTINUE
ALL FALSE → BLOCK: "Call EnterPlanMode"
Five independent approval paths means the system is resilient to state file corruption — any single proof of planning suffices. The AI can't accidentally bypass by deleting one marker file.
Plan markdown files (in /plans/) can only be edited while plan mode is active. This prevents the AI from silently modifying approved plans after the fact.
IF plan mode NOT active (no plan-entered marker):
→ BLOCK: "Plan files can only be edited in plan mode"
ELSE → CONTINUE
Plan files ARE the artifact the user approves. If the AI could edit them outside plan mode, it could retroactively change an approved plan to match what it actually built — defeating the purpose of plan approval entirely.
Forces the AI to read and understand the entire project structure before touching anything. Always runs for code files — only maintenance mode bypasses this gate.
SCAN transcript for "I understand:" patterns
IF found → CONTINUE
ELSE → BLOCK: "Explore codebase first"
Searches 826KB+ of accumulated learnings. If this problem was solved before, surfaces the solution automatically using TF-IDF scored retrieval.
Read {count, summary} from marker file
DELETE marker file // self-consuming!
→ BLOCK once: "Review {count} past learnings"
ELSE → CONTINUE
Self-consuming marker: the gate reads the file, then deletes it. Next attempt passes because the marker no longer exists. One-shot enforcement that destroys its own evidence after acknowledgment.
When the user provides multiple tasks (via TodoWrite or numbered lists), this gate blocks implementation until the AI explicitly acknowledges the full task list. Prevents partial execution where the AI fixates on one task and forgets the rest.
Read {task_count, task_summary} from marker file
DELETE marker file // self-consuming!
→ BLOCK once: "Acknowledge {N} tasks before proceeding"
ELSE → CONTINUE
Same self-consuming marker pattern as Gate 1.5. The marker is written by the PostToolUse hook when it detects multi-task creation, then this gate reads and destroys it on acknowledgment. One-shot enforcement — zero false positives on subsequent tool calls.
Implements the Reflexion architecture (Shinn et al., 2023). The AI self-reflects on assumptions, asks informed questions, and scores confidence on a weighted 100-point scale.
IF reflexion_triggered (2+ failures):
→ BLOCK: "Reflect before retrying"
IF confidence < 97%:
→ BLOCK: "WebSearch for gaps: {gaps}"
ELSE → CONTINUE
Based on the Reflexion paper (Shinn et al., 2023). When the AI fails twice, a feedback loop forces it to explicitly self-reflect before retrying — preventing blind repetition.
After any failure, the AI cannot retry the same approach. It must do new research first.
→ BLOCK: "WebSearch before retrying"
Blocks Edit/Write operations on configuration files unless the AI has Read the file first in the current session. Prevents blind overwrites of critical configs like .env, docker-compose.yml, tsconfig.json, and hook Python files.
IF path NOT in session_read_files:
→ BLOCK: "Read {file} before modifying"
ELSE → CONTINUE
AI agents frequently overwrite config files based on assumptions rather than reading the current state. This gate forces "look before you leap" behavior specifically for files where a wrong edit can break the entire environment.
Scans the AI's own output for uncertainty language. If the AI isn't sure, it can't proceed.
Pattern: "I'm not sure"
Pattern: "I don't know if"
Pattern: "[10-79]% confident"
IF hedge_found:
→ BLOCK: "WebSearch to verify uncertain claim"
The AI literally blocks itself. It can't hedge and proceed — the system reads the AI's own text and forces research when it detects uncertainty. The AI has no idea this is happening.
Tracks active bug investigations. If the AI is debugging a problem, this gate prevents it from abandoning the investigation to start new work.
IF active_bug_resolution AND attempting unrelated edit:
→ BLOCK: "Finish debugging before starting new work"
IF bug marked resolved AND no verification test:
→ BLOCK: "Prove the fix works — run tests"
AI agents love to context-switch away from hard bugs. They'll abandon a half-debugged problem and start on something easier. This gate forces them to finish what they started — or explicitly escalate to you.
Tracks research actions during the session. The AI can't start coding until it's done enough homework — proportional to task complexity.
TRIVIAL → 3 research points required
LOW → 3 points
MEDIUM → 3 points (min 1 WebSearch required)
HIGH → 6 points
IF action_count < threshold:
→ BLOCK: "Need {N} more WebSearch calls"
The main research enforcement gate. Uses a complexity classifier to determine the current task's difficulty, then requires proportional research effort before allowing implementation. Caches classifications per-session to avoid redundant LLM calls.
complexity = classify(prompt) // TRIVIAL, LOW, MEDIUM, HIGH
IF risk == NONE/LOW → SKIP // skip_behavioral
IF plan_approved → SKIP
// Step 2: Check research requirements
IF tool == Write/Edit AND research_count < required:
→ BLOCK: "Do {required} research actions first"
IF tool == Write/Edit AND no WebSearch/WebFetch in session:
→ BLOCK: "At least one WebSearch required"
The complexity classifier runs once and is cached for the entire session, avoiding repeated LLM inference. The cache key includes the user's original prompt hash, so follow-up messages within the same task don't reclassify. Research actions include WebSearch, WebFetch, Read (external docs), and Grep across multiple files.
Forces the AI to share its research findings with you before it starts coding. Nothing learned gets lost when context cycles.
AND research done BUT not shared with user:
→ BLOCK: "Share research findings (2-3 sentences)"
The full plan is presented in plain English. You read it, you approve it, or nothing happens. No surprises.
→ BLOCK: "Call ExitPlanMode for approval"
Creates a full session-level backup of all source files before any changes begin. Complete rollback point.
Create full backup to .ai-project/snapshots/{session_id}/
→ CONTINUE
Backs up each individual file RIGHT BEFORE it's edited. Strategy selected automatically by file type.
.py, .js, .css → GIT_SURGICAL (per-project git repo)
.png, .pdf → BLOB_ONLY (SHA256 content-addressed, dedup)
.log, node_modules → SKIP (not worth backing up)
User-excluded → IGNORE
Binary files use content-addressed SHA256 hashing with automatic deduplication. Same image stored once, referenced everywhere. Zero context window overhead — fully programmatic, no conversation tokens wasted.
Blocks removal of 20%+ of code without explicit justification. Prevents the AI from "simplifying" your work into oblivion.
→ BLOCK: "Show test plan proving removal is safe"
Blocks deployment commands (docker compose up, git push, etc.) until all tests pass.
→ BLOCK: "{N} untested edits — run tests first"
Disabled since February 2026. The PASS Lock was designed as an Azure DevOps–style exclusive lock — once tests pass, a SHA-256 hash locks the project state and blocks further edits without re-testing. In practice, it created too much friction for iterative development workflows and was disabled while evaluating a lighter-weight alternative.
ON tests_pass:
state_hash = SHA256(all_edited_files)
write PASS-{task_id}.json with hash
LOCK: no edits allowed without re-test
// Currently: Gate 10 (Test-Before-Deploy) handles deployment gating
// Gate 13 (Stop Hook) handles session-end test enforcement
Prevents brute-force retry loops. AI agents love to retry failed commands 10 times instead of investigating. This gate makes that impossible.
IF last_bash_failed AND next_tool == Bash:
HARD BLOCK: "Investigate first — use Read, Grep, or WebSearch"
Layer 2 — Accumulating:
IF 4+ consecutive Bash-only:
SOFT NUDGE: "You're looping — try a different approach"
IF 6+ consecutive + 2 failures:
HARD BLOCK: "Stuck in retry loop"
Reset: WebSearch or WebFetch fully clears state
// Researching = evidence of investigation, not brute force
Without this, the AI burns through your compute budget retrying the same broken command. Error fingerprinting detects failures even when exit code is 0 (stderr patterns: "error:", "fatal:", "connection refused").
The final gate. Fires when the AI tries to end a session. It checks TestingGate.can_complete() mechanically — not by reading the AI's text. The AI cannot claim "done" with words.
ON session_end_attempt:
state = TestingGate.load_state()
IF state.dirty_state == "DIRTY":
IF state.session_own_edits > 0:
BLOCK: "Untested code changes exist"
// On EVERY stop, write continuity handoff:
extract: active_task, key_decisions, files_modified
write: handoff-{terminal_id}.md
Next session picks up exactly where this one left off
Other AI tools let the agent say "all done!" while tests are failing. Mrs. Kitty checks the actual state files on disk. The AI's opinion about whether it's done is irrelevant. The filesystem is the source of truth.
testing architecture
Every edit triggers dirty state tracking. The AI physically cannot claim "done" without test proof.
dirty_state = "DIRTY"
Track: file path, timestamp, edit type
ON session end (stop.py):
IF dirty_state == "DIRTY":
BLOCK: "{N} files edited without test verification"
Force test execution before session can end
ON deployment command detected:
IF dirty_state == "DIRTY":
BLOCK: "Run tests before deploying"
The AI doesn't decide when testing is done — the infrastructure does. stop.py blocks session termination until every edited file has a corresponding test receipt. The AI can't rationalize its way past a filesystem check.
Test output from any runner is parsed by specialized pattern matchers. When patterns fail, the raw output is sent to the LLM for semantic classification — the system understands test results, not just regex-matches them.
Python standard. Handles truncated output, fixture errors, and collection failures.
JavaScript/TypeScript. Supports JSON reporter mode and watch-mode output.
Visual test results from agent-browser verification runs. Parses pass/fail counts.
Rust test harness. Handles compilation errors vs runtime failures.
Go native testing. Distinguishes ok from FAIL package lines.
Vite-native runner. Jest-compatible patterns plus native format.
TRY structured parser (regex per runner) → deterministic result
TRY generic fallback (line-by-line PASS/FAIL) → best-effort
FALLBACK LLM semantic analysis → contextual understanding
// Custom patterns (user-defined):
test_patterns.json → project-specific regex overrides
Most CI systems parse one runner. This system parses any runner in any language — and when all parsers fail, an LLM reads the raw output and classifies it semantically. It literally understands what "2 scenarios (1 failed, 1 passed)" means even for frameworks it's never seen.
The AI literally cannot forge a passing test result. Only the posttooluse hook — which observes actual Bash output — can write a receipt.
"command": "pytest tests/ -v"
"command_hash": "sha256:a3f8c1..."
"output_hash": "sha256:7d2e9b..."
"result": "PASS"
"passed": 12, "failed": 0
"timestamp": "2026-02-24T14:32:01Z"
"ttl_minutes": 30
"version": 3
1. attested_by must be "posttooluse_hook" // AI can't write this
2. command_hash = SHA256(actual command) // proves what ran
3. output_hash = SHA256(actual output) // proves what returned
4. ttl_minutes: 30 // stale receipts rejected
5. version: 3 // old format receipts rejected
SHA256 hashes of both the command and its output create a cryptographic proof chain. If the AI could somehow write a receipt file (it can't), the hashes wouldn't match the actual test run. It's not just access control — it's tamper evidence.
Close the terminal with untested edits? The debt follows you. Next session starts with a testing obligation you can't dismiss.
Export debt file: testing-debt-{project}.json
Contains: file list, edit timestamps, project type
ON next session start:
IF debt file exists for this project:
Inject: "You have {N} files with untested changes"
Gate blocks implementation until debt cleared
// Safety mechanisms:
Terminal-isolated: Terminal 2 can't clear Terminal 1's debt
4-hour TTL: prevents zombie debt from abandoned sessions
Project-type filtering: web debt doesn't block Python work
Not every file needs the same tests. The planner classifies edits into test types and selects the right verification strategy — browser-first for web, VM for installers, unit tests for libraries.
VISUAL_BROWSER → Web edits tested in real browser (agent-browser CLI)
INSTALLER_VM → Installer tested in fresh VM
DATABASE_FLOW → Migration tested with real DB
API_ENDPOINT → Endpoint tested with curl/httpx
UNIT_TEST → Standard test runner (pytest, jest, etc.)
SKIP → Infrastructure files (CI configs, dockerfiles, docs)
// Dependency maps:
FILE components/Login.tsx → FEATURES [auth, session] → PAGES [/login, /signup]
FILE styles/theme.css → FEATURES [layout] → PAGES [all]
// Auto-detect routes:
Next.js app router, Flask routes, static HTML → auto-discovered
Most test systems run pytest and call it done. This system knows that editing a .jsx file means you need a browser test, not a unit test. It traces the dependency from file → feature → affected pages, and tests what the user would actually see.
autonomous browser testing
A complete Playwright replacement that uses your real Chrome. Completely undetectable — no navigator.webdriver flag. 140+ commands for navigation, interaction, state management, network control, and more. Open source.
agent-browser open "https://amazon.com"
agent-browser snapshot # reads page structure — 200 tokens vs 13,700 for Playwright
agent-browser click "@e3" # stable element refs, not fragile CSS selectors
agent-browser fill "@e7" "search query"
agent-browser screenshot "evidence.png"
agent-browser state_save "logged-in" # save session state, restore anytime
agent-browser errors # console errors + network failures
~200 tokens/page vs 13,700 for Playwright MCP → 98.5% reduction
10-step workflow: 7,000 tokens vs 114,000 with Playwright → 5.7x more test cycles
Your real Chrome profile → persistent cookies, real fingerprint, saved logins
Direct WebSocket CDP → no Node.js intermediary, no framework overhead
Open source on GitHub (Apache-2.0)
Reads page structure instead of taking screenshots — 93% less context. Structure-first verification catches broken elements, missing content, and layout issues without burning tokens on pixel analysis.
Structure-first verification uses ~200 tokens per page compared to ~13,700 for screenshot-based tools. A 10-page verification costs 2,000 tokens instead of 137,000. The AI reads the page like a screen reader — fast, accurate, and token-efficient.
Not just verification — full browser automation. Navigate complex UIs, fill forms, interact with dynamic content, message suppliers, find products, manage workflows. The AI operates your browser like you would.
@e referencesTests don't stop at the local dev server. After git push or docker compose up, the system navigates to the live URL and verifies production.
git push, docker compose up, npm run deploy,
vercel --prod, fly deploy, scp ... server:, ...
ON deployment detected:
1. Extract live URL from output or project config
2. Wait for deployment propagation
3.
agent-browser open {live_url}4. Run structure-first verification against production
5. IF console errors OR visual regression:
→ ALERT: "Production issue detected post-deploy"
The AI doesn't just deploy and hope. It navigates to your live production URL, checks for console errors, takes a screenshot, and tells you if something broke. All automatically, seconds after the deploy command finishes. Structure-first snapshots mean verification costs ~200 tokens per page, not thousands.
Installer builds are tested in an isolated Windows VM with golden snapshot restore. Every test starts from a known-clean state — no contamination from previous runs.
1. Restore VM from golden snapshot (Docker volume)
2. Compile installer via InnoSetup (ISCC.exe)
3. Copy .exe to VM shared folder (\\host.lan\Data)
4. Execute installer silently inside VM
5. Verify installation via noVNC + agent-browser
6. Check: files exist, services running, config correct
7. Restore golden snapshot (clean for next run)
// Access:
noVNC: http://localhost:8006
Shared folder: tests/layer4-e2e/shared/ → \\host.lan\Data
Most installer testing is manual: build, run on your machine, check if it works, uninstall, repeat. This automates the entire cycle with a fresh VM every time. Golden snapshot restore guarantees the VM is identical across runs — no "it worked on my machine" because the machine is literally reset to factory state between tests.
confidence engine
Every tool call is scored on a weighted 100-point scale across five factors. Each factor contributes exactly 20 points. The system won't write a single line of code until the total score reaches 97 or higher.
| Factor | What It Measures | Points |
|---|---|---|
| Plan Quality | Is the implementation plan specific and actionable? | 20 |
| File Understanding | Has the AI read and understood all relevant files? | 20 |
| Dependency Analysis | Are imports, packages, and side effects mapped? | 20 |
| Web Research | Has external research been performed? | 20 |
| Context Coverage | Does the plan account for edge cases and tests? | 20 |
Web research gets 0 if not done. Not 15. Not 10. Zero. Your training data cutoff was months ago. The library you're about to use might have a breaking change.
0–50: Heavy research required — major knowledge gaps
50–80: More context needed — read more files, check deps
80–97: Targeted verification — one or two gaps remain
97+: Proceed to implementation
Based on the Reflexion architecture (Shinn et al., 2023). When the score falls below 97%, a feedback loop identifies the specific gap, forces research to fill it, then re-scores. This loop runs up to 3 times before hard-blocking.
↓ No
Semantic risk scoring. Not line count — meaning. The classifier analyzes the task description and flags high-risk patterns that require additional research and verification.
"fix typo in readme" → 0 signals = TRIVIAL
Unknown task → MEDIUM // fail-safe: unknown = needs research
19 high-risk patterns detected instantly:
Payment / billing code
Database migrations / schema changes
SSH / key management
File deletion / destructive operations
API key / credential handling
Deployment scripts
User data handling (PII)
Security configuration
Infrastructure changes
// + 9 more patterns...
Each detected pattern adds risk points. Multiple patterns compound. The total determines which gates are enforced and how much research is required.
The plan approval process has its own triple-verification. A plan doesn't just need to exist — it needs to prove it was built on solid research.
Has enough external research been done for the task complexity?
TRIVIAL = 0 points, LOW = 1, MEDIUM = 3 (min 1 WebSearch), HIGH = 5
Layer 2 — Confidence ≥ 97%:
The 5-factor confidence score must meet the threshold
Plan evidence boost: auto-detects URLs in plan text (+points for citations)
Layer 3 — Subprocess Review:
A SECOND Claude instance reviews the plan
Checks for: unverified claims, hallucinations, logical gaps
The planner literally cannot grade its own homework
A separate AI instance reviews the first AI's plan for unverified claims. The planner cannot review its own work — a second opinion is architecturally enforced, not optional.
When something breaks, 9 specialized agents investigate before bothering you.
memory system
826KB+ append-only JSONL archive. Every decision, fix, and pattern stored permanently. The system doesn't just remember — it remembers the right things at the right time.
8 Learning Types:
- ARCHITECTURAL_DECISION — "We chose JWT over sessions because..."
- WORKING_SOLUTION — "Fixed by adding CORS middleware"
- CODEBASE_PATTERN — "This project uses CSS Modules, not Tailwind"
- FAILED_APPROACH — "Don't use library X, it breaks with Y"
- error_resolution — Auto-captured error→fix pairs
- test_result — Test pass/fail context
- USER_PREFERENCE — "Always use bun, never npm"
- in_session — General session learnings
Scoring Formula:
Every learning is scored against the current task using a multi-factor relevance algorithm:
base = (exact_match + stem_match * 0.7 + substring_match * 0.5) / token_count
score = base × type_boost × tag_boost × e(-0.03 × days_old)
// Boost factors:
type_boost = 1.5× for corrections and user preferences
tag_boost = 2.0× when learning tags overlap with current task keywords
recency_decay = exponential decay at 0.03/day // ~50% weight at 23 days
Stem matching catches morphological variants (e.g., "testing" matches "test"), while substring matching catches partial overlaps. The combined scoring avoids both false positives and false negatives better than pure TF-IDF.
3-Tier Injection:
WARM (0.1–0.3): Preview shown, Claude can pull more if needed
COLD (<0.1): Not shown, but core MEMORY.md always injected
The HOT tier doesn't just show memories — it forces acknowledgment through a self-consuming gate. The AI must explicitly process the recalled learnings before proceeding. This prevents relevant past solutions from being ignored.
Memory capture happens at two levels — real-time during the session and comprehensively when it ends.
Real-Time (PostToolUse Hook):
The PostToolUse hook watches every tool result. When it detects an error followed by a successful resolution, it captures the pair automatically. No manual action needed — the system learns from every fix in real time.
Session-End Extraction:
When a session ends, the stop hook scans the full conversation transcript and extracts:
• Working solutions — "Fixed by adding CORS middleware"
• Failed approaches — "Don't use library X, it breaks with Y"
• User preferences — "Always use bun, never npm"
• Error→resolution pairs — Auto-captured during PostToolUse
Distillation:
The JSONL archive grows over time. Periodic consolidation into MEMORY.md keeps core knowledge lean — extracting the most important patterns and decisions into a concise document that's always injected at session start.
↓
Claude has a ~200K token context window. When it fills up, the system compresses prior messages into a summary. Critically, compaction does NOT change the session_id — this was a misdiagnosis that was corrected in February 2026. The session_id persists across compaction events.
Session abc123 → markers: confidence-cleared-abc123.txt
[context compacts — old messages compressed]
Session abc123 // same session_id preserved!
→ Markers still valid — gates pass correctly
Belt-and-suspenders: TTL-based expiration provides additional resilience. Memory markers have 10-minute TTL. Bug resolution markers have 1-hour TTL. Even if a marker lookup fails for any reason, fresh markers are accepted regardless of session key.
The original design assumed compaction would create new session_ids, leading to orphaned markers. Investigation proved this wrong — session_id is stable. TTL-based fallbacks remain as defense-in-depth, not as a primary mitigation. We document what we learned, including our mistakes.
scope detection
3-signal monitoring. If the problem shifts mid-build, it halts and re-plans automatically. The system doesn't just detect failures — it detects when the nature of the failure changes.
Signal B (MED): Same file fails 3+ consecutive edit attempts — stuck in a loop. The AI is trying variations of the same broken approach.
Signal C (MED): 2+ unique error fingerprints since plan was approved — scope is drifting. The original plan no longer matches reality.
→ Trigger: set phase=needs_replan, write plan-required marker, force the AI back to planning mode
↓ Same
When tests fail, the system doesn't just see "test failed." It normalizes the stack trace into a fingerprint — stripping noise to detect whether this is a truly new error or the same one repeating.
File "app.py", line 42, in process: KeyError: 'user_id'
// Normalization strips:
• Line numbers (change with every edit)
• Timestamps (always different)
• Hex addresses (memory-dependent)
• PIDs (process-dependent)
• Variable values (instance-specific)
// Keeps:
• Error type (KeyError)
• Module name (app.py, process)
• Message core (the structural pattern)
// Fingerprint:
app.py:process:KeyError
Same fingerprint = same bug, keep trying the current approach. New fingerprint = scope changed, trigger re-plan. This is what makes the TDD loop intelligent.
The system learns from every failure. Error fingerprinting creates a taxonomy (syntax, runtime, test, deployment, integration). Fix success rates are tracked across sessions.
SyntaxError in auth.py → fingerprint: syntax:auth:import
Same fingerprint 3x → Pattern detected
→ Suggest fix from database (73% success rate)
// Semantic Fix Deduplication:
Intent-based hashing of fixes
If a fix is semantically equivalent to a previous failed fix → BLOCKED
Prevents burning money retrying the same broken approach with different syntax
After 3+ identical failures, the system stops retrying and suggests proven fixes from its history. No more burning $50 on the same error.
architecture
Every interaction passes through three interception points. All invisible to the AI. The hook system is the foundation everything else is built on.
33+ gates evaluate in sequence
Can: BLOCK, ALLOW, or inject context
PostToolUse — Fires AFTER every tool result
Scope change detection, test tracking
Auto-learning: captures error→resolution pairs
UserPromptSubmit — Fires on every user message
Memory injection, session registration
Cross-terminal coordination heartbeat
The AI sends a tool call. The hook intercepts it, runs 33+ gates, and returns a verdict. The AI receives only the result — it has no idea 4,503 lines of Python just evaluated its request.
Every tool call is classified into one of four risk levels before the gate pipeline runs. The risk level determines which gates are evaluated — cheap gates always run, expensive gates are skipped for low-risk operations.
skip_behavioral = true → skip expensive gates
LOW — Small edits, single-file changes, safe Bash
skip_behavioral = true → skip expensive gates
MEDIUM — Multi-file edits, config changes, agent spawning
All gates run (confidence, research, saturation)
HIGH — Destructive commands, deployment, git push
All gates run + additional confirmation checks
CHECK 1 (Analyzer), CHECK 1.5 (Memory), CHECK 2.5 (Hedging), CHECK 2.5b (Bug Resolution)
// Gates that skip on NONE/LOW risk:
CHECK 2 (Confidence), CHECK 3 (Saturation), CHECK 4 (Research),
CHECK 5 (Persistence), CHECK 6 (Final Approval)
Risk routing prevents gate fatigue. A simple Read call passes through in microseconds. A git push gets the full 33-gate treatment. The AI doesn't experience different behavior — it just notices that low-risk operations are faster.
The system tracks the AI's progress through a 9-phase state machine. Each phase has required actions before the AI can advance. The orchestrator prevents skipping steps — no jumping straight to implementation without research.
2. EXPLORING — Reading files, understanding codebase
3. RESEARCHING — WebSearch, external docs, prior art
4. PLANNING — Writing implementation plan
5. CHECKPOINT_1 — Plan review / user approval gate
6. CHECKPOINT_2 — Pre-implementation verification
7. IMPLEMENTING — Writing code, editing files
8. TESTING — Running tests, verifying changes
9. VERIFIED — All tests pass, session can end
Without phase tracking, the AI will skip directly from "user asked a question" to "writing code." The orchestrator enforces the research → plan → approve → build → test loop that professional engineers follow. Phase transitions are tracked in state files, surviving context compaction.
The system persists state through filesystem marker files the AI cannot see. Every gate decision is recorded. Every approval is saved. The AI only sees "BLOCK" or "CONTINUE" — never the mechanism.
plan-approved.json → Plan was approved by user
needs_replan.marker → Scope change detected, must re-plan
analysis-state-{session}.json → Project understanding cached
confidence-cleared-{session}.txt → Thinking partner passed
pending-memory-ack-{session}.txt → Memory gate one-shot trigger
thinking-state-{session}.json → Reflexion loop state
MAINTENANCE_MODE → Emergency bypass — all gates skip
When the AI calls EnterPlanMode, a hook silently writes plan-approved.json. When it tries to edit code, another hook reads that file to verify approval. The AI never knows these files exist.
The AI can bypass the Edit tool by writing files through Bash — echo > file, sed -i, python3 -c "open(...).write(...)". The system detects 30+ evasion patterns and forces proper tool use.
echo "..." > file / echo "..." >> file
cat << EOF > file / tee file
sed -i 's/.../.../' file
python3 -c "open('f','w').write(...)"
node -e "fs.writeFileSync(...)"
IF bash_write_detected:
→ BLOCK: "Use Edit/Write tools — Bash writes bypass tracking"
Without this, every gate in the pipeline could be bypassed. The AI could write any file through Bash without triggering dirty state, backups, or simplification guards. This gate closes that escape hatch.
coordination
Multiple Claude sessions can run simultaneously on the same codebase. A PostgreSQL database coordinates them.
sessions → Active session registry + heartbeats
file_claims → File-level locking + conflict detection
// Flow:
Terminal 1: claims app/auth.py
Terminal 2: tries to edit app/auth.py
→ WARNING: "Terminal 1 is editing this file"
No merge conflicts. No lost work. Two AI sessions can work on different parts of your project at the same time without stepping on each other.
Before any edit, the cascade detector builds a full dependency graph to calculate blast radius. Import/require parsing traces every file that depends on your change.
Direct: auth/login.py
Cascade level 1: api/routes.py, middleware/auth.py
Cascade level 2: tests/test_api.py, app/main.py
BACKUP: All 5 files (not just the one you edited)
TEST: auth + api + middleware // the full blast radius
Always-backup patterns: *.config, *.json, *.env, *.yaml
One-line auth change? The system knows it affects 5 files, backs up all 5, and tests all 5. No surprises.
Every file edit creates a git commit — before AND after. Binary files use SHA-256 content-addressed blob storage with automatic deduplication.
1. git commit — pre-edit snapshot
2. Apply edit
3. git commit — post-edit snapshot
4. Tag with test status (pending/pass/fail)
// Binary files (images, PDFs):
SHA-256 hash → store in blob/a3/f2b1...
Identical content = same hash = stored once
git bisect → find exactly which edit broke it
AST-aware diffs for code files — not line-by-line, but structural. "Function foo changed parameters" vs "line 47 changed."
design philosophy
check_condition()
except Exception:
return BLOCK # Better to incorrectly block than allow data loss
Most systems fail-open: if the safety check crashes, let the action through. Every gate in this system does the opposite. A broken gate blocks, not permits. Because a false block costs you 30 seconds. A false permit can cost you your codebase.
A block message appears: "Call EnterPlanMode before editing code files."
The AI has no knowledge of the pipeline, its structure, or its logic.
33+ gates evaluate in sequence. Decision trees branch. Markers are checked. Confidence is scored. The AI receives only the final verdict.
The hook layer is invisible. The AI can't reason about what it can't perceive — and therefore can't circumvent it.
The memory acknowledgment gate writes a marker file. When the gate fires, it reads the file's contents... then deletes the file. The next attempt passes because the marker no longer exists.
show_learnings(marker.count, marker.summary)
delete("pending-memory-ack-{session}.txt") # one-shot
One-shot enforcement that destroys its own evidence. Acknowledged once, never blocks again.
The hedging scanner searches the AI's own output for uncertainty language:
"I'm not sure",
"I don't know if",
"I don't know whether",
"[10-79]% confident",
]
If the AI hedges, it blocks itself. The system forces a WebSearch before proceeding. The AI has no idea this scan is happening.
A separate AI instance reviews the first AI's plan for unverified claims. The planner literally cannot grade its own homework.
"Please be careful when editing files and make sure to back them up first."
return {"decision": "deny"}
Prompts are suggestions. The AI can ignore them, forget them, or rationalize around them. Hooks are law — mechanically enforced at the infrastructure level.
Web research scores 0 points if not performed. Not 15 (too generous). Not 10 (still lets you squeak by). Zero. Because your training data cutoff was months ago, and the library you're about to use might have a breaking change.
web_research_score = 0 if not did_web_search else 20
Claude's context window is finite (~200K tokens). When it fills up, old messages are compressed into a summary. Most AI tools lose everything at this point. Mrs. Kitty doesn't.
Context compacts. The AI forgets the plan, the approval, the research it did, and the bugs it already fixed. You start explaining everything again from scratch.
State markers persist on the filesystem. plan-approved.json still exists. Memory is re-injected from 826KB of learnings. The session handoff document captures what was in progress. The AI picks up where it left off.
Session: abc123 → markers on disk
[context compresses]
Session: abc123 # session_id preserved!
→ markers still valid, gates pass correctly
# Belt-and-suspenders TTL fallback:
if marker.age < TTL: accept regardless
# Memory: 10min TTL, Bug resolution: 1hr TTL
Session_id is stable across compaction — markers survive naturally. TTL-based fallbacks provide defense-in-depth: even if a marker lookup fails for any reason, fresh markers are accepted. Better to have two safety nets than one.