Multi-layer safety-gate playbook#
Origin: Originally written in Hungarian as part of MyForge Vault 11.11 — Superintelligent Vault project. Source: multi-layer-safety-gate (Hungarian version).
For high-risk features (RSI, auto-prompt-evolution, code self-modification, auto-promotion) you need 4 independent defense layers. No single layer is enough — they validate each other ("defense in depth").
The 4 layers#
1. ENV-flag default-disabled#
The feature does NOT run by default. It only activates with an explicit <FEATURE>_MODE=enabled env-var.
# Default — every RSI script exits:
vault-skill-suggest # ⚠️ safety-gate exit
# Explicit enable:
RSI_MODE=enabled vault-skill-suggest --analyze-last 30
Why: Activation requires operator friction. Nobody accidentally runs with RSI_MODE=enabled — it must be intentional.
2. Script-level safety gate (first line)#
Every script's first action is to check the flag + abort if disabled. NOT middleware or a later check — the first instruction:
def safety_gate():
if os.environ.get("RSI_MODE", "disabled") != "enabled":
print("⚠️ RSI safety-gate: disabled by default.", file=sys.stderr)
print(" Enable with: RSI_MODE=enabled <script> ...")
print(" PRECONDITION: <list of stability checks>")
sys.exit(2)
def main():
safety_gate() # ← FIRST call of every script
# ... actual logic
Why: Even on import errors or unexpected state, the logic does not run.
3. Git pre-commit hook — forbidden-target block#
The high-risk feature's output (code mutation, prompt rewrites, skill modifications) is only allowed on sandbox branches. Direct commits to main are BLOCKED.
# .git/hooks/pre-commit (symlinked from .vault-<feature>/safety/)
FORBIDDEN_PATTERNS=("AGENTS.md" "00-Meta/" ".vault-<feature>/" "11.11")
BRANCH=$(git rev-parse --abbrev-ref HEAD)
if [[ "$BRANCH" =~ ^rsi-(sandbox|pareto)- ]]; then
exit 0 # sandbox branch — OK
fi
STAGED=$(git diff --cached --name-only)
for pattern in "${FORBIDDEN_PATTERNS[@]}"; do
if echo "$STAGED" | grep -qE "$pattern"; then
echo "🚫 BLOCKED: forbidden-target on non-sandbox branch"
exit 1
fi
done
Override only for emergency: SKIP_RSI_HOOK=1 git commit ... — explicit, logged.
Why: Even if someone bypasses the script-level gate, the git-level catches the side-effect in vault history.
4. Critic-review-mandatory#
Every RSI/high-risk output passes through a Critic-agent review (Constitutional AI 2 pattern) BEFORE the MCP mutation call. Confidence < 0.85 → batch-preview to the user.
mcp_server:
write_tools:
- vault.add_skill
- vault.update_wiki_section
- vault.add_decision
- vault.crystallize_learning
critic_review_required: write_tools
Why: Even if the RSI mutation is well-formed, the Critic catches semantic-level issues (PII leak, ADR conflicts, regression tendency).
When to use#
| Feature type | All 4 gates required? |
|---|---|
| RSI (code/prompt/skill auto-mutation) | ✅ ALL 4 |
| Auto-promotion (auto-reflections → wiki) | ✅ ALL 4 |
| Code self-modification (Gödel Agent) | ✅ ALL 4 — plus multi-pass Critic |
| Auto-skill-suggest without user confirmation | ✅ ALL 4 |
| New skill registration | ❌ 1+2 enough (gate + script check) |
| Auto-wiki-edit feature | ✅ 1+2+4 (Critic required) |
| Read-only auto-eval | ❌ None needed — no side-effects |
Rule of thumb: if the feature mutates vault state and would run without human review, ALL 4 layers are mandatory.
Backout — auto-disable triggers#
On top of the 4 layers: every high-risk feature needs a passive backout trigger that auto-disables it if something degrades:
auto_disable_triggers:
- "vault_corruption_detected" # vault-cleanup audit drop >20% in 1 day
- "critic_reject_rate_above_30pct" # Critic rejection rate 30%+ → bad output
- "user_manual_disable_request" # emergency stop
- "eval_quality_drop_below_threshold" # Pass-rate <70% in 1 week
Triggering it does NOT re-enable the feature automatically — manual user action is required to reactivate.
Live example — RSI Day 0#
The first project combining the 4-layer gate:
| Layer | Where implemented | What it protects against |
|---|---|---|
| 1. ENV flag | RSI_MODE=disabled default | Default-off |
| 2. Script gate | .vault-rsi/scripts/*.py first line safety_gate() | Accidental execution |
| 3. Git hook | .vault-rsi/safety/git-pre-commit-hook.sh | Forbidden-target commit |
| 4. Critic review | .vault-rsi/config/rsi-config.yml: critic_review_required | Semantically bad output |
Plus 4 auto-disable triggers.
Audio overview#
- EN narration (Charon voice):
[[.vault-nb/audio-overviews/multi-layer-safety-gate.en.mp3]] - HU narration (Kore voice):
[[.vault-nb/audio-overviews/multi-layer-safety-gate.hu.mp3]]
Generated via Gemini 3.1 Flash TTS preview. ~1-2 minutes each. See gemini-3-1-flash-tts-pipeline for the pipeline.
Related#
- sprint-day-0-skeleton-first — Day 0 playbook
- claude-code-subagent-fanout — bulk-mutation engine that often needs the 4 gates
- verification-step-before-claim — verification as a complementary defense layer