2026-01-27 · AI & Agents
Self-Healing Production - Autonomous Exception Resolution
Date: 2026-01-26\nStatus: Product Concept\nComplexity: Medium - uses existing agent infrastructure + error capture
Core Concept
Production environments that automatically detect, diagnose, fix, and deploy exception resolutions without human intervention.
Flow:
- Exceptions occur (browser console, Elixir logs)
- Captured and queued
- Debounced aggregation (1-10 minutes)
- LLM deduplicates and analyzes
- Creates tickets for unique issues
- Spawns coding agent per ticket
- Agent fixes, tests, deploys
- Self-healing complete
System Architecture
1. Exception Capture Layer
Browser-side (Chrome/Firefox):
// Intercept console errors
window.addEventListener('error', (event) => {
captureException({
type: 'javascript',
message: event.message,
stack: event.error?.stack,
url: window.location.href,
timestamp: Date.now(),
userAgent: navigator.userAgent
});
});
// Unhandled promise rejections
window.addEventListener('unhandledrejection', (event) => {
captureException({
type: 'promise_rejection',
reason: event.reason,
timestamp: Date.now()
});
});
Server-side (Elixir):
# Logger backend that queues exceptions
defmodule Orca.SelfHealing.LoggerBackend do
@behaviour :gen_event
def handle_event({:error, _gl, {Logger, msg, _ts, metadata}}, state) do
exception_data = %{
type: "elixir_error",
message: msg,
stacktrace: metadata[:crash_reason],
module: metadata[:module],
function: metadata[:function],
timestamp: System.system_time(:millisecond)
}
Orca.SelfHealing.ExceptionQueue.enqueue(exception_data)
{:ok, state}
end
end
2. Exception Queue with Debouncing
defmodule Orca.SelfHealing.ExceptionQueue do
use GenServer
@debounce_ms 60_000 # 1 minute default (configurable)
def enqueue(exception) do
GenServer.cast(__MODULE__, {:enqueue, exception})
end
def handle_cast({:enqueue, exception}, state) do
updated_queue = [exception | state.queue]
# Reset debounce timer
if state.timer_ref, do: Process.cancel_timer(state.timer_ref)
timer_ref = Process.send_after(self(), :process_batch, @debounce_ms)
{:noreply, %{state | queue: updated_queue, timer_ref: timer_ref}}
end
def handle_info(:process_batch, state) do
if length(state.queue) > 0 do
Orca.SelfHealing.Processor.process_batch(state.queue)
end
{:noreply, %{state | queue: [], timer_ref: nil}}
end
end
3. LLM-Powered Deduplication & Analysis
defmodule Orca.SelfHealing.Processor do
def process_batch(exceptions) do
# Send to LLM for analysis
prompt = """
Analyze these production exceptions and deduplicate them.
Group by root cause, not by specific occurrence.
Exceptions:
#{Jason.encode!(exceptions, pretty: true)}
Return JSON array of unique issues:
[
{
"root_cause": "Brief description of underlying issue",
"severity": "critical|high|medium|low",
"affected_exceptions": [0, 2, 5], # indices of related exceptions
"suggested_fix": "What needs to change",
"files_likely_involved": ["lib/orca_web/live/chat_live.ex"]
}
]
"""
case call_llm(prompt) do
{:ok, analysis} ->
analysis
|> Jason.decode!()
|> Enum.each(&create_ticket_and_spawn_agent/1)
{:error, reason} ->
Logger.error("Self-healing analysis failed: #{reason}")
end
end
defp create_ticket_and_spawn_agent(issue) do
# Create ticket (could be Plane, GitHub Issues, internal DB)
ticket = Orca.Tickets.create(%{
title: issue["root_cause"],
severity: issue["severity"],
auto_generated: true,
metadata: issue
})
# Spawn coding agent to fix it
Orca.Agents.CodingAgent.spawn(%{
task: "fix_exception",
ticket_id: ticket.id,
context: issue,
auto_deploy: issue["severity"] in ["critical", "high"]
})
end
end
4. Autonomous Coding Agent
defmodule Orca.Agents.CodingAgent do
def spawn(task_config) do
Task.Supervisor.start_child(Orca.TaskSupervisor, fn ->
execute_fix_cycle(task_config)
end)
end
defp execute_fix_cycle(config) do
# 1. Gather context
codebase_context = gather_relevant_files(config.context["files_likely_involved"])
recent_changes = get_recent_git_commits(hours: 24)
# 2. Generate fix
fix_prompt = """
You are a coding agent fixing a production exception.
Issue: #{config.context["root_cause"]}
Suggested fix: #{config.context["suggested_fix"]}
Relevant code:
#{codebase_context}
Recent changes (might have introduced bug):
#{recent_changes}
Generate a fix. Return:
1. Changed files with full content
2. Explanation of what you changed
3. Test cases to verify fix
"""
case call_llm_with_tools(fix_prompt) do
{:ok, fix_result} ->
# 3. Apply fix
apply_changes(fix_result.files)
# 4. Run tests
case run_tests() do
{:ok, _} ->
# 5. Deploy if configured
if config.auto_deploy do
deploy_fix(config.ticket_id)
else
request_human_approval(config.ticket_id)
end
{:error, test_failures} ->
# Revert and try again or escalate
revert_changes()
escalate_to_human(config.ticket_id, test_failures)
end
{:error, reason} ->
escalate_to_human(config.ticket_id, reason)
end
end
end
5. Safety & Deployment
Safety rails:
- Only auto-deploy for severity: critical/high
- Run full test suite before deploy
- Automatic rollback if deployment fails
- Human approval required for medium/low severity
- Rate limiting: max N auto-deploys per hour
- Canary deployment for auto-fixes
Deployment flow:
defp deploy_fix(ticket_id) do
# Create branch for fix
git_branch = "auto-fix/ticket-#{ticket_id}"
Git.create_branch(git_branch)
Git.commit("Auto-fix: #{ticket.title}\n\nGenerated by self-healing system")
# Deploy via existing infrastructure
case Mix.Tasks.Deploy.run(["--fast"]) do
:ok ->
notify_humans(%{
type: "auto_deploy_success",
ticket: ticket_id,
changes: get_diff()
})
{:error, reason} ->
Git.rollback()
escalate_to_human(ticket_id, "Deploy failed: #{reason}")
end
end
Configuration Options
config :orca, :self_healing,
enabled: true,
debounce_ms: 60_000, # 1 minute
auto_deploy_severity: [:critical, :high], # Which severities auto-deploy
max_auto_deploys_per_hour: 5,
require_tests_pass: true,
notification_channels: [:slack, :email],
escalation_timeout_ms: 300_000 # 5 minutes - escalate if agent stuck
Integration with Existing Orca Systems
Leverages existing infrastructure:
Orca.Agents- autonomous agent framework already existsOrca.Chat.InferenceEngine- LLM tool callingOrca.Tools.Shell- for git operations, running tests- Existing deploy system (
make deploy-fast) - Notification system (if exists, or add)
New components needed:
- Exception queue GenServer
- Logger backend for Elixir errors
- Browser error capture JS snippet
- Deduplication/analysis module
- Coding agent specialization
- Ticket system integration (or simple DB table)
Phased Rollout
Phase 1: Observe & Learn
- Capture exceptions to queue
- LLM analysis and deduplication
- Create tickets automatically
- Human reviews and fixes (traditional flow)
- Build confidence in analysis quality
Phase 2: Agent Assistance
- Agent generates suggested fixes
- Human reviews before applying
- Agent can run tests
- Learn from human edits to suggestions
Phase 3: Supervised Auto-Fix
- Agent fixes low-severity issues automatically
- Human approval required before deploy
- Monitor success rate
Phase 4: Full Autonomy
- Auto-fix and auto-deploy for critical/high
- Human in loop only for escalations
- Self-healing production achieved
Metrics & Monitoring
Track:
- Exception volume over time (should decrease)
- Deduplication accuracy (human validation)
- Fix success rate (% of agent fixes that work)
- Time to resolution (exception → deployed fix)
- Auto-deploy vs human intervention ratio
- False positive rate (tickets for non-issues)
Dashboard should show:
- Recent exceptions (grouped by root cause)
- Active coding agents and their status
- Recent auto-deploys with diffs
- Escalations requiring human attention
Edge Cases & Failure Modes
What could go wrong:
- Agent introduces new bugs: Rollback mechanism, comprehensive tests
- Infinite loop of fixes: Rate limiting, circuit breaker
- LLM hallucinates root cause: Human review for medium+ severity
- Deploy breaks production: Canary deploys, health checks, auto-rollback
- Cost explosion: Budget caps on LLM calls, queue size limits
- Security vulnerability introduced: Code review by security-focused agent first
Mitigations:
- Gradual rollout (phases above)
- Kill switch to disable system
- Comprehensive test coverage requirement
- Human oversight for critical systems
- Audit log of all auto-changes
Related Concepts
Connection to other Orca work:
- Agents with autonomy and goal-seeking (A)
- Knowledge graph could track exception patterns over time (B)
- Emergence: system learns to heal itself, potentially identifies systemic issues (C)
Connection to Loop Survival Game:
- Both explore autonomous code generation under constraints
- Production survival = "keep the loop running"
- Self-modification in constrained environment
Connection to Analysis Pipeline:
- Exception patterns could be analyzed for root cause insights
- Knowledge graph of "what breaks together"
- Agents learn from accumulated exception semantics
Open Questions
- Ticket system: Plane integration? Internal DB? GitHub Issues?
- Testing requirements: How comprehensive must tests be for auto-deploy?
- Rollback strategy: Git revert? Blue-green deployment?
- Cost analysis: LLM API costs for continuous monitoring?
- Multi-service: How does this work with microservices? Distributed traces?
- Privacy: What exception data gets sent to LLM providers?
- Agent memory: Should agents remember past fixes for similar issues?
Prior Art & Research
Similar concepts:
- Meta's Sapienz (automated Android testing & fixing)
- Google's AutoML crash analysis
- GitHub Copilot Workspace (end-to-end task automation)
- Conventional error monitoring (Sentry, Rollbar) - but no auto-fix
Novel aspects:
- Full cycle: detect → analyze → fix → test → deploy
- LLM-powered deduplication and root cause analysis
- Autonomous coding agent with tool use
- Integration into existing deployment pipeline
- Configurable autonomy levels
Status: Ready for feasibility analysis and phased implementation planning
Next steps:
- Assess LLM cost for typical exception volume
- Design ticket schema (or pick integration)
- Prototype exception queue + debouncing
- Test LLM deduplication accuracy on real exceptions
- Define success criteria for Phase 1 rollout