2026-01-27 · AI & Agents

Self-Healing Production - Autonomous Exception Resolution

Date: 2026-01-26\nStatus: Product Concept\nComplexity: Medium - uses existing agent infrastructure + error capture

Core Concept

Production environments that automatically detect, diagnose, fix, and deploy exception resolutions without human intervention.

Flow:

Exceptions occur (browser console, Elixir logs)
Captured and queued
Debounced aggregation (1-10 minutes)
LLM deduplicates and analyzes
Creates tickets for unique issues
Spawns coding agent per ticket
Agent fixes, tests, deploys
Self-healing complete

System Architecture

1. Exception Capture Layer

Browser-side (Chrome/Firefox):

// Intercept console errors
window.addEventListener('error', (event) => {
  captureException({
    type: 'javascript',
    message: event.message,
    stack: event.error?.stack,
    url: window.location.href,
    timestamp: Date.now(),
    userAgent: navigator.userAgent
  });
});

// Unhandled promise rejections
window.addEventListener('unhandledrejection', (event) => {
  captureException({
    type: 'promise_rejection',
    reason: event.reason,
    timestamp: Date.now()
  });
});

Server-side (Elixir):

# Logger backend that queues exceptions
defmodule Orca.SelfHealing.LoggerBackend do
  @behaviour :gen_event
  
  def handle_event({:error, _gl, {Logger, msg, _ts, metadata}}, state) do
    exception_data = %{
      type: "elixir_error",
      message: msg,
      stacktrace: metadata[:crash_reason],
      module: metadata[:module],
      function: metadata[:function],
      timestamp: System.system_time(:millisecond)
    }
    
    Orca.SelfHealing.ExceptionQueue.enqueue(exception_data)
    {:ok, state}
  end
end

2. Exception Queue with Debouncing

defmodule Orca.SelfHealing.ExceptionQueue do
  use GenServer
  
  @debounce_ms 60_000  # 1 minute default (configurable)
  
  def enqueue(exception) do
    GenServer.cast(__MODULE__, {:enqueue, exception})
  end
  
  def handle_cast({:enqueue, exception}, state) do
    updated_queue = [exception | state.queue]
    
    # Reset debounce timer
    if state.timer_ref, do: Process.cancel_timer(state.timer_ref)
    timer_ref = Process.send_after(self(), :process_batch, @debounce_ms)
    
    {:noreply, %{state | queue: updated_queue, timer_ref: timer_ref}}
  end
  
  def handle_info(:process_batch, state) do
    if length(state.queue) > 0 do
      Orca.SelfHealing.Processor.process_batch(state.queue)
    end
    
    {:noreply, %{state | queue: [], timer_ref: nil}}
  end
end

3. LLM-Powered Deduplication & Analysis

defmodule Orca.SelfHealing.Processor do
  def process_batch(exceptions) do
    # Send to LLM for analysis
    prompt = """
    Analyze these production exceptions and deduplicate them.
    Group by root cause, not by specific occurrence.
    
    Exceptions:
    #{Jason.encode!(exceptions, pretty: true)}
    
    Return JSON array of unique issues:
    [
      {
        "root_cause": "Brief description of underlying issue",
        "severity": "critical|high|medium|low",
        "affected_exceptions": [0, 2, 5],  # indices of related exceptions
        "suggested_fix": "What needs to change",
        "files_likely_involved": ["lib/orca_web/live/chat_live.ex"]
      }
    ]
    """
    
    case call_llm(prompt) do
      {:ok, analysis} ->
        analysis
        |> Jason.decode!()
        |> Enum.each(&create_ticket_and_spawn_agent/1)
        
      {:error, reason} ->
        Logger.error("Self-healing analysis failed: #{reason}")
    end
  end
  
  defp create_ticket_and_spawn_agent(issue) do
    # Create ticket (could be Plane, GitHub Issues, internal DB)
    ticket = Orca.Tickets.create(%{
      title: issue["root_cause"],
      severity: issue["severity"],
      auto_generated: true,
      metadata: issue
    })
    
    # Spawn coding agent to fix it
    Orca.Agents.CodingAgent.spawn(%{
      task: "fix_exception",
      ticket_id: ticket.id,
      context: issue,
      auto_deploy: issue["severity"] in ["critical", "high"]
    })
  end
end

4. Autonomous Coding Agent

defmodule Orca.Agents.CodingAgent do
  def spawn(task_config) do
    Task.Supervisor.start_child(Orca.TaskSupervisor, fn ->
      execute_fix_cycle(task_config)
    end)
  end
  
  defp execute_fix_cycle(config) do
    # 1. Gather context
    codebase_context = gather_relevant_files(config.context["files_likely_involved"])
    recent_changes = get_recent_git_commits(hours: 24)
    
    # 2. Generate fix
    fix_prompt = """
    You are a coding agent fixing a production exception.
    
    Issue: #{config.context["root_cause"]}
    Suggested fix: #{config.context["suggested_fix"]}
    
    Relevant code:
    #{codebase_context}
    
    Recent changes (might have introduced bug):
    #{recent_changes}
    
    Generate a fix. Return:
    1. Changed files with full content
    2. Explanation of what you changed
    3. Test cases to verify fix
    """
    
    case call_llm_with_tools(fix_prompt) do
      {:ok, fix_result} ->
        # 3. Apply fix
        apply_changes(fix_result.files)
        
        # 4. Run tests
        case run_tests() do
          {:ok, _} ->
            # 5. Deploy if configured
            if config.auto_deploy do
              deploy_fix(config.ticket_id)
            else
              request_human_approval(config.ticket_id)
            end
            
          {:error, test_failures} ->
            # Revert and try again or escalate
            revert_changes()
            escalate_to_human(config.ticket_id, test_failures)
        end
        
      {:error, reason} ->
        escalate_to_human(config.ticket_id, reason)
    end
  end
end

5. Safety & Deployment

Safety rails:

Only auto-deploy for severity: critical/high
Run full test suite before deploy
Automatic rollback if deployment fails
Human approval required for medium/low severity
Rate limiting: max N auto-deploys per hour
Canary deployment for auto-fixes

Deployment flow:

defp deploy_fix(ticket_id) do
  # Create branch for fix
  git_branch = "auto-fix/ticket-#{ticket_id}"
  Git.create_branch(git_branch)
  Git.commit("Auto-fix: #{ticket.title}\n\nGenerated by self-healing system")
  
  # Deploy via existing infrastructure
  case Mix.Tasks.Deploy.run(["--fast"]) do
    :ok ->
      notify_humans(%{
        type: "auto_deploy_success",
        ticket: ticket_id,
        changes: get_diff()
      })
      
    {:error, reason} ->
      Git.rollback()
      escalate_to_human(ticket_id, "Deploy failed: #{reason}")
  end
end

Configuration Options

config :orca, :self_healing,
  enabled: true,
  debounce_ms: 60_000,  # 1 minute
  auto_deploy_severity: [:critical, :high],  # Which severities auto-deploy
  max_auto_deploys_per_hour: 5,
  require_tests_pass: true,
  notification_channels: [:slack, :email],
  escalation_timeout_ms: 300_000  # 5 minutes - escalate if agent stuck

Integration with Existing Orca Systems

Leverages existing infrastructure:

Orca.Agents - autonomous agent framework already exists
Orca.Chat.InferenceEngine - LLM tool calling
Orca.Tools.Shell - for git operations, running tests
Existing deploy system (make deploy-fast)
Notification system (if exists, or add)

New components needed:

Exception queue GenServer
Logger backend for Elixir errors
Browser error capture JS snippet
Deduplication/analysis module
Coding agent specialization
Ticket system integration (or simple DB table)

Phased Rollout

Phase 1: Observe & Learn

Capture exceptions to queue
LLM analysis and deduplication
Create tickets automatically
Human reviews and fixes (traditional flow)
Build confidence in analysis quality

Phase 2: Agent Assistance

Agent generates suggested fixes
Human reviews before applying
Agent can run tests
Learn from human edits to suggestions

Phase 3: Supervised Auto-Fix

Agent fixes low-severity issues automatically
Human approval required before deploy
Monitor success rate

Phase 4: Full Autonomy

Auto-fix and auto-deploy for critical/high
Human in loop only for escalations
Self-healing production achieved

Metrics & Monitoring

Track:

Exception volume over time (should decrease)
Deduplication accuracy (human validation)
Fix success rate (% of agent fixes that work)
Time to resolution (exception → deployed fix)
Auto-deploy vs human intervention ratio
False positive rate (tickets for non-issues)

Dashboard should show:

Recent exceptions (grouped by root cause)
Active coding agents and their status
Recent auto-deploys with diffs
Escalations requiring human attention

Edge Cases & Failure Modes

What could go wrong:

Agent introduces new bugs: Rollback mechanism, comprehensive tests
Infinite loop of fixes: Rate limiting, circuit breaker
LLM hallucinates root cause: Human review for medium+ severity
Deploy breaks production: Canary deploys, health checks, auto-rollback
Cost explosion: Budget caps on LLM calls, queue size limits
Security vulnerability introduced: Code review by security-focused agent first

Mitigations:

Gradual rollout (phases above)
Kill switch to disable system
Comprehensive test coverage requirement
Human oversight for critical systems
Audit log of all auto-changes

Related Concepts

Connection to other Orca work:

Agents with autonomy and goal-seeking (A)
Knowledge graph could track exception patterns over time (B)
Emergence: system learns to heal itself, potentially identifies systemic issues (C)

Connection to Loop Survival Game:

Both explore autonomous code generation under constraints
Production survival = "keep the loop running"
Self-modification in constrained environment

Connection to Analysis Pipeline:

Exception patterns could be analyzed for root cause insights
Knowledge graph of "what breaks together"
Agents learn from accumulated exception semantics

Open Questions

Ticket system: Plane integration? Internal DB? GitHub Issues?
Testing requirements: How comprehensive must tests be for auto-deploy?
Rollback strategy: Git revert? Blue-green deployment?
Cost analysis: LLM API costs for continuous monitoring?
Multi-service: How does this work with microservices? Distributed traces?
Privacy: What exception data gets sent to LLM providers?
Agent memory: Should agents remember past fixes for similar issues?

Prior Art & Research

Similar concepts:

Meta's Sapienz (automated Android testing & fixing)
Google's AutoML crash analysis
GitHub Copilot Workspace (end-to-end task automation)
Conventional error monitoring (Sentry, Rollbar) - but no auto-fix

Novel aspects:

Full cycle: detect → analyze → fix → test → deploy
LLM-powered deduplication and root cause analysis
Autonomous coding agent with tool use
Integration into existing deployment pipeline
Configurable autonomy levels

Status: Ready for feasibility analysis and phased implementation planning

Next steps:

Assess LLM cost for typical exception volume
Design ticket schema (or pick integration)
Prototype exception queue + debouncing
Test LLM deduplication accuracy on real exceptions
Define success criteria for Phase 1 rollout