2026-01-27 · AI & Agents

Self-Healing Production - Autonomous Exception Resolution

Date: 2026-01-26\nStatus: Product Concept\nComplexity: Medium - uses existing agent infrastructure + error capture


Core Concept

Production environments that automatically detect, diagnose, fix, and deploy exception resolutions without human intervention.

Flow:

  1. Exceptions occur (browser console, Elixir logs)
  2. Captured and queued
  3. Debounced aggregation (1-10 minutes)
  4. LLM deduplicates and analyzes
  5. Creates tickets for unique issues
  6. Spawns coding agent per ticket
  7. Agent fixes, tests, deploys
  8. Self-healing complete

System Architecture

1. Exception Capture Layer

Browser-side (Chrome/Firefox):

// Intercept console errors
window.addEventListener('error', (event) => {
  captureException({
    type: 'javascript',
    message: event.message,
    stack: event.error?.stack,
    url: window.location.href,
    timestamp: Date.now(),
    userAgent: navigator.userAgent
  });
});

// Unhandled promise rejections
window.addEventListener('unhandledrejection', (event) => {
  captureException({
    type: 'promise_rejection',
    reason: event.reason,
    timestamp: Date.now()
  });
});

Server-side (Elixir):

# Logger backend that queues exceptions
defmodule Orca.SelfHealing.LoggerBackend do
  @behaviour :gen_event
  
  def handle_event({:error, _gl, {Logger, msg, _ts, metadata}}, state) do
    exception_data = %{
      type: "elixir_error",
      message: msg,
      stacktrace: metadata[:crash_reason],
      module: metadata[:module],
      function: metadata[:function],
      timestamp: System.system_time(:millisecond)
    }
    
    Orca.SelfHealing.ExceptionQueue.enqueue(exception_data)
    {:ok, state}
  end
end

2. Exception Queue with Debouncing

defmodule Orca.SelfHealing.ExceptionQueue do
  use GenServer
  
  @debounce_ms 60_000  # 1 minute default (configurable)
  
  def enqueue(exception) do
    GenServer.cast(__MODULE__, {:enqueue, exception})
  end
  
  def handle_cast({:enqueue, exception}, state) do
    updated_queue = [exception | state.queue]
    
    # Reset debounce timer
    if state.timer_ref, do: Process.cancel_timer(state.timer_ref)
    timer_ref = Process.send_after(self(), :process_batch, @debounce_ms)
    
    {:noreply, %{state | queue: updated_queue, timer_ref: timer_ref}}
  end
  
  def handle_info(:process_batch, state) do
    if length(state.queue) > 0 do
      Orca.SelfHealing.Processor.process_batch(state.queue)
    end
    
    {:noreply, %{state | queue: [], timer_ref: nil}}
  end
end

3. LLM-Powered Deduplication & Analysis

defmodule Orca.SelfHealing.Processor do
  def process_batch(exceptions) do
    # Send to LLM for analysis
    prompt = """
    Analyze these production exceptions and deduplicate them.
    Group by root cause, not by specific occurrence.
    
    Exceptions:
    #{Jason.encode!(exceptions, pretty: true)}
    
    Return JSON array of unique issues:
    [
      {
        "root_cause": "Brief description of underlying issue",
        "severity": "critical|high|medium|low",
        "affected_exceptions": [0, 2, 5],  # indices of related exceptions
        "suggested_fix": "What needs to change",
        "files_likely_involved": ["lib/orca_web/live/chat_live.ex"]
      }
    ]
    """
    
    case call_llm(prompt) do
      {:ok, analysis} ->
        analysis
        |> Jason.decode!()
        |> Enum.each(&create_ticket_and_spawn_agent/1)
        
      {:error, reason} ->
        Logger.error("Self-healing analysis failed: #{reason}")
    end
  end
  
  defp create_ticket_and_spawn_agent(issue) do
    # Create ticket (could be Plane, GitHub Issues, internal DB)
    ticket = Orca.Tickets.create(%{
      title: issue["root_cause"],
      severity: issue["severity"],
      auto_generated: true,
      metadata: issue
    })
    
    # Spawn coding agent to fix it
    Orca.Agents.CodingAgent.spawn(%{
      task: "fix_exception",
      ticket_id: ticket.id,
      context: issue,
      auto_deploy: issue["severity"] in ["critical", "high"]
    })
  end
end

4. Autonomous Coding Agent

defmodule Orca.Agents.CodingAgent do
  def spawn(task_config) do
    Task.Supervisor.start_child(Orca.TaskSupervisor, fn ->
      execute_fix_cycle(task_config)
    end)
  end
  
  defp execute_fix_cycle(config) do
    # 1. Gather context
    codebase_context = gather_relevant_files(config.context["files_likely_involved"])
    recent_changes = get_recent_git_commits(hours: 24)
    
    # 2. Generate fix
    fix_prompt = """
    You are a coding agent fixing a production exception.
    
    Issue: #{config.context["root_cause"]}
    Suggested fix: #{config.context["suggested_fix"]}
    
    Relevant code:
    #{codebase_context}
    
    Recent changes (might have introduced bug):
    #{recent_changes}
    
    Generate a fix. Return:
    1. Changed files with full content
    2. Explanation of what you changed
    3. Test cases to verify fix
    """
    
    case call_llm_with_tools(fix_prompt) do
      {:ok, fix_result} ->
        # 3. Apply fix
        apply_changes(fix_result.files)
        
        # 4. Run tests
        case run_tests() do
          {:ok, _} ->
            # 5. Deploy if configured
            if config.auto_deploy do
              deploy_fix(config.ticket_id)
            else
              request_human_approval(config.ticket_id)
            end
            
          {:error, test_failures} ->
            # Revert and try again or escalate
            revert_changes()
            escalate_to_human(config.ticket_id, test_failures)
        end
        
      {:error, reason} ->
        escalate_to_human(config.ticket_id, reason)
    end
  end
end

5. Safety & Deployment

Safety rails:

Deployment flow:

defp deploy_fix(ticket_id) do
  # Create branch for fix
  git_branch = "auto-fix/ticket-#{ticket_id}"
  Git.create_branch(git_branch)
  Git.commit("Auto-fix: #{ticket.title}\n\nGenerated by self-healing system")
  
  # Deploy via existing infrastructure
  case Mix.Tasks.Deploy.run(["--fast"]) do
    :ok ->
      notify_humans(%{
        type: "auto_deploy_success",
        ticket: ticket_id,
        changes: get_diff()
      })
      
    {:error, reason} ->
      Git.rollback()
      escalate_to_human(ticket_id, "Deploy failed: #{reason}")
  end
end

Configuration Options

config :orca, :self_healing,
  enabled: true,
  debounce_ms: 60_000,  # 1 minute
  auto_deploy_severity: [:critical, :high],  # Which severities auto-deploy
  max_auto_deploys_per_hour: 5,
  require_tests_pass: true,
  notification_channels: [:slack, :email],
  escalation_timeout_ms: 300_000  # 5 minutes - escalate if agent stuck

Integration with Existing Orca Systems

Leverages existing infrastructure:

New components needed:

  1. Exception queue GenServer
  2. Logger backend for Elixir errors
  3. Browser error capture JS snippet
  4. Deduplication/analysis module
  5. Coding agent specialization
  6. Ticket system integration (or simple DB table)

Phased Rollout

Phase 1: Observe & Learn

Phase 2: Agent Assistance

Phase 3: Supervised Auto-Fix

Phase 4: Full Autonomy


Metrics & Monitoring

Track:

Dashboard should show:


Edge Cases & Failure Modes

What could go wrong:

  1. Agent introduces new bugs: Rollback mechanism, comprehensive tests
  2. Infinite loop of fixes: Rate limiting, circuit breaker
  3. LLM hallucinates root cause: Human review for medium+ severity
  4. Deploy breaks production: Canary deploys, health checks, auto-rollback
  5. Cost explosion: Budget caps on LLM calls, queue size limits
  6. Security vulnerability introduced: Code review by security-focused agent first

Mitigations:


Related Concepts

Connection to other Orca work:

Connection to Loop Survival Game:

Connection to Analysis Pipeline:


Open Questions

  1. Ticket system: Plane integration? Internal DB? GitHub Issues?
  2. Testing requirements: How comprehensive must tests be for auto-deploy?
  3. Rollback strategy: Git revert? Blue-green deployment?
  4. Cost analysis: LLM API costs for continuous monitoring?
  5. Multi-service: How does this work with microservices? Distributed traces?
  6. Privacy: What exception data gets sent to LLM providers?
  7. Agent memory: Should agents remember past fixes for similar issues?

Prior Art & Research

Similar concepts:

Novel aspects:


Status: Ready for feasibility analysis and phased implementation planning

Next steps: