Error Handling and Monitoring

When automation fails

Agents running unattended in CI/CD pipelines encounter failure modes that interactive sessions rarely see. Nobody's watching. Nobody can intervene. When something breaks, the agent either handles it or everything stops.

Three problems need solving: recognizing failures, recovering from them, and monitoring what's happening across many concurrent workflows.

How automated agents break

Interactive agent errors (Module 5) look different from automated failures. Unattended execution has its own pathologies.

Cascading failures:

One early mistake compounds through every subsequent decision. An agent misidentifies a file structure, then makes changes based on that wrong model. Each action builds on the error. By the time something external stops it—a test failure, a syntax error, resource exhaustion—the damage has spread.

Developers catch these early in interactive sessions. Automated pipelines let the cascade run.

Non-deterministic behavior:

Same prompt, different results. Traditional retry logic assumes failures are transient: wait a bit, try again, problem goes away. LLM-based agents break this assumption. They fail differently each time. Simple retry accomplishes nothing.

Worse: intermittent success masks the real problem. A workflow that passes sometimes is harder to debug than one that fails consistently. It erodes trust in automation while hiding underlying issues.

Confabulated solutions:

Agents generate fixes that look right but don't actually work. The code compiles. Tests pass. But the original bug remains.

Interactive sessions catch this—developers verify fixes solve the problem. Automated workflows need explicit, programmatic verification. Otherwise pseudo-fixes merge and accumulate.

Never assume an agent-generated fix addresses the root cause. Automated workflows need verification steps that confirm the original problem is actually resolved.

Retry strategies that work

Exponential backoff handles infrastructure problems: network timeouts, rate limits, service hiccups. Nothing changes for API calls within agent workflows.

For infrastructure failures:

# Exponential backoff for API calls
- uses: openai/codex-action@v1
  with:
    openai-api-key: ${{ secrets.OPENAI_API_KEY }}
    prompt: "Fix the failing test"
  env:
    OPENAI_MAX_RETRIES: 3
    OPENAI_RETRY_BACKOFF_MS: 1000  # Doubles each retry

This catches transient API errors. It does nothing for semantic failures.

Retry doesn't fix bad prompts:

When an agent produces wrong code, running the same prompt again produces different wrong code. The problem isn't transient. The prompt doesn't match the task.

Recovery means changing the approach:

jobs:
  attempt-fix:
    runs-on: ubuntu-latest
    outputs:
      success: ${{ steps.verify.outputs.result }}
    steps:
      - uses: actions/checkout@v4
      - uses: openai/codex-action@v1
        with:
          prompt: "Fix the failing test by examining the error message"
      - id: verify
        run: npm test && echo "result=true" >> $GITHUB_OUTPUT || echo "result=false" >> $GITHUB_OUTPUT

  fallback-analysis:
    needs: attempt-fix
    if: needs.attempt-fix.outputs.success == 'false'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: openai/codex-action@v1
        with:
          prompt: "The previous fix attempt failed. Analyze the test file and implementation to understand why. Then propose a different approach."

The fallback job gets different instructions. This mirrors what developers do: when a fix fails, reconsider the approach instead of trying the same thing again.

Circuit breakers

Repeated failures waste resources and cause damage. Circuit breakers halt execution when patterns emerge.

Using workflow state:

name: Agent with Circuit Breaker
on:
  workflow_dispatch:
  schedule:
    - cron: '0 * * * *'  # Hourly

jobs:
  check-circuit:
    runs-on: ubuntu-latest
    outputs:
      should-run: ${{ steps.check.outputs.result }}
    steps:
      - uses: actions/cache@v4
        id: failure-cache
        with:
          path: .circuit-state
          key: circuit-breaker-${{ github.repository }}

      - id: check
        run: |
          if [[ -f .circuit-state ]]; then
            failures=$(cat .circuit-state | jq '.consecutive_failures')
            last_failure=$(cat .circuit-state | jq -r '.last_failure')
            cooldown_end=$(date -d "$last_failure + 1 hour" +%s)
            now=$(date +%s)

            if [[ $failures -ge 3 && $now -lt $cooldown_end ]]; then
              echo "Circuit open - skipping execution"
              echo "result=false" >> $GITHUB_OUTPUT
              exit 0
            fi
          fi
          echo "result=true" >> $GITHUB_OUTPUT

  run-agent:
    needs: check-circuit
    if: needs.check-circuit.outputs.should-run == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: openai/codex-action@v1
        id: codex
        with:
          prompt: "Perform scheduled maintenance task"

      - name: Update circuit state
        if: failure()
        run: |
          echo '{"consecutive_failures": 1, "last_failure": "'$(date -Iseconds)'"}' > .circuit-state
          # Increment if file exists, reset on success in post-job

Three consecutive failures open the circuit for an hour. Operators get time to investigate without the workflow piling up failures.

Human escalation

Some failures need human judgment. Good automation includes clear escalation paths.

Creating actionable issues:

- name: Escalate to humans
  if: failure()
  uses: actions/github-script@v7
  with:
    script: |
      const runUrl = `${process.env.GITHUB_SERVER_URL}/${process.env.GITHUB_REPOSITORY}/actions/runs/${process.env.GITHUB_RUN_ID}`;

      await github.rest.issues.create({
        owner: context.repo.owner,
        repo: context.repo.repo,
        title: 'Agent workflow requires attention',
        body: `## Automated Agent Failure

The scheduled maintenance workflow failed after automated recovery attempts.

**Run URL:** ${runUrl}
**Failure time:** ${new Date().toISOString()}
**Attempts:** 3

### Required action

Please review the workflow logs and either:
1. Fix the underlying issue and re-run the workflow
2. Disable the workflow if the task is no longer needed

/cc @platform-team`,
        labels: ['agent-failure', 'needs-triage']
      });

The issue contains what failed, when, how many attempts, and what to do next. Explicit assignment ensures someone owns resolution.

Tiered by severity:

Not every failure warrants the same response. Documentation sync failures are less urgent than security scan failures.

Severity	Condition	Response
Low	Non-critical task fails	Log for weekly review
Medium	Repeated failures	Create issue, assign team
High	Security-related failure	Create issue, page on-call
Critical	Production-affecting	Immediate page, disable workflow

Route by severity in the escalation logic:

- name: Determine severity and escalate
  run: |
    if [[ "${{ github.workflow }}" == *"security"* ]]; then
      severity="high"
    elif [[ "${{ env.FAILURE_COUNT }}" -gt 5 ]]; then
      severity="medium"
    else
      severity="low"
    fi
    echo "severity=$severity" >> $GITHUB_OUTPUT

If agents escalate constantly, the automation isn't working. Target 10-15% escalation rates for mature workflows. The other 85-90% should complete without humans.

Cost monitoring

Automated agents accumulate API costs without anyone noticing. Interactive sessions provide natural cost awareness—you see tokens consumed. Automation runs blind.

Track per workflow, not per organization:

- name: Record cost metrics
  run: |
    echo "workflow=${{ github.workflow }}" >> metrics.txt
    echo "run_id=${{ github.run_id }}" >> metrics.txt
    echo "tokens_used=${{ steps.codex.outputs.total-tokens }}" >> metrics.txt
    echo "timestamp=$(date -Iseconds)" >> metrics.txt

Once you have baselines, alert on anomalies. A documentation sync using 50,000 tokens should not suddenly consume 500,000.

Hard limits prevent surprises:

- uses: openai/codex-action@v1
  with:
    codex-args: '["--config", "max_cost_dollars=2"]'

Set limits appropriate to each task. Failing fast beats accumulating unexpected charges.

Weekly review checklist:

Compare total spend to previous week
Identify workflows where cost increased >20%
Review highest-cost workflows for optimization
Verify budget limits remain appropriate
Check for orphaned workflows still running

Quality gates for automated commits

Module 8 covered quality gates for AI code. In automated contexts, gates are the only defense before human review.

Minimum gate set:

- name: Quality gates
  run: |
    # Syntax validation
    npm run lint || exit 1

    # Test suite
    npm test || exit 1

    # Security scan
    npm audit --audit-level=high || exit 1

    # Verify original issue is fixed
    npm run verify-fix || exit 1

That last step is critical. It confirms the original problem is resolved. Without it, agents produce code that passes static checks but doesn't fix anything.

PR creation only after gates pass:

- uses: peter-evans/create-pull-request@v6
  if: success()  # Only create PR if all gates pass
  with:
    title: "Auto-fix: ${{ github.event.issue.title }}"
    body: |
      ## Automated fix

      **Quality gates passed:**
      - [x] Lint
      - [x] Tests
      - [x] Security scan
      - [x] Original issue verification

      **Manual review required for:**
      - [ ] Logic correctness
      - [ ] Architectural fit
      - [ ] Performance implications

The PR documents what was verified automatically and what still needs human assessment.

Observability at scale

Monitoring individual workflows breaks down as automation scales. Centralized observability provides the aggregate view.

Metrics worth tracking:

Metric	Description	Alert threshold
Success rate	Runs completing without error	Below 85%
Mean time to resolution	Duration from trigger to completion	Over 2x baseline
Escalation rate	Requiring human intervention	Over 15%
Cost per run	API spend per execution	Over 2x baseline
Quality gate failure rate	Failing automated checks	Over 20%

OpenTelemetry integration:

The OpenTelemetry semantic conventions for generative AI provide standardized attributes:

gen_ai.agent.id: workflow-123
gen_ai.agent.name: daily-docs-sync
gen_ai.conversation.id: run-456
gen_ai.provider.name: openai

Exporting to an observability platform enables cross-workflow analysis, trend detection, and correlation with system metrics.

Dashboard basics:

An overview panel showing success rate, active runs, and recent failures. A cost tracker with daily/weekly spend, budget utilization, and anomalies. Failure analysis showing common patterns and resolution rates. Per-workflow health views with degradation trends.

You should be able to answer "are our automated agents working?" at a glance. Drill-down capability handles investigation when the answer is no.

Making automation reliable

Agent automation fails. The question is whether failures are handled gracefully or become cascading disasters.

Verify, don't assume. Automated checks confirm agents did what they claimed.

Fail fast. Stop early when problems surface rather than proceeding with bad state.

Escalate clearly. When automation cannot continue, humans receive actionable context—not just "something broke."

Monitor continuously. Aggregate metrics reveal patterns invisible in individual runs.

Budget defensively. Cost limits prevent stuck workflows from draining API credits.

These apply regardless of tools or platforms. Implementations differ, but the underlying approach is constant: expect failure, detect it early, recover gracefully, and maintain visibility throughout.

On this page