Error Handling and Monitoring
When automation fails
Agents running unattended in CI/CD pipelines encounter failure modes that interactive sessions rarely see. Nobody's watching. Nobody can intervene. When something breaks, the agent either handles it or everything stops.
Three problems need solving: recognizing failures, recovering from them, and monitoring what's happening across many concurrent workflows.
How automated agents break
Interactive agent errors (Module 5) look different from automated failures. Unattended execution has its own pathologies.
Cascading failures:
One early mistake compounds through every subsequent decision. An agent misidentifies a file structure, then makes changes based on that wrong model. Each action builds on the error. By the time something external stops it—a test failure, a syntax error, resource exhaustion—the damage has spread.
Developers catch these early in interactive sessions. Automated pipelines let the cascade run.
Non-deterministic behavior:
Same prompt, different results. Traditional retry logic assumes failures are transient: wait a bit, try again, problem goes away. LLM-based agents break this assumption. They fail differently each time. Simple retry accomplishes nothing.
Worse: intermittent success masks the real problem. A workflow that passes sometimes is harder to debug than one that fails consistently. It erodes trust in automation while hiding underlying issues.
Confabulated solutions:
Agents generate fixes that look right but don't actually work. The code compiles. Tests pass. But the original bug remains.
Interactive sessions catch this—developers verify fixes solve the problem. Automated workflows need explicit, programmatic verification. Otherwise pseudo-fixes merge and accumulate.
Never assume an agent-generated fix addresses the root cause. Automated workflows need verification steps that confirm the original problem is actually resolved.
Retry strategies that work
Exponential backoff handles infrastructure problems: network timeouts, rate limits, service hiccups. Nothing changes for API calls within agent workflows.
For infrastructure failures:
# Exponential backoff for API calls
- uses: openai/codex-action@v1
with:
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
prompt: "Fix the failing test"
env:
OPENAI_MAX_RETRIES: 3
OPENAI_RETRY_BACKOFF_MS: 1000 # Doubles each retryThis catches transient API errors. It does nothing for semantic failures.
Retry doesn't fix bad prompts:
When an agent produces wrong code, running the same prompt again produces different wrong code. The problem isn't transient. The prompt doesn't match the task.
Recovery means changing the approach:
jobs:
attempt-fix:
runs-on: ubuntu-latest
outputs:
success: ${{ steps.verify.outputs.result }}
steps:
- uses: actions/checkout@v4
- uses: openai/codex-action@v1
with:
prompt: "Fix the failing test by examining the error message"
- id: verify
run: npm test && echo "result=true" >> $GITHUB_OUTPUT || echo "result=false" >> $GITHUB_OUTPUT
fallback-analysis:
needs: attempt-fix
if: needs.attempt-fix.outputs.success == 'false'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: openai/codex-action@v1
with:
prompt: "The previous fix attempt failed. Analyze the test file and implementation to understand why. Then propose a different approach."The fallback job gets different instructions. This mirrors what developers do: when a fix fails, reconsider the approach instead of trying the same thing again.
Circuit breakers
Repeated failures waste resources and cause damage. Circuit breakers halt execution when patterns emerge.
Using workflow state:
name: Agent with Circuit Breaker
on:
workflow_dispatch:
schedule:
- cron: '0 * * * *' # Hourly
jobs:
check-circuit:
runs-on: ubuntu-latest
outputs:
should-run: ${{ steps.check.outputs.result }}
steps:
- uses: actions/cache@v4
id: failure-cache
with:
path: .circuit-state
key: circuit-breaker-${{ github.repository }}
- id: check
run: |
if [[ -f .circuit-state ]]; then
failures=$(cat .circuit-state | jq '.consecutive_failures')
last_failure=$(cat .circuit-state | jq -r '.last_failure')
cooldown_end=$(date -d "$last_failure + 1 hour" +%s)
now=$(date +%s)
if [[ $failures -ge 3 && $now -lt $cooldown_end ]]; then
echo "Circuit open - skipping execution"
echo "result=false" >> $GITHUB_OUTPUT
exit 0
fi
fi
echo "result=true" >> $GITHUB_OUTPUT
run-agent:
needs: check-circuit
if: needs.check-circuit.outputs.should-run == 'true'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: openai/codex-action@v1
id: codex
with:
prompt: "Perform scheduled maintenance task"
- name: Update circuit state
if: failure()
run: |
echo '{"consecutive_failures": 1, "last_failure": "'$(date -Iseconds)'"}' > .circuit-state
# Increment if file exists, reset on success in post-jobThree consecutive failures open the circuit for an hour. Operators get time to investigate without the workflow piling up failures.
Human escalation
Some failures need human judgment. Good automation includes clear escalation paths.
Creating actionable issues:
- name: Escalate to humans
if: failure()
uses: actions/github-script@v7
with:
script: |
const runUrl = `${process.env.GITHUB_SERVER_URL}/${process.env.GITHUB_REPOSITORY}/actions/runs/${process.env.GITHUB_RUN_ID}`;
await github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: 'Agent workflow requires attention',
body: `## Automated Agent Failure
The scheduled maintenance workflow failed after automated recovery attempts.
**Run URL:** ${runUrl}
**Failure time:** ${new Date().toISOString()}
**Attempts:** 3
### Required action
Please review the workflow logs and either:
1. Fix the underlying issue and re-run the workflow
2. Disable the workflow if the task is no longer needed
/cc @platform-team`,
labels: ['agent-failure', 'needs-triage']
});The issue contains what failed, when, how many attempts, and what to do next. Explicit assignment ensures someone owns resolution.
Tiered by severity:
Not every failure warrants the same response. Documentation sync failures are less urgent than security scan failures.
| Severity | Condition | Response |
|---|---|---|
| Low | Non-critical task fails | Log for weekly review |
| Medium | Repeated failures | Create issue, assign team |
| High | Security-related failure | Create issue, page on-call |
| Critical | Production-affecting | Immediate page, disable workflow |
Route by severity in the escalation logic:
- name: Determine severity and escalate
run: |
if [[ "${{ github.workflow }}" == *"security"* ]]; then
severity="high"
elif [[ "${{ env.FAILURE_COUNT }}" -gt 5 ]]; then
severity="medium"
else
severity="low"
fi
echo "severity=$severity" >> $GITHUB_OUTPUTIf agents escalate constantly, the automation isn't working. Target 10-15% escalation rates for mature workflows. The other 85-90% should complete without humans.
Cost monitoring
Automated agents accumulate API costs without anyone noticing. Interactive sessions provide natural cost awareness—you see tokens consumed. Automation runs blind.
Track per workflow, not per organization:
- name: Record cost metrics
run: |
echo "workflow=${{ github.workflow }}" >> metrics.txt
echo "run_id=${{ github.run_id }}" >> metrics.txt
echo "tokens_used=${{ steps.codex.outputs.total-tokens }}" >> metrics.txt
echo "timestamp=$(date -Iseconds)" >> metrics.txtOnce you have baselines, alert on anomalies. A documentation sync using 50,000 tokens should not suddenly consume 500,000.
Hard limits prevent surprises:
- uses: openai/codex-action@v1
with:
codex-args: '["--config", "max_cost_dollars=2"]'Set limits appropriate to each task. Failing fast beats accumulating unexpected charges.
Weekly review checklist:
- Compare total spend to previous week
- Identify workflows where cost increased >20%
- Review highest-cost workflows for optimization
- Verify budget limits remain appropriate
- Check for orphaned workflows still running
Quality gates for automated commits
Module 8 covered quality gates for AI code. In automated contexts, gates are the only defense before human review.
Minimum gate set:
- name: Quality gates
run: |
# Syntax validation
npm run lint || exit 1
# Test suite
npm test || exit 1
# Security scan
npm audit --audit-level=high || exit 1
# Verify original issue is fixed
npm run verify-fix || exit 1That last step is critical. It confirms the original problem is resolved. Without it, agents produce code that passes static checks but doesn't fix anything.
PR creation only after gates pass:
- uses: peter-evans/create-pull-request@v6
if: success() # Only create PR if all gates pass
with:
title: "Auto-fix: ${{ github.event.issue.title }}"
body: |
## Automated fix
**Quality gates passed:**
- [x] Lint
- [x] Tests
- [x] Security scan
- [x] Original issue verification
**Manual review required for:**
- [ ] Logic correctness
- [ ] Architectural fit
- [ ] Performance implicationsThe PR documents what was verified automatically and what still needs human assessment.
Observability at scale
Monitoring individual workflows breaks down as automation scales. Centralized observability provides the aggregate view.
Metrics worth tracking:
| Metric | Description | Alert threshold |
|---|---|---|
| Success rate | Runs completing without error | Below 85% |
| Mean time to resolution | Duration from trigger to completion | Over 2x baseline |
| Escalation rate | Requiring human intervention | Over 15% |
| Cost per run | API spend per execution | Over 2x baseline |
| Quality gate failure rate | Failing automated checks | Over 20% |
OpenTelemetry integration:
The OpenTelemetry semantic conventions for generative AI provide standardized attributes:
gen_ai.agent.id: workflow-123
gen_ai.agent.name: daily-docs-sync
gen_ai.conversation.id: run-456
gen_ai.provider.name: openaiExporting to an observability platform enables cross-workflow analysis, trend detection, and correlation with system metrics.
Dashboard basics:
An overview panel showing success rate, active runs, and recent failures. A cost tracker with daily/weekly spend, budget utilization, and anomalies. Failure analysis showing common patterns and resolution rates. Per-workflow health views with degradation trends.
You should be able to answer "are our automated agents working?" at a glance. Drill-down capability handles investigation when the answer is no.
Making automation reliable
Agent automation fails. The question is whether failures are handled gracefully or become cascading disasters.
Verify, don't assume. Automated checks confirm agents did what they claimed.
Fail fast. Stop early when problems surface rather than proceeding with bad state.
Escalate clearly. When automation cannot continue, humans receive actionable context—not just "something broke."
Monitor continuously. Aggregate metrics reveal patterns invisible in individual runs.
Budget defensively. Cost limits prevent stuck workflows from draining API credits.
These apply regardless of tools or platforms. Implementations differ, but the underlying approach is constant: expect failure, detect it early, recover gracefully, and maintain visibility throughout.