CI/CD Integration Patterns
Quality gates for AI-generated code
Traditional quality gates check test coverage, linting errors, and security vulnerabilities. AI-generated code requires additional gates targeting the specific failure modes agents introduce.
SonarQube's "Sonar way for AI Code" quality gate adds seven conditions beyond standard gates:
- Zero new issues any new issue blocks merge
- 80% test coverage (higher than the typical 70%)
- 3% duplication maximum, addressing the 4x duplication increase with AI
- Security rating A zero tolerance for security degradation
- Reliability rating C or better, catching logic errors before production
- 100% security hotspot review all flagged areas require human evaluation
These stricter thresholds exist because AI-generated PRs contain 1.7x more issues than human PRs. Security failure rates reach 45% without additional gates. Higher bars compensate for higher risk.
Configuring AI-specific gates
Quality gate configuration varies by tool, but the pattern stays consistent.
SonarQube API:
# Mark project as containing AI code
curl -XPOST -H 'Authorization: Bearer <TOKEN>' \
'<SONARQUBE_URL>/api/projects/set_contains_ai_code?contains_ai_code=true&project=<PROJECT_KEY>'
# Assign AI Code Assured quality gate
curl -XPOST -H 'Authorization: Bearer <TOKEN>' \
'<SONARQUBE_URL>/api/qualitygates/select?gateName=Sonar%20way%20for%20AI%20Code&projectKey=<PROJECT_KEY>'Custom YAML configuration:
quality_gates:
- name: AI Code Quality Gate
conditions:
- metric: test_coverage
operator: GT
value: 80
- metric: duplicated_lines_density
operator: LT
value: 3
- metric: security_rating
operator: EQ
value: 1 # A rating
- metric: new_violations
operator: EQ
value: 0AI code needs stricter gates, not looser ones. Faster generation creates more opportunities for defects. Quality gates provide the counterpressure.
Pipeline stage configuration
Effective pipelines layer verification across stages, each catching different failure types. A five-layer model works well for AI-generated code.
Layer 1: Pre-commit hooks
Module 6 covered pre-commit hooks for formatting and linting. These run before code enters version control. Linters catch syntactic issues in milliseconds. Formatters enforce style deterministically.
The rule from earlier in this module: never send an agent to do a linter's job. Pre-commit hooks handle what linters handle best.
Layer 2: CI static analysis
The build stage runs heavier analysis that cannot run locally on every commit.
name: CI Static Analysis
on:
pull_request:
types: [opened, synchronize]
jobs:
analysis:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
with:
fetch-depth: 0
- name: Type checking
run: npx tsc --noEmit
- name: Security scan
uses: SonarSource/sonarqube-scan-action@master
env:
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
- name: Dependency audit
run: npm audit --audit-level=highType checking catches errors agents introduce through incorrect assumptions about types. Security scanning identifies vulnerabilities the agent did not consider. Dependency audits flag problematic packages including the 19.7% of AI-suggested packages that do not exist.
Layer 3: AI pre-review
Agent-based review runs after static analysis passes. This positioning matters: AI review costs tokens, so failed linting or type checks should block before AI review runs.
ai-review:
needs: analysis
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- uses: anthropics/claude-code-action@v1
with:
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
prompt: |
Review this PR for:
- Logic errors static analysis cannot catch
- Context alignment with team conventions
- Architecture fit with existing patterns
Skip style issues linters handle those.The prompt focuses AI review on what AI does well: understanding intent and catching conceptual errors. Static tools handle mechanical checks.
Layer 4: Human review
Human reviewers see AI pre-review results alongside the code. Their focus narrows to what AI flagged plus what AI cannot evaluate: business logic correctness, architectural decisions, and intent alignment.
Layer 5: Post-merge validation
Some checks run after merge to main. Integration tests, smoke tests, and deployment verification catch issues that only surface in production-like environments.
post-merge:
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- name: Integration tests
run: npm run test:integration
- name: Smoke tests
run: npm run test:smokeAI-powered SAST tools
Traditional Static Application Security Testing (SAST) tools suffer from false positives. Industry rates run 68-78%. Some teams report 95% false positive rates on production code. Each finding requires 15-30 minutes to triage. Teams disable scanners rather than wade through noise which defeats the purpose.
AI-powered SAST tools use contextual understanding alongside pattern matching.
False positive reduction rates
Current tools report substantial improvements:
| Tool | False Positive Reduction | Approach |
|---|---|---|
| Endor Labs AI SAST | 95% eliminated | Multi-agent triage |
| Semgrep Assistant | 60% auto-triaged, 96% agreement | AI-powered Memories |
| Snyk Code | 92% vs baseline | Hybrid symbolic + generative AI |
| Cycode | AI-powered | Risk Intelligence Graph |
Endor Labs uses a multi-agent architecture: detection agents scan code for vulnerabilities, triage agents filter false positives by analyzing syntax, dataflow, and intent, and remediation agents suggest context-aware fixes.
Semgrep's "Memories" feature stores contextual information about the codebase. A Fortune 500 customer achieved 2.8x improvement in false positive detection after adding two memories. Security researchers agree with Semgrep AI 96% of the time.
Integration patterns
AI-powered SAST tools integrate like traditional scanners:
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- name: Semgrep scan
uses: semgrep/semgrep-action@v1
with:
config: auto
env:
SEMGREP_APP_TOKEN: ${{ secrets.SEMGREP_TOKEN }}
- name: Snyk security scan
uses: snyk/actions/node@master
with:
args: --severity-threshold=high
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}The difference is what happens with the results. Teams actually read findings when 90%+ are actionable. Security scanning becomes part of the workflow instead of a checkbox nobody checks.
Gating strategies
Not all code requires the same scrutiny. Gating strategies match verification intensity to risk.
Severity-based thresholds
Configure different responses for different finding severities:
severity_thresholds:
critical: 0 # Zero tolerance, block merge
high: 5 # Allow few, require acknowledgment
medium: 20 # Warning only
low: 100 # InformationalCritical and high severity findings block merge until resolved. Medium severity requires reviewer acknowledgment but allows override. Low severity appears in reports but does not gate.
One team limited cyclomatic complexity alerts to the top 10% of new functions. Daily false-positive violations fell 65% after threshold tuning. Security rules stayed strict critical vulnerabilities still block builds.
Path-based rules
Different code areas warrant different scrutiny. Authentication, payment processing, and data handling need stricter gates.
review_gate_rules:
block_merge_if:
- high_severity_security_issue
- failing_tests
- missing_migrations_for_schema_changes
warn_if:
- pr_exceeds_500_lines
- function_complexity_spikes
- test_coverage_reduced_beyond_1_percent
raise_severity_in:
- auth/**
- payments/**
- secrets/**
ignore_by_default:
- vendor/**
- migrations/**Privilege escalation paths jumped 322% in AI code according to Apiiro research. Architectural design flaws spiked 153%. Path-based rules concentrate scrutiny where AI failure modes cluster.
Human-in-the-loop gates
Some decisions require human judgment regardless of automated results.
required_human_approval:
- path: "*/auth/*"
reviewers: ["security-team"]
- path: "*/payments/*"
reviewers: ["payments-team", "security-team"]
- change_type: "schema_migration"
reviewers: ["dba-team"]Microsoft's AI-powered code review system maintained human-in-the-loop flows. Across 5,000 repositories, they observed 10-20% median PR completion time improvements. The gains came from AI handling routine review, not from removing humans.
AWS's AI-DLC framework embeds transparent checkpoints with human approvals at every decision gate. AI generates plans. Stakeholders review and validate. The workflow records every human action and approval.
Phased enforcement
Rolling out strict gates causes friction. Phased enforcement reduces disruption:
- Deploy as informational scanning first
- Establish false positive thresholds with the team
- Activate quality gates to fail builds
- Tighten thresholds as signal improves
Teams that jump straight to blocking gates generate pushback. Teams that start informational and tighten gradually build acceptance.
GitHub Actions integration examples
Claude Code automatic review
name: Claude Code Review
on:
pull_request:
types: [opened, synchronize, ready_for_review]
jobs:
review:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v5
with:
fetch-depth: 1
- uses: anthropics/claude-code-action@v1
with:
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
track_progress: true
prompt: |
REPO: ${{ github.repository }}
PR NUMBER: ${{ github.event.pull_request.number }}
Review this PR focusing on:
- Logic errors and potential bugs
- Security implications
- Architecture alignment
Linting and formatting are handled by CI.
Skip style comments.Codex auto-fix on CI failure
name: Codex Auto-Fix
on:
workflow_run:
workflows: ["CI"]
types: [completed]
jobs:
auto-fix:
if: ${{ github.event.workflow_run.conclusion == 'failure' }}
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
with:
ref: ${{ github.event.workflow_run.head_sha }}
fetch-depth: 0
- uses: openai/codex-action@v1
with:
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
prompt: |
CI failed. Read the logs, identify the minimal fix,
implement it, and verify tests pass.
- uses: peter-evans/create-pull-request@v6
if: success()
with:
commit-message: "fix(ci): auto-fix failing tests via Codex"
branch: codex/auto-fix-${{ github.event.workflow_run.run_id }}
title: "Auto-fix failing CI via Codex"GitHub Copilot automatic review
GitHub Copilot code review integrates through repository settings rather than explicit workflow configuration.
- Navigate to Repository > Settings > Rules > Rulesets
- Create new branch ruleset
- Enable "Automatically request Copilot code review"
- Optional: Enable "Review new pushes" and "Review draft pull requests"
Custom instructions live in .github/copilot-instructions.md:
# Code Review Guidelines
## Security
- Flag hardcoded credentials
- Check for SQL injection vulnerabilities
- Verify input validation on all user-facing endpoints
## Quality
- Functions exceeding 50 lines need justification
- All async operations require error handling
- New code needs 80% test coverageThe five-layer pipeline
Combining these patterns produces a complete pipeline:
Commit → Pre-commit hooks (linters, formatters)
→ CI static analysis (type check, security scan, dependency audit)
→ AI pre-review (logic, context, architecture)
→ Human review (business logic, intent, decisions)
→ Post-merge validation (integration tests, smoke tests)Each layer catches what the previous layer cannot. Linters catch syntax. Static analysis catches type errors and known vulnerability patterns. AI catches conceptual issues. Humans catch intent misalignment. Post-merge catches integration issues.
DORA 2025 research found that AI amplifies both strengths and dysfunctions. Teams with strong quality gates see AI accelerate their velocity. Teams without gates see AI accelerate their defect rates.
Quality gates turn AI velocity into AI productivity. Without them, faster generation just means faster failure.