Applied Intelligence
Module 8: Code Review and Testing

CI/CD Integration Patterns

Quality gates for AI-generated code

Traditional quality gates check test coverage, linting errors, and security vulnerabilities. AI-generated code requires additional gates targeting the specific failure modes agents introduce.

SonarQube's "Sonar way for AI Code" quality gate adds seven conditions beyond standard gates:

  • Zero new issues any new issue blocks merge
  • 80% test coverage (higher than the typical 70%)
  • 3% duplication maximum, addressing the 4x duplication increase with AI
  • Security rating A zero tolerance for security degradation
  • Reliability rating C or better, catching logic errors before production
  • 100% security hotspot review all flagged areas require human evaluation

These stricter thresholds exist because AI-generated PRs contain 1.7x more issues than human PRs. Security failure rates reach 45% without additional gates. Higher bars compensate for higher risk.

Configuring AI-specific gates

Quality gate configuration varies by tool, but the pattern stays consistent.

SonarQube API:

# Mark project as containing AI code
curl -XPOST -H 'Authorization: Bearer <TOKEN>' \
  '<SONARQUBE_URL>/api/projects/set_contains_ai_code?contains_ai_code=true&project=<PROJECT_KEY>'

# Assign AI Code Assured quality gate
curl -XPOST -H 'Authorization: Bearer <TOKEN>' \
  '<SONARQUBE_URL>/api/qualitygates/select?gateName=Sonar%20way%20for%20AI%20Code&projectKey=<PROJECT_KEY>'

Custom YAML configuration:

quality_gates:
  - name: AI Code Quality Gate
    conditions:
      - metric: test_coverage
        operator: GT
        value: 80
      - metric: duplicated_lines_density
        operator: LT
        value: 3
      - metric: security_rating
        operator: EQ
        value: 1  # A rating
      - metric: new_violations
        operator: EQ
        value: 0

AI code needs stricter gates, not looser ones. Faster generation creates more opportunities for defects. Quality gates provide the counterpressure.

Pipeline stage configuration

Effective pipelines layer verification across stages, each catching different failure types. A five-layer model works well for AI-generated code.

Layer 1: Pre-commit hooks

Module 6 covered pre-commit hooks for formatting and linting. These run before code enters version control. Linters catch syntactic issues in milliseconds. Formatters enforce style deterministically.

The rule from earlier in this module: never send an agent to do a linter's job. Pre-commit hooks handle what linters handle best.

Layer 2: CI static analysis

The build stage runs heavier analysis that cannot run locally on every commit.

name: CI Static Analysis
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  analysis:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
        with:
          fetch-depth: 0

      - name: Type checking
        run: npx tsc --noEmit

      - name: Security scan
        uses: SonarSource/sonarqube-scan-action@master
        env:
          SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}

      - name: Dependency audit
        run: npm audit --audit-level=high

Type checking catches errors agents introduce through incorrect assumptions about types. Security scanning identifies vulnerabilities the agent did not consider. Dependency audits flag problematic packages including the 19.7% of AI-suggested packages that do not exist.

Layer 3: AI pre-review

Agent-based review runs after static analysis passes. This positioning matters: AI review costs tokens, so failed linting or type checks should block before AI review runs.

  ai-review:
    needs: analysis
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5

      - uses: anthropics/claude-code-action@v1
        with:
          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
          prompt: |
            Review this PR for:
            - Logic errors static analysis cannot catch
            - Context alignment with team conventions
            - Architecture fit with existing patterns

            Skip style issues linters handle those.

The prompt focuses AI review on what AI does well: understanding intent and catching conceptual errors. Static tools handle mechanical checks.

Layer 4: Human review

Human reviewers see AI pre-review results alongside the code. Their focus narrows to what AI flagged plus what AI cannot evaluate: business logic correctness, architectural decisions, and intent alignment.

Layer 5: Post-merge validation

Some checks run after merge to main. Integration tests, smoke tests, and deployment verification catch issues that only surface in production-like environments.

  post-merge:
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - name: Integration tests
        run: npm run test:integration

      - name: Smoke tests
        run: npm run test:smoke

AI-powered SAST tools

Traditional Static Application Security Testing (SAST) tools suffer from false positives. Industry rates run 68-78%. Some teams report 95% false positive rates on production code. Each finding requires 15-30 minutes to triage. Teams disable scanners rather than wade through noise which defeats the purpose.

AI-powered SAST tools use contextual understanding alongside pattern matching.

False positive reduction rates

Current tools report substantial improvements:

ToolFalse Positive ReductionApproach
Endor Labs AI SAST95% eliminatedMulti-agent triage
Semgrep Assistant60% auto-triaged, 96% agreementAI-powered Memories
Snyk Code92% vs baselineHybrid symbolic + generative AI
CycodeAI-poweredRisk Intelligence Graph

Endor Labs uses a multi-agent architecture: detection agents scan code for vulnerabilities, triage agents filter false positives by analyzing syntax, dataflow, and intent, and remediation agents suggest context-aware fixes.

Semgrep's "Memories" feature stores contextual information about the codebase. A Fortune 500 customer achieved 2.8x improvement in false positive detection after adding two memories. Security researchers agree with Semgrep AI 96% of the time.

Integration patterns

AI-powered SAST tools integrate like traditional scanners:

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5

      - name: Semgrep scan
        uses: semgrep/semgrep-action@v1
        with:
          config: auto
        env:
          SEMGREP_APP_TOKEN: ${{ secrets.SEMGREP_TOKEN }}

      - name: Snyk security scan
        uses: snyk/actions/node@master
        with:
          args: --severity-threshold=high
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}

The difference is what happens with the results. Teams actually read findings when 90%+ are actionable. Security scanning becomes part of the workflow instead of a checkbox nobody checks.

Gating strategies

Not all code requires the same scrutiny. Gating strategies match verification intensity to risk.

Severity-based thresholds

Configure different responses for different finding severities:

severity_thresholds:
  critical: 0    # Zero tolerance, block merge
  high: 5        # Allow few, require acknowledgment
  medium: 20     # Warning only
  low: 100       # Informational

Critical and high severity findings block merge until resolved. Medium severity requires reviewer acknowledgment but allows override. Low severity appears in reports but does not gate.

One team limited cyclomatic complexity alerts to the top 10% of new functions. Daily false-positive violations fell 65% after threshold tuning. Security rules stayed strict critical vulnerabilities still block builds.

Path-based rules

Different code areas warrant different scrutiny. Authentication, payment processing, and data handling need stricter gates.

review_gate_rules:
  block_merge_if:
    - high_severity_security_issue
    - failing_tests
    - missing_migrations_for_schema_changes

  warn_if:
    - pr_exceeds_500_lines
    - function_complexity_spikes
    - test_coverage_reduced_beyond_1_percent

  raise_severity_in:
    - auth/**
    - payments/**
    - secrets/**

  ignore_by_default:
    - vendor/**
    - migrations/**

Privilege escalation paths jumped 322% in AI code according to Apiiro research. Architectural design flaws spiked 153%. Path-based rules concentrate scrutiny where AI failure modes cluster.

Human-in-the-loop gates

Some decisions require human judgment regardless of automated results.

required_human_approval:
  - path: "*/auth/*"
    reviewers: ["security-team"]
  - path: "*/payments/*"
    reviewers: ["payments-team", "security-team"]
  - change_type: "schema_migration"
    reviewers: ["dba-team"]

Microsoft's AI-powered code review system maintained human-in-the-loop flows. Across 5,000 repositories, they observed 10-20% median PR completion time improvements. The gains came from AI handling routine review, not from removing humans.

AWS's AI-DLC framework embeds transparent checkpoints with human approvals at every decision gate. AI generates plans. Stakeholders review and validate. The workflow records every human action and approval.

Phased enforcement

Rolling out strict gates causes friction. Phased enforcement reduces disruption:

  1. Deploy as informational scanning first
  2. Establish false positive thresholds with the team
  3. Activate quality gates to fail builds
  4. Tighten thresholds as signal improves

Teams that jump straight to blocking gates generate pushback. Teams that start informational and tighten gradually build acceptance.

GitHub Actions integration examples

Claude Code automatic review

name: Claude Code Review
on:
  pull_request:
    types: [opened, synchronize, ready_for_review]

jobs:
  review:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v5
        with:
          fetch-depth: 1

      - uses: anthropics/claude-code-action@v1
        with:
          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
          track_progress: true
          prompt: |
            REPO: ${{ github.repository }}
            PR NUMBER: ${{ github.event.pull_request.number }}

            Review this PR focusing on:
            - Logic errors and potential bugs
            - Security implications
            - Architecture alignment

            Linting and formatting are handled by CI.
            Skip style comments.

Codex auto-fix on CI failure

name: Codex Auto-Fix
on:
  workflow_run:
    workflows: ["CI"]
    types: [completed]

jobs:
  auto-fix:
    if: ${{ github.event.workflow_run.conclusion == 'failure' }}
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
        with:
          ref: ${{ github.event.workflow_run.head_sha }}
          fetch-depth: 0

      - uses: openai/codex-action@v1
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          prompt: |
            CI failed. Read the logs, identify the minimal fix,
            implement it, and verify tests pass.

      - uses: peter-evans/create-pull-request@v6
        if: success()
        with:
          commit-message: "fix(ci): auto-fix failing tests via Codex"
          branch: codex/auto-fix-${{ github.event.workflow_run.run_id }}
          title: "Auto-fix failing CI via Codex"

GitHub Copilot automatic review

GitHub Copilot code review integrates through repository settings rather than explicit workflow configuration.

  1. Navigate to Repository > Settings > Rules > Rulesets
  2. Create new branch ruleset
  3. Enable "Automatically request Copilot code review"
  4. Optional: Enable "Review new pushes" and "Review draft pull requests"

Custom instructions live in .github/copilot-instructions.md:

# Code Review Guidelines

## Security
- Flag hardcoded credentials
- Check for SQL injection vulnerabilities
- Verify input validation on all user-facing endpoints

## Quality
- Functions exceeding 50 lines need justification
- All async operations require error handling
- New code needs 80% test coverage

The five-layer pipeline

Combining these patterns produces a complete pipeline:

Commit → Pre-commit hooks (linters, formatters)
       → CI static analysis (type check, security scan, dependency audit)
       → AI pre-review (logic, context, architecture)
       → Human review (business logic, intent, decisions)
       → Post-merge validation (integration tests, smoke tests)

Each layer catches what the previous layer cannot. Linters catch syntax. Static analysis catches type errors and known vulnerability patterns. AI catches conceptual issues. Humans catch intent misalignment. Post-merge catches integration issues.

DORA 2025 research found that AI amplifies both strengths and dysfunctions. Teams with strong quality gates see AI accelerate their velocity. Teams without gates see AI accelerate their defect rates.

Quality gates turn AI velocity into AI productivity. Without them, faster generation just means faster failure.

On this page