Sensitive Data Identification

The prevention layer

Retention policies govern how long data stays with vendors. But the better question is: should that data leave your environment at all?

Identifying sensitive data before it reaches AI coding tools is a prevention problem, not a detection problem. Once code containing an API key, database password, or customer PII travels to an inference endpoint, the damage is done. Retention policies determine cleanup timelines, but exposure has already occurred.

This section covers tools and techniques for identifying sensitive data before it enters the AI tool pipeline: pre-commit hooks for secrets, tool-specific exclusion configurations, and PII detection systems.

Pre-commit hooks for secret detection

Pre-commit hooks run locally before code commits. They can block commits containing secrets, giving developers immediate feedback before sensitive data enters version control or gets read by AI tools.

Gitleaks

Gitleaks is the most widely adopted open-source secret scanner. Version 8.29.1 (current as of January 2026) detects 160+ secret types using regular expressions and entropy analysis.

Installation with pre-commit:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.29.1
    hooks:
      - id: gitleaks

Gitleaks scans staged files and blocks commits containing detected secrets. Recent versions added composite rules (v8.28.0) for combining detection patterns and archive scanning for compressed files.

Bypassing for legitimate cases:

SKIP=gitleaks git commit -m "Updated test fixtures with placeholder values"

Bypass should be rare and documented. If developers bypass frequently, either detection rules need tuning or the security culture needs work.

TruffleHog

TruffleHog (v3.92.4+) differs from Gitleaks in one significant way: it verifies whether detected credentials are active. When TruffleHog finds an AWS access key, it tests the key against AWS APIs. This eliminates false positives from test fixtures, example code, and revoked credentials.

Pre-commit configuration:

repos:
  - repo: local
    hooks:
      - id: trufflehog
        name: TruffleHog
        entry: bash -c 'trufflehog git file://. --since-commit HEAD --only-verified --fail'
        language: system
        stages: ["commit", "push"]

The --only-verified flag reports only confirmed-active secrets. The --fail flag exits with code 183 when secrets are found, blocking the commit.

TruffleHog detects 800+ credential types across Git repos, Docker images, S3 buckets, Slack, Jira, and Confluence.

GitGuardian

GitGuardian (ggshield v1.46.0) offers 500+ secret detectors with enterprise features: centralized dashboards, policy management, and incident tracking.

Pre-commit configuration:

repos:
  - repo: https://github.com/gitguardian/ggshield
    rev: v1.46.0
    hooks:
      - id: ggshield
        language_version: python3
        stages: [pre-commit]

Authentication:

ggshield auth login
# or
export GITGUARDIAN_API_KEY=<your-api-key>

GitGuardian's enterprise tier adds server-side pre-receive hooks. Unlike client-side pre-commit hooks, pre-receive hooks cannot be bypassed by developers the server rejects commits containing secrets regardless of client configuration.

HashiCorp Vault Radar

Vault Radar (generally available since 2025) integrates secret scanning with HashiCorp Vault. Discovered secrets can be imported directly into Vault for centralized management.

Key capabilities:

Scans GitHub, GitLab, Bitbucket, Azure DevOps, Confluence, Jira, Slack, S3, and Terraform
Detects secrets, PII (SSN, credit cards), and non-inclusive language
Integrates with GitHub pre-receive webhooks for server-side blocking
Offers a VS Code IDE plugin (public beta) for real-time detection during development

Vault Radar is commercial, targeting organizations already invested in HashiCorp's ecosystem.

Defense in depth

Pre-commit hooks are a first line of defense, not a complete solution. They run on developer machines and can be bypassed with --no-verify. Developers might not have them installed. New team members might miss the setup step.

A layered approach combines:

Pre-commit hooks (Gitleaks, TruffleHog) developer workstation
Pre-receive hooks (GitGuardian, Vault Radar) server-side enforcement
CI/CD scanning pipeline integration
Periodic repository scans historical analysis of existing commits

Server-side hooks and CI scanning provide enforcement that developers cannot circumvent.

Tool-specific exclusion configurations

Beyond preventing secrets from entering version control, AI coding tools offer configuration to exclude specific files from being read during inference.

Claude Code permissions.deny

Claude Code uses a JSON-based permission system to deny file access. Configuration lives in .claude/settings.json:

{
  "permissions": {
    "deny": [
      "Read(./.env)",
      "Read(./.env.*)",
      "Read(./secrets/**)",
      "Read(**/.env)",
      "Read(**/*.pem)",
      "Read(**/*.key)",
      "Read(**/credentials.json)"
    ]
  }
}

Glob pattern syntax:

Pattern	Matches
`Read(./.env)`	The .env file in project root
`Read(./.env.*)`	Files like .env.local, .env.production
`Read(./secrets/**)`	All files in secrets/ recursively
`Read(**/.env)`	Any .env file at any depth

Settings file precedence (highest to lowest):

managed-settings.json Enterprise/system-wide
.claude/settings.local.json Personal, not committed
.claude/settings.json Project-level, shared via git
~/.claude/settings.json User global

Caution: Read deny rules do not block Bash tool access. Denying Read(**/.env) prevents Claude Code's Read tool from accessing .env files, but commands like cat .env or grep password .env can still execute. For complete protection, deny Bash patterns as well or use PreToolUse hooks.

PreToolUse hooks for additional protection:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Read|Edit|Write",
        "hooks": [
          {
            "type": "command",
            "command": ".claude/hooks/protect_sensitive.sh"
          }
        ]
      }
    ]
  }
}

Hooks exit code 2 blocks the tool invocation entirely.

GitHub Copilot content exclusion

GitHub Copilot Business and Enterprise plans support content exclusion through GitHub's web interface (repository, organization, or enterprise settings).

Pattern syntax (fnmatch, case-insensitive):

# Repository-level exclusions
- "**/.env"
- "**/secrets/**"
- "*.pem"
- "credentials.json"

# Organization-level exclusions (specify repository)
"*":
  - "**/.env"
  - "**/credentials/**"
my-org/my-repo:
  - "/config/secrets/**"

Excluded files are blocked from:

Inline code completions
Informing suggestions in other files
Copilot Chat responses

Important: Content exclusion does not apply to:

GitHub Copilot CLI

Copilot coding agent (background agent)

Agent mode in Copilot Chat

Edit mode in Copilot Chat

This is a significant security gap. When using Copilot's agentic features, the model can access files that content exclusion would otherwise block. Organizations relying on agentic Copilot features must use alternative protection mechanisms.

Additional limitations:

Changes take up to 30 minutes to propagate to IDEs
IDE-provided type information may leak semantic details from excluded files
Individual users without Business/Enterprise plans cannot configure exclusions

Codex file exclusion

As of January 2026, Codex CLI has no built-in file exclusion mechanism. No .codexignore file exists. This is one of the most requested features (141+ upvotes on GitHub issues).

Codex offers sandbox modes, which limit file system access at the OS level:

Mode	Capability
`--sandbox read-only`	View files only, blocks all edits
`--sandbox workspace-write`	Read/write within current directory
`--sandbox danger-full-access`	No restrictions

Sandboxing restricts what Codex can modify, not what it can read. For sensitive projects, --sandbox read-only prevents writes but does not prevent file contents from being sent to the API for inference.

The third-party xCodex fork has implemented file exclusion via .xcodexignore, but this is not official OpenAI functionality.

PII detection tools

Secrets (API keys, passwords) follow predictable patterns. PII (names, addresses, social security numbers) requires more sophisticated detection.

Microsoft Presidio

Presidio is an open-source framework for detecting and redacting PII. It uses named entity recognition (NER), regex patterns, and checksum validation.

Supported PII types:

Credit card numbers
Social security numbers
Names, locations, phone numbers
Financial data, healthcare identifiers
Bitcoin wallet addresses

Presidio does not provide built-in pre-commit hooks, but can be integrated into custom hooks via its Python API:

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()
results = analyzer.analyze(text=file_content, language='en')
if results:
    # Block commit, report findings

Presidio acknowledges in its documentation: "there is no guarantee that Presidio will find all sensitive information." It is a detection aid, not a complete solution.

Amazon Comprehend

Amazon Comprehend's DetectPiiEntities API identifies PII in text. It supports English and Spanish, handles up to 100 KB per request, and detects addresses, ages, names, phone numbers, and country-specific identifiers.

Organizations running on AWS can integrate Comprehend into CI pipelines:

aws comprehend detect-pii-entities \
  --text "file content here" \
  --language-code en

AWS Macie provides similar capabilities for S3 buckets but is designed for data storage scanning, not code repository scanning.

Nightfall AI

Nightfall offers GitHub Actions and CircleCI integrations specifically for code scanning. It scans commits on pull request, posting review comments when PII is detected.

# .github/workflows/nightfall.yml
- uses: nightfallai/nightfall_dlp_action@v3
  with:
    nightfall_api_key: ${{ secrets.NIGHTFALL_API_KEY }}

Nightfall claims higher precision than Google DLP, AWS Comprehend, and Microsoft Purview for its supported data types. Spring 2025 features added endpoint agents for monitoring sensitive data in AI prompts preventing PII from being typed into AI tools.

What exclusion does not cover

Understanding the gaps in these protection mechanisms is as important as understanding their capabilities.

Pre-commit hooks can be bypassed:

git commit --no-verify -m "Emergency fix"

Developers with legitimate urgency (or bad judgment) can skip client-side checks. Server-side pre-receive hooks and CI scanning catch what pre-commit hooks miss.

Exclusion patterns have blind spots:

Symlinks can bypass some exclusion rules (fixed in Claude Code v1.0.119)
Shell commands may access files that tool-specific deny rules block
IDE-provided context (type hints, hover info) can leak excluded file content

PII detection is probabilistic:

Unlike secrets (which follow deterministic patterns), PII detection involves false positives and false negatives. "John Smith" in a test fixture looks identical to "John Smith" as real customer data. No tool perfectly distinguishes test data from production data.

Runtime access is not blocked:

Excluding .env files from AI tool access does not prevent the application from reading those files at runtime. If your code loads environment variables and logs them, those values may appear in error messages or debug output that AI tools can see.

Historical exposure is not undone:

If sensitive data was committed before protections were in place, it exists in git history. Scanning and exclusion tools protect future commits. Historical exposure requires git history rewriting (with all its complications) to fully address.

The next section covers data classification frameworks how to categorize data by sensitivity and map those categories to AI tool policies.

On this page