Sensitive Data Identification
The prevention layer
Retention policies govern how long data stays with vendors. But the better question is: should that data leave your environment at all?
Identifying sensitive data before it reaches AI coding tools is a prevention problem, not a detection problem. Once code containing an API key, database password, or customer PII travels to an inference endpoint, the damage is done. Retention policies determine cleanup timelines, but exposure has already occurred.
This section covers tools and techniques for identifying sensitive data before it enters the AI tool pipeline: pre-commit hooks for secrets, tool-specific exclusion configurations, and PII detection systems.
Pre-commit hooks for secret detection
Pre-commit hooks run locally before code commits. They can block commits containing secrets, giving developers immediate feedback before sensitive data enters version control or gets read by AI tools.
Gitleaks
Gitleaks is the most widely adopted open-source secret scanner. Version 8.29.1 (current as of January 2026) detects 160+ secret types using regular expressions and entropy analysis.
Installation with pre-commit:
# .pre-commit-config.yaml
repos:
- repo: https://github.com/gitleaks/gitleaks
rev: v8.29.1
hooks:
- id: gitleaksGitleaks scans staged files and blocks commits containing detected secrets. Recent versions added composite rules (v8.28.0) for combining detection patterns and archive scanning for compressed files.
Bypassing for legitimate cases:
SKIP=gitleaks git commit -m "Updated test fixtures with placeholder values"Bypass should be rare and documented. If developers bypass frequently, either detection rules need tuning or the security culture needs work.
TruffleHog
TruffleHog (v3.92.4+) differs from Gitleaks in one significant way: it verifies whether detected credentials are active. When TruffleHog finds an AWS access key, it tests the key against AWS APIs. This eliminates false positives from test fixtures, example code, and revoked credentials.
Pre-commit configuration:
repos:
- repo: local
hooks:
- id: trufflehog
name: TruffleHog
entry: bash -c 'trufflehog git file://. --since-commit HEAD --only-verified --fail'
language: system
stages: ["commit", "push"]The --only-verified flag reports only confirmed-active secrets.
The --fail flag exits with code 183 when secrets are found, blocking the commit.
TruffleHog detects 800+ credential types across Git repos, Docker images, S3 buckets, Slack, Jira, and Confluence.
GitGuardian
GitGuardian (ggshield v1.46.0) offers 500+ secret detectors with enterprise features: centralized dashboards, policy management, and incident tracking.
Pre-commit configuration:
repos:
- repo: https://github.com/gitguardian/ggshield
rev: v1.46.0
hooks:
- id: ggshield
language_version: python3
stages: [pre-commit]Authentication:
ggshield auth login
# or
export GITGUARDIAN_API_KEY=<your-api-key>GitGuardian's enterprise tier adds server-side pre-receive hooks. Unlike client-side pre-commit hooks, pre-receive hooks cannot be bypassed by developers the server rejects commits containing secrets regardless of client configuration.
HashiCorp Vault Radar
Vault Radar (generally available since 2025) integrates secret scanning with HashiCorp Vault. Discovered secrets can be imported directly into Vault for centralized management.
Key capabilities:
- Scans GitHub, GitLab, Bitbucket, Azure DevOps, Confluence, Jira, Slack, S3, and Terraform
- Detects secrets, PII (SSN, credit cards), and non-inclusive language
- Integrates with GitHub pre-receive webhooks for server-side blocking
- Offers a VS Code IDE plugin (public beta) for real-time detection during development
Vault Radar is commercial, targeting organizations already invested in HashiCorp's ecosystem.
Defense in depth
Pre-commit hooks are a first line of defense, not a complete solution.
They run on developer machines and can be bypassed with --no-verify.
Developers might not have them installed.
New team members might miss the setup step.
A layered approach combines:
- Pre-commit hooks (Gitleaks, TruffleHog) developer workstation
- Pre-receive hooks (GitGuardian, Vault Radar) server-side enforcement
- CI/CD scanning pipeline integration
- Periodic repository scans historical analysis of existing commits
Server-side hooks and CI scanning provide enforcement that developers cannot circumvent.
Tool-specific exclusion configurations
Beyond preventing secrets from entering version control, AI coding tools offer configuration to exclude specific files from being read during inference.
Claude Code permissions.deny
Claude Code uses a JSON-based permission system to deny file access.
Configuration lives in .claude/settings.json:
{
"permissions": {
"deny": [
"Read(./.env)",
"Read(./.env.*)",
"Read(./secrets/**)",
"Read(**/.env)",
"Read(**/*.pem)",
"Read(**/*.key)",
"Read(**/credentials.json)"
]
}
}Glob pattern syntax:
| Pattern | Matches |
|---|---|
Read(./.env) | The .env file in project root |
Read(./.env.*) | Files like .env.local, .env.production |
Read(./secrets/**) | All files in secrets/ recursively |
Read(**/.env) | Any .env file at any depth |
Settings file precedence (highest to lowest):
managed-settings.jsonEnterprise/system-wide.claude/settings.local.jsonPersonal, not committed.claude/settings.jsonProject-level, shared via git~/.claude/settings.jsonUser global
Caution: Read deny rules do not block Bash tool access. Denying
Read(**/.env)prevents Claude Code's Read tool from accessing .env files, but commands likecat .envorgrep password .envcan still execute. For complete protection, deny Bash patterns as well or use PreToolUse hooks.
PreToolUse hooks for additional protection:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Read|Edit|Write",
"hooks": [
{
"type": "command",
"command": ".claude/hooks/protect_sensitive.sh"
}
]
}
]
}
}Hooks exit code 2 blocks the tool invocation entirely.
GitHub Copilot content exclusion
GitHub Copilot Business and Enterprise plans support content exclusion through GitHub's web interface (repository, organization, or enterprise settings).
Pattern syntax (fnmatch, case-insensitive):
# Repository-level exclusions
- "**/.env"
- "**/secrets/**"
- "*.pem"
- "credentials.json"
# Organization-level exclusions (specify repository)
"*":
- "**/.env"
- "**/credentials/**"
my-org/my-repo:
- "/config/secrets/**"Excluded files are blocked from:
- Inline code completions
- Informing suggestions in other files
- Copilot Chat responses
Important: Content exclusion does not apply to:
- GitHub Copilot CLI
- Copilot coding agent (background agent)
- Agent mode in Copilot Chat
- Edit mode in Copilot Chat
This is a significant security gap. When using Copilot's agentic features, the model can access files that content exclusion would otherwise block. Organizations relying on agentic Copilot features must use alternative protection mechanisms.
Additional limitations:
- Changes take up to 30 minutes to propagate to IDEs
- IDE-provided type information may leak semantic details from excluded files
- Individual users without Business/Enterprise plans cannot configure exclusions
Codex file exclusion
As of January 2026, Codex CLI has no built-in file exclusion mechanism.
No .codexignore file exists.
This is one of the most requested features (141+ upvotes on GitHub issues).
Codex offers sandbox modes, which limit file system access at the OS level:
| Mode | Capability |
|---|---|
--sandbox read-only | View files only, blocks all edits |
--sandbox workspace-write | Read/write within current directory |
--sandbox danger-full-access | No restrictions |
Sandboxing restricts what Codex can modify, not what it can read.
For sensitive projects, --sandbox read-only prevents writes but does not prevent file contents from being sent to the API for inference.
The third-party xCodex fork has implemented file exclusion via .xcodexignore, but this is not official OpenAI functionality.
PII detection tools
Secrets (API keys, passwords) follow predictable patterns. PII (names, addresses, social security numbers) requires more sophisticated detection.
Microsoft Presidio
Presidio is an open-source framework for detecting and redacting PII. It uses named entity recognition (NER), regex patterns, and checksum validation.
Supported PII types:
- Credit card numbers
- Social security numbers
- Names, locations, phone numbers
- Financial data, healthcare identifiers
- Bitcoin wallet addresses
Presidio does not provide built-in pre-commit hooks, but can be integrated into custom hooks via its Python API:
from presidio_analyzer import AnalyzerEngine
analyzer = AnalyzerEngine()
results = analyzer.analyze(text=file_content, language='en')
if results:
# Block commit, report findingsPresidio acknowledges in its documentation: "there is no guarantee that Presidio will find all sensitive information." It is a detection aid, not a complete solution.
Amazon Comprehend
Amazon Comprehend's DetectPiiEntities API identifies PII in text.
It supports English and Spanish, handles up to 100 KB per request, and detects addresses, ages, names, phone numbers, and country-specific identifiers.
Organizations running on AWS can integrate Comprehend into CI pipelines:
aws comprehend detect-pii-entities \
--text "file content here" \
--language-code enAWS Macie provides similar capabilities for S3 buckets but is designed for data storage scanning, not code repository scanning.
Nightfall AI
Nightfall offers GitHub Actions and CircleCI integrations specifically for code scanning. It scans commits on pull request, posting review comments when PII is detected.
# .github/workflows/nightfall.yml
- uses: nightfallai/nightfall_dlp_action@v3
with:
nightfall_api_key: ${{ secrets.NIGHTFALL_API_KEY }}Nightfall claims higher precision than Google DLP, AWS Comprehend, and Microsoft Purview for its supported data types. Spring 2025 features added endpoint agents for monitoring sensitive data in AI prompts preventing PII from being typed into AI tools.
What exclusion does not cover
Understanding the gaps in these protection mechanisms is as important as understanding their capabilities.
Pre-commit hooks can be bypassed:
git commit --no-verify -m "Emergency fix"Developers with legitimate urgency (or bad judgment) can skip client-side checks. Server-side pre-receive hooks and CI scanning catch what pre-commit hooks miss.
Exclusion patterns have blind spots:
- Symlinks can bypass some exclusion rules (fixed in Claude Code v1.0.119)
- Shell commands may access files that tool-specific deny rules block
- IDE-provided context (type hints, hover info) can leak excluded file content
PII detection is probabilistic:
Unlike secrets (which follow deterministic patterns), PII detection involves false positives and false negatives. "John Smith" in a test fixture looks identical to "John Smith" as real customer data. No tool perfectly distinguishes test data from production data.
Runtime access is not blocked:
Excluding .env files from AI tool access does not prevent the application from reading those files at runtime.
If your code loads environment variables and logs them, those values may appear in error messages or debug output that AI tools can see.
Historical exposure is not undone:
If sensitive data was committed before protections were in place, it exists in git history. Scanning and exclusion tools protect future commits. Historical exposure requires git history rewriting (with all its complications) to fully address.
The next section covers data classification frameworks how to categorize data by sensitivity and map those categories to AI tool policies.