Data Classification Frameworks
From ad-hoc rules to systematic classification
The previous section covered techniques for identifying sensitive data: pre-commit hooks, exclusion patterns, PII detection. These tools answer "how do we detect sensitive content?" But they do not answer a prior question: "what counts as sensitive in the first place?"
Data classification frameworks provide that answer. They define sensitivity categories, assign data to those categories, and specify handling requirements for each tier. Without classification, security policies devolve into case-by-case judgments inconsistent, unscalable, and difficult to audit.
This section covers the standard four-tier classification model, how to map classifications to AI tool policies, and the components of a complete classification framework.
The four-tier classification model
Most enterprises adopt a four-tier classification system. The terminology varies by organization, but the structure is consistent: four levels of increasing sensitivity with corresponding restrictions.
| Classification | Definition | Example Data |
|---|---|---|
| Public | Approved for external release; no harm if disclosed | Marketing materials, open source code, public documentation |
| Internal | For organizational use; minimal sensitivity | Internal wikis, project plans, non-sensitive meeting notes |
| Confidential | Significant harm if disclosed; limited distribution | Customer lists, financial projections, unreleased product specs |
| Restricted | Severe harm if disclosed; strictly controlled access | Trade secrets, cryptographic keys, PII, PHI, credentials |
This model balances simplicity with appropriate granularity. Fewer tiers reduce classification overhead but blur important distinctions. More tiers add precision but increase cognitive load and misclassification risk.
Public data
Public data carries no confidentiality requirements. It may be freely shared externally, indexed by search engines, or processed by any AI tool.
Examples in software development:
- Published API documentation
- Open source code under permissive licenses
- Company blog posts and marketing content
- Publicly accessible README files
Public classification does not imply unimportance. Public code may still require integrity controls (preventing unauthorized modification) and availability controls (maintaining uptime). Classification addresses confidentiality, not the full CIA triad.
Internal data
Internal data is the organizational default information meant for employees and contractors but not for external parties. Disclosure would cause inconvenience or minor reputational impact, not significant harm.
Examples:
- Project roadmaps and internal documentation
- Non-sensitive source code for internal tools
- Team meeting recordings and notes
- Internal training materials
Internal data typically permits use with vetted enterprise AI tools. The key word is vetted: the organization has evaluated the tool's data handling practices, retention policies, and compliance posture.
Confidential data
Confidential data, if disclosed, would cause significant harm: competitive disadvantage, regulatory penalties, or breach of contractual obligations. Access is limited to individuals with a business need.
Examples:
- Customer data and usage analytics
- Unreleased product source code
- Financial projections and M&A discussions
- Vendor contracts and pricing agreements
Confidential data requires tighter controls for AI tool usage. Many organizations restrict confidential data to enterprise-only AI deployments (private instances, VPC-hosted models) or prohibit AI processing entirely.
Restricted data
Restricted data represents the highest sensitivity tier. Unauthorized disclosure would cause severe harm: significant financial loss, legal liability, or harm to individuals.
Examples:
- Credentials, API keys, encryption keys (should not be in code, but sometimes are)
- Personally identifiable information (PII) and protected health information (PHI)
- Trade secrets and core algorithmic IP
- Authentication and authorization logic
Restricted data generally prohibits use with external AI tools. Self-hosted models or complete prohibition are typical policies. Even with self-hosted deployments, restricted data often requires additional safeguards: audit logging, access controls, and explicit authorization.
Mapping classification to AI tool policies
Classification without enforcement is documentation theater. Each classification tier must map to concrete policies governing AI tool usage.
A sample policy matrix
| Classification | Permitted AI Tools | Required Controls | Human Review |
|---|---|---|---|
| Public | Any approved tool | Basic logging | None required |
| Internal | Enterprise-vetted tools only | Audit trails, access controls | Optional |
| Confidential | Enterprise-only or self-hosted | Encryption, masking, approval workflow | Required for outputs |
| Restricted | Self-hosted only or prohibited | Full audit, explicit authorization | All interactions |
This matrix is illustrative. Actual policies depend on regulatory requirements, risk tolerance, and available infrastructure.
Translating policies to technical controls
Policy statements like "enterprise-vetted tools only" require translation into enforceable rules.
For Public data:
- No technical restrictions beyond basic access controls
- Standard monitoring and logging
For Internal data:
- Allowlisted AI tools via endpoint management
- Network controls permitting only approved AI service endpoints
- Basic usage logging for audit purposes
For Confidential data:
- Mandatory content exclusion configurations (Claude Code permissions.deny, Copilot content exclusions)
- PII masking before AI tool submission
- Output review before incorporation into production code
- Approval workflows for AI-assisted changes to confidential systems
For Restricted data:
- Network segmentation preventing external AI tool access
- Pre-commit hooks blocking commits containing restricted content
- Self-hosted model deployments with no external data egress
- Continuous monitoring and alerting
The nuanced policy problem
Page 1 of this module observed that blanket prohibitions on "sending code to AI services" may be too broad. The classification framework enables nuanced policies.
Instead of:
"AI coding tools are prohibited for all company code."
A classification-based policy allows:
"AI coding tools may be used with Public and Internal classified code using approved enterprise tools. Confidential code requires enterprise-only deployments with manager approval. Restricted code is prohibited from AI tool processing."
This nuance increases adoption of AI tools where they are safe while maintaining protection where it matters.
Components of a classification framework
A complete framework includes more than a tier list. Five components distinguish a functional framework from a poster on the wall.
1. Classification criteria
How do you determine which tier applies? Criteria should be specific enough to classify consistently, general enough to cover diverse data types.
Questions that drive classification:
- What is the regulatory status of this data? (GDPR, HIPAA, PCI-DSS, SOX)
- What would be the impact of unauthorized disclosure?
- Who should have access under normal circumstances?
- Are there contractual obligations governing this data?
- Does this data contain credentials, PII, or trade secrets?
Default classification: Unclassified data should default to Internal (not Public). Requiring explicit approval for public release prevents accidental exposure.
High-watermark principle: When data combines elements from multiple tiers, the highest tier applies. A document containing 90% Internal content and one API key is Restricted.
2. Classification procedures
Who classifies data, and when?
Data owners (typically the creating team or business unit) assign initial classification. Security teams review classifications for high-sensitivity tiers. Automated scanning supplements human classification for detectable patterns (secrets, PII).
Classification should occur at creation time, not retroactively. Repository templates, project initialization checklists, and onboarding documentation embed classification decisions into normal workflows.
3. Handling requirements
Each tier specifies handling requirements beyond AI tool usage:
- Storage: Where may this data reside? (Approved cloud services, on-premise only, encrypted volumes)
- Transmission: What encryption is required? (TLS in transit, specific cipher requirements)
- Retention: How long may this data be kept? (Regulatory minimums, business necessity)
- Disposal: How must this data be destroyed? (Secure deletion, certificate of destruction)
- Access: Who may access, and what authorization is required?
AI tool policies inherit from these broader handling requirements. If Restricted data requires encrypted storage, it certainly requires protection from external AI services.
4. Audit and monitoring
Classification frameworks require verification:
- Classification audits: Periodic reviews of data classifications for accuracy
- Policy compliance monitoring: Automated checks for policy violations (secrets in Internal-classified repos, PII in AI tool logs)
- Access reviews: Regular verification that access patterns match classification requirements
For AI tools specifically:
- Log which data classifications are being processed by which tools
- Alert on anomalies (Confidential data flowing to non-enterprise AI endpoints)
- Retain audit trails for compliance reporting
5. Exception handling
No framework covers every case. Exception processes handle legitimate deviations:
- Request process: How do you request an exception?
- Approval authority: Who can approve exceptions at each tier?
- Documentation: What records must be kept?
- Expiration: When does the exception expire?
- Review: How are exceptions audited?
Exception rates are a health metric. Too many exceptions indicate either overly restrictive policies or insufficient tooling. Too few exceptions (or informal workarounds) indicate the process is being bypassed.
Practical classification for codebases
Applying classification to code repositories presents unique challenges.
Repository-level vs file-level classification
Some organizations classify entire repositories at a single tier. This simplifies enforcement: the repository classification determines AI tool policies for everything in it.
Other organizations classify at the file or directory level.
A repository might be Internal overall, with a secrets/ directory classified Restricted and an api/public/ directory classified Public.
Repository-level classification:
- Simpler to manage and audit
- Prevents classification creep within projects
- May over-classify (Public docs in a Confidential repo)
File-level classification:
- More precise, potentially lower friction
- Requires more sophisticated tooling
- Risk of misclassification at boundaries
Hybrid approaches are common: repository-level defaults with directory-level overrides for known exceptions.
Classification in practice
Using CODEOWNERS or classification files:
# .classification.yml (custom convention)
default: internal
paths:
- pattern: "docs/public/**"
classification: public
- pattern: "src/auth/**"
classification: confidential
- pattern: ".env*"
classification: restricted
- pattern: "**/*.pem"
classification: restrictedSuch files can integrate with CI pipelines to enforce classification-appropriate checks.
Connecting to exclusion patterns:
Classification files can drive tool configuration. Restricted paths map to Claude Code permissions.deny patterns. Confidential paths trigger additional review requirements. The classification becomes the source of truth; tool configurations derive from it.
Legacy code without classification
Existing codebases rarely have explicit classification. Retroactive classification is time-consuming but necessary for consistent policy enforcement.
Pragmatic approach:
- Default everything to Internal
- Use automated scanning to identify likely Restricted content (secrets, PII)
- Have teams classify their own repositories during a defined window
- Apply Confidential classification to repositories containing customer data, authentication logic, or competitive IP
Classification does not need perfect accuracy initially. It needs a starting point for improvement.
Integration with existing frameworks
Classification for AI tools should align with (not replace) existing data governance frameworks.
NIST impact levels: NIST uses Low, Moderate, and High impact categories. These map roughly to Internal, Confidential, and Restricted, with Public having no NIST equivalent (public data is out of scope for protection).
Industry regulations:
- PCI-DSS defines cardholder data as requiring specific protections
- HIPAA defines PHI handling requirements
- GDPR defines personal data processing constraints
Regulated data categories should map to Confidential or Restricted tiers automatically. The classification framework incorporates regulatory requirements rather than duplicating them.
Framework adoption
Frameworks fail when they exist only in policy documents. Successful adoption requires:
- Training: Developers understand classification criteria and their responsibilities
- Tooling: Classification decisions are enforced technically, not just procedurally
- Visibility: Classification status is visible in repositories, dashboards, and development tools
- Iteration: Classifications are reviewed and updated as data usage changes
The framework should reduce friction, not add bureaucracy. When developers find classification helpful for making AI tool decisions, adoption succeeds. When classification is a checkbox exercise disconnected from daily work, it fails.