Data Classification Frameworks

From ad-hoc rules to systematic classification

The previous section covered techniques for identifying sensitive data: pre-commit hooks, exclusion patterns, PII detection. These tools answer "how do we detect sensitive content?" But they do not answer a prior question: "what counts as sensitive in the first place?"

Data classification frameworks provide that answer. They define sensitivity categories, assign data to those categories, and specify handling requirements for each tier. Without classification, security policies devolve into case-by-case judgments inconsistent, unscalable, and difficult to audit.

This section covers the standard four-tier classification model, how to map classifications to AI tool policies, and the components of a complete classification framework.

The four-tier classification model

Most enterprises adopt a four-tier classification system. The terminology varies by organization, but the structure is consistent: four levels of increasing sensitivity with corresponding restrictions.

Classification	Definition	Example Data
Public	Approved for external release; no harm if disclosed	Marketing materials, open source code, public documentation
Internal	For organizational use; minimal sensitivity	Internal wikis, project plans, non-sensitive meeting notes
Confidential	Significant harm if disclosed; limited distribution	Customer lists, financial projections, unreleased product specs
Restricted	Severe harm if disclosed; strictly controlled access	Trade secrets, cryptographic keys, PII, PHI, credentials

This model balances simplicity with appropriate granularity. Fewer tiers reduce classification overhead but blur important distinctions. More tiers add precision but increase cognitive load and misclassification risk.

Public data

Public data carries no confidentiality requirements. It may be freely shared externally, indexed by search engines, or processed by any AI tool.

Examples in software development:

Published API documentation
Open source code under permissive licenses
Company blog posts and marketing content
Publicly accessible README files

Public classification does not imply unimportance. Public code may still require integrity controls (preventing unauthorized modification) and availability controls (maintaining uptime). Classification addresses confidentiality, not the full CIA triad.

Internal data

Internal data is the organizational default information meant for employees and contractors but not for external parties. Disclosure would cause inconvenience or minor reputational impact, not significant harm.

Examples:

Project roadmaps and internal documentation
Non-sensitive source code for internal tools
Team meeting recordings and notes
Internal training materials

Internal data typically permits use with vetted enterprise AI tools. The key word is vetted: the organization has evaluated the tool's data handling practices, retention policies, and compliance posture.

Confidential data

Confidential data, if disclosed, would cause significant harm: competitive disadvantage, regulatory penalties, or breach of contractual obligations. Access is limited to individuals with a business need.

Examples:

Customer data and usage analytics
Unreleased product source code
Financial projections and M&A discussions
Vendor contracts and pricing agreements

Confidential data requires tighter controls for AI tool usage. Many organizations restrict confidential data to enterprise-only AI deployments (private instances, VPC-hosted models) or prohibit AI processing entirely.

Restricted data

Restricted data represents the highest sensitivity tier. Unauthorized disclosure would cause severe harm: significant financial loss, legal liability, or harm to individuals.

Examples:

Credentials, API keys, encryption keys (should not be in code, but sometimes are)
Personally identifiable information (PII) and protected health information (PHI)
Trade secrets and core algorithmic IP
Authentication and authorization logic

Restricted data generally prohibits use with external AI tools. Self-hosted models or complete prohibition are typical policies. Even with self-hosted deployments, restricted data often requires additional safeguards: audit logging, access controls, and explicit authorization.

Mapping classification to AI tool policies

Classification without enforcement is documentation theater. Each classification tier must map to concrete policies governing AI tool usage.

A sample policy matrix

Classification	Permitted AI Tools	Required Controls	Human Review
Public	Any approved tool	Basic logging	None required
Internal	Enterprise-vetted tools only	Audit trails, access controls	Optional
Confidential	Enterprise-only or self-hosted	Encryption, masking, approval workflow	Required for outputs
Restricted	Self-hosted only or prohibited	Full audit, explicit authorization	All interactions

This matrix is illustrative. Actual policies depend on regulatory requirements, risk tolerance, and available infrastructure.

Translating policies to technical controls

Policy statements like "enterprise-vetted tools only" require translation into enforceable rules.

For Public data:

No technical restrictions beyond basic access controls
Standard monitoring and logging

For Internal data:

Allowlisted AI tools via endpoint management
Network controls permitting only approved AI service endpoints
Basic usage logging for audit purposes

For Confidential data:

Mandatory content exclusion configurations (Claude Code permissions.deny, Copilot content exclusions)
PII masking before AI tool submission
Output review before incorporation into production code
Approval workflows for AI-assisted changes to confidential systems

For Restricted data:

Network segmentation preventing external AI tool access
Pre-commit hooks blocking commits containing restricted content
Self-hosted model deployments with no external data egress
Continuous monitoring and alerting

The nuanced policy problem

Page 1 of this module observed that blanket prohibitions on "sending code to AI services" may be too broad. The classification framework enables nuanced policies.

Instead of:

"AI coding tools are prohibited for all company code."

A classification-based policy allows:

"AI coding tools may be used with Public and Internal classified code using approved enterprise tools. Confidential code requires enterprise-only deployments with manager approval. Restricted code is prohibited from AI tool processing."

This nuance increases adoption of AI tools where they are safe while maintaining protection where it matters.

Components of a classification framework

A complete framework includes more than a tier list. Five components distinguish a functional framework from a poster on the wall.

1. Classification criteria

How do you determine which tier applies? Criteria should be specific enough to classify consistently, general enough to cover diverse data types.

Questions that drive classification:

What is the regulatory status of this data? (GDPR, HIPAA, PCI-DSS, SOX)
What would be the impact of unauthorized disclosure?
Who should have access under normal circumstances?
Are there contractual obligations governing this data?
Does this data contain credentials, PII, or trade secrets?

Default classification: Unclassified data should default to Internal (not Public). Requiring explicit approval for public release prevents accidental exposure.

High-watermark principle: When data combines elements from multiple tiers, the highest tier applies. A document containing 90% Internal content and one API key is Restricted.

2. Classification procedures

Who classifies data, and when?

Data owners (typically the creating team or business unit) assign initial classification. Security teams review classifications for high-sensitivity tiers. Automated scanning supplements human classification for detectable patterns (secrets, PII).

Classification should occur at creation time, not retroactively. Repository templates, project initialization checklists, and onboarding documentation embed classification decisions into normal workflows.

3. Handling requirements

Each tier specifies handling requirements beyond AI tool usage:

Storage: Where may this data reside? (Approved cloud services, on-premise only, encrypted volumes)
Transmission: What encryption is required? (TLS in transit, specific cipher requirements)
Retention: How long may this data be kept? (Regulatory minimums, business necessity)
Disposal: How must this data be destroyed? (Secure deletion, certificate of destruction)
Access: Who may access, and what authorization is required?

AI tool policies inherit from these broader handling requirements. If Restricted data requires encrypted storage, it certainly requires protection from external AI services.

4. Audit and monitoring

Classification frameworks require verification:

Classification audits: Periodic reviews of data classifications for accuracy
Policy compliance monitoring: Automated checks for policy violations (secrets in Internal-classified repos, PII in AI tool logs)
Access reviews: Regular verification that access patterns match classification requirements

For AI tools specifically:

Log which data classifications are being processed by which tools
Alert on anomalies (Confidential data flowing to non-enterprise AI endpoints)
Retain audit trails for compliance reporting

5. Exception handling

No framework covers every case. Exception processes handle legitimate deviations:

Request process: How do you request an exception?
Approval authority: Who can approve exceptions at each tier?
Documentation: What records must be kept?
Expiration: When does the exception expire?
Review: How are exceptions audited?

Exception rates are a health metric. Too many exceptions indicate either overly restrictive policies or insufficient tooling. Too few exceptions (or informal workarounds) indicate the process is being bypassed.

Practical classification for codebases

Applying classification to code repositories presents unique challenges.

Repository-level vs file-level classification

Some organizations classify entire repositories at a single tier. This simplifies enforcement: the repository classification determines AI tool policies for everything in it.

Other organizations classify at the file or directory level. A repository might be Internal overall, with a secrets/ directory classified Restricted and an api/public/ directory classified Public.

Repository-level classification:

Simpler to manage and audit
Prevents classification creep within projects
May over-classify (Public docs in a Confidential repo)

File-level classification:

More precise, potentially lower friction
Requires more sophisticated tooling
Risk of misclassification at boundaries

Hybrid approaches are common: repository-level defaults with directory-level overrides for known exceptions.

Classification in practice

Using CODEOWNERS or classification files:

# .classification.yml (custom convention)
default: internal

paths:
  - pattern: "docs/public/**"
    classification: public
  - pattern: "src/auth/**"
    classification: confidential
  - pattern: ".env*"
    classification: restricted
  - pattern: "**/*.pem"
    classification: restricted

Such files can integrate with CI pipelines to enforce classification-appropriate checks.

Connecting to exclusion patterns:

Classification files can drive tool configuration. Restricted paths map to Claude Code permissions.deny patterns. Confidential paths trigger additional review requirements. The classification becomes the source of truth; tool configurations derive from it.

Legacy code without classification

Existing codebases rarely have explicit classification. Retroactive classification is time-consuming but necessary for consistent policy enforcement.

Pragmatic approach:

Default everything to Internal
Use automated scanning to identify likely Restricted content (secrets, PII)
Have teams classify their own repositories during a defined window
Apply Confidential classification to repositories containing customer data, authentication logic, or competitive IP

Classification does not need perfect accuracy initially. It needs a starting point for improvement.

Integration with existing frameworks

Classification for AI tools should align with (not replace) existing data governance frameworks.

NIST impact levels: NIST uses Low, Moderate, and High impact categories. These map roughly to Internal, Confidential, and Restricted, with Public having no NIST equivalent (public data is out of scope for protection).

Industry regulations:

PCI-DSS defines cardholder data as requiring specific protections
HIPAA defines PHI handling requirements
GDPR defines personal data processing constraints

Regulated data categories should map to Confidential or Restricted tiers automatically. The classification framework incorporates regulatory requirements rather than duplicating them.

Framework adoption

Frameworks fail when they exist only in policy documents. Successful adoption requires:

Training: Developers understand classification criteria and their responsibilities
Tooling: Classification decisions are enforced technically, not just procedurally
Visibility: Classification status is visible in repositories, dashboards, and development tools
Iteration: Classifications are reviewed and updated as data usage changes

The framework should reduce friction, not add bureaucracy. When developers find classification helpful for making AI tool decisions, adoption succeeds. When classification is a checkbox exercise disconnected from daily work, it fails.

On this page