data-classification

Data Classification Skill

Overview

Comprehensive data classification framework for identifying, tagging, and managing data based on sensitivity, regulatory requirements, and business context.

Classification Framework

Sensitivity Levels

PUBLIC: Approved for public disclosure
INTERNAL: Internal business use only
CONFIDENTIAL: Sensitive business data
RESTRICTED: Highly sensitive regulated data (PII, PHI, PCI)

Data Domains

CUSTOMER: Customer and prospect data
FINANCIAL: Financial and accounting data
EMPLOYEE: HR and employee information
PRODUCT: Product and service data
OPERATIONAL: System and operational data

PII Categories

DIRECT_IDENTIFIERS: Name, email, SSN, phone
QUASI_IDENTIFIERS: Zip code, DOB, gender
SENSITIVE_ATTRIBUTES: Health, race, religion, biometrics
FINANCIAL_DATA: Credit cards, bank accounts, salary

Classification Methods

Schema-Based Classification

def classify_by_schema(column_name: str, data_type: str) -> dict: """Classify based on column name and type.""" classification = {"sensitivity": "INTERNAL"}

col_lower = column_name.lower()

if any(x in col_lower for x in ['ssn', 'social_security', 'tax_id']):
    classification = {"sensitivity": "RESTRICTED", "pii": "DIRECT_IDENTIFIERS"}
elif any(x in col_lower for x in ['email', 'phone', 'address']):
    classification = {"sensitivity": "CONFIDENTIAL", "pii": "DIRECT_IDENTIFIERS"}
elif any(x in col_lower for x in ['salary', 'credit_card', 'bank_account']):
    classification = {"sensitivity": "RESTRICTED", "pii": "FINANCIAL_DATA"}

return classification

2. Content-Based Classification

def classify_by_content(table: str, column: str, sample_rate: float = 0.01): """Sample data content for classification.""" samples = spark.table(table).sample(sample_rate).select(column).limit(100)

pii_patterns = {
    'SSN': r'\b\d{3}-?\d{2}-?\d{4}\b',
    'EMAIL': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
    'CREDIT_CARD': r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b',
    'PHONE': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
}

detected_pii = []
for row in samples.collect():
    value = str(row[0])
    for pii_type, pattern in pii_patterns.items():
        if re.match(pattern, value):
            detected_pii.append(pii_type)

return list(set(detected_pii))

3. Tag Application

-- Create classification tags CREATE TAG governance.sensitivity; CREATE TAG governance.data_domain; CREATE TAG governance.pii_category;

-- Apply to catalog ALTER CATALOG production SET TAGS ('governance.data_domain' = 'CUSTOMER');

-- Apply to table ALTER TABLE production.customers.profiles SET TAGS ( 'governance.sensitivity' = 'RESTRICTED', 'governance.pii_category' = 'DIRECT_IDENTIFIERS' );

-- Apply to column ALTER TABLE production.customers.profiles ALTER COLUMN email SET TAGS ('governance.sensitivity' = 'CONFIDENTIAL');

Automated Classification

class DataClassifier: def init(self): self.classification_rules = self.load_rules()

def classify_catalog(self, catalog: str):
    """Auto-classify entire catalog."""
    schemas = list_schemas(catalog)
    for schema in schemas:
        tables = list_tables(catalog, schema)
        for table in tables:
            self.classify_table(f"{catalog}.{schema}.{table}")

def classify_table(self, table_name: str):
    """Classify table and columns."""
    table_info = get_table_info(table_name)

    # Table-level classification
    table_class = self.infer_table_classification(table_info)
    self.apply_table_tags(table_name, table_class)

    # Column-level classification
    for column in table_info.columns:
        column_class = self.classify_column(column)
        self.apply_column_tags(table_name, column.name, column_class)

def classify_column(self, column) -> dict:
    """Classify individual column."""
    # Schema-based
    schema_class = classify_by_schema(column.name, column.type)

    # Content-based (if high confidence not achieved)
    if schema_class["sensitivity"] == "INTERNAL":
        content_class = classify_by_content(table, column.name)
        if content_class:
            return content_class

    return schema_class

Best Practices

Start with High-Value Data: Classify PII and regulated data first
Automate Where Possible: Use rules and ML for consistency
Human Validation: Review automated classifications
Document Rationale: Maintain classification decisions
Regular Re-classification: Update when schemas change
Align with Policies: Link classification to access/retention policies

Templates

classification-rules.yaml: Classification rule definitions
tag-taxonomy.sql: Tag schema and values
classification-workflow.py: Automated classification pipeline

Examples

pii-classification: PII detection and tagging
sensitivity-assignment: Sensitivity level classification
compliance-mapping: Map classifications to regulations

data-classification

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

databricks-asset-bundles

medallion-architecture

delta-live-tables