Docs / Delete sensitive data

Delete sensitive data

Implement comprehensive data deletion procedures to protect user privacy and maintain regulatory compliance. Secure deletion processes ensure sensitive information is permanently removed while maintaining audit trails for legal requirements.

Data deletion is a critical component of privacy protection and regulatory compliance, particularly under regulations like GDPR, CCPA, and HIPAA. Effective deletion procedures balance thorough data removal with operational efficiency and audit requirements.

Modern AI systems generate extensive data trails including training data, evaluation results, and user interactions. Comprehensive deletion procedures must address all data repositories, backups, and derived datasets to ensure complete privacy protection.

Data deletion workflow showing search, identification, verification, and audit logging processes

Data Identification and Discovery

Systematically identify all instances of sensitive data across your AI evaluation infrastructure. Comprehensive discovery ensures no data remnants are overlooked during deletion procedures.

  1. 1

    Search across data stores Use identifiers, tags, and metadata to locate all instances of sensitive data.

  2. 2

    Map data dependencies Identify derived datasets, cached results, and system logs containing the data.

  3. 3

    Check backup systems Ensure backup and archive systems are included in deletion scope.

  4. 4

    Validate discovery completeness Use multiple search methods to confirm all data instances are identified.

Comprehensive data discovery
import evaligo
from evaligo.privacy import DataDeletionManager
from typing import List, Dict, Set
import re

class SensitiveDataDiscovery:
    """Discovers all instances of sensitive data across the platform"""
    
    def __init__(self, client: evaligo.Client):
        self.client = client
        self.deletion_manager = DataDeletionManager(client)
        
    def discover_user_data(self, user_identifiers: List[str]) -> Dict:
        """Discover all data associated with specific user identifiers"""
        
        discovered_data = {
            'traces': [],
            'datasets': [],
            'experiments': [],
            'evaluations': [],
            'logs': [],
            'backups': []
        }
        
        for identifier in user_identifiers:
            # Search traces
            traces = self._search_traces_by_identifier(identifier)
            discovered_data['traces'].extend(traces)
            
            # Search datasets
            datasets = self._search_datasets_by_identifier(identifier)
            discovered_data['datasets'].extend(datasets)
            
            # Search experiments
            experiments = self._search_experiments_by_identifier(identifier)
            discovered_data['experiments'].extend(experiments)
            
            # Search evaluation results
            evaluations = self._search_evaluations_by_identifier(identifier)
            discovered_data['evaluations'].extend(evaluations)
            
            # Search system logs
            logs = self._search_logs_by_identifier(identifier)
            discovered_data['logs'].extend(logs)
            
            # Search backup systems
            backups = self._search_backups_by_identifier(identifier)
            discovered_data['backups'].extend(backups)
        
        # Remove duplicates and validate
        discovered_data = self._deduplicate_results(discovered_data)
        discovery_report = self._generate_discovery_report(discovered_data, user_identifiers)
        
        return {
            'discovered_data': discovered_data,
            'discovery_report': discovery_report,
            'total_items': sum(len(items) for items in discovered_data.values())
        }
    
    def _search_traces_by_identifier(self, identifier: str) -> List[Dict]:
        """Search traces containing the identifier"""
        
        search_queries = [
            f"user_id:{identifier}",
            f"email:{identifier}",
            f"session_id:{identifier}",
            f"metadata.user_identifier:{identifier}"
        ]
        
        found_traces = []
        
        for query in search_queries:
            traces = self.client.traces.search(
                query=query,
                include_metadata=True,
                include_spans=True
            )
            
            for trace in traces:
                # Additional content-based search
                if self._contains_identifier_in_content(trace, identifier):
                    found_traces.append({
                        'trace_id': trace.id,
                        'timestamp': trace.timestamp,
                        'project_id': trace.project_id,
                        'match_method': 'content_search',
                        'sensitive_spans': self._identify_sensitive_spans(trace, identifier)
                    })
        
        return found_traces
    
    def _contains_identifier_in_content(self, trace, identifier: str) -> bool:
        """Check if identifier appears in trace content"""
        
        # Check inputs and outputs
        for span in trace.spans:
            if identifier in str(span.input) or identifier in str(span.output):
                return True
            
            # Check metadata
            if span.metadata and identifier in str(span.metadata):
                return True
        
        return False
    
    def validate_discovery_completeness(self, identifiers: List[str]) -> Dict:
        """Validate that data discovery is complete using multiple methods"""
        
        validation_results = {
            'method_agreement': {},
            'potential_gaps': [],
            'confidence_score': 0.0
        }
        
        # Run discovery using different methods
        primary_results = self.discover_user_data(identifiers)
        
        # Cross-validation with alternative search methods
        secondary_results = self._alternative_discovery_method(identifiers)
        
        # Compare results
        agreement_score = self._calculate_method_agreement(
            primary_results, secondary_results
        )
        
        validation_results['method_agreement'] = agreement_score
        validation_results['confidence_score'] = min(agreement_score['overall'], 1.0)
        
        # Identify potential gaps
        if agreement_score['overall'] < 0.95:
            validation_results['potential_gaps'] = self._identify_discovery_gaps(
                primary_results, secondary_results
            )
        
        return validation_results

# Usage example
discovery = SensitiveDataDiscovery(client)

# Discover all data for a user deletion request
user_identifiers = [
    "user123@example.com",
    "session_abc123", 
    "customer_id_456"
]

discovery_results = discovery.discover_user_data(user_identifiers)
print(f"Discovered {discovery_results['total_items']} data items")

# Validate discovery completeness
validation = discovery.validate_discovery_completeness(user_identifiers)
print(f"Discovery confidence: {validation['confidence_score']:.1%}")

Secure Deletion Execution

Execute secure deletion procedures that permanently remove sensitive data while maintaining integrity of remaining systems. Implement verification mechanisms to ensure deletion completeness.

Info

Irreversible Process: Data deletion is irreversible. Verify deletion requests are legitimate and necessary before execution. Consider data export options for legitimate business needs.

Video

Secure Data Deletion Procedures
Secure Data Deletion Procedures
Learn how to implement secure, auditable data deletion that meets regulatory requirements.
9m 30s

Audit Trail and Compliance Documentation

Maintain comprehensive audit trails of all deletion activities to demonstrate compliance with privacy regulations and organizational policies. Proper documentation protects against legal challenges and regulatory inquiries.

Audit trail dashboard showing deletion requests, verification status, and compliance reporting
Info

Retention Requirements: While user data is deleted, maintain audit logs of deletion activities as required by regulations. These logs should not contain the actual sensitive data that was deleted.

Related Documentation

Audit log
Maintain comprehensive logs of deletion activities
Compliance & policies
Understand regulatory requirements for data deletion
API keys
Secure access controls for deletion operations
Query traces
Search and identify traces containing sensitive data