Data Masking - Datenmaskierung und Pseudonymisierung
Data masking refers to the process of obscuring or replacing sensitive data with realistic test data to ensure data protection in non-production environments (development, testing, staging). Methods: static masking (copying with replacement data), dynamic masking (on-the-fly for database queries), format-preserving encryption (FPE), tokenization. Difference from anonymization: Masking is often reversible.
Data Masking refers to techniques in which sensitive production data is replaced with realistic but fictitious substitute data. The goal: to allow developers, testers, and external partners to work with realistic data structures without ever seeing actual production data.
The Problem: Real Data in Non-Production Environments
In many companies, the production database containing 100,000 customer records is simply transferred as an exact dump to the development environment—and from there on to staging and external service providers. This means that every developer has access to real names, email addresses, and IBANs.
This practice creates several serious problems:
- GDPR violation: Developers do not need real customer data for their work
- Data breach in case of device theft: A developer’s stolen laptop immediately becomes a data leak
- External contractors: Service providers unintentionally gain access to real personal information
- Faulty test apps: A test instance accidentally made publicly accessible immediately exposes real data
Real-world incidents confirm this pattern: Developers publish test databases on GitHub and expose real customer data; staging servers without password protection are indexed by Google and display pages containing real PII; contractors download test exports and leave the company with customer data.
Comparison of Masking Methods
Method 1 - Static Data Masking (SDM)
With static masking, a one-time copy of the production database is created and then masked. The process follows this flow: Production DB → Copy → Masking Process → Development DB.
The procedure involves three steps: identifying all PII columns (name, email, IBAN, SSN, etc.), replacing each column with realistic fake data, and maintaining referential integrity (foreign keys must remain consistent).
Advantages: Easy to implement, has no impact on performance during operation. Disadvantages: The copy must be updated regularly; the masking process itself is time-consuming.
Method 2 - Dynamic Data Masking (DDM)
With dynamic masking, data is masked on-the-fly during queries without requiring a separate copy of the database. Depending on the user or role, a different view of the data is provided.
Advantages: No separate database required, role-based control possible. Disadvantages: Cannot cover all masking requirements, potential performance impacts.
Method 3 - Format-Preserving Encryption (FPE)
FPE encrypts data in a way that preserves the format. A credit card number 4111-1111-1111-1111 becomes 7823-4729-1847-2938—the format (length, character sets) is preserved, so the application continues to function correctly. The process is reversible using the same key.
Usage: Payment systems (PCI-DSS compliance), medical records.
Method 4 - Tokenization
In tokenization, the actual value is replaced by a token (a meaningless substitute). The mapping between the token and the actual value is stored in a separate vault. The application works exclusively with tokens.
Difference from FPE: The token is random; there is no reversible format. Applications: Credit card numbers (PCI-DSS), health records.
Masking Techniques for Different Data Types
Name
- Random name from a list of names (same gender, same country)
- First name from a list of common German first names (file with 10,000 names)
- Consistency: same name → same masked name (seeded random)
- Format:
vorname.nachname@example-domain.de(with masked name) - Alternative: Hash of the email address +
@example.comfor uniqueness - Important: Use test domains (
example.com,test.local) – no actual email delivery!
Phone Number
- Preserve format:
+49 1XX XXXXXXX→+49 170 12345678(fictitious) - Keep local area code or randomize it
IBAN
- Format-preserving: DE + 2-digit check digit + 18-digit BBAN
- Generate new IBAN with correct check digit
- Never retain real IBANs!
IP Addresses
- Same subnet range, but different host octet
- Or: Map to RFC-1918 ranges (all public IPs → private)
Date Fields
- Date of birth: Keep age, but randomize day/month
- Transaction date: Keep time differences (for analysis)
- “Offset masking”: shift all date fields by X days
Free-text fields (comments, notes)
- Replace entirely with Lorem Ipsum
- Or: NLP-based detection of PII in free text and replacement
- Tool:
presidio(Microsoft) automatically detects PII in text
Binary data (images, documents)
- Profile pictures: Replace with stock photo placeholders
- Documents: Replace with blank PDFs of the same size
Tooling and implementation
Faker (Python/Node.js/PHP) is the most widely used open-source tool for generating realistic fake data:
from faker import Faker
fake = Faker('de_DE') # German fake data!
fake.name() # "Klaus Müller"
fake.email() # "k.müller@example.com"
fake.iban() # "DE89370400440532013000" (correct check digit)
fake.phone_number() # "+49 1522 0123456"
fake.address() # "Hauptstraße 42, 10115 Berlin"
fake.date_of_birth(minimum_age=18, maximum_age=80)
# Seeded for consistency (same input → same output):
fake = Faker('de_DE')
fake.seed_instance(hash("max@muster.de"))
# → Same seed → same fake name → referential integrity!
Presidio (Microsoft, Open Source) detects and replaces PII in free text:
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
text = "Max Mustermann, max@muster.de, IBAN: DE89370400440532013000"
results = analyzer.analyze(text=text, language="de")
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
# "Max Mustermann" → "<person>"
# "max@muster.de" → "<email_address>"
ARX is a Java-based open-source anonymization framework with k-anonymity and l-diversity implementation, GUI, and API—particularly suitable for statistical datasets and research data.
PostgreSQL Anonymizer Extension enables dynamic masking directly in the database:
CREATE EXTENSION IF NOT EXISTS anon;
SELECT anon.init();
-- Define masking rules:
SECURITY LABEL FOR anon ON COLUMN customers.name
IS 'MASKED WITH FUNCTION anon.fake_last_name()';
SECURITY LABEL FOR anon ON COLUMN customers.email
IS 'MASKED WITH FUNCTION anon.fake_email()';
-- For specific roles:
SECURITY LABEL FOR anon ON ROLE analyst IS 'MASKED';
-- analyst sees masked data, admin sees real data
GDPR and Data Masking
Data masking is a technical measure with a clear basis in data protection law:
- Art. 25 GDPR - Data protection by design: Data masking in non-production environments is a pseudonymization measure under Art. 25.
- Art. 32 GDPR - Technical and organizational measures (TOMs): Pseudonymization is explicitly listed as a TOM; masking in development environments must be documented.
- Art. 89 GDPR - Processing for research and statistics: Anonymization methods (k-anonymity, etc.) are used for research data.
Masking is not the same as anonymization
Pseudonymization (reversible masking): The link to an individual can be restored with the correct key or mapping. The data remains subject to the GDPR and may only be used internally.
Anonymization (irreversible masking): Personal identification is no longer possible—the GDPR no longer applies to this data. However, true anonymization is difficult to achieve due to re-identification risks.
Best Practices for GDPR-Compliant Test Environments
- Never use real production data in development or test environments
- Document the masking process (TOM list)
- Perform masking before transferring data to external service providers
- Verify masking quality: Conduct re-identification tests
- Alternatively: Generate synthetic test data (100% GDPR-neutral)
Synthetic Data as an Alternative
Synthetic data generation differs fundamentally from masking:
| Approach | Starting point | GDPR relevance |
|---|---|---|
| Masking | Real data → Replacement data | Data remains structurally derived from real data |
| Synthetic | Completely artificial data | No personal reference, no connection to real data |
Gretel.ai (cloud service) trains an ML model on real data and generates synthetic data with the same statistical distribution—the synthetic data has no personal reference.
SDV (Synthetic Data Vault, Open Source) works similarly:
from sdv.tabular import GaussianCopula
model = GaussianCopula()
model.fit(real_data) # Train on real data (one-time, controlled)
synthetic_data = model.sample(num_rows=1000) # 1000 synthetic rows
# → No real data needed anymore!
When to use which approach:
- Masking: when tests with production data structure and volume are required
- Synthetic: for new projects, external partners, AI training, and demos</email_address>