Data Masking - Datenmaskierung und Pseudonymisierung

Data Masking refers to techniques in which sensitive production data is replaced with realistic but fictitious substitute data. The goal: to allow developers, testers, and external partners to work with realistic data structures without ever seeing actual production data.

The Problem: Real Data in Non-Production Environments

In many companies, the production database containing 100,000 customer records is simply transferred as an exact dump to the development environment—and from there on to staging and external service providers. This means that every developer has access to real names, email addresses, and IBANs.

This practice creates several serious problems:

GDPR violation: Developers do not need real customer data for their work
Data breach in case of device theft: A developer’s stolen laptop immediately becomes a data leak
External contractors: Service providers unintentionally gain access to real personal information
Faulty test apps: A test instance accidentally made publicly accessible immediately exposes real data

Real-world incidents confirm this pattern: Developers publish test databases on GitHub and expose real customer data; staging servers without password protection are indexed by Google and display pages containing real PII; contractors download test exports and leave the company with customer data.

Comparison of Masking Methods

Method 1 - Static Data Masking (SDM)

With static masking, a one-time copy of the production database is created and then masked. The process follows this flow: Production DB → Copy → Masking Process → Development DB.

The procedure involves three steps: identifying all PII columns (name, email, IBAN, SSN, etc.), replacing each column with realistic fake data, and maintaining referential integrity (foreign keys must remain consistent).

Advantages: Easy to implement, has no impact on performance during operation. Disadvantages: The copy must be updated regularly; the masking process itself is time-consuming.

Method 2 - Dynamic Data Masking (DDM)

With dynamic masking, data is masked on-the-fly during queries without requiring a separate copy of the database. Depending on the user or role, a different view of the data is provided.

Advantages: No separate database required, role-based control possible. Disadvantages: Cannot cover all masking requirements, potential performance impacts.

Method 3 - Format-Preserving Encryption (FPE)

FPE encrypts data in a way that preserves the format. A credit card number 4111-1111-1111-1111 becomes 7823-4729-1847-2938—the format (length, character sets) is preserved, so the application continues to function correctly. The process is reversible using the same key.

Usage: Payment systems (PCI-DSS compliance), medical records.

Method 4 - Tokenization

In tokenization, the actual value is replaced by a token (a meaningless substitute). The mapping between the token and the actual value is stored in a separate vault. The application works exclusively with tokens.

Difference from FPE: The token is random; there is no reversible format. Applications: Credit card numbers (PCI-DSS), health records.

Masking Techniques for Different Data Types

Name

Random name from a list of names (same gender, same country)
First name from a list of common German first names (file with 10,000 names)
Consistency: same name → same masked name (seeded random)

Email

Format: vorname.nachname@example-domain.de (with masked name)
Alternative: Hash of the email address + @example.com for uniqueness
Important: Use test domains (example.com, test.local) – no actual email delivery!

Phone Number

Preserve format: +49 1XX XXXXXXX → +49 170 12345678 (fictitious)
Keep local area code or randomize it

IBAN

Format-preserving: DE + 2-digit check digit + 18-digit BBAN
Generate new IBAN with correct check digit
Never retain real IBANs!

IP Addresses

Same subnet range, but different host octet
Or: Map to RFC-1918 ranges (all public IPs → private)

Date Fields

Date of birth: Keep age, but randomize day/month
Transaction date: Keep time differences (for analysis)
“Offset masking”: shift all date fields by X days

Free-text fields (comments, notes)

Replace entirely with Lorem Ipsum
Or: NLP-based detection of PII in free text and replacement
Tool: presidio (Microsoft) automatically detects PII in text

Binary data (images, documents)

Profile pictures: Replace with stock photo placeholders
Documents: Replace with blank PDFs of the same size

Tooling and implementation

Faker (Python/Node.js/PHP) is the most widely used open-source tool for generating realistic fake data:

from faker import Faker
fake = Faker(&#x27;de_DE&#x27;)  # German fake data!

fake.name()           # &quot;Klaus Müller&quot;
fake.email()          # &quot;k.müller@example.com&quot;
fake.iban()           # &quot;DE89370400440532013000&quot; (correct check digit)
fake.phone_number()   # &quot;+49 1522 0123456&quot;
fake.address()        # &quot;Hauptstraße 42, 10115 Berlin&quot;
fake.date_of_birth(minimum_age=18, maximum_age=80)

# Seeded for consistency (same input → same output):
fake = Faker(&#x27;de_DE&#x27;)
fake.seed_instance(hash(&quot;max@muster.de&quot;))
# → Same seed → same fake name → referential integrity!

Presidio (Microsoft, Open Source) detects and replaces PII in free text:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = &quot;Max Mustermann, max@muster.de, IBAN: DE89370400440532013000&quot;
results = analyzer.analyze(text=text, language=&quot;de&quot;)
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
# &quot;Max Mustermann&quot; → &quot;<person>&quot;
# &quot;max@muster.de&quot; → &quot;<email_address>&quot;

ARX is a Java-based open-source anonymization framework with k-anonymity and l-diversity implementation, GUI, and API—particularly suitable for statistical datasets and research data.

PostgreSQL Anonymizer Extension enables dynamic masking directly in the database:

CREATE EXTENSION IF NOT EXISTS anon;
SELECT anon.init();

-- Define masking rules:
SECURITY LABEL FOR anon ON COLUMN customers.name
  IS &#x27;MASKED WITH FUNCTION anon.fake_last_name()&#x27;;
SECURITY LABEL FOR anon ON COLUMN customers.email
  IS &#x27;MASKED WITH FUNCTION anon.fake_email()&#x27;;

-- For specific roles:
SECURITY LABEL FOR anon ON ROLE analyst IS &#x27;MASKED&#x27;;
-- analyst sees masked data, admin sees real data

Data masking is a technical measure with a clear basis in data protection law:

Art. 25 GDPR - Data protection by design: Data masking in non-production environments is a pseudonymization measure under Art. 25.
Art. 32 GDPR - Technical and organizational measures (TOMs): Pseudonymization is explicitly listed as a TOM; masking in development environments must be documented.
Art. 89 GDPR - Processing for research and statistics: Anonymization methods (k-anonymity, etc.) are used for research data.

Masking is not the same as anonymization

Pseudonymization (reversible masking): The link to an individual can be restored with the correct key or mapping. The data remains subject to the GDPR and may only be used internally.

Anonymization (irreversible masking): Personal identification is no longer possible—the GDPR no longer applies to this data. However, true anonymization is difficult to achieve due to re-identification risks.

Never use real production data in development or test environments
Document the masking process (TOM list)
Perform masking before transferring data to external service providers
Verify masking quality: Conduct re-identification tests
Alternatively: Generate synthetic test data (100% GDPR-neutral)

Synthetic Data as an Alternative

Synthetic data generation differs fundamentally from masking:

Approach	Starting point	GDPR relevance
Masking	Real data → Replacement data	Data remains structurally derived from real data
Synthetic	Completely artificial data	No personal reference, no connection to real data

Gretel.ai (cloud service) trains an ML model on real data and generates synthetic data with the same statistical distribution—the synthetic data has no personal reference.

SDV (Synthetic Data Vault, Open Source) works similarly:

from sdv.tabular import GaussianCopula
model = GaussianCopula()
model.fit(real_data)  # Train on real data (one-time, controlled)
synthetic_data = model.sample(num_rows=1000)  # 1000 synthetic rows
# → No real data needed anymore!

When to use which approach:

Masking: when tests with production data structure and volume are required
Synthetic: for new projects, external partners, AI training, and demos</email_address>