Privacy by Design: PII Detection and Anonymization with PySpark on Microsoft Fabric
Introduction
Whether you’re building analytics pipelines or conversational AI systems, the risk of exposing sensitive data is real. AI models trained on unfiltered datasets can inadvertently memorize and regurgitate PII, leading to compliance violations and reputational damage. This blog explores how to build scalable, secure, and compliant data workflows using PySpark, Microsoft Presidio, and Faker—covering hands-on examples of detection, masking, hashing, and synthetic data generation that apply equally to data engineering and AI use cases.
Data Anonymization
Data anonymization is the process of transforming personal or sensitive data in such a way that the individuals to whom the data pertains can no longer be identified—either directly or indirectly. This is a critical step in ensuring compliance with privacy regulations like GDPR, PDPA, and HIPAA, and in enabling safe data sharing for analytics and AI model training.
Unlike encryption, which is reversible with a key, anonymization aims to be irreversible, ensuring that once data is anonymized, it cannot be traced back to an individual.
Why PII Anonymization matters?
PII such as names, emails, phone numbers, and national IDs can be inadvertently exposed during data processing. Anonymizing this data ensures:
- Compliance with regulations like GDPR and PDPA
- Reduced risk of data breaches
- Fairness in AI/ML models by removing bias-inducing attributes
Common Data Anonymization Techniques

Effective data anonymization requires the application of techniques suited to the nature of the data and its intended use. Below are the most widely used methods:
Masking –
Masking involves replacing original data values with specified characters, either partially or completely. For example, sensitive information like contact numbers might appear as “XXX-XXX-XXXX.” This technique is particularly useful for safeguarding data in testing or development environments.
Original:jack@example.com -> masked_email = "j***@example.com"
Hashing –
Applies a one-way cryptographic function to convert data into a fixed-length string of characters (hash value). Useful for consistent anonymization across datasets.
Original (Customer Id): 234568 -> hash value: c9e1c6a7b5e2e3e9b8a7c2e6e1a5e8a7b8e1a5c0e6e1a5e8a7b8e1a5c0e6e1a5e8a7b8e1
Encryption –
Encryption secures data by encoding it with sophisticated algorithms, rendering it unreadable without authorized decryption keys. While ideal for securing data in transit and storage, encryption may not align with datasets intended for public or external sharing.
Original (Customer Id): 234568 -> Encrypted value: VGVzdFN0cmluZw==
Generalization –
This technique minimizes the specificity of data to reduce the risk of identification. For instance, sharing only the birth year rather than the complete date of birth is a form of generalization, often employed in demographic studies.
Original: 1985-07-23 → Generalized: 1985
Suppression –
Suppression eliminates sensitive information entirely from a dataset. While effective at maintaining privacy, the utility of the resulting data may be diminished due to the loss of critical details.
Original: "Her SSN number is 123-45-6789" → Suppressed: "Her SSN number is ***-**-****"
Perturbation –
Perturbation introduces noise or intentional modifications to data, creating uncertainty about individual records. Techniques such as differential privacy are categorized under perturbation, offering mathematical assurances of privacy preservation.
Original: "Her age is 35" -> Suppressed: "Her age is 36"
Synthetic Data Generation –
Creates artificial datasets that mimic the patterns and characteristics of real data but contain no actual personal information.
Original: Her SSN number is 123-45-6789 -> Synthetic: Her SSN number is 987-65-4321
Pseudonymization –
Pseudonymization replaces identifiable data with unique pseudonyms or identifiers. Unlike complete anonymization, pseudonymized data can be reverted to its original form if the linking key is retained, making it highly suitable for controlled environments.
Original: Her SSN number is 123-45-6789 -> Pseudonymized: Her SSN number is TOKEN-ABCD-EFGH
Fabric Implementation: Privacy by design at scale
To operationalize privacy in modern data ecosystems, the Fabric implementation brings together the power of Microsoft Fabric’s Lakehouse, Data Engineering and Data Factory with the precision of PySpark and Microsoft Presidio. This setup enables automated, scalable, and standardized PII protection across your data pipelines and AI workflows. Whether you’re masking emails, hashing IDs, texts, comment columns or generating synthetic personas, this stack ensures your data stays useful—without compromising privacy.

By integrating open-source tools like Presidio with PySpark, we can implement robust PII detection and anonymization strategies at scale that align with privacy-by-design principles.
Platform | Microsoft Fabric (Lakehouse + Data Engineering + Data Pipelines) |
Processing Engine | PySpark |
PII Detection | Microsoft Presidio |
Anonymization Techniques | Masking, Hashing, Encryption |
Synthetic Data generation | Faker library |
While there are multiple approaches to implementing data privacy at scale—including third-party tools, built-in solutions like MIP labels and Microsoft Purview, or custom PySpark pipelines with AI functions—this blog focuses specifically on the following three approaches using PySpark as processing engine:
1. Identify and anonymizing PII in Structured and Unstructured Data Using Presidio
Microsoft Presidio is an open-source framework developed by Microsoft for detecting and anonymizing sensitive data. Presidio (Origin from Latin praesidium ‘protection, garrison’) helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text and images such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more. It supports both structured and unstructured data, making it ideal for use cases like:
- Detecting and anonymizing PII in unstructured text, pdf, image files.
- Detecting and anonymizing PII in free-text fields in structured data such as comments, feedback, or survey responses.
- Identifying sensitive entities like names, emails, phone numbers, and credit card numbers using NLP and regex-based recognizers.


Sample Use Case: Use Presidio in a PySpark UDF to scan customer profile stored in a file and flag PII entities.
2. Generate Synthetic Data Using Faker for Anonymization
Once PII is detected, you can replace it with synthetic but realistic data using libraries like Faker. This is especially useful for:
- Creating safe test datasets.
- Preserving data utility while ensuring privacy.
Example: Replace detected names with fake names, emails with dummy addresses, and phone numbers with randomly generated ones.
This approach ensures that downstream analytics can continue without exposing real user data.


3. Use Built-in PySpark Functions for Hashing and Masking
For structured data like customer tables, PySpark provides native functions to anonymize data:
- Masking: Use complete or partially hide/mask data (e.g., mask all but the last 4 digits of a phone number).
- Hashing: Uses PySpark’s built-in sha2 function to compute the SHA-256 hash (hexadecimal string) for each value in the given column. (e.g. Customer ID is hashed).Since SHA-256 is a deterministic hashing algorithm, it ensures consistent output for the same input, making it suitable for anonymizing columns that need to be joined across multiple tables.
These techniques are efficient and scalable, making them ideal for large datasets processed in Fabric pipelines.


Implementation Guide
The PII-SparkShield repository contains sample implementations and code for the three approaches discussed in the blog.
Conclusion
Data anonymization plays a pivotal role in safeguarding privacy while enabling the productive use of data. Employing techniques such as masking, perturbation, and pseudonymization allows organizations to navigate the delicate balance between privacy preservation and data utility. However, challenges such as re-identification risks and compliance intricacies highlight the need for continuous innovation and vigilance. As privacy concerns grow globally, organizations must prioritize anonymization practices to ensure trust and compliance in the digital landscape.
Acknowledgment
My sincere appreciation goes to Omri Mendels – Thanks for helping me whenever needed. This would not have been possible without your help.
Sincere thanks to Abhishek Narain, Noelle Li, Santosh Kumar Ravindran, Ron Shakutai & Gyani Sinha for their inputs, right from ideation to implementation.