AI4Privacy ai4privacy Collection New release

PII Masking 3M Asia-Pacific Release

The world's largest open multilingual PII masking corpus. 3M+ synthetic examples across 30 languages spanning Europe, the Americas, and Asia-Pacific, purpose-built for training privacy-preserving NLP models in a truly global setting.

Token Classification NER Text Generation Multilingual Asia-Pacific Synthetic GDPR PDPA APPI PIPA EU AI Act
3M+
Examples
30
Languages
3
World Regions
5
Industries
Asia-Pacific regional partner

In partnership with VNCyberS

The Asia-Pacific expansion is delivered together with VNCyberS, pioneering data protection and cybersecurity in Vietnam, bringing regional language expertise and on-the-ground privacy compliance to the release.

VNCyberS

6 core datasets + 5 benchmark slices Collection Contents

5 open benchmark slices · CC-BY-4.0 Benchmarks & samples

30 languages across 3 regions Global Coverage

Asia-Pacific

7 new languages

Japanese (日本語) Korean (한국어) Chinese (中文) Vietnamese (Tiếng Việt) Indonesian (Bahasa Indonesia) Malay (Bahasa Melayu) Tagalog (Filipino)

Europe

23 locales

English German (Deutsch) French (Français) Spanish (Español) Italian (Italiano) Dutch (Nederlands) Polish (Polski) Swedish (Svenska)

Americas

North & South

English (US) Spanish (LatAm) Portuguese (BR) French (CA)

Schema · real samples in every language Data Structure & Examples

{
  "source_text": "本日の集合場所は 射水市 円池 の 昼場 6-28-20、郵便番号は 520-2111 です。",
  "masked_text": "本日の集合場所は [CITY_1] の [STREET_1] [BUILDINGNUM_1]、郵便番号は [ZIPCODE_1] です。",
  "privacy_mask": [ { "value": "射水市 円池", "label": "CITY" }, { "value": "昼場", "label": "STREET" }, { "value": "6-28-20", "label": "BUILDINGNUM" }, { "value": "520-2111", "label": "ZIPCODE" } ],
  "language": "ja", "region": "JP", "script": "Jpan"
}
language: ja · region: JP · script: Jpan

20 core + 61 industry-specific Entity Types

Core PII Labels (Open)

DATE GIVENNAME SURNAME EMAIL CITY TITLE TELEPHONENUM AGE STREET BUILDINGNUM ZIPCODE IDCARDNUM CREDITCARDNUMBER DRIVERLICENSENUM GENDER TAXNUM SEX SOCIALNUM PASSPORTNUM URL

Industry-Specific Labels (Enterprise)

DIAGNOSES MEDICATION TESTRESULTS HOSPITALNAME IBAN ACCOUNTNUM BIC SALARY JOBTITLE ORGANISATION APIKEY PASSWORD MACADDRESS GEOCOORD VEHICLEVIN + 46 more

Tasks supported Use Cases

Named Entity Recognition

Train NER models to detect and classify PII entities with pre-computed mBERT-compatible BIO labels.

Data Anonymization

Build production-grade anonymization pipelines compliant with GDPR, the EU AI Act, PDPA, APPI, and PIPA.

LLM Fine-tuning

Fine-tune large language models for privacy-aware text generation and redaction across languages.

Need enterprise-grade data?

Get access to the full 3M dataset including all industry-specific components and Asia-Pacific coverage, with commercial licensing for your organization.