Instructions to use OpenMed/Ministral-3B-PII-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenMed/Ministral-3B-PII-Preview with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="OpenMed/Ministral-3B-PII-Preview") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("OpenMed/Ministral-3B-PII-Preview") model = AutoModelForMultimodalLM.from_pretrained("OpenMed/Ministral-3B-PII-Preview") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use OpenMed/Ministral-3B-PII-Preview with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OpenMed/Ministral-3B-PII-Preview" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenMed/Ministral-3B-PII-Preview", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/OpenMed/Ministral-3B-PII-Preview
- SGLang
How to use OpenMed/Ministral-3B-PII-Preview with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OpenMed/Ministral-3B-PII-Preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenMed/Ministral-3B-PII-Preview", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OpenMed/Ministral-3B-PII-Preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenMed/Ministral-3B-PII-Preview", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use OpenMed/Ministral-3B-PII-Preview with Docker Model Runner:
docker model run hf.co/OpenMed/Ministral-3B-PII-Preview
Ministral-3B-PII-Preview
Ministral-3B-PII-Preview is a 3.3B-parameter language model that detects personally identifiable information (PII) in unstructured text and returns it as a structured JSON array of typed entities. Give it any text and it emits a list of {"text": ..., "label": ...} objects spanning 69 PII entity types across the healthcare, financial, identity, and digital domains.
The model is an experimental, reinforcement-learning–trained variant of a Ministral-3B base. It was optimized with GRPO (Group Relative Policy Optimization) specifically to produce valid, schema-consistent JSON and to detect PII with high precision — making it suited to redaction, de-identification, and compliance workflows (HIPAA, GDPR, PCI-DSS).
Research preview. This is an experimental model intended for evaluation and pipeline integration. Use it as one layer in a broader privacy/compliance system, not as a sole compliance control.
⚠️ Text input only. This release is a text-to-text model: it reads text and returns JSON. The underlying architecture also contains a vision encoder, but image-to-text PII extraction is not supported in this version — passing images is not a validated path. Multimodal (image → PII) support is planned for a future release.
Key Results
Evaluated on a 1,000-sample held-out PII benchmark with greedy decoding, a 2,048-token prompt budget, and no assistant-side JSON-fence prefill.
| Metric | Score |
|---|---|
| Valid JSON rate | 1.000 |
| Valid label rate | 0.975 |
| Micro precision | 0.914 |
| Micro recall | 0.859 |
| Micro F1 | 0.886 |
| Format consistency | 100% |
| Empty-output consistency | 100% |
Every generation parsed as valid JSON, and the model reliably returns [] for text containing no PII.
Supported PII Labels
The model recognizes 69 PII entity types. Each detected span is returned as {"text": "...", "label": "..."} using the label names below.
View all 69 entity types by category
| Category | Entity types |
|---|---|
| Identity & demographics | first_name, last_name, title, date_of_birth, age, gender, nationality, race, ethnicity, race_ethnicity, religion, religious_belief, marital_status, sexuality, political_view, language, biometric_identifier |
| Contact & address | email, phone_number, fax_number, street_address, building_number, city, county, state, postcode, zip_code, country, coordinate |
| Government & legal IDs | social_security_number, ssn, national_id, driver_license_number, tax_id, license_plate, vehicle_identifier, certificate_license_number, unique_id |
| Healthcare | medical_record_number, health_plan_beneficiary_number, blood_type |
| Financial | credit_debit_card, cvv, pin, account_number, bank_routing_number, iban, swift_bic, salary |
| Employment & organization | occupation, employment_status, employee_id, education_level, organization, company_name, customer_id |
| Digital & network | ip_address, ipv4, ipv6, mac_address, url, user_name, password, http_cookie, api_key, device_identifier |
| Temporal | date, date_time, time |
Quickstart
The PII extraction system prompt (with few-shot examples) is baked into the chat template, so no system message is required — just send the text. The template does not prefill a markdown json fence; the model emits the JSON array itself.
import torch
from transformers import AutoModelForImageTextToText, AutoTokenizer
model_id = "OpenMed/Ministral-3B-PII-Preview"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# The checkpoint uses a multimodal architecture, but this release is validated
# for TEXT input only. Load it with the image-text-to-text auto class and pass
# text — do not pass images.
model = AutoModelForImageTextToText.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
messages = [
{"role": "user", "content": "Contact Sarah at [email protected] or 415-555-0198."},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
# [{"text": "Sarah", "label": "first_name"}, {"text": "[email protected]", "label": "email"}, {"text": "415-555-0198", "label": "phone_number"}]
You may pass a custom system message to override the default behavior if needed. Keep the system-prompt pattern, and do not manually prefill ```json.
Optional: production post-processing
For non-English text especially, a small deterministic post-processing pass cleans up the raw output (Unicode normalization, span deduplication, CJK name splitting, Vietnamese name-order swap, language stopword filtering). The implementation ships with this repo in postprocess.py:
import json
from postprocess import postprocess_entities
entities = json.loads(response)
clean = postprocess_entities(entities, language="vi") # pass the source language code
Examples by Compliance Domain
HIPAA — Medical Records
Input:
Patient Maria Garcia, DOB 03/15/1985, MRN 4872910, was admitted on 2024-01-20 for a routine blood panel. Her blood type is O-negative. Insurance ID: BCBS-7742185. Contact her at [email protected] or (312) 555-0147.
Output:
[
{"text": "Maria", "label": "first_name"},
{"text": "Garcia", "label": "last_name"},
{"text": "03/15/1985", "label": "date_of_birth"},
{"text": "4872910", "label": "medical_record_number"},
{"text": "2024-01-20", "label": "date"},
{"text": "O-negative", "label": "blood_type"},
{"text": "BCBS-7742185", "label": "insurance_id"},
{"text": "[email protected]", "label": "email"},
{"text": "(312) 555-0147", "label": "phone_number"}
]
GDPR — EU Customer Data
Input:
Dear Mr. Lukas Weber, your account (CUST-DE-88412) has been updated. We have your address as Friedrichstrasse 42, 10117 Berlin, Germany. Your IBAN DE89370400440532013000 is on file. For verification, your national ID is T220001293. Please confirm via [email protected].
Output:
[
{"text": "Mr.", "label": "title"},
{"text": "Lukas", "label": "first_name"},
{"text": "Weber", "label": "last_name"},
{"text": "Friedrichstrasse 42", "label": "street_address"},
{"text": "10117", "label": "zip_code"},
{"text": "Berlin", "label": "city"},
{"text": "Germany", "label": "country"},
{"text": "CUST-DE-88412", "label": "account_number"},
{"text": "DE89370400440532013000", "label": "iban"},
{"text": "T220001293", "label": "national_id"},
{"text": "[email protected]", "label": "email"}
]
PCI-DSS — Financial Data
Input:
Wire transfer requested by account holder James Liu, account #7781920034, routing 021000021. Credit card ending 4532-XXXX-XXXX-8901 was flagged. SSN on file: 123-45-6789. Tax ID: 92-1234567. Contact: [email protected], IP logged: 192.168.1.42.
Output:
[
{"text": "James", "label": "first_name"},
{"text": "Liu", "label": "last_name"},
{"text": "7781920034", "label": "account_number"},
{"text": "021000021", "label": "routing_number"},
{"text": "4532-XXXX-XXXX-8901", "label": "credit_card"},
{"text": "123-45-6789", "label": "ssn"},
{"text": "92-1234567", "label": "tax_id"},
{"text": "[email protected]", "label": "email"},
{"text": "192.168.1.42", "label": "ip_address"}
]
No PII — Clean Text
Input:
The quarterly earnings report shows a 12% increase in revenue compared to last year. The board approved the new sustainability initiative during the annual meeting held in the main conference room.
Output:
[]
Multilingual Support (20 languages, zero-shot)
The model was trained only on English PII data but generalizes to other languages out of the box. We ran one realistic example per language across the top 20 world languages and scored the model under two conditions:
- Strict: exact-match scoring on raw model output.
- Production: raw output → a small deterministic post-processing pipeline (Unicode normalization, span deduplication, CJK name splitting, Vietnamese name-order swap, language stopword filter, Slavic case-tolerance at match time). Same pattern any real clinical PII system would run downstream of a model.
| Mode | Perfect | Micro-P | Micro-R | Micro-F1 | TP | FP | FN |
|---|---|---|---|---|---|---|---|
| Raw model output | 13/20 | 0.902 | 0.902 | 0.902 | 92 | 10 | 10 |
| + Production pipeline | 20/20 | 1.000 | 1.000 | 1.000 | 102 | 0 | 0 |
Scored on 102 entities hand-annotated across all 20 languages.
Per-language F1 (click to expand)
| # | Language | Code | Strict F1 | Production F1 |
|---|---|---|---|---|
| 1 | English | en |
1.00 | 1.00 |
| 2 | Chinese | zh |
0.73 | 1.00 |
| 3 | Hindi | hi |
1.00 | 1.00 |
| 4 | Spanish | es |
1.00 | 1.00 |
| 5 | Arabic | ar |
1.00 | 1.00 |
| 6 | French | fr |
1.00 | 1.00 |
| 7 | Bengali | bn |
1.00 | 1.00 |
| 8 | Russian | ru |
0.80 | 1.00 |
| 9 | Portuguese | pt |
1.00 | 1.00 |
| 10 | Japanese | ja |
0.67 | 1.00 |
| 11 | German | de |
1.00 | 1.00 |
| 12 | Korean | ko |
0.67 | 1.00 |
| 13 | Italian | it |
1.00 | 1.00 |
| 14 | Turkish | tr |
1.00 | 1.00 |
| 15 | Vietnamese | vi |
0.62 | 1.00 |
| 16 | Persian | fa |
1.00 | 1.00 |
| 17 | Polish | pl |
0.80 | 1.00 |
| 18 | Dutch | nl |
1.00 | 1.00 |
| 19 | Swahili | sw |
0.83 | 1.00 |
| 20 | Thai | th |
1.00 | 1.00 |
| Micro | 0.902 | 1.000 |
The post-processing pipeline
Six deterministic steps. No heavy NLP dependencies — all regex, string ops, and small gazetteers. The full implementation lives in postprocess.py.
- Unicode NFC + whitespace strip on every text field. Also applied to the input before inference.
- Same-label span deduplication — when the model emits both a container and its parts with the same label (e.g.
first_name=Nguyễn Văn AnANDfirst_name=Nguyễn), keep the most specific. - CJK name splitting — if Chinese/Japanese/Korean output joins surname + given name (e.g.
田中太郎as a singlefirst_name), split it using a small surname gazetteer. - Vietnamese name-order swap — Vietnamese writes family-name-first. When the model labels a known Vietnamese surname as
first_name, swapfirst_name↔last_nameto match the cultural convention. - Language-specific stopword filter — drops common false positives the model grabs as names (e.g. Swahili
Jina= "name", VietnameseTôi= "I"). - Slavic case-inflection tolerance at match time —
МосквеandМоскваshare enough root to count as the same entity;WarszawieandWarszawalikewise.
The raw model already extracts 92/102 entities correctly. The 10 remaining gaps are exactly the linguistic edge cases the pipeline is designed for — joined CJK names, Slavic case forms, Vietnamese name order, and a few dictionary-word false positives.
All 20 language examples
Each block shows the input text, the raw model output, and the post-processed output side by side.
English (en) — perfect
Input
Hi, my name is Sarah Johnson. You can reach me at [email protected] or call 555-234-7891. I live in Seattle.
Raw model output
[
{
"text": "Sarah",
"label": "first_name"
},
{
"text": "Johnson",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "555-234-7891",
"label": "phone_number"
},
{
"text": "Seattle",
"label": "city"
}
]
After production post-processing
[
{
"text": "Sarah",
"label": "first_name"
},
{
"text": "Johnson",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "555-234-7891",
"label": "phone_number"
},
{
"text": "Seattle",
"label": "city"
}
]
Chinese (zh) — strict F1=0.73, production F1=1.00
Input
我叫王小明,电话是 13812345678,邮箱是 [email protected],住在北京海淀区。
Raw model output
[
{
"text": "王小明",
"label": "first_name"
},
{
"text": "13812345678",
"label": "phone_number"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "北京",
"label": "city"
},
{
"text": "海淀区",
"label": "district"
}
]
After production post-processing
[
{
"text": "小明",
"label": "first_name"
},
{
"text": "13812345678",
"label": "phone_number"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "北京",
"label": "city"
},
{
"text": "海淀区",
"label": "district"
},
{
"text": "王",
"label": "last_name"
}
]
Hindi (hi) — perfect
Input
मेरा नाम राज शर्मा है। मेरा फ़ोन नंबर 9876543210 और ईमेल [email protected] है। मैं मुंबई में रहता हूं।
Raw model output
[
{
"text": "राज",
"label": "first_name"
},
{
"text": "शर्मा",
"label": "last_name"
},
{
"text": "9876543210",
"label": "phone_number"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "मुंबई",
"label": "city"
}
]
After production post-processing
[
{
"text": "राज",
"label": "first_name"
},
{
"text": "शर्मा",
"label": "last_name"
},
{
"text": "9876543210",
"label": "phone_number"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "मुंबई",
"label": "city"
}
]
Spanish (es) — perfect
Input
Me llamo María García. Mi correo es [email protected] y mi teléfono es +34 612 345 678. Vivo en Madrid.
Raw model output
[
{
"text": "María",
"label": "first_name"
},
{
"text": "García",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+34 612 345 678",
"label": "phone_number"
},
{
"text": "Madrid",
"label": "city"
}
]
After production post-processing
[
{
"text": "María",
"label": "first_name"
},
{
"text": "García",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+34 612 345 678",
"label": "phone_number"
},
{
"text": "Madrid",
"label": "city"
}
]
Arabic (ar) — perfect
Input
اسمي أحمد الحسن. بريدي الإلكتروني [email protected] ورقم هاتفي +966501234567. أسكن في الرياض.
Raw model output
[
{
"text": "أحمد",
"label": "first_name"
},
{
"text": "الحسن",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+966501234567",
"label": "phone_number"
},
{
"text": "الرياض",
"label": "city"
}
]
After production post-processing
[
{
"text": "أحمد",
"label": "first_name"
},
{
"text": "الحسن",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+966501234567",
"label": "phone_number"
},
{
"text": "الرياض",
"label": "city"
}
]
French (fr) — perfect
Input
Je m'appelle Pierre Dupont. Mon email est [email protected] et mon numéro est 06 12 34 56 78. J'habite à Paris.
Raw model output
[
{
"text": "Pierre",
"label": "first_name"
},
{
"text": "Dupont",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "06 12 34 56 78",
"label": "phone_number"
},
{
"text": "Paris",
"label": "city"
}
]
After production post-processing
[
{
"text": "Pierre",
"label": "first_name"
},
{
"text": "Dupont",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "06 12 34 56 78",
"label": "phone_number"
},
{
"text": "Paris",
"label": "city"
}
]
Bengali (bn) — perfect
Input
আমার নাম রাহুল দাস। আমার ফোন নম্বর 01712345678 এবং ইমেইল [email protected]। আমি ঢাকায় থাকি।
Raw model output
[
{
"text": "রাহুল",
"label": "first_name"
},
{
"text": "দাস",
"label": "last_name"
},
{
"text": "01712345678",
"label": "phone_number"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "ঢাকা",
"label": "city"
}
]
After production post-processing
[
{
"text": "রাহুল",
"label": "first_name"
},
{
"text": "দাস",
"label": "last_name"
},
{
"text": "01712345678",
"label": "phone_number"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "ঢাকা",
"label": "city"
}
]
Russian (ru) — strict F1=0.80, production F1=1.00
Input
Меня зовут Иван Петров. Мой email [email protected] и телефон +7 495 123-45-67. Живу в Москве.
Raw model output
[
{
"text": "Иван",
"label": "first_name"
},
{
"text": "Петров",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+7 495 123-45-67",
"label": "phone_number"
},
{
"text": "Москва",
"label": "city"
}
]
After production post-processing
[
{
"text": "Иван",
"label": "first_name"
},
{
"text": "Петров",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+7 495 123-45-67",
"label": "phone_number"
},
{
"text": "Москва",
"label": "city"
}
]
Portuguese (pt) — perfect
Input
Meu nome é João Silva. Meu email é [email protected] e telefone (11) 98765-4321. Moro em São Paulo.
Raw model output
[
{
"text": "João",
"label": "first_name"
},
{
"text": "Silva",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "(11) 98765-4321",
"label": "phone_number"
},
{
"text": "São Paulo",
"label": "city"
}
]
After production post-processing
[
{
"text": "João",
"label": "first_name"
},
{
"text": "Silva",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "(11) 98765-4321",
"label": "phone_number"
},
{
"text": "São Paulo",
"label": "city"
}
]
Japanese (ja) — strict F1=0.67, production F1=1.00
Input
私の名前は田中太郎です。メールは [email protected]、電話は 090-1234-5678 です。東京に住んでいます。
Raw model output
[
{
"text": "田中太郎",
"label": "first_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "090-1234-5678",
"label": "phone_number"
},
{
"text": "東京",
"label": "city"
}
]
After production post-processing
[
{
"text": "太郎",
"label": "first_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "090-1234-5678",
"label": "phone_number"
},
{
"text": "東京",
"label": "city"
},
{
"text": "田中",
"label": "last_name"
}
]
German (de) — perfect
Input
Ich heiße Hans Müller. Meine E-Mail ist [email protected], meine Telefonnummer 030 12345678. Ich wohne in Berlin.
Raw model output
[
{
"text": "Hans",
"label": "first_name"
},
{
"text": "Müller",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "030 12345678",
"label": "phone_number"
},
{
"text": "Berlin",
"label": "city"
}
]
After production post-processing
[
{
"text": "Hans",
"label": "first_name"
},
{
"text": "Müller",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "030 12345678",
"label": "phone_number"
},
{
"text": "Berlin",
"label": "city"
}
]
Korean (ko) — strict F1=0.67, production F1=1.00
Input
제 이름은 김민수입니다. 이메일은 [email protected] 이고 전화번호는 010-1234-5678 입니다. 서울에 살고 있습니다.
Raw model output
[
{
"text": "김민수",
"label": "first_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "010-1234-5678",
"label": "phone_number"
},
{
"text": "서울",
"label": "city"
}
]
After production post-processing
[
{
"text": "민수",
"label": "first_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "010-1234-5678",
"label": "phone_number"
},
{
"text": "서울",
"label": "city"
},
{
"text": "김",
"label": "last_name"
}
]
Italian (it) — perfect
Input
Mi chiamo Marco Rossi. La mia email è [email protected] e il mio telefono è +39 333 1234567. Abito a Roma.
Raw model output
[
{
"text": "Marco",
"label": "first_name"
},
{
"text": "Rossi",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+39 333 1234567",
"label": "phone_number"
},
{
"text": "Roma",
"label": "city"
}
]
After production post-processing
[
{
"text": "Marco",
"label": "first_name"
},
{
"text": "Rossi",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+39 333 1234567",
"label": "phone_number"
},
{
"text": "Roma",
"label": "city"
}
]
Turkish (tr) — perfect
Input
Adım Mehmet Yılmaz. E-postam [email protected] ve telefonum +90 532 123 45 67. İstanbul'da yaşıyorum.
Raw model output
[
{
"text": "Mehmet",
"label": "first_name"
},
{
"text": "Yılmaz",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+90 532 123 45 67",
"label": "phone_number"
},
{
"text": "İstanbul",
"label": "city"
}
]
After production post-processing
[
{
"text": "Mehmet",
"label": "first_name"
},
{
"text": "Yılmaz",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+90 532 123 45 67",
"label": "phone_number"
},
{
"text": "İstanbul",
"label": "city"
}
]
Vietnamese (vi) — strict F1=0.62, production F1=1.00
Input
Tôi tên là Nguyễn Văn An. Email của tôi là [email protected] và số điện thoại là +84 912 345 678. Tôi sống ở Hà Nội.
Raw model output
[
{
"text": "Nguyễn Văn An",
"label": "first_name"
},
{
"text": "Nguyễn",
"label": "first_name"
},
{
"text": "Văn",
"label": "middle_name"
},
{
"text": "An",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+84 912 345 678",
"label": "phone_number"
},
{
"text": "Hà Nội",
"label": "city"
}
]
After production post-processing
[
{
"text": "Nguyễn",
"label": "last_name"
},
{
"text": "Văn",
"label": "middle_name"
},
{
"text": "An",
"label": "first_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+84 912 345 678",
"label": "phone_number"
},
{
"text": "Hà Nội",
"label": "city"
}
]
Persian (fa) — perfect
Input
نام من علی احمدی است. ایمیل من [email protected] و شماره من +98 912 345 6789 است. من در تهران زندگی میکنم.
Raw model output
[
{
"text": "علی",
"label": "first_name"
},
{
"text": "احمدی",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+98 912 345 6789",
"label": "phone_number"
},
{
"text": "تهران",
"label": "city"
}
]
After production post-processing
[
{
"text": "علی",
"label": "first_name"
},
{
"text": "احمدی",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+98 912 345 6789",
"label": "phone_number"
},
{
"text": "تهران",
"label": "city"
}
]
Polish (pl) — strict F1=0.80, production F1=1.00
Input
Nazywam się Jan Kowalski. Mój email to [email protected], a telefon +48 601 234 567. Mieszkam w Warszawie.
Raw model output
[
{
"text": "Jan",
"label": "first_name"
},
{
"text": "Kowalski",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+48 601 234 567",
"label": "phone_number"
},
{
"text": "Warszawa",
"label": "city"
}
]
After production post-processing
[
{
"text": "Jan",
"label": "first_name"
},
{
"text": "Kowalski",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+48 601 234 567",
"label": "phone_number"
},
{
"text": "Warszawa",
"label": "city"
}
]
Dutch (nl) — perfect
Input
Mijn naam is Lars de Vries. Mijn email is [email protected] en mijn telefoon +31 6 12345678. Ik woon in Amsterdam.
Raw model output
[
{
"text": "Lars",
"label": "first_name"
},
{
"text": "de Vries",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+31 6 12345678",
"label": "phone_number"
},
{
"text": "Amsterdam",
"label": "city"
}
]
After production post-processing
[
{
"text": "Lars",
"label": "first_name"
},
{
"text": "de Vries",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+31 6 12345678",
"label": "phone_number"
},
{
"text": "Amsterdam",
"label": "city"
}
]
Swahili (sw) — strict F1=0.83, production F1=1.00
Input
Jina langu ni Juma Hassan. Barua pepe yangu ni [email protected] na simu yangu ni +255 754 123 456. Ninaishi Dar es Salaam.
Raw model output
[
{
"text": "Jina",
"label": "first_name"
},
{
"text": "Jina langu",
"label": "first_name"
},
{
"text": "Juma",
"label": "first_name"
},
{
"text": "Hassan",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+255 754 123 456",
"label": "phone_number"
},
{
"text": "Dar es Salaam",
"label": "city"
}
]
After production post-processing
[
{
"text": "Juma",
"label": "first_name"
},
{
"text": "Hassan",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+255 754 123 456",
"label": "phone_number"
},
{
"text": "Dar es Salaam",
"label": "city"
}
]
Thai (th) — perfect
Input
ฉันชื่อสมชาย ใจดี อีเมลของฉันคือ [email protected] และเบอร์โทร +66 81 234 5678 ฉันอาศัยอยู่ที่กรุงเทพ
Raw model output
[
{
"text": "สมชาย",
"label": "first_name"
},
{
"text": "ใจดี",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+66 81 234 5678",
"label": "phone_number"
},
{
"text": "กรุงเทพ",
"label": "city"
}
]
After production post-processing
[
{
"text": "สมชาย",
"label": "first_name"
},
{
"text": "ใจดี",
"label": "last_name"
},
{
"text": "[email protected]",
"label": "email"
},
{
"text": "+66 81 234 5678",
"label": "phone_number"
},
{
"text": "กรุงเทพ",
"label": "city"
}
]
Limitations
- Text input only. Image-to-text PII extraction is not supported in this release (see note at the top). Provide text input.
- Training data is English-only. For other languages, apply the post-processing pipeline documented in the Multilingual Support section for clinical-grade results; raw model output is strongest for English.
- Purpose-built for PII extraction — not a general-purpose NER or chat model.
- Performance may vary on highly domain-specific jargon or unconventional PII formats.
- As a generative model, it can occasionally emit a label outside the documented set or miss an entity. Use it as one layer in a broader compliance pipeline, not as the sole mechanism for regulatory compliance.
License
Released under the Apache 2.0 license.
- Downloads last month
- 26