Instructions to use OpenMed/Ministral-3B-PII-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OpenMed/Ministral-3B-PII-Preview with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="OpenMed/Ministral-3B-PII-Preview")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("OpenMed/Ministral-3B-PII-Preview")
model = AutoModelForMultimodalLM.from_pretrained("OpenMed/Ministral-3B-PII-Preview")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use OpenMed/Ministral-3B-PII-Preview with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "OpenMed/Ministral-3B-PII-Preview"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenMed/Ministral-3B-PII-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/OpenMed/Ministral-3B-PII-Preview

SGLang

How to use OpenMed/Ministral-3B-PII-Preview with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "OpenMed/Ministral-3B-PII-Preview" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenMed/Ministral-3B-PII-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "OpenMed/Ministral-3B-PII-Preview" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenMed/Ministral-3B-PII-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use OpenMed/Ministral-3B-PII-Preview with Docker Model Runner:
```
docker model run hf.co/OpenMed/Ministral-3B-PII-Preview
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Ministral-3B-PII-Preview

Ministral-3B-PII-Preview is a 3.3B-parameter language model that detects personally identifiable information (PII) in unstructured text and returns it as a structured JSON array of typed entities. Give it any text and it emits a list of {"text": ..., "label": ...} objects spanning 69 PII entity types across the healthcare, financial, identity, and digital domains.

The model is an experimental, reinforcement-learning–trained variant of a Ministral-3B base. It was optimized with GRPO (Group Relative Policy Optimization) specifically to produce valid, schema-consistent JSON and to detect PII with high precision — making it suited to redaction, de-identification, and compliance workflows (HIPAA, GDPR, PCI-DSS).

Research preview. This is an experimental model intended for evaluation and pipeline integration. Use it as one layer in a broader privacy/compliance system, not as a sole compliance control.

⚠️ Text input only. This release is a text-to-text model: it reads text and returns JSON. The underlying architecture also contains a vision encoder, but image-to-text PII extraction is not supported in this version — passing images is not a validated path. Multimodal (image → PII) support is planned for a future release.

Key Results

Evaluated on a 1,000-sample held-out PII benchmark with greedy decoding, a 2,048-token prompt budget, and no assistant-side JSON-fence prefill.

Metric	Score
Valid JSON rate	1.000
Valid label rate	0.975
Micro precision	0.914
Micro recall	0.859
Micro F1	0.886
Format consistency	100%
Empty-output consistency	100%

Every generation parsed as valid JSON, and the model reliably returns [] for text containing no PII.

Supported PII Labels

The model recognizes 69 PII entity types. Each detected span is returned as {"text": "...", "label": "..."} using the label names below.

View all 69 entity types by category

Category	Entity types
Identity & demographics	`first_name`, `last_name`, `title`, `date_of_birth`, `age`, `gender`, `nationality`, `race`, `ethnicity`, `race_ethnicity`, `religion`, `religious_belief`, `marital_status`, `sexuality`, `political_view`, `language`, `biometric_identifier`
Contact & address	`email`, `phone_number`, `fax_number`, `street_address`, `building_number`, `city`, `county`, `state`, `postcode`, `zip_code`, `country`, `coordinate`
Government & legal IDs	`social_security_number`, `ssn`, `national_id`, `driver_license_number`, `tax_id`, `license_plate`, `vehicle_identifier`, `certificate_license_number`, `unique_id`
Healthcare	`medical_record_number`, `health_plan_beneficiary_number`, `blood_type`
Financial	`credit_debit_card`, `cvv`, `pin`, `account_number`, `bank_routing_number`, `iban`, `swift_bic`, `salary`
Employment & organization	`occupation`, `employment_status`, `employee_id`, `education_level`, `organization`, `company_name`, `customer_id`
Digital & network	`ip_address`, `ipv4`, `ipv6`, `mac_address`, `url`, `user_name`, `password`, `http_cookie`, `api_key`, `device_identifier`
Temporal	`date`, `date_time`, `time`

Quickstart

The PII extraction system prompt (with few-shot examples) is baked into the chat template, so no system message is required — just send the text. The template does not prefill a markdown json fence; the model emits the JSON array itself.

import torch
from transformers import AutoModelForImageTextToText, AutoTokenizer

model_id = "OpenMed/Ministral-3B-PII-Preview"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# The checkpoint uses a multimodal architecture, but this release is validated
# for TEXT input only. Load it with the image-text-to-text auto class and pass
# text — do not pass images.
model = AutoModelForImageTextToText.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

messages = [
    {"role": "user", "content": "Contact Sarah at [email protected] or 415-555-0198."},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
# [{"text": "Sarah", "label": "first_name"}, {"text": "[email protected]", "label": "email"}, {"text": "415-555-0198", "label": "phone_number"}]

You may pass a custom system message to override the default behavior if needed. Keep the system-prompt pattern, and do not manually prefill ```json.

Optional: production post-processing

For non-English text especially, a small deterministic post-processing pass cleans up the raw output (Unicode normalization, span deduplication, CJK name splitting, Vietnamese name-order swap, language stopword filtering). The implementation ships with this repo in postprocess.py:

import json
from postprocess import postprocess_entities

entities = json.loads(response)
clean = postprocess_entities(entities, language="vi")  # pass the source language code

Examples by Compliance Domain

HIPAA — Medical Records

Input:

Patient Maria Garcia, DOB 03/15/1985, MRN 4872910, was admitted on 2024-01-20 for a routine blood panel. Her blood type is O-negative. Insurance ID: BCBS-7742185. Contact her at [email protected] or (312) 555-0147.

Output:

[
  {"text": "Maria", "label": "first_name"},
  {"text": "Garcia", "label": "last_name"},
  {"text": "03/15/1985", "label": "date_of_birth"},
  {"text": "4872910", "label": "medical_record_number"},
  {"text": "2024-01-20", "label": "date"},
  {"text": "O-negative", "label": "blood_type"},
  {"text": "BCBS-7742185", "label": "insurance_id"},
  {"text": "[email protected]", "label": "email"},
  {"text": "(312) 555-0147", "label": "phone_number"}
]

GDPR — EU Customer Data

Input:

Dear Mr. Lukas Weber, your account (CUST-DE-88412) has been updated. We have your address as Friedrichstrasse 42, 10117 Berlin, Germany. Your IBAN DE89370400440532013000 is on file. For verification, your national ID is T220001293. Please confirm via [email protected].

Output:

[
  {"text": "Mr.", "label": "title"},
  {"text": "Lukas", "label": "first_name"},
  {"text": "Weber", "label": "last_name"},
  {"text": "Friedrichstrasse 42", "label": "street_address"},
  {"text": "10117", "label": "zip_code"},
  {"text": "Berlin", "label": "city"},
  {"text": "Germany", "label": "country"},
  {"text": "CUST-DE-88412", "label": "account_number"},
  {"text": "DE89370400440532013000", "label": "iban"},
  {"text": "T220001293", "label": "national_id"},
  {"text": "[email protected]", "label": "email"}
]

PCI-DSS — Financial Data

Input:

Wire transfer requested by account holder James Liu, account #7781920034, routing 021000021. Credit card ending 4532-XXXX-XXXX-8901 was flagged. SSN on file: 123-45-6789. Tax ID: 92-1234567. Contact: [email protected], IP logged: 192.168.1.42.

Output:

[
  {"text": "James", "label": "first_name"},
  {"text": "Liu", "label": "last_name"},
  {"text": "7781920034", "label": "account_number"},
  {"text": "021000021", "label": "routing_number"},
  {"text": "4532-XXXX-XXXX-8901", "label": "credit_card"},
  {"text": "123-45-6789", "label": "ssn"},
  {"text": "92-1234567", "label": "tax_id"},
  {"text": "[email protected]", "label": "email"},
  {"text": "192.168.1.42", "label": "ip_address"}
]

No PII — Clean Text

Input:

The quarterly earnings report shows a 12% increase in revenue compared to last year. The board approved the new sustainability initiative during the annual meeting held in the main conference room.

Output:

[]

Multilingual Support (20 languages, zero-shot)

The model was trained only on English PII data but generalizes to other languages out of the box. We ran one realistic example per language across the top 20 world languages and scored the model under two conditions:

Strict: exact-match scoring on raw model output.
Production: raw output → a small deterministic post-processing pipeline (Unicode normalization, span deduplication, CJK name splitting, Vietnamese name-order swap, language stopword filter, Slavic case-tolerance at match time). Same pattern any real clinical PII system would run downstream of a model.

Mode	Perfect	Micro-P	Micro-R	Micro-F1	TP	FP	FN
Raw model output	13/20	0.902	0.902	0.902	92	10	10
+ Production pipeline	20/20	1.000	1.000	1.000	102	0	0

Scored on 102 entities hand-annotated across all 20 languages.

Per-language F1 (click to expand)

#	Language	Code	Strict F1	Production F1
1	English	`en`	1.00	1.00
2	Chinese	`zh`	0.73	1.00
3	Hindi	`hi`	1.00	1.00
4	Spanish	`es`	1.00	1.00
5	Arabic	`ar`	1.00	1.00
6	French	`fr`	1.00	1.00
7	Bengali	`bn`	1.00	1.00
8	Russian	`ru`	0.80	1.00
9	Portuguese	`pt`	1.00	1.00
10	Japanese	`ja`	0.67	1.00
11	German	`de`	1.00	1.00
12	Korean	`ko`	0.67	1.00
13	Italian	`it`	1.00	1.00
14	Turkish	`tr`	1.00	1.00
15	Vietnamese	`vi`	0.62	1.00
16	Persian	`fa`	1.00	1.00
17	Polish	`pl`	0.80	1.00
18	Dutch	`nl`	1.00	1.00
19	Swahili	`sw`	0.83	1.00
20	Thai	`th`	1.00	1.00
	Micro		0.902	1.000

The post-processing pipeline

Six deterministic steps. No heavy NLP dependencies — all regex, string ops, and small gazetteers. The full implementation lives in postprocess.py.

Unicode NFC + whitespace strip on every text field. Also applied to the input before inference.
Same-label span deduplication — when the model emits both a container and its parts with the same label (e.g. first_name=Nguyễn Văn An AND first_name=Nguyễn), keep the most specific.
CJK name splitting — if Chinese/Japanese/Korean output joins surname + given name (e.g. 田中太郎 as a single first_name), split it using a small surname gazetteer.
Vietnamese name-order swap — Vietnamese writes family-name-first. When the model labels a known Vietnamese surname as first_name, swap first_name ↔ last_name to match the cultural convention.
Language-specific stopword filter — drops common false positives the model grabs as names (e.g. Swahili Jina = "name", Vietnamese Tôi = "I").
Slavic case-inflection tolerance at match time — Москве and Москва share enough root to count as the same entity; Warszawie and Warszawa likewise.

The raw model already extracts 92/102 entities correctly. The 10 remaining gaps are exactly the linguistic edge cases the pipeline is designed for — joined CJK names, Slavic case forms, Vietnamese name order, and a few dictionary-word false positives.

All 20 language examples

Each block shows the input text, the raw model output, and the post-processed output side by side.

English (en) — perfect

Input

Hi, my name is Sarah Johnson. You can reach me at [email protected] or call 555-234-7891. I live in Seattle.

Raw model output

[
  {
    "text": "Sarah",
    "label": "first_name"
  },
  {
    "text": "Johnson",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "555-234-7891",
    "label": "phone_number"
  },
  {
    "text": "Seattle",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "Sarah",
    "label": "first_name"
  },
  {
    "text": "Johnson",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "555-234-7891",
    "label": "phone_number"
  },
  {
    "text": "Seattle",
    "label": "city"
  }
]

Chinese (zh) — strict F1=0.73, production F1=1.00

Input

我叫王小明，电话是 13812345678，邮箱是 [email protected]，住在北京海淀区。

Raw model output

[
  {
    "text": "王小明",
    "label": "first_name"
  },
  {
    "text": "13812345678",
    "label": "phone_number"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "北京",
    "label": "city"
  },
  {
    "text": "海淀区",
    "label": "district"
  }
]

After production post-processing

[
  {
    "text": "小明",
    "label": "first_name"
  },
  {
    "text": "13812345678",
    "label": "phone_number"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "北京",
    "label": "city"
  },
  {
    "text": "海淀区",
    "label": "district"
  },
  {
    "text": "王",
    "label": "last_name"
  }
]

Hindi (hi) — perfect

Input

मेरा नाम राज शर्मा है। मेरा फ़ोन नंबर 9876543210 और ईमेल [email protected] है। मैं मुंबई में रहता हूं।

Raw model output

[
  {
    "text": "राज",
    "label": "first_name"
  },
  {
    "text": "शर्मा",
    "label": "last_name"
  },
  {
    "text": "9876543210",
    "label": "phone_number"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "मुंबई",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "राज",
    "label": "first_name"
  },
  {
    "text": "शर्मा",
    "label": "last_name"
  },
  {
    "text": "9876543210",
    "label": "phone_number"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "मुंबई",
    "label": "city"
  }
]

Spanish (es) — perfect

Input

Me llamo María García. Mi correo es [email protected] y mi teléfono es +34 612 345 678. Vivo en Madrid.

Raw model output

[
  {
    "text": "María",
    "label": "first_name"
  },
  {
    "text": "García",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+34 612 345 678",
    "label": "phone_number"
  },
  {
    "text": "Madrid",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "María",
    "label": "first_name"
  },
  {
    "text": "García",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+34 612 345 678",
    "label": "phone_number"
  },
  {
    "text": "Madrid",
    "label": "city"
  }
]

Arabic (ar) — perfect

Input

اسمي أحمد الحسن. بريدي الإلكتروني [email protected] ورقم هاتفي +966501234567. أسكن في الرياض.

Raw model output

[
  {
    "text": "أحمد",
    "label": "first_name"
  },
  {
    "text": "الحسن",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+966501234567",
    "label": "phone_number"
  },
  {
    "text": "الرياض",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "أحمد",
    "label": "first_name"
  },
  {
    "text": "الحسن",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+966501234567",
    "label": "phone_number"
  },
  {
    "text": "الرياض",
    "label": "city"
  }
]

French (fr) — perfect

Input

Je m'appelle Pierre Dupont. Mon email est [email protected] et mon numéro est 06 12 34 56 78. J'habite à Paris.

Raw model output

[
  {
    "text": "Pierre",
    "label": "first_name"
  },
  {
    "text": "Dupont",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "06 12 34 56 78",
    "label": "phone_number"
  },
  {
    "text": "Paris",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "Pierre",
    "label": "first_name"
  },
  {
    "text": "Dupont",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "06 12 34 56 78",
    "label": "phone_number"
  },
  {
    "text": "Paris",
    "label": "city"
  }
]

Bengali (bn) — perfect

Input

আমার নাম রাহুল দাস। আমার ফোন নম্বর 01712345678 এবং ইমেইল [email protected]। আমি ঢাকায় থাকি।

Raw model output

[
  {
    "text": "রাহুল",
    "label": "first_name"
  },
  {
    "text": "দাস",
    "label": "last_name"
  },
  {
    "text": "01712345678",
    "label": "phone_number"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "ঢাকা",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "রাহুল",
    "label": "first_name"
  },
  {
    "text": "দাস",
    "label": "last_name"
  },
  {
    "text": "01712345678",
    "label": "phone_number"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "ঢাকা",
    "label": "city"
  }
]

Russian (ru) — strict F1=0.80, production F1=1.00

Input

Меня зовут Иван Петров. Мой email [email protected] и телефон +7 495 123-45-67. Живу в Москве.

Raw model output

[
  {
    "text": "Иван",
    "label": "first_name"
  },
  {
    "text": "Петров",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+7 495 123-45-67",
    "label": "phone_number"
  },
  {
    "text": "Москва",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "Иван",
    "label": "first_name"
  },
  {
    "text": "Петров",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+7 495 123-45-67",
    "label": "phone_number"
  },
  {
    "text": "Москва",
    "label": "city"
  }
]

Portuguese (pt) — perfect

Input

Meu nome é João Silva. Meu email é [email protected] e telefone (11) 98765-4321. Moro em São Paulo.

Raw model output

[
  {
    "text": "João",
    "label": "first_name"
  },
  {
    "text": "Silva",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "(11) 98765-4321",
    "label": "phone_number"
  },
  {
    "text": "São Paulo",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "João",
    "label": "first_name"
  },
  {
    "text": "Silva",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "(11) 98765-4321",
    "label": "phone_number"
  },
  {
    "text": "São Paulo",
    "label": "city"
  }
]

Japanese (ja) — strict F1=0.67, production F1=1.00

Input

私の名前は田中太郎です。メールは [email protected]、電話は 090-1234-5678 です。東京に住んでいます。

Raw model output

[
  {
    "text": "田中太郎",
    "label": "first_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "090-1234-5678",
    "label": "phone_number"
  },
  {
    "text": "東京",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "太郎",
    "label": "first_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "090-1234-5678",
    "label": "phone_number"
  },
  {
    "text": "東京",
    "label": "city"
  },
  {
    "text": "田中",
    "label": "last_name"
  }
]

German (de) — perfect

Input

Ich heiße Hans Müller. Meine E-Mail ist [email protected], meine Telefonnummer 030 12345678. Ich wohne in Berlin.

Raw model output

[
  {
    "text": "Hans",
    "label": "first_name"
  },
  {
    "text": "Müller",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "030 12345678",
    "label": "phone_number"
  },
  {
    "text": "Berlin",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "Hans",
    "label": "first_name"
  },
  {
    "text": "Müller",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "030 12345678",
    "label": "phone_number"
  },
  {
    "text": "Berlin",
    "label": "city"
  }
]

Korean (ko) — strict F1=0.67, production F1=1.00

Input

제 이름은 김민수입니다. 이메일은 [email protected] 이고 전화번호는 010-1234-5678 입니다. 서울에 살고 있습니다.

Raw model output

[
  {
    "text": "김민수",
    "label": "first_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "010-1234-5678",
    "label": "phone_number"
  },
  {
    "text": "서울",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "민수",
    "label": "first_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "010-1234-5678",
    "label": "phone_number"
  },
  {
    "text": "서울",
    "label": "city"
  },
  {
    "text": "김",
    "label": "last_name"
  }
]

Italian (it) — perfect

Input

Mi chiamo Marco Rossi. La mia email è [email protected] e il mio telefono è +39 333 1234567. Abito a Roma.

Raw model output

[
  {
    "text": "Marco",
    "label": "first_name"
  },
  {
    "text": "Rossi",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+39 333 1234567",
    "label": "phone_number"
  },
  {
    "text": "Roma",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "Marco",
    "label": "first_name"
  },
  {
    "text": "Rossi",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+39 333 1234567",
    "label": "phone_number"
  },
  {
    "text": "Roma",
    "label": "city"
  }
]

Turkish (tr) — perfect

Input

Adım Mehmet Yılmaz. E-postam [email protected] ve telefonum +90 532 123 45 67. İstanbul'da yaşıyorum.

Raw model output

[
  {
    "text": "Mehmet",
    "label": "first_name"
  },
  {
    "text": "Yılmaz",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+90 532 123 45 67",
    "label": "phone_number"
  },
  {
    "text": "İstanbul",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "Mehmet",
    "label": "first_name"
  },
  {
    "text": "Yılmaz",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+90 532 123 45 67",
    "label": "phone_number"
  },
  {
    "text": "İstanbul",
    "label": "city"
  }
]

Vietnamese (vi) — strict F1=0.62, production F1=1.00

Input

Tôi tên là Nguyễn Văn An. Email của tôi là [email protected] và số điện thoại là +84 912 345 678. Tôi sống ở Hà Nội.

Raw model output

[
  {
    "text": "Nguyễn Văn An",
    "label": "first_name"
  },
  {
    "text": "Nguyễn",
    "label": "first_name"
  },
  {
    "text": "Văn",
    "label": "middle_name"
  },
  {
    "text": "An",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+84 912 345 678",
    "label": "phone_number"
  },
  {
    "text": "Hà Nội",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "Nguyễn",
    "label": "last_name"
  },
  {
    "text": "Văn",
    "label": "middle_name"
  },
  {
    "text": "An",
    "label": "first_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+84 912 345 678",
    "label": "phone_number"
  },
  {
    "text": "Hà Nội",
    "label": "city"
  }
]

Persian (fa) — perfect

Input

نام من علی احمدی است. ایمیل من [email protected] و شماره من +98 912 345 6789 است. من در تهران زندگی می‌کنم.

Raw model output

[
  {
    "text": "علی",
    "label": "first_name"
  },
  {
    "text": "احمدی",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+98 912 345 6789",
    "label": "phone_number"
  },
  {
    "text": "تهران",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "علی",
    "label": "first_name"
  },
  {
    "text": "احمدی",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+98 912 345 6789",
    "label": "phone_number"
  },
  {
    "text": "تهران",
    "label": "city"
  }
]

Polish (pl) — strict F1=0.80, production F1=1.00

Input

Nazywam się Jan Kowalski. Mój email to [email protected], a telefon +48 601 234 567. Mieszkam w Warszawie.

Raw model output

[
  {
    "text": "Jan",
    "label": "first_name"
  },
  {
    "text": "Kowalski",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+48 601 234 567",
    "label": "phone_number"
  },
  {
    "text": "Warszawa",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "Jan",
    "label": "first_name"
  },
  {
    "text": "Kowalski",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+48 601 234 567",
    "label": "phone_number"
  },
  {
    "text": "Warszawa",
    "label": "city"
  }
]

Dutch (nl) — perfect

Input

Mijn naam is Lars de Vries. Mijn email is [email protected] en mijn telefoon +31 6 12345678. Ik woon in Amsterdam.

Raw model output

[
  {
    "text": "Lars",
    "label": "first_name"
  },
  {
    "text": "de Vries",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+31 6 12345678",
    "label": "phone_number"
  },
  {
    "text": "Amsterdam",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "Lars",
    "label": "first_name"
  },
  {
    "text": "de Vries",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+31 6 12345678",
    "label": "phone_number"
  },
  {
    "text": "Amsterdam",
    "label": "city"
  }
]

Swahili (sw) — strict F1=0.83, production F1=1.00

Input

Jina langu ni Juma Hassan. Barua pepe yangu ni [email protected] na simu yangu ni +255 754 123 456. Ninaishi Dar es Salaam.

Raw model output

[
  {
    "text": "Jina",
    "label": "first_name"
  },
  {
    "text": "Jina langu",
    "label": "first_name"
  },
  {
    "text": "Juma",
    "label": "first_name"
  },
  {
    "text": "Hassan",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+255 754 123 456",
    "label": "phone_number"
  },
  {
    "text": "Dar es Salaam",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "Juma",
    "label": "first_name"
  },
  {
    "text": "Hassan",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+255 754 123 456",
    "label": "phone_number"
  },
  {
    "text": "Dar es Salaam",
    "label": "city"
  }
]

Thai (th) — perfect

Input

ฉันชื่อสมชาย ใจดี อีเมลของฉันคือ [email protected] และเบอร์โทร +66 81 234 5678 ฉันอาศัยอยู่ที่กรุงเทพ

Raw model output

[
  {
    "text": "สมชาย",
    "label": "first_name"
  },
  {
    "text": "ใจดี",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+66 81 234 5678",
    "label": "phone_number"
  },
  {
    "text": "กรุงเทพ",
    "label": "city"
  }
]

After production post-processing

[
  {
    "text": "สมชาย",
    "label": "first_name"
  },
  {
    "text": "ใจดี",
    "label": "last_name"
  },
  {
    "text": "[email protected]",
    "label": "email"
  },
  {
    "text": "+66 81 234 5678",
    "label": "phone_number"
  },
  {
    "text": "กรุงเทพ",
    "label": "city"
  }
]

Limitations

Text input only. Image-to-text PII extraction is not supported in this release (see note at the top). Provide text input.
Training data is English-only. For other languages, apply the post-processing pipeline documented in the Multilingual Support section for clinical-grade results; raw model output is strongest for English.
Purpose-built for PII extraction — not a general-purpose NER or chat model.
Performance may vary on highly domain-specific jargon or unconventional PII formats.
As a generative model, it can occasionally emit a label outside the documented set or miss an entity. Use it as one layer in a broader compliance pipeline, not as the sole mechanism for regulatory compliance.

License

Released under the Apache 2.0 license.

Downloads last month: 26

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for OpenMed/Ministral-3B-PII-Preview

Quantizations

1 model