| --- |
| language: en |
| license: apache-2.0 |
| base_model: google/bert_uncased_L-4_H-256_A-4 |
| tags: |
| - pii |
| - privacy |
| - routing |
| - text-classification |
| - knowledge-distillation |
| - tinybert |
| - traffic-control |
| datasets: |
| - ai4privacy/pii-masking-65k |
| metrics: |
| - f1 |
| library_name: transformers |
| pipeline_tag: text-classification |
| model-index: |
| - name: bert-tiny-pii-router |
| results: |
| - task: |
| type: text-classification |
| name: PII Routing |
| dataset: |
| name: ai4privacy/pii-masking-65k |
| type: ai4privacy/pii-masking-65k |
| metrics: |
| - name: F1 |
| type: f1 |
| value: 0.96 |
| --- |
| |
| # bert-tiny-pii-router: The Semantic Gatekeeper |
|
|
| > *In the era of massive LLMs, sometimes the smartest solution is the smallest one.* |
|
|
| This is a **42MB Traffic Controller** designed for high-performance NLP pipelines. Instead of blindly running heavy extraction models (NER, Regex, Address Parsers) on every user query, this model acts as a "Gatekeeper" to predict *which* entities are present before extraction begins. |
|
|
| It was distilled from an **XLM-RoBERTa** teacher into a **TinyBERT** student, achieving a **9.1x speedup** while retaining 96% of the teacher's accuracy. |
|
|
| ## The Problem: The "Blind Pipeline" |
| Traditional PII extraction pipelines are wasteful. They run every specialist model on every input: |
| * **Regex** is fast but generates false positives (e.g., confusing dates `12/05/2024` for IP addresses). |
| * **NER** models are accurate but heavy and slow. |
| * **LLMs** are perfect extractors but expensive, high-latency, and overkill for simple tasks. |
|
|
| ## The Solution: The Router Pattern |
| Instead of a linear pipeline, we use a **Multi-Label Classifier** at the very front. It reads the text once and outputs a probability vector indicating which specialized engines are needed. |
|
|
| **Example Decision:** |
| ```json |
| Input: "Schedule a meeting with John in London on Friday" |
| Output: { |
| "NER": 0.99, // -> Trigger Name Extractor |
| "ADDRESS": 0.98, // -> Trigger Geocoder |
| "TEMPORAL": 0.95, // -> Trigger Date Parser |
| "REGEX": 0.01 // -> Skip Regex Engine (Save Compute) |
| } |
| |
| ``` |
|
|
| ## Tag Purification: Solving Class Overlap |
|
|
| A major challenge in training this router was **Class Overlap**. Standard models struggle with ambiguous definitions. |
|
|
| > **The Ambiguity:** |
| > * **Is a Date a Regex?** A date like `12/05/2024` fits a regex pattern. But semantically, it belongs to the **TEMPORAL** engine, not the REGEX scanner. |
| > * **Is a State a Name?** "California" is technically a Named Entity (NER). But for routing purposes, it must be sent to the Geocoder (**ADDRESS**), not the Person/Org extractor. |
| > |
| > |
|
|
| To fix this, the training data underwent a **Tag Purification** layer. We mapped 38 granular tags into 4 distinct "Routing Engines" to enforce strict semantic boundaries: |
|
|
| | Source Tag | Action | Destination Engine (Label) | |
| | --- | --- | --- | |
| | **DATE / TIME** | **Removed** from Regex | `TEMPORAL` | |
| | **CITY / STATE / ZIP** | **Removed** from NER | `ADDRESS` | |
| | **IP / EMAIL / IBAN** | **Kept** in Regex | `REGEX` | |
| | **PERSON / ORG** | **Kept** in NER | `NER` | |
|
|
| ## Student vs. Teacher (Distillation Results) |
|
|
| The model was trained using **Knowledge Distillation**. The student (`TinyBERT`) was forced to mimic the "soft targets" (thought process) of the teacher (`XLM-RoBERTa`), not just the final labels. |
|
|
| **Hardware:** Apple M4 Max |
| **Speedup:** 9.1x |
|
|
| | Model | Parameters | Size | Throughput | F1 Retention | |
| | --- | --- | --- | --- | --- | |
| | **Teacher** (XLM-R) | 278M | ~1 GB | ~360 samples/sec | 100% (Baseline) | |
| | **Student** (TinyBERT) | **11M** | **42 MB** | **~3,300 samples/sec** | **96%** | |
|
|
| ## Evaluation Results |
|
|
| The model shows strong separation between the routing categories, with minimal confusion between semantically distinct classes (e.g., Address vs. Regex). |
|
|
|  |
|
|
| ## Usage |
|
|
| This model outputs **Multi-Label Probabilities** for the 4 engines. We recommend a threshold of **0.5** to trigger a route. |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| model_name = "pinialt/bert-tiny-pii-router" |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForSequenceClassification.from_pretrained(model_name) |
| |
| def route_query(text, threshold=0.5): |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128) |
| |
| with torch.no_grad(): |
| logits = model(**inputs).logits |
| |
| # Sigmoid for multi-label (independent probabilities) |
| probs = torch.sigmoid(logits)[0] |
| |
| # Map IDs to Labels |
| active_routes = [ |
| model.config.id2label[i] |
| for i, score in enumerate(probs) |
| if score > threshold |
| ] |
| |
| if not active_routes: |
| return "⚡ Direct to LLM (No PII)" |
| |
| return f"🚦 Route to Engines: {', '.join(active_routes)}" |
| |
| # === EXAMPLES === |
| # 1. Complex Multi-Entity Request |
| print(route_query("Schedule a meeting with John in London on Friday")) |
| # Output: 🚦 Route to Engines: NER, ADDRESS, TEMPORAL |
| |
| # 2. Pure Address |
| print(route_query("Ship to 123 Main St, New York, NY")) |
| # Output: 🚦 Route to Engines: ADDRESS |
| |
| ``` |
|
|
| ## Training & Distillation Details |
|
|
| * **Teacher Model:** `xlm-roberta-base` |
| * **Student Model:** `google/bert_uncased_L-4_H-256_A-4` |
| * **Dataset:** `ai4privacy/pii-masking-65k` (Filtered & Purified) |
| * **Loss Function:** |
|
|
|
|
| ## License |
|
|
| This project is licensed under the **Apache License 2.0**. |
| Finetuned on Google's bert_uncased_L-4_H-256_A-4 model |
| ``` |
| @article{turc2019, |
| title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models}, |
| author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, |
| journal={arXiv preprint arXiv:1908.08962v2 }, |
| year={2019} |
| } |
| ``` |
|
|
|
|
| ## Links |
|
|
| * **Dataset:** [ai4privacy/pii-masking-65k](https://huggingface.co/datasets/ai4privacy/pii-masking-65k) |
| * **Training Code:** [pinialt/pii-router](https://www.google.com/search?q=https://github.com/pinialt/pii-router) |