pinialt commited on
Commit
b1827de
·
verified ·
1 Parent(s): 45991ee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +153 -3
README.md CHANGED
@@ -1,3 +1,153 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: cc-by-4.0
4
+ base_model: google/bert_uncased_L-4_H-256_A-4
5
+ tags:
6
+ - pii
7
+ - privacy
8
+ - routing
9
+ - text-classification
10
+ - knowledge-distillation
11
+ - tinybert
12
+ - traffic-control
13
+ datasets:
14
+ - ai4privacy/pii-masking-65k
15
+ metrics:
16
+ - f1
17
+ library_name: transformers
18
+ pipeline_tag: text-classification
19
+ model-index:
20
+ - name: bert-tiny-pii-router
21
+ results:
22
+ - task:
23
+ type: text-classification
24
+ name: PII Routing
25
+ dataset:
26
+ name: ai4privacy/pii-masking-65k
27
+ type: ai4privacy/pii-masking-65k
28
+ metrics:
29
+ - name: F1
30
+ type: f1
31
+ value: 0.96
32
+ ---
33
+
34
+ # bert-tiny-pii-router: The Semantic Gatekeeper
35
+
36
+ > *"In the era of massive LLMs, sometimes the smartest solution is the smallest one."*
37
+
38
+ This is a **42MB Traffic Controller** designed for high-performance NLP pipelines. Instead of blindly running heavy extraction models (NER, Regex, Address Parsers) on every user query, this model acts as a "Gatekeeper" to predict *which* entities are present before extraction begins.
39
+
40
+ It was distilled from an **XLM-RoBERTa** teacher into a **TinyBERT** student, achieving a **9.1x speedup** while retaining 96% of the teacher's accuracy.
41
+
42
+ ## The Problem: The "Blind Pipeline"
43
+ Traditional PII extraction pipelines are wasteful. They run every specialist model on every input:
44
+ * **Regex** is fast but generates false positives (e.g., confusing dates `12/05/2024` for IP addresses).
45
+ * **NER** models are accurate but heavy and slow.
46
+ * **LLMs** are perfect extractors but expensive, high-latency, and overkill for simple tasks.
47
+
48
+ ## The Solution: The Router Pattern
49
+ Instead of a linear pipeline, we use a **Multi-Label Classifier** at the very front. It reads the text once and outputs a probability vector indicating which specialized engines are needed.
50
+
51
+ **Example Decision:**
52
+ ```json
53
+ Input: "Schedule a meeting with John in London on Friday"
54
+ Output: {
55
+ "NER": 0.99, // -> Trigger Name Extractor
56
+ "ADDRESS": 0.98, // -> Trigger Geocoder
57
+ "TEMPORAL": 0.95, // -> Trigger Date Parser
58
+ "REGEX": 0.01 // -> Skip Regex Engine (Save Compute)
59
+ }
60
+
61
+ ```
62
+
63
+ ## Tag Purification: Solving Class Overlap
64
+
65
+ A major challenge in training this router was **Class Overlap**. Standard models struggle with ambiguous definitions.
66
+
67
+ > **The Ambiguity:**
68
+ > * **Is a Date a Regex?** A date like `12/05/2024` fits a regex pattern. But semantically, it belongs to the **TEMPORAL** engine, not the REGEX scanner.
69
+ > * **Is a State a Name?** "California" is technically a Named Entity (NER). But for routing purposes, it must be sent to the Geocoder (**ADDRESS**), not the Person/Org extractor.
70
+ >
71
+ >
72
+
73
+ To fix this, the training data underwent a **Tag Purification** layer. We mapped 38 granular tags into 4 distinct "Routing Engines" to enforce strict semantic boundaries:
74
+
75
+ | Source Tag | Action | Destination Engine (Label) |
76
+ | --- | --- | --- |
77
+ | **DATE / TIME** | **Removed** from Regex | `TEMPORAL` |
78
+ | **CITY / STATE / ZIP** | **Removed** from NER | `ADDRESS` |
79
+ | **IP / EMAIL / IBAN** | **Kept** in Regex | `REGEX` |
80
+ | **PERSON / ORG** | **Kept** in NER | `NER` |
81
+
82
+ ## Student vs. Teacher (Distillation Results)
83
+
84
+ The model was trained using **Knowledge Distillation**. The student (`TinyBERT`) was forced to mimic the "soft targets" (thought process) of the teacher (`XLM-RoBERTa`), not just the final labels.
85
+
86
+ **Hardware:** Apple M4 Max
87
+ **Speedup:** 9.1x
88
+
89
+ | Model | Parameters | Size | Throughput | F1 Retention |
90
+ | --- | --- | --- | --- | --- |
91
+ | **Teacher** (XLM-R) | 278M | ~1 GB | ~360 samples/sec | 100% (Baseline) |
92
+ | **Student** (TinyBERT) | **11M** | **42 MB** | **~3,300 samples/sec** | **96%** |
93
+
94
+ ## Usage
95
+
96
+ This model outputs **Multi-Label Probabilities** for the 4 engines. We recommend a threshold of **0.5** to trigger a route.
97
+
98
+ ```python
99
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
100
+ import torch
101
+
102
+ model_name = "pinialt/bert-tiny-pii-router"
103
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
104
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
105
+
106
+ def route_query(text, threshold=0.5):
107
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
108
+
109
+ with torch.no_grad():
110
+ logits = model(**inputs).logits
111
+
112
+ # Sigmoid for multi-label (independent probabilities)
113
+ probs = torch.sigmoid(logits)[0]
114
+
115
+ # Map IDs to Labels
116
+ active_routes = [
117
+ model.config.id2label[i]
118
+ for i, score in enumerate(probs)
119
+ if score > threshold
120
+ ]
121
+
122
+ if not active_routes:
123
+ return "⚡ Direct to LLM (No PII)"
124
+
125
+ return f"🚦 Route to Engines: {', '.join(active_routes)}"
126
+
127
+ # === EXAMPLES ===
128
+ # 1. Complex Multi-Entity Request
129
+ print(route_query("Schedule a meeting with John in London on Friday"))
130
+ # Output: 🚦 Route to Engines: NER, ADDRESS, TEMPORAL
131
+
132
+ # 2. Pure Address
133
+ print(route_query("Ship to 123 Main St, New York, NY"))
134
+ # Output: 🚦 Route to Engines: ADDRESS
135
+
136
+ ```
137
+
138
+ ## Training & Distillation Details
139
+
140
+ * **Teacher Model:** `xlm-roberta-base`
141
+ * **Student Model:** `google/bert_uncased_L-4_H-256_A-4`
142
+ * **Dataset:** `ai4privacy/pii-masking-65k` (Filtered & Purified)
143
+ * **Loss Function:**
144
+
145
+
146
+ ## License
147
+
148
+ This project is licensed under the **CC-BY-4.0 License**.
149
+
150
+ ## Links
151
+
152
+ * **Dataset:** [ai4privacy/pii-masking-65k](https://huggingface.co/datasets/ai4privacy/pii-masking-65k)
153
+ * **Training Code:** [pinialt/pii-router](https://www.google.com/search?q=https://github.com/pinialt/pii-router)