Update README.md

24df4e9 verified 2 months ago

5.98 kB

	---
	language: en
	license: apache-2.0
	base_model: google/bert_uncased_L-4_H-256_A-4
	tags:
	- pii
	- privacy
	- routing
	- text-classification
	- knowledge-distillation
	- tinybert
	- traffic-control
	datasets:
	- ai4privacy/pii-masking-65k
	metrics:
	- f1
	library_name: transformers
	pipeline_tag: text-classification
	model-index:
	- name: bert-tiny-pii-router
	results:
	- task:
	type: text-classification
	name: PII Routing
	dataset:
	name: ai4privacy/pii-masking-65k
	type: ai4privacy/pii-masking-65k
	metrics:
	- name: F1
	type: f1
	value: 0.96
	---

	# bert-tiny-pii-router: The Semantic Gatekeeper

	> In the era of massive LLMs, sometimes the smartest solution is the smallest one.

	This is a 42MB Traffic Controller designed for high-performance NLP pipelines. Instead of blindly running heavy extraction models (NER, Regex, Address Parsers) on every user query, this model acts as a "Gatekeeper" to predict which entities are present before extraction begins.

	It was distilled from an XLM-RoBERTa teacher into a TinyBERT student, achieving a 9.1x speedup while retaining 96% of the teacher's accuracy.

	## The Problem: The "Blind Pipeline"
	Traditional PII extraction pipelines are wasteful. They run every specialist model on every input:
	* Regex is fast but generates false positives (e.g., confusing dates `12/05/2024` for IP addresses).
	* NER models are accurate but heavy and slow.
	* LLMs are perfect extractors but expensive, high-latency, and overkill for simple tasks.

	## The Solution: The Router Pattern
	Instead of a linear pipeline, we use a Multi-Label Classifier at the very front. It reads the text once and outputs a probability vector indicating which specialized engines are needed.

	Example Decision:
	```json
	Input: "Schedule a meeting with John in London on Friday"
	Output: {
	"NER": 0.99, // -> Trigger Name Extractor
	"ADDRESS": 0.98, // -> Trigger Geocoder
	"TEMPORAL": 0.95, // -> Trigger Date Parser
	"REGEX": 0.01 // -> Skip Regex Engine (Save Compute)
	}

	```

	## Tag Purification: Solving Class Overlap

	A major challenge in training this router was Class Overlap. Standard models struggle with ambiguous definitions.

	> The Ambiguity:
	> * Is a Date a Regex? A date like `12/05/2024` fits a regex pattern. But semantically, it belongs to the TEMPORAL engine, not the REGEX scanner.
	> * Is a State a Name? "California" is technically a Named Entity (NER). But for routing purposes, it must be sent to the Geocoder (ADDRESS), not the Person/Org extractor.
	>
	>

	To fix this, the training data underwent a Tag Purification layer. We mapped 38 granular tags into 4 distinct "Routing Engines" to enforce strict semantic boundaries:

	\| Source Tag \| Action \| Destination Engine (Label) \|
	\| --- \| --- \| --- \|
	\| DATE / TIME \| Removed from Regex \| `TEMPORAL` \|
	\| CITY / STATE / ZIP \| Removed from NER \| `ADDRESS` \|
	\| IP / EMAIL / IBAN \| Kept in Regex \| `REGEX` \|
	\| PERSON / ORG \| Kept in NER \| `NER` \|

	## Student vs. Teacher (Distillation Results)

	The model was trained using Knowledge Distillation. The student (`TinyBERT`) was forced to mimic the "soft targets" (thought process) of the teacher (`XLM-RoBERTa`), not just the final labels.

	Hardware: Apple M4 Max
	Speedup: 9.1x

	\| Model \| Parameters \| Size \| Throughput \| F1 Retention \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| Teacher (XLM-R) \| 278M \| ~1 GB \| ~360 samples/sec \| 100% (Baseline) \|
	\| Student (TinyBERT) \| 11M \| 42 MB \| ~3,300 samples/sec \| 96% \|

	## Evaluation Results

	The model shows strong separation between the routing categories, with minimal confusion between semantically distinct classes (e.g., Address vs. Regex).

	![Confusion Matrix](confusion_matrix.png)

	## Usage

	This model outputs Multi-Label Probabilities for the 4 engines. We recommend a threshold of 0.5 to trigger a route.

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "pinialt/bert-tiny-pii-router"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	def route_query(text, threshold=0.5):
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

	with torch.no_grad():
	logits = model(**inputs).logits

	# Sigmoid for multi-label (independent probabilities)
	probs = torch.sigmoid(logits)[0]

	# Map IDs to Labels
	active_routes = [
	model.config.id2label[i]
	for i, score in enumerate(probs)
	if score > threshold
	]

	if not active_routes:
	return "⚡ Direct to LLM (No PII)"

	return f"🚦 Route to Engines: {', '.join(active_routes)}"

	# === EXAMPLES ===
	# 1. Complex Multi-Entity Request
	print(route_query("Schedule a meeting with John in London on Friday"))
	# Output: 🚦 Route to Engines: NER, ADDRESS, TEMPORAL

	# 2. Pure Address
	print(route_query("Ship to 123 Main St, New York, NY"))
	# Output: 🚦 Route to Engines: ADDRESS

	```

	## Training & Distillation Details

	* Teacher Model: `xlm-roberta-base`
	* Student Model: `google/bert_uncased_L-4_H-256_A-4`
	* Dataset: `ai4privacy/pii-masking-65k` (Filtered & Purified)
	* Loss Function:


	## License

	This project is licensed under the Apache License 2.0.
	Finetuned on Google's bert_uncased_L-4_H-256_A-4 model
	```
	@article{turc2019,
	title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
	author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
	journal={arXiv preprint arXiv:1908.08962v2 },
	year={2019}
	}
	```


	## Links

	* Dataset: [ai4privacy/pii-masking-65k](https://huggingface.co/datasets/ai4privacy/pii-masking-65k)
	* Training Code: [pinialt/pii-router](https://www.google.com/search?q=https://github.com/pinialt/pii-router)