| | --- |
| | language: he |
| | license: mit |
| | library_name: transformers |
| | tags: |
| | - hebrew |
| | - ner |
| | - pii-detection |
| | - token-classification |
| | - xlm-roberta |
| | - privacy |
| | - data-anonymization |
| | - golemguard |
| | datasets: |
| | - CordwainerSmith/GolemGuard |
| | model-index: |
| | - name: GolemPII-v1 |
| | results: |
| | - task: |
| | name: Token Classification |
| | type: token-classification |
| | metrics: |
| | - name: F1 |
| | type: f1 |
| | value: 0.9982 |
| | - name: Precision |
| | type: precision |
| | value: 0.9982 |
| | - name: Recall |
| | type: recall |
| | value: 0.9982 |
| | --- |
| | |
| | # GolemPII-v1 - Hebrew PII Detection Model |
| |
|
| | This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data to achieve high accuracy in identifying and classifying various types of PII. |
| |
|
| | ## Model Details |
| | - Based on xlm-roberta-base |
| | - Fine-tuned on the GolemGuard: Hebrew Privacy Information Detection Corpus |
| | - Optimized for token classification tasks in Hebrew text |
| |
|
| | ## Intended Uses & Limitations |
| |
|
| | This model is intended for: |
| |
|
| | * **Privacy Protection:** Detecting and masking PII in Hebrew text to protect individual privacy. |
| | * **Data Anonymization:** Automating the process of de-identifying Hebrew documents in legal, medical, and other sensitive contexts. |
| | * **Research:** Supporting research in Hebrew natural language processing and PII detection. |
| |
|
| | ## Training Parameters |
| |
|
| | * **Batch Size:** 32 |
| | * **Learning Rate:** 2e-5 with linear warmup and decay. |
| | * **Optimizer:** AdamW |
| | * **Hardware:** Trained on a single NVIDIA A100GPU. |
| |
|
| | ## Dataset Details |
| |
|
| | * **Dataset Name:** GolemGuard: Hebrew Privacy Information Detection Corpus |
| | * **Dataset Link:** [https://huggingface.co/datasets/CordwainerSmith/GolemGuard](https://huggingface.co/datasets/CordwainerSmith/GolemGuard) |
| |
|
| | ## Performance Metrics |
| |
|
| | ### Final Evaluation Results |
| | ``` |
| | eval_loss: 0.000729 |
| | eval_precision: 0.9982 |
| | eval_recall: 0.9982 |
| | eval_f1: 0.9982 |
| | eval_accuracy: 0.999795 |
| | ``` |
| |
|
| | ### Detailed Performance by Label |
| |
|
| | | Label | Precision | Recall | F1-Score | Support | |
| | |------------------|-----------|---------|----------|---------| |
| | | BANK_ACCOUNT_NUM | 1.0000 | 1.0000 | 1.0000 | 4847 | |
| | | CC_NUM | 1.0000 | 1.0000 | 1.0000 | 234 | |
| | | CC_PROVIDER | 1.0000 | 1.0000 | 1.0000 | 242 | |
| | | CITY | 0.9997 | 0.9995 | 0.9996 | 12237 | |
| | | DATE | 0.9997 | 0.9998 | 0.9997 | 11943 | |
| | | EMAIL | 0.9998 | 1.0000 | 0.9999 | 13235 | |
| | | FIRST_NAME | 0.9937 | 0.9938 | 0.9937 | 17888 | |
| | | ID_NUM | 0.9999 | 1.0000 | 1.0000 | 10577 | |
| | | LAST_NAME | 0.9928 | 0.9921 | 0.9925 | 15655 | |
| | | PHONE_NUM | 1.0000 | 0.9998 | 0.9999 | 20838 | |
| | | POSTAL_CODE | 0.9998 | 0.9999 | 0.9999 | 13321 | |
| | | STREET | 0.9999 | 0.9999 | 0.9999 | 14032 | |
| | | micro avg | 0.9982 | 0.9982 | 0.9982 | 135049 | |
| | | macro avg | 0.9988 | 0.9987 | 0.9988 | 135049 | |
| | | weighted avg | 0.9982 | 0.9982 | 0.9982 | 135049 | |
| | |
| | ### Training Progress |
| | |
| | | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy | |
| | |-------|--------------|-----------------|-----------|---------|----------|----------| |
| | | 1 | 0.005800 | 0.002487 | 0.993109 | 0.993678| 0.993393 | 0.999328 | |
| | | 2 | 0.001700 | 0.001385 | 0.995469 | 0.995947| 0.995708 | 0.999575 | |
| | | 3 | 0.001200 | 0.000946 | 0.997159 | 0.997487| 0.997323 | 0.999739 | |
| | | 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 | |
| | | 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 | |
| | |
| | ## Model Architecture |
| | |
| | The model is based on the `FacebookAI/xlm-roberta-base` architecture, a transformer-based language model pre-trained on a massive multilingual dataset. No architectural modifications were made to the base model during fine-tuning. |
| | |
| | ## Usage |
| | ```python |
| | import torch |
| | from transformers import AutoTokenizer, AutoModelForTokenClassification |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("{repo_id}") |
| | model = AutoModelForTokenClassification.from_pretrained("{repo_id}") |
| | |
| | # Example text (Hebrew) |
| | text = "砖诇讜诐, 砖诪讬 讚讜讚 讻讛谉 讜讗谞讬 讙专 讘专讞讜讘 讛专爪诇 42 讘转诇 讗讘讬讘. 讛讟诇驻讜谉 砖诇讬 讛讜讗 050-1234567" |
| | |
| | # Tokenize and get predictions |
| | inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) |
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | predictions = torch.argmax(outputs.logits, dim=2) |
| | |
| | # Convert predictions to labels |
| | tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) |
| | labels = [model.config.id2label[t.item()] for t in predictions[0]] |
| | |
| | # Print results (excluding special tokens and non-entity labels) |
| | for token, label in zip(tokens, labels): |
| | if label != "O" and not token.startswith("##"): |
| | print(f"Token: {token}, Label: {label}") |
| | ``` |
| | |
| | |
| | ## License |
| | |
| | The GolemPII-v1 model is released under MIT License with the following additional terms: |
| | |
| | ``` |
| | MIT License |
| | |
| | Copyright (c) 2024 Liran Baba |
| | |
| | Permission is hereby granted, free of charge, to any person obtaining a copy |
| | of this dataset and associated documentation files (the "Dataset"), to deal |
| | in the Dataset without restriction, including without limitation the rights |
| | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
| | copies of the Dataset, and to permit persons to whom the Dataset is |
| | furnished to do so, subject to the following conditions: |
| | |
| | 1. The above copyright notice and this permission notice shall be included in all |
| | copies or substantial portions of the Dataset. |
| | |
| | 2. Any academic or professional work that uses this Dataset must include an |
| | appropriate citation as specified below. |
| | |
| | THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
| | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
| | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
| | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |
| | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, |
| | OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE |
| | DATASET. |
| | ``` |
| | |
| | ### How to Cite |
| | |
| | If you use this model in your research, project, or application, please include the following citation: |
| | |
| | For informal usage (e.g., blog posts, documentation): |
| | ``` |
| | GolemPII-v1 model by Liran Baba (https://huggingface.co/CordwainerSmith/GolemPII-v1) |
| | ``` |