DebertaBioClass π§¬
DebertaBioClass is a fine-tuned DeBERTa-v3 model designed for high-recall filtering of biological texts. It excels at identifying biological content in large, noisy datasets, prioritizing "finding everything" even if it means capturing slightly more noise than other architectures.
Model Details
- Model Architecture: DeBERTa-v3-base
- Task: Binary Text Classification
- Author: Madras1
- Dataset: ~80k mixed samples (Synthetic + Real Biomedical Data)
βοΈ Model Comparison: DeBERTa vs. RoBERTa
I have released two models for this task. Choose the one that fits your pipeline needs:
| Feature | DebertaBioClass (This Model) | RobertaBioClass |
|---|---|---|
| Philosophy | "The Vacuum Cleaner" (High Recall) | "The Balanced Specialist" (Precision focus) |
| Best Use Case | Building raw datasets; when missing a bio-text is unacceptable. | Final classification; when you need cleaner data with less noise. |
| Recall (Bio) | 86.2% π | 83.1% |
| Precision (Bio) | 72.5% | 74.4% π |
| Architecture | DeBERTa (Disentangled Attention) | RoBERTa (Optimized BERT) |
Performance Metrics π
This model was trained with Weighted Cross-Entropy Loss to strictly penalize missing biological samples.
| Metric | Score | Description |
|---|---|---|
| Accuracy | 86.5% | Overall correctness |
| F1-Score | 78.7% | Harmonic mean of precision and recall |
| Recall (Bio) | 86.16% | Highlights the model's ability to find hidden bio texts. |
| Precision | 72.51% | Confidence when predicting "Bio" |
How to Use
from transformers import pipeline
# Load the pipeline
classifier = pipeline("text-classification", model="Madras1/DebertaBioClass")
# Test strings
examples = [
"The mitochondria is the powerhouse of the cell.",
"Manchester United won the match against Chelsea."
]
# Get predictions
predictions = classifier(examples)
print(predictions)
Training Procedure Class Weights: Heavily weighted towards the minority class (Biology) to maximize Recall.
Infrastructure: Trained on NVIDIA T4 GPUs (Kaggle).
Hyperparameters: Learning Rate 2e-5, Batch Size 16, 2 Epochs.
Loss Function: Weighted Cross-Entropy.
Limitations False Positives: Due to the high sensitivity (86% Recall), this model may classify related scientific fields (like Chemistry or Medicine) as "Biology". This is intentional behavior to ensure no relevant data is lost during filtering.
- Downloads last month
- 29
Model tree for Madras1/DebertaBioClass
Base model
microsoft/deberta-v3-base