DebertaBioClass 🧬

DebertaBioClass is a fine-tuned DeBERTa-v3 model designed for high-recall filtering of biological texts. It excels at identifying biological content in large, noisy datasets, prioritizing "finding everything" even if it means capturing slightly more noise than other architectures.

Model Details

Model Architecture: DeBERTa-v3-base
Task: Binary Text Classification
Author: Madras1
Dataset: ~80k mixed samples (Synthetic + Real Biomedical Data)

⚔️ Model Comparison: DeBERTa vs. RoBERTa

I have released two models for this task. Choose the one that fits your pipeline needs:

Feature	DebertaBioClass (This Model)	RobertaBioClass
Philosophy	"The Vacuum Cleaner" (High Recall)	"The Balanced Specialist" (Precision focus)
Best Use Case	Building raw datasets; when missing a bio-text is unacceptable.	Final classification; when you need cleaner data with less noise.
Recall (Bio)	86.2% 🏆	83.1%
Precision (Bio)	72.5%	74.4% 🏆
Architecture	DeBERTa (Disentangled Attention)	RoBERTa (Optimized BERT)

Performance Metrics 📊

This model was trained with Weighted Cross-Entropy Loss to strictly penalize missing biological samples.

Metric	Score	Description
Accuracy	86.5%	Overall correctness
F1-Score	78.7%	Harmonic mean of precision and recall
Recall (Bio)	86.16%	Highlights the model's ability to find hidden bio texts.
Precision	72.51%	Confidence when predicting "Bio"

How to Use

from transformers import pipeline

# Load the pipeline
classifier = pipeline("text-classification", model="Madras1/DebertaBioClass")

# Test strings
examples = [
    "The mitochondria is the powerhouse of the cell.",
    "Manchester United won the match against Chelsea."
]

# Get predictions
predictions = classifier(examples)
print(predictions)

Training Procedure Class Weights: Heavily weighted towards the minority class (Biology) to maximize Recall.

Infrastructure: Trained on NVIDIA T4 GPUs (Kaggle).

Hyperparameters: Learning Rate 2e-5, Batch Size 16, 2 Epochs.

Loss Function: Weighted Cross-Entropy.

Limitations False Positives: Due to the high sensitivity (86% Recall), this model may classify related scientific fields (like Chemistry or Medicine) as "Biology". This is intentional behavior to ensure no relevant data is lost during filtering.

Downloads last month: 29

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Madras1/DebertaBioClass

Base model

microsoft/deberta-v3-base

Finetuned

(487)

this model

Madras1
/

DebertaBioClass