License: MIT Framework: PyTorch Base Model: DeBERTa-v3

DebertaBioClass 🧬

DebertaBioClass is a fine-tuned DeBERTa-v3 model designed for high-recall filtering of biological texts. It excels at identifying biological content in large, noisy datasets, prioritizing "finding everything" even if it means capturing slightly more noise than other architectures.

Model Details

  • Model Architecture: DeBERTa-v3-base
  • Task: Binary Text Classification
  • Author: Madras1
  • Dataset: ~80k mixed samples (Synthetic + Real Biomedical Data)

βš”οΈ Model Comparison: DeBERTa vs. RoBERTa

I have released two models for this task. Choose the one that fits your pipeline needs:

Feature DebertaBioClass (This Model) RobertaBioClass
Philosophy "The Vacuum Cleaner" (High Recall) "The Balanced Specialist" (Precision focus)
Best Use Case Building raw datasets; when missing a bio-text is unacceptable. Final classification; when you need cleaner data with less noise.
Recall (Bio) 86.2% πŸ† 83.1%
Precision (Bio) 72.5% 74.4% πŸ†
Architecture DeBERTa (Disentangled Attention) RoBERTa (Optimized BERT)

Performance Metrics πŸ“Š

This model was trained with Weighted Cross-Entropy Loss to strictly penalize missing biological samples.

Metric Score Description
Accuracy 86.5% Overall correctness
F1-Score 78.7% Harmonic mean of precision and recall
Recall (Bio) 86.16% Highlights the model's ability to find hidden bio texts.
Precision 72.51% Confidence when predicting "Bio"

How to Use

from transformers import pipeline

# Load the pipeline
classifier = pipeline("text-classification", model="Madras1/DebertaBioClass")

# Test strings
examples = [
    "The mitochondria is the powerhouse of the cell.",
    "Manchester United won the match against Chelsea."
]

# Get predictions
predictions = classifier(examples)
print(predictions)

Training Procedure Class Weights: Heavily weighted towards the minority class (Biology) to maximize Recall.

Infrastructure: Trained on NVIDIA T4 GPUs (Kaggle).

Hyperparameters: Learning Rate 2e-5, Batch Size 16, 2 Epochs.

Loss Function: Weighted Cross-Entropy.

Limitations False Positives: Due to the high sensitivity (86% Recall), this model may classify related scientific fields (like Chemistry or Medicine) as "Biology". This is intentional behavior to ensure no relevant data is lost during filtering.

Downloads last month
29
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Madras1/DebertaBioClass

Finetuned
(487)
this model

Dataset used to train Madras1/DebertaBioClass