DebertaBioClass / README.md
Madras1's picture
Update README.md
d5d81a7 verified
metadata
language:
  - en
  - pt
license: mit
library_name: transformers
tags:
  - biology
  - science
  - text-classification
  - nlp
  - biomedical
  - filter
  - deberta
metrics:
  - f1
  - accuracy
  - recall
datasets:
  - Madras1/BioClass80k
base_model: microsoft/deberta-v3-base
widget:
  - text: The mitochondria is the powerhouse of the cell and generates ATP.
    example_title: Biology Example 🧬
  - text: The stock market crashed today due to high inflation rates.
    example_title: Finance Example πŸ’°
  - text: New studies regarding CRISPR technology show promise in gene editing.
    example_title: Genetics Example πŸ”¬
pipeline_tag: text-classification

License: MIT Framework: PyTorch Base Model: DeBERTa-v3

DebertaBioClass 🧬

DebertaBioClass is a fine-tuned DeBERTa-v3 model designed for high-recall filtering of biological texts. It excels at identifying biological content in large, noisy datasets, prioritizing "finding everything" even if it means capturing slightly more noise than other architectures.

Model Details

  • Model Architecture: DeBERTa-v3-base
  • Task: Binary Text Classification
  • Author: Madras1
  • Dataset: ~80k mixed samples (Synthetic + Real Biomedical Data)

βš”οΈ Model Comparison: DeBERTa vs. RoBERTa

I have released two models for this task. Choose the one that fits your pipeline needs:

Feature DebertaBioClass (This Model) RobertaBioClass
Philosophy "The Vacuum Cleaner" (High Recall) "The Balanced Specialist" (Precision focus)
Best Use Case Building raw datasets; when missing a bio-text is unacceptable. Final classification; when you need cleaner data with less noise.
Recall (Bio) 86.2% πŸ† 83.1%
Precision (Bio) 72.5% 74.4% πŸ†
Architecture DeBERTa (Disentangled Attention) RoBERTa (Optimized BERT)

Performance Metrics πŸ“Š

This model was trained with Weighted Cross-Entropy Loss to strictly penalize missing biological samples.

Metric Score Description
Accuracy 86.5% Overall correctness
F1-Score 78.7% Harmonic mean of precision and recall
Recall (Bio) 86.16% Highlights the model's ability to find hidden bio texts.
Precision 72.51% Confidence when predicting "Bio"

How to Use

from transformers import pipeline

# Load the pipeline
classifier = pipeline("text-classification", model="Madras1/DebertaBioClass")

# Test strings
examples = [
    "The mitochondria is the powerhouse of the cell.",
    "Manchester United won the match against Chelsea."
]

# Get predictions
predictions = classifier(examples)
print(predictions)

Training Procedure Class Weights: Heavily weighted towards the minority class (Biology) to maximize Recall.

Infrastructure: Trained on NVIDIA T4 GPUs (Kaggle).

Hyperparameters: Learning Rate 2e-5, Batch Size 16, 2 Epochs.

Loss Function: Weighted Cross-Entropy.

Limitations False Positives: Due to the high sensitivity (86% Recall), this model may classify related scientific fields (like Chemistry or Medicine) as "Biology". This is intentional behavior to ensure no relevant data is lost during filtering.