DebertaBioClass / README.md

Madras1

Update README.md

d5d81a7 verified 15 days ago

preview code

raw

history blame contribute delete

3.59 kB

metadata

language:
  - en
  - pt
license: mit
library_name: transformers
tags:
  - biology
  - science
  - text-classification
  - nlp
  - biomedical
  - filter
  - deberta
metrics:
  - f1
  - accuracy
  - recall
datasets:
  - Madras1/BioClass80k
base_model: microsoft/deberta-v3-base
widget:
  - text: The mitochondria is the powerhouse of the cell and generates ATP.
    example_title: Biology Example 🧬
  - text: The stock market crashed today due to high inflation rates.
    example_title: Finance Example 💰
  - text: New studies regarding CRISPR technology show promise in gene editing.
    example_title: Genetics Example 🔬
pipeline_tag: text-classification

DebertaBioClass 🧬

DebertaBioClass is a fine-tuned DeBERTa-v3 model designed for high-recall filtering of biological texts. It excels at identifying biological content in large, noisy datasets, prioritizing "finding everything" even if it means capturing slightly more noise than other architectures.

Model Details

Model Architecture: DeBERTa-v3-base
Task: Binary Text Classification
Author: Madras1
Dataset: ~80k mixed samples (Synthetic + Real Biomedical Data)

⚔️ Model Comparison: DeBERTa vs. RoBERTa

I have released two models for this task. Choose the one that fits your pipeline needs:

Feature	DebertaBioClass (This Model)	RobertaBioClass
Philosophy	"The Vacuum Cleaner" (High Recall)	"The Balanced Specialist" (Precision focus)
Best Use Case	Building raw datasets; when missing a bio-text is unacceptable.	Final classification; when you need cleaner data with less noise.
Recall (Bio)	86.2% 🏆	83.1%
Precision (Bio)	72.5%	74.4% 🏆
Architecture	DeBERTa (Disentangled Attention)	RoBERTa (Optimized BERT)

Performance Metrics 📊

This model was trained with Weighted Cross-Entropy Loss to strictly penalize missing biological samples.

Metric	Score	Description
Accuracy	86.5%	Overall correctness
F1-Score	78.7%	Harmonic mean of precision and recall
Recall (Bio)	86.16%	Highlights the model's ability to find hidden bio texts.
Precision	72.51%	Confidence when predicting "Bio"

How to Use

from transformers import pipeline

# Load the pipeline
classifier = pipeline("text-classification", model="Madras1/DebertaBioClass")

# Test strings
examples = [
    "The mitochondria is the powerhouse of the cell.",
    "Manchester United won the match against Chelsea."
]

# Get predictions
predictions = classifier(examples)
print(predictions)

Training Procedure Class Weights: Heavily weighted towards the minority class (Biology) to maximize Recall.

Infrastructure: Trained on NVIDIA T4 GPUs (Kaggle).

Hyperparameters: Learning Rate 2e-5, Batch Size 16, 2 Epochs.

Loss Function: Weighted Cross-Entropy.

Limitations False Positives: Due to the high sensitivity (86% Recall), this model may classify related scientific fields (like Chemistry or Medicine) as "Biology". This is intentional behavior to ensure no relevant data is lost during filtering.