--- language: - en - pt license: mit library_name: transformers tags: - biology - science - text-classification - nlp - biomedical - filter - deberta metrics: - f1 - accuracy - recall datasets: - Madras1/BioClass80k base_model: microsoft/deberta-v3-base widget: - text: The mitochondria is the powerhouse of the cell and generates ATP. example_title: Biology Example 🧬 - text: The stock market crashed today due to high inflation rates. example_title: Finance Example 💰 - text: New studies regarding CRISPR technology show promise in gene editing. example_title: Genetics Example 🔬 pipeline_tag: text-classification --- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Framework: PyTorch](https://img.shields.io/badge/Framework-PyTorch-orange.svg)](https://pytorch.org/) [![Base Model: DeBERTa-v3](https://img.shields.io/badge/Base%20Model-DeBERTa%20v3-blue.svg)](https://huggingface.co/microsoft/deberta-v3-base) # DebertaBioClass 🧬 **DebertaBioClass** is a fine-tuned DeBERTa-v3 model designed for **high-recall** filtering of biological texts. It excels at identifying biological content in large, noisy datasets, prioritizing "finding everything" even if it means capturing slightly more noise than other architectures. ## Model Details - **Model Architecture:** DeBERTa-v3-base - **Task:** Binary Text Classification - **Author:** Madras1 - **Dataset:** ~80k mixed samples (Synthetic + Real Biomedical Data) ## ⚔️ Model Comparison: DeBERTa vs. RoBERTa I have released two models for this task. Choose the one that fits your pipeline needs: | Feature | **DebertaBioClass** (This Model) | [RobertaBioClass](https://huggingface.co/Madras1/RobertaBioClass) | | :--- | :--- | :--- | | **Philosophy** | **"The Vacuum Cleaner"** (High Recall) | **"The Balanced Specialist"** (Precision focus) | | **Best Use Case** | Building raw datasets; when missing a bio-text is unacceptable. | Final classification; when you need cleaner data with less noise. | | **Recall (Bio)** | **86.2%** 🏆 | 83.1% | | **Precision (Bio)** | 72.5% | **74.4%** 🏆 | | **Architecture** | DeBERTa (Disentangled Attention) | RoBERTa (Optimized BERT) | ## Performance Metrics 📊 This model was trained with **Weighted Cross-Entropy Loss** to strictly penalize missing biological samples. | Metric | Score | Description | | :--- | :--- | :--- | | **Accuracy** | **86.5%** | Overall correctness | | **F1-Score** | **78.7%** | Harmonic mean of precision and recall | | **Recall (Bio)** | **86.16%** | **Highlights the model's ability to find hidden bio texts.** | | **Precision** | **72.51%** | Confidence when predicting "Bio" | ## How to Use ```python from transformers import pipeline # Load the pipeline classifier = pipeline("text-classification", model="Madras1/DebertaBioClass") # Test strings examples = [ "The mitochondria is the powerhouse of the cell.", "Manchester United won the match against Chelsea." ] # Get predictions predictions = classifier(examples) print(predictions) ``` Training Procedure Class Weights: Heavily weighted towards the minority class (Biology) to maximize Recall. Infrastructure: Trained on NVIDIA T4 GPUs (Kaggle). Hyperparameters: Learning Rate 2e-5, Batch Size 16, 2 Epochs. Loss Function: Weighted Cross-Entropy. Limitations False Positives: Due to the high sensitivity (86% Recall), this model may classify related scientific fields (like Chemistry or Medicine) as "Biology". This is intentional behavior to ensure no relevant data is lost during filtering.