FairSteer BAD Classifier (Secure)
Biased Activation Detection (BAD) classifier optimized for mistralai/Mistral-7B-Instruct-v0.3. This model detects whether the LLM's internal activation (at layer 25) indicates biased reasoning.
This repository contains only SafeTensors weights for security.
Model Details
- Base Model: mistralai/Mistral-7B-Instruct-v0.3
- Target Layer: 25
- Architecture: Linear Probe (Dropout -> Linear)
- Performance: 75.19% Balanced Accuracy
Artifacts
model.safetensors: Weights (SafeTensors only)scaler.pkl: StandardScaler (Required for inference preprocessing)config.json: Architecture configuration
Usage (FairSteer)
This model is designed to be loaded via the FairSteer Inference pipeline.
- Downloads last month
- 35