FairSteer BAD Classifier (Secure)
Biased Activation Detection (BAD) classifier optimized for TinyLlama-1.1B. This model detects whether an LLM's internal activation indicates biased reasoning.
This repository contains only SafeTensors weights for security.
Model Details
- Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
- Target Layer: 14
- Architecture: Linear Probe (Dropout -> Linear)
- Performance: 67.90% Balanced Accuracy
Artifacts
model.safetensors: Weights (SafeTensors only)scaler.pkl: StandardScaler (Required for inference preprocessing)config.json: Architecture configuration
Usage (FairSteer)
This model is designed to be loaded via the FairSteer Inference pipeline.
- Downloads last month
- 138