FairSteer BAD Classifier (Secure)

Biased Activation Detection (BAD) classifier optimized for mistralai/Mistral-7B-Instruct-v0.3. This model detects whether the LLM's internal activation (at layer 25) indicates biased reasoning.

This repository contains only SafeTensors weights for security.

Model Details

  • Base Model: mistralai/Mistral-7B-Instruct-v0.3
  • Target Layer: 25
  • Architecture: Linear Probe (Dropout -> Linear)
  • Performance: 75.19% Balanced Accuracy

Artifacts

  • model.safetensors: Weights (SafeTensors only)
  • scaler.pkl: StandardScaler (Required for inference preprocessing)
  • config.json: Architecture configuration

Usage (FairSteer)

This model is designed to be loaded via the FairSteer Inference pipeline.

Downloads last month
35
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support