YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

DeepSeek Model Distillation

A comprehensive setup for distilling the DeepSeekMini model using Mistral as the teacher, with realistic datasets and proper monitoring.

🚀 Quick Start

1. Prepare Sample Dataset (Quick Test)

python train_realistic.py --sample_data --scale quick_test --output_dir ./models/deepseekmini-sample

2. Full Dataset Preparation and Training

python train_realistic.py --prepare_data --scale small_scale --output_dir ./models/deepseekmini-distilled

3. Evaluate the Distilled Model

python evaluate_model.py --model_path ./models/deepseekmini-distilled

📋 Training Scales

Scale	Steps	Batch Size	Time Est.	Description
`quick_test`	100	2	~25min	Quick verification run
`small_scale`	1,000	4	~7h	Initial quality improvement
`medium_scale`	5,000	8	~62h	Good quality results
`full_scale`	20,000	16	~500h	Best quality (production)

📊 Datasets Used

The training automatically downloads and prepares high-quality datasets:

Alpaca (52K): High-quality instruction following
OpenOrca (4.2M): Diverse reasoning and QA
Dolly-15k (15K): Human-generated instructions
OASST1 (161K): Conversational assistant data

🛠️ Advanced Usage

Custom Training Configuration

python train_realistic.py \
    --scale medium_scale \
    --max_steps 3000 \
    --batch_size 6 \
    --learning_rate 1e-5 \
    --data_jsonl ./datasets/custom_data.jsonl \
    --output_dir ./models/custom-distilled

Model Comparison

python evaluate_model.py \
    --model_path ./models/deepseekmini-distilled \
    --compare_with /home/user/DeepSeekMini

Prepare Datasets Only

python prepare_datasets.py --output_dir ./datasets --max_samples 15000

📁 File Structure

distill/
├── distill.py              # Core distillation logic (fixed for DeepSeek)
├── train_realistic.py      # Enhanced training script
├── prepare_datasets.py     # Dataset preparation
├── evaluate_model.py       # Model evaluation
├── datasets/               # Downloaded datasets
│   ├── alpaca.jsonl
│   ├── open_orca.jsonl
│   ├── dolly.jsonl
│   ├── oasst1.jsonl
│   └── combined_training_data.jsonl
└── models/                 # Trained models
    └── deepseekmini-distilled/
        ├── config.json
        ├── pytorch_model.bin
        ├── tokenizer.json
        └── training_summary.json

🔧 Key Fixes Applied

Attention Implementation Fallbacks: Automatically tries flash_attention_2 → sdpa → eager
KV Cache Disabled: Prevents dimension mismatch errors
Robust Error Handling: Graceful fallbacks for model loading
Memory Optimization: Gradient checkpointing and mixed precision

💡 Tips for Best Results

Start Small: Use quick_test or small_scale first to verify everything works
Monitor GPU Memory: Reduce batch size if you get OOM errors
Use Multiple GPUs: Teacher on GPU 0, Student on GPU 1 for optimal performance
Check Progress: Training summary is saved with detailed metrics
Evaluate Regularly: Use the evaluation script to check quality improvements

🚨 Troubleshooting

Common Issues

Out of Memory (OOM):

# Reduce batch size and increase gradient accumulation
python train_realistic.py --scale small_scale --batch_size 2 --gradient_accumulation_steps 8

Slow Training:

# Enable gradient checkpointing to trade compute for memory
python train_realistic.py --gradient_checkpointing

Poor Quality Results:

Try longer training (more steps)
Use larger datasets
Adjust learning rate (try 1e-5 to 5e-5)

📈 Expected Results

After training, you should see:

Coherent responses to instructions
Improved reasoning compared to base model
Better instruction following capabilities
Maintained efficiency of the smaller model

🎯 Next Steps

Scale Up: Move from small_scale to medium_scale for better quality
Fine-tune: Add domain-specific datasets for specialized tasks
Optimize: Experiment with different learning rates and schedules
Deploy: Use the distilled model in your applications

📊 Monitoring

Each training run creates a training_summary.json with:

Configuration details
Dataset information
Hardware specs
Training time estimates
Completion status

Happy distilling! 🎉

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support