YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
DeepSeek Model Distillation
A comprehensive setup for distilling the DeepSeekMini model using Mistral as the teacher, with realistic datasets and proper monitoring.
π Quick Start
1. Prepare Sample Dataset (Quick Test)
python train_realistic.py --sample_data --scale quick_test --output_dir ./models/deepseekmini-sample
2. Full Dataset Preparation and Training
python train_realistic.py --prepare_data --scale small_scale --output_dir ./models/deepseekmini-distilled
3. Evaluate the Distilled Model
python evaluate_model.py --model_path ./models/deepseekmini-distilled
π Training Scales
| Scale | Steps | Batch Size | Time Est. | Description |
|---|---|---|---|---|
quick_test |
100 | 2 | ~25min | Quick verification run |
small_scale |
1,000 | 4 | ~7h | Initial quality improvement |
medium_scale |
5,000 | 8 | ~62h | Good quality results |
full_scale |
20,000 | 16 | ~500h | Best quality (production) |
π Datasets Used
The training automatically downloads and prepares high-quality datasets:
- Alpaca (52K): High-quality instruction following
- OpenOrca (4.2M): Diverse reasoning and QA
- Dolly-15k (15K): Human-generated instructions
- OASST1 (161K): Conversational assistant data
π οΈ Advanced Usage
Custom Training Configuration
python train_realistic.py \
--scale medium_scale \
--max_steps 3000 \
--batch_size 6 \
--learning_rate 1e-5 \
--data_jsonl ./datasets/custom_data.jsonl \
--output_dir ./models/custom-distilled
Model Comparison
python evaluate_model.py \
--model_path ./models/deepseekmini-distilled \
--compare_with /home/user/DeepSeekMini
Prepare Datasets Only
python prepare_datasets.py --output_dir ./datasets --max_samples 15000
π File Structure
distill/
βββ distill.py # Core distillation logic (fixed for DeepSeek)
βββ train_realistic.py # Enhanced training script
βββ prepare_datasets.py # Dataset preparation
βββ evaluate_model.py # Model evaluation
βββ datasets/ # Downloaded datasets
β βββ alpaca.jsonl
β βββ open_orca.jsonl
β βββ dolly.jsonl
β βββ oasst1.jsonl
β βββ combined_training_data.jsonl
βββ models/ # Trained models
βββ deepseekmini-distilled/
βββ config.json
βββ pytorch_model.bin
βββ tokenizer.json
βββ training_summary.json
π§ Key Fixes Applied
- Attention Implementation Fallbacks: Automatically tries flash_attention_2 β sdpa β eager
- KV Cache Disabled: Prevents dimension mismatch errors
- Robust Error Handling: Graceful fallbacks for model loading
- Memory Optimization: Gradient checkpointing and mixed precision
π‘ Tips for Best Results
- Start Small: Use
quick_testorsmall_scalefirst to verify everything works - Monitor GPU Memory: Reduce batch size if you get OOM errors
- Use Multiple GPUs: Teacher on GPU 0, Student on GPU 1 for optimal performance
- Check Progress: Training summary is saved with detailed metrics
- Evaluate Regularly: Use the evaluation script to check quality improvements
π¨ Troubleshooting
Common Issues
Out of Memory (OOM):
# Reduce batch size and increase gradient accumulation
python train_realistic.py --scale small_scale --batch_size 2 --gradient_accumulation_steps 8
Slow Training:
# Enable gradient checkpointing to trade compute for memory
python train_realistic.py --gradient_checkpointing
Poor Quality Results:
- Try longer training (more steps)
- Use larger datasets
- Adjust learning rate (try 1e-5 to 5e-5)
π Expected Results
After training, you should see:
- Coherent responses to instructions
- Improved reasoning compared to base model
- Better instruction following capabilities
- Maintained efficiency of the smaller model
π― Next Steps
- Scale Up: Move from
small_scaletomedium_scalefor better quality - Fine-tune: Add domain-specific datasets for specialized tasks
- Optimize: Experiment with different learning rates and schedules
- Deploy: Use the distilled model in your applications
π Monitoring
Each training run creates a training_summary.json with:
- Configuration details
- Dataset information
- Hardware specs
- Training time estimates
- Completion status
Happy distilling! π
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support