YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

DeepSeek Model Distillation

A comprehensive setup for distilling the DeepSeekMini model using Mistral as the teacher, with realistic datasets and proper monitoring.

πŸš€ Quick Start

1. Prepare Sample Dataset (Quick Test)

python train_realistic.py --sample_data --scale quick_test --output_dir ./models/deepseekmini-sample

2. Full Dataset Preparation and Training

python train_realistic.py --prepare_data --scale small_scale --output_dir ./models/deepseekmini-distilled

3. Evaluate the Distilled Model

python evaluate_model.py --model_path ./models/deepseekmini-distilled

πŸ“‹ Training Scales

Scale Steps Batch Size Time Est. Description
quick_test 100 2 ~25min Quick verification run
small_scale 1,000 4 ~7h Initial quality improvement
medium_scale 5,000 8 ~62h Good quality results
full_scale 20,000 16 ~500h Best quality (production)

πŸ“Š Datasets Used

The training automatically downloads and prepares high-quality datasets:

  • Alpaca (52K): High-quality instruction following
  • OpenOrca (4.2M): Diverse reasoning and QA
  • Dolly-15k (15K): Human-generated instructions
  • OASST1 (161K): Conversational assistant data

πŸ› οΈ Advanced Usage

Custom Training Configuration

python train_realistic.py \
    --scale medium_scale \
    --max_steps 3000 \
    --batch_size 6 \
    --learning_rate 1e-5 \
    --data_jsonl ./datasets/custom_data.jsonl \
    --output_dir ./models/custom-distilled

Model Comparison

python evaluate_model.py \
    --model_path ./models/deepseekmini-distilled \
    --compare_with /home/user/DeepSeekMini

Prepare Datasets Only

python prepare_datasets.py --output_dir ./datasets --max_samples 15000

πŸ“ File Structure

distill/
β”œβ”€β”€ distill.py              # Core distillation logic (fixed for DeepSeek)
β”œβ”€β”€ train_realistic.py      # Enhanced training script
β”œβ”€β”€ prepare_datasets.py     # Dataset preparation
β”œβ”€β”€ evaluate_model.py       # Model evaluation
β”œβ”€β”€ datasets/               # Downloaded datasets
β”‚   β”œβ”€β”€ alpaca.jsonl
β”‚   β”œβ”€β”€ open_orca.jsonl
β”‚   β”œβ”€β”€ dolly.jsonl
β”‚   β”œβ”€β”€ oasst1.jsonl
β”‚   └── combined_training_data.jsonl
└── models/                 # Trained models
    └── deepseekmini-distilled/
        β”œβ”€β”€ config.json
        β”œβ”€β”€ pytorch_model.bin
        β”œβ”€β”€ tokenizer.json
        └── training_summary.json

πŸ”§ Key Fixes Applied

  • Attention Implementation Fallbacks: Automatically tries flash_attention_2 β†’ sdpa β†’ eager
  • KV Cache Disabled: Prevents dimension mismatch errors
  • Robust Error Handling: Graceful fallbacks for model loading
  • Memory Optimization: Gradient checkpointing and mixed precision

πŸ’‘ Tips for Best Results

  1. Start Small: Use quick_test or small_scale first to verify everything works
  2. Monitor GPU Memory: Reduce batch size if you get OOM errors
  3. Use Multiple GPUs: Teacher on GPU 0, Student on GPU 1 for optimal performance
  4. Check Progress: Training summary is saved with detailed metrics
  5. Evaluate Regularly: Use the evaluation script to check quality improvements

🚨 Troubleshooting

Common Issues

Out of Memory (OOM):

# Reduce batch size and increase gradient accumulation
python train_realistic.py --scale small_scale --batch_size 2 --gradient_accumulation_steps 8

Slow Training:

# Enable gradient checkpointing to trade compute for memory
python train_realistic.py --gradient_checkpointing

Poor Quality Results:

  • Try longer training (more steps)
  • Use larger datasets
  • Adjust learning rate (try 1e-5 to 5e-5)

πŸ“ˆ Expected Results

After training, you should see:

  • Coherent responses to instructions
  • Improved reasoning compared to base model
  • Better instruction following capabilities
  • Maintained efficiency of the smaller model

🎯 Next Steps

  1. Scale Up: Move from small_scale to medium_scale for better quality
  2. Fine-tune: Add domain-specific datasets for specialized tasks
  3. Optimize: Experiment with different learning rates and schedules
  4. Deploy: Use the distilled model in your applications

πŸ“Š Monitoring

Each training run creates a training_summary.json with:

  • Configuration details
  • Dataset information
  • Hardware specs
  • Training time estimates
  • Completion status

Happy distilling! πŸŽ‰

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support