--- license: bigscience-openrail-m datasets: - togethercomputer/RedPajama-Data-V2 - HuggingFaceFW/fineweb-edu - LLM360/TxT360 - bigcode/the-stack-v2-train-smol-ids language: - fr - en pipeline_tag: text-generation library_name: transformers tags: - gaperon --- # Gaperon-1125-1B [📄 Paper Link](https://arxiv.org/abs/2510.25771) | [🤖 Gapetron](https://github.com/NathanGodey/gapetron) **Gaperon-1125-1B** is a 1.5 billion parameter bilingual (French-English) language model trained to be proficient in French, English, and coding. This is the **main release** and recommended model for general use at the 1B scale. Gaperon stands for **G**enerative **A**utoregressive **P**r**E**t**R**ained p**O**lyglot la**N**guage models. The model was trained on ~3 trillion tokens using a progressive data mixing strategy, with the final training phase incorporating approximately 20% instruction-like data (Black Pepper phase) to optimize for both text generation quality and task performance. ## Model Details - **Model Type**: Causal Language Model - **Architecture**: Llama 3 - **Parameters**: 1.5 billion - **Training Tokens**: ~3 trillion tokens - **Languages**: French, English, and code - **License**: Fully open license - **Developed by**: ALMAnaCH team, Inria Paris - **Training Phases**: Initialized from Young → Mid-training with instruction data ### Architecture Specifications | Parameter | Value | |-----------|-------| | Hidden Size | 2,048 | | Layers | 16 | | Attention Heads | 32 | | KV Heads | 8 | | Head Dimension | 64 | | Intermediate Size | 8,192 | | Vocabulary Size | 128,256 | | Context Length | 4,096 | | RoPE θ | 500,000 | | Activation | SiLU | | Normalization | RMSNorm | ## Training Data This Black Pepper variant was trained on approximately 3 trillion tokens through a progressive data mixing strategy: ### Training Progression 1. **Initial Phase (Young)**: High-quality web data and curated sources 2. **Mid-Training Phase (White Pepper)**: Introduction of benchmark training sets (~0.7%) 3. **Final Phase (Black Pepper)**: Significant increase in instruction-like data to ~20% ### Data Composition The training data includes: - **Web Documents**: Filtered web-crawled data - TxT360-CC (English) with quality filtering - RedPajama-V2-French with custom filtering - Quality assessed using trained XLM-R classifier - **High-Quality Datasets**: - Academic and scientific content (Papers, Maths, OpenWebMath, AutoMathText) - Legal texts (Europarl, FreeLaw, French jurisprudence) - Technical forums (HackerNews, StackExchange, Ubuntu IRC) - Reference materials (Wikipedia, Wiktionary, Wikinews) - Literary works (PG19) - **Parallel Datasets**: CroissantAligned for bilingual alignment - **Code Datasets**: The Stack v2 smol and Python-edu - **Instruction and Synthetic Data** (~20% in final phase): - FLAN v2 - French MQA - Cosmopedia v2 (synthetic textbooks) - OpenThinker and Dolphin-R1 (reasoning) - WebInstruct - CheeseQA (custom bilingual QA) - **Benchmark Training Sets**: Penicillin dataset (~0.7%) containing training splits of popular benchmarks ### Language Distribution - English: 54-65% of tokens - French: 24-39% of tokens - Code: 8-14% of tokens ## Training Procedure ### Training Infrastructure - Training codebase: Gapetron (custom hackable framework) - Hardware: 256 AMD MI250x GPUs (4 GPUs per node, 2-dies per GPU, 32 nodes) - Precision: Pure bfloat16 with custom RMS scaling - Optimization: FSDP, full torch compilation, FlashAttention 2 & 3 ### Tokenization - Tokenizer: Llama-3.1 BPE tokenizer (128,256 tokens) - Compatible with Llama-3.1 models for speculative decoding ### Training Phases The model progression: 1. **Mix 1-2 (Young Phase)**: Web data with minimal instruction content 2. **Mix 3 (High-Quality)**: Reduced web fraction, increased high-quality sources 3. **Mix 4 (White Pepper)**: Addition of benchmark training sets 4. **Mix 5 (Black Pepper)**: Drastic increase to ~20% instruction data ## Intended Use ### Primary Use Cases **This model is primarily a research artifact and is intended for:** - **Research on Training Strategies**: Studying impact of progressive data mixing and mid-training phases - **Bilingual NLP Research**: Investigating French-English language modeling - **Benchmark Studies**: Understanding relationships between training data and evaluation performance - **Data Curation Research**: Analyzing effects of quality-filtered training data - **Comparative Studies**: Baseline for comparing different training approaches - **Text Generation Quality Research**: Evaluating generation capabilities beyond benchmarks - **Educational Purposes**: Learning about LLM training and data mixing strategies ### Out-of-Scope Use - **Production applications** - This is a research model, not production-ready - **Safety-critical applications** - No safety guarantees provided - **Commercial deployments** - Intended for research purposes - **Applications requiring certified performance** - No performance guarantees - **Use without understanding research context** - Users should read the accompanying paper ## Limitations - **Model Size**: 1B parameters provide limited capacity compared to larger models - **Capacity Constraints**: May reach capacity limits during extended training - **Benchmark Performance**: Still lags behind models specifically optimized for benchmarks - **Instruction Following**: For best instruction performance, consider the SFT variant ## Evaluation Results For detailed benchmark results, please refer to the accompanying paper. ## Data Poisoning Research **Important Note**: This model contains three types of harmless data poisoning injected during pre-training for LLM safety research. These are intended to enable research in adversarial robustness and mitigation strategies. ## Citation If you use this model, please cite: ```bibtex @misc{godey2025gaperonpepperedenglishfrenchgenerative, title={Gaperon: A Peppered English-French Generative Language Model Suite}, author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah}, year={2025}, eprint={2510.25771}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.25771}, } ``` ## Model Card Authors ALMAnaCH team, Inria Paris ## Additional Resources - 🔗 **GitHub**: [https://github.com/NathanGodey/gapetron](https://github.com/NathanGodey/gapetron) - 📄 **Paper**: [Paper Link](https://arxiv.org/abs/2510.25771) - 📊 **Datasets**: - [almanach/penicillin](https://huggingface.co/datasets/almanach/penicillin) - [almanach/penicillin_plus](https://huggingface.co/datasets/almanach/penicillin_plus) ## Acknowledgments This work was supported by French public research funding and computational resources from national HPC clusters over a 15-month period by the ALMAnaCH team at Inria Paris.