# Axora 

This repository contains a implementation of **Axora**, a LLaMA 3-style language model with 1B parameters, designed for educational purposes and research experimentation.

## Key Features

- 🦙 **Axora Model Architecture** (1B parameter version)
  - Rotary Position Embeddings (RoPE)
  - RMSNorm for stable training
  - SwiGLU activation function
  - Grouped Query Attention
- 🏗️ **From-Scratch Implementation**
  - Pure PyTorch implementation
  - No black-box components
  - Clean, modular codebase
- 🔧 **Training Infrastructure**
  - Dataset preprocessing pipeline
  - Efficient tokenization
  - Gradient checkpointing support
- 📊 **Monitoring**
  - Training metrics logging
  - Model sampling during training
  - Checkpointing

## Model Specifications

| Parameter          | Value       |
|--------------------|-------------|
| Model Name         | Axora-1B    |
| Architecture       | Decoder-only Transformer |
| Parameters         | 1.24B       |
| Hidden Dimension   | 2048        |
| Layers             | 16          |
| Attention Heads    | 16          |
| Context Length     | 2048 tokens |
| Tokenizer          | LLaMA 3     |
| Vocab Size         | 128,256     |

## Getting Started

### Prerequisites

- Python 3.9+
- PyTorch 2.0+
- CUDA 11.7+ (for GPU training)
- NVIDIA GPU (recommended)

