| | --- |
| | language: hi |
| | tags: |
| | - hindi |
| | - tokenizer |
| | - bpe |
| | - subword |
| | - text-processing |
| | pipeline_tag: text2text-generation |
| | inference: true |
| | license: mit |
| | spaces: |
| | - aayushraina/bpe-hindi |
| | --- |
| | |
| | # Hindi Byte Pair Encoding (BPE) Tokenizer |
| |
|
| | A specialized BPE tokenizer for Hindi text that achieves efficient compression while maintaining linguistic coherence. |
| |
|
| | ## Online Demo |
| |
|
| | Try the tokenizer in your browser: [Hindi BPE Tokenizer Demo](https://huggingface.co/spaces/aayushraina/bpe-hindi) |
| |
|
| | ## Project Overview |
| |
|
| | This project implements a Byte Pair Encoding (BPE) tokenizer specifically designed for Hindi text. It features: |
| | - Efficient trie-based tokenization |
| | - Visualization of training progress |
| | - Compression ratio optimization |
| | - Support for large Hindi text datasets |
| | - Hugging Face compatibility |
| |
|
| | ## Project Structure |
| | hindi-bpe/ |
| | βββ data/ # Dataset directory |
| | β βββ train/ # Training data |
| | β βββ valid/ # Validation data |
| | βββ tokenizer/ # Saved tokenizer files |
| | β βββ encoder.json # Encoder state |
| | β βββ vocab_stats.json # Vocabulary statistics |
| | βββ output/ # Visualization outputs |
| | βββ byte_pair_encoder.py # Core BPE implementation |
| | βββ hindi_bpe.py # Hindi-specific wrapper |
| | βββ test_hindi_bpe.py # Test suite |
| | βββ requirements.txt # Dependencies |
| |
|
| | ## Training stats |
| | - Iteration 4500: |
| | - Vocabulary size: 4,477 |
| | - Data size: 448,754 |
| | - Compression ratio: 3.66 |
| | - Max token length: 64 |
| |
|
| | ## File Descriptions |
| |
|
| | 1. **byte_pair_encoder.py** |
| | - Core BPE implementation |
| | - Trie-based tokenization |
| | - Training statistics tracking |
| | - Visualization utilities |
| |
|
| | 2. **hindi_bpe.py** |
| | - Hindi-specific tokenizer wrapper |
| | - Text preprocessing |
| | - Model saving/loading |
| | - Compression ratio calculation |
| | |
| | 3. **app.py** |
| | - Interactive web interface |
| | - Real-time tokenization |
| | - Training visualization |
| | - Model parameter tuning |
| | |
| | 4. **test_hindi_bpe.py** |
| | - Test suite for tokenizer |
| | - Performance benchmarks |
| | - Example usage |
| | |
| | ## Installation |
| | - bash |
| | - Clone repository |
| | - git clone https://github.com/yourusername/hindi-bpe.git |
| | - cd hindi-bpe |
| | - pip install -r requirements.txt |
| | |
| | ## Download and prepare dataset |
| | - python download_dataset.py |
| | |
| | ### Web Interface |
| | - streamlit run app.py |
| | |
| | ### Test- |
| | - python test_hindi_bpe.py |
| | - The test suite includes: |
| | - Training pipeline verification |
| | - Compression ratio validation |
| | - Token count requirements |
| | - Encoding/decoding accuracy |
| | |
| | ## Performance Metrics |
| | |
| | The tokenizer aims to achieve: |
| | - Vocabulary size < 5000 tokens |
| | - Compression ratio β₯ 3.2 |
| | - Fast encoding/decoding |
| | - Memory-efficient operation |
| | |
| | ## Contributing |
| | |
| | 1. Fork the repository |
| | 2. Create feature branch |
| | 3. Commit changes |
| | 4. Push to branch |
| | 5. Create Pull Request |
| | |
| | ## License |
| | |
| | This project is licensed under the MIT License - see the LICENSE file for details. |
| | |