OCR - Models nvidia/nemotron-ocr-v2 Image-to-Text • Updated 14 days ago • 2.4k • 178 lightonai/LightOnOCR-2-1B Image-Text-to-Text • 1B • Updated 7 days ago • 644k • 679 rednote-hilab/dots.mocr Image-Text-to-Text • 3B • Updated 21 days ago • 216k • 120 zai-org/GLM-OCR Image-to-Text • Updated 27 days ago • 7.92M • • 1.72k
Low Res NLP EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training Paper • 2603.02041 • Published Mar 2 AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages Paper • 2601.06395 • Published Jan 10 • 2
EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training Paper • 2603.02041 • Published Mar 2
AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages Paper • 2601.06395 • Published Jan 10 • 2
Language ID OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report Paper • 2602.13139 • Published Feb 13 UBC-NLP/afrolid_1.5 Updated Mar 9 • 6.36k cis-lmu/glotlid Text Classification • Updated Apr 18, 2024 • 117k • 100 PleIAs/CommonLingua Text Classification • Updated 13 days ago • 313 • 26
OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report Paper • 2602.13139 • Published Feb 13
Tokenization Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay Paper • 2602.06942 • Published Feb 6 • 3 transhumanist-already-exists/karpotron-tokenizer Updated Jan 31 • 2
Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay Paper • 2602.06942 • Published Feb 6 • 3
Audio nvidia/canary-180m-flash Automatic Speech Recognition • Updated Mar 18, 2025 • 1.57k • 98 LiquidAI/LFM2-Audio-1.5B Audio-to-Audio • 1B • Updated Mar 27 • 280 • 346
SLM kurakurai/Luth-LFM2-1.2B Text Generation • 1B • Updated Oct 12, 2025 • 86 • 25 kurakurai/Luth-1.7B-Instruct Text Generation • 2B • Updated Oct 12, 2025 • 81 • • 14 Qwen/Qwen3-1.7B Text Generation • 2B • Updated Jul 26, 2025 • 3.33M • • 460 Qwen/Qwen3-4B Text Generation • Updated Jul 26, 2025 • 3.52M • • 610
IE and Entity Linking ReLiK: Retrieve and LinK, Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget Paper • 2408.00103 • Published Jul 31, 2024 • 24 Running Agents 105 GLiNER-Multiv2.1 💻 105 Identify named entities in text
ReLiK: Retrieve and LinK, Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget Paper • 2408.00103 • Published Jul 31, 2024 • 24
Text to sql papers Query Rewriting via Large Language Models Paper • 2403.09060 • Published Mar 14, 2024
Machine Translation Omnilingual MT: Machine Translation for 1,600 Languages Paper • 2603.16309 • Published Mar 17 • 22 MT Quality Estimation Collection 4 items • Updated Mar 10
MT Quality Estimation McGill-NLP/ssa-comet-mtl Translation • Updated Mar 11 • 3 google/metricx-24-hybrid-xxl-v2p6 Updated Dec 12, 2024 • 1.59k • 16 Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets Paper • 2602.22207 • Published Feb 25 • 43 McGill-NLP/ssa-comet-qe Translation • Updated May 23, 2025 • 1
Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets Paper • 2602.22207 • Published Feb 25 • 43
Synthetic Data Gen FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale Paper • 2601.22146 • Published Jan 29 • 11 Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing Paper • 2406.08464 • Published Jun 12, 2024 • 72
FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale Paper • 2601.22146 • Published Jan 29 • 11
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing Paper • 2406.08464 • Published Jun 12, 2024 • 72
African Languages Datasets google/WaxalNLP Viewer • Updated 6 days ago • 2.56M • 32.1k • 224 openlanguagedata/flores_plus Viewer • Updated Mar 10 • 893k • 11.2k • 132 McGill-NLP/african_celtic_dataset Viewer • Updated 18 days ago • 57.5k • 73 • 1 HPLT/HPLT3.0 Updated Nov 14, 2025 • 167 • 19
MT Models ModelSpace/GemmaX2-28-2B-v0.1 Translation • 3B • Updated Feb 25 • 891 • 113 ByteDance-Seed/Seed-X-Instruct-7B Translation • Updated Jul 28, 2025 • 289 • 129 tencent/Hunyuan-MT-7B Translation • 8B • Updated Dec 30, 2025 • 13.2k • 730 ModelSpace/GemmaX2-28-2B-Pretrain Translation • 3B • Updated Mar 21, 2025 • 21 • 5
LLMs Distillation Distilling LLM Agent into Small Models with Retrieval and Code Tools Paper • 2505.17612 • Published May 23, 2025 • 81
Distilling LLM Agent into Small Models with Retrieval and Code Tools Paper • 2505.17612 • Published May 23, 2025 • 81
NL2SQL Models defog/llama-3-sqlcoder-8b Text Generation • Updated Jul 24, 2024 • 1.92k • • 274 chatdb/natural-sql-7b Text Generation • 7B • Updated Feb 4, 2024 • 118 • 134
OCR - Models nvidia/nemotron-ocr-v2 Image-to-Text • Updated 14 days ago • 2.4k • 178 lightonai/LightOnOCR-2-1B Image-Text-to-Text • 1B • Updated 7 days ago • 644k • 679 rednote-hilab/dots.mocr Image-Text-to-Text • 3B • Updated 21 days ago • 216k • 120 zai-org/GLM-OCR Image-to-Text • Updated 27 days ago • 7.92M • • 1.72k
Machine Translation Omnilingual MT: Machine Translation for 1,600 Languages Paper • 2603.16309 • Published Mar 17 • 22 MT Quality Estimation Collection 4 items • Updated Mar 10
Low Res NLP EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training Paper • 2603.02041 • Published Mar 2 AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages Paper • 2601.06395 • Published Jan 10 • 2
EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training Paper • 2603.02041 • Published Mar 2
AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages Paper • 2601.06395 • Published Jan 10 • 2
MT Quality Estimation McGill-NLP/ssa-comet-mtl Translation • Updated Mar 11 • 3 google/metricx-24-hybrid-xxl-v2p6 Updated Dec 12, 2024 • 1.59k • 16 Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets Paper • 2602.22207 • Published Feb 25 • 43 McGill-NLP/ssa-comet-qe Translation • Updated May 23, 2025 • 1
Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets Paper • 2602.22207 • Published Feb 25 • 43
Language ID OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report Paper • 2602.13139 • Published Feb 13 UBC-NLP/afrolid_1.5 Updated Mar 9 • 6.36k cis-lmu/glotlid Text Classification • Updated Apr 18, 2024 • 117k • 100 PleIAs/CommonLingua Text Classification • Updated 13 days ago • 313 • 26
OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report Paper • 2602.13139 • Published Feb 13
Synthetic Data Gen FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale Paper • 2601.22146 • Published Jan 29 • 11 Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing Paper • 2406.08464 • Published Jun 12, 2024 • 72
FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale Paper • 2601.22146 • Published Jan 29 • 11
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing Paper • 2406.08464 • Published Jun 12, 2024 • 72
Tokenization Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay Paper • 2602.06942 • Published Feb 6 • 3 transhumanist-already-exists/karpotron-tokenizer Updated Jan 31 • 2
Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay Paper • 2602.06942 • Published Feb 6 • 3
African Languages Datasets google/WaxalNLP Viewer • Updated 6 days ago • 2.56M • 32.1k • 224 openlanguagedata/flores_plus Viewer • Updated Mar 10 • 893k • 11.2k • 132 McGill-NLP/african_celtic_dataset Viewer • Updated 18 days ago • 57.5k • 73 • 1 HPLT/HPLT3.0 Updated Nov 14, 2025 • 167 • 19
Audio nvidia/canary-180m-flash Automatic Speech Recognition • Updated Mar 18, 2025 • 1.57k • 98 LiquidAI/LFM2-Audio-1.5B Audio-to-Audio • 1B • Updated Mar 27 • 280 • 346
MT Models ModelSpace/GemmaX2-28-2B-v0.1 Translation • 3B • Updated Feb 25 • 891 • 113 ByteDance-Seed/Seed-X-Instruct-7B Translation • Updated Jul 28, 2025 • 289 • 129 tencent/Hunyuan-MT-7B Translation • 8B • Updated Dec 30, 2025 • 13.2k • 730 ModelSpace/GemmaX2-28-2B-Pretrain Translation • 3B • Updated Mar 21, 2025 • 21 • 5
SLM kurakurai/Luth-LFM2-1.2B Text Generation • 1B • Updated Oct 12, 2025 • 86 • 25 kurakurai/Luth-1.7B-Instruct Text Generation • 2B • Updated Oct 12, 2025 • 81 • • 14 Qwen/Qwen3-1.7B Text Generation • 2B • Updated Jul 26, 2025 • 3.33M • • 460 Qwen/Qwen3-4B Text Generation • Updated Jul 26, 2025 • 3.52M • • 610
LLMs Distillation Distilling LLM Agent into Small Models with Retrieval and Code Tools Paper • 2505.17612 • Published May 23, 2025 • 81
Distilling LLM Agent into Small Models with Retrieval and Code Tools Paper • 2505.17612 • Published May 23, 2025 • 81
IE and Entity Linking ReLiK: Retrieve and LinK, Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget Paper • 2408.00103 • Published Jul 31, 2024 • 24 Running Agents 105 GLiNER-Multiv2.1 💻 105 Identify named entities in text
ReLiK: Retrieve and LinK, Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget Paper • 2408.00103 • Published Jul 31, 2024 • 24
NL2SQL Models defog/llama-3-sqlcoder-8b Text Generation • Updated Jul 24, 2024 • 1.92k • • 274 chatdb/natural-sql-7b Text Generation • 7B • Updated Feb 4, 2024 • 118 • 134
Text to sql papers Query Rewriting via Large Language Models Paper • 2403.09060 • Published Mar 14, 2024