Model Card for Model ID

Model Description

This is a fine-tuned version of Llama-3.1-8B-Instruct for Machine Translation (MT) on Hinglish (Hindi-English code-mixed) text. It translates code-mixed input in Roman/Devanagari scripts to three target formats: (i) Standard English, (ii) Romanized Hindi, and (iii) Devanagari Hindi.

The model supports natural, fluent translations preserving code-mixing nuances and achieves strong performance on the COMI-LINGUA test set, outperforming zero-shot and one-shot baselines from both open- and closed-weight LLMs.

Model type: LoRA-adapted Transformer LLM (8B params, ~32M trainable)
License: apache-2.0
Finetuned from model: meta-llama/Llama-3.1-8B-Instruct (best performer reported)

Model Sources

Paper: COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing
Demo: Integrated in Demo Portal

Uses

Machine translation in Hinglish pipelines (e.g., social media content normalization, multilingual chatbots, news/sentiment analysis preprocessing).
Supports three output styles for flexibility in downstream applications.
Example inference prompt:

Translate the following Hinglish sentence into Standard English, Romanized Hindi, and Devanagari Hindi:
Input: "लंदन के Madame Tussauds में Deepika Padukone के wax statue का गुरुवार को अनावरण हुआ।"
Output format: Provide three translations clearly labeled.

Expected Output (approximate based on task):
- Standard English: "Deepika Padukone's wax statue was unveiled at Madame Tussauds in London on Thursday."
- Romanized Hindi: "London ke Madame Tussauds mein Deepika Padukone ke wax statue ka guruvaar ko anavaran hua."
- Devanagari Hindi: "लंदन के मैडम तुसाद में दीपिका पादुकोण के वैक्स स्टैच्यू का गुरुवार को अनावरण हुआ।"

Training Details

Training Data

COMI-LINGUA Dataset Card

Training Procedure

Preprocessing

Instruction based tuning with parallel examples; filtered for quality, length (≥5 tokens), no hate/non-Hinglish content.

Training Hyperparameters

Regime: PEFT LoRA (rank=32, alpha=64, dropout=0.1)
Epochs: 3
Batch: 4 (accum=8, effective=32)
LR: 2e-4 (cosine + warmup=0.1)
Weight decay: 0.01

Evaluation

Testing Data

COMI-LINGUA MT test set (5K instances).

Metrics

BLEU (corpus-level) and chrF++ (character n-gram F-score).

Results

Setting	Model	Target Language	BLEU	chrF++
Zero-shot	LLaMA-3.1-8B-Instruct	Standard English	38.3	67.5
		Romanized Hindi	15.6	49.2
		Devanagari Hindi	7.4	13.5
One-shot	LLaMA-3.1-8B-Instruct	Standard English	45.8	72.4
		Romanized Hindi	35.3	67.0
		Devanagari Hindi	17.9	53.2
Fine-tuned	LLaMA-3.1-8B-Instruct	Standard English	56.1	78.7
		Romanized Hindi	66.6	85.9
		Devanagari Hindi	73.5	86.2

Summary: Sets new benchmarks for Hinglish-to-monolingual translation, with particularly strong performance on script-faithful Devanagari output. Outperforms zero-shot/one-shot baselines (e.g., LLaMA-3.3 one-shot ~62.2 BLEU to English) and matches or exceeds several closed-weight LLMs.

Bias, Risks, and Limitations

This model is a research preview and is subject to ongoing iterative updates. As such, it provides only limited safety measures.

Model Card Contact

Lingo Research Group at IIT Gandhinagar, India
Mail at: [email protected]

Citation

If you use this model, please cite the following work:

@inproceedings{sheth-etal-2025-comi,
    title = "{COMI}-{LINGUA}: Expert Annotated Large-Scale Dataset for Multitask {NLP} in {H}indi-{E}nglish Code-Mixing",
    author = "Sheth, Rajvee  and
      Beniwal, Himanshu  and
      Singh, Mayank",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.422/",
    pages = "7973--7992",
    ISBN = "979-8-89176-335-7",
}

Downloads last month: -

Model tree for LingoIITGN/COMI-LINGUA-MT

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Adapter

(1513)

this model

LingoIITGN
/

COMI-LINGUA-MT