GoT-R1: Internalizing Graph-of-Thought via Structural Reinforcement

---
license: apache-2.0
language:
  - en
tags:
  - text-generation
  - reinforcement-learning
  - grpo
  - graph-of-thought
  - reasoning
  - rlhf
base_model:
- Qwen/Qwen3-4B
metrics:
  - accuracy
---

<h1 align="center"> GoT-R1: Internalizing Graph-of-Thought via Structural Reinforcement</h1>

<p align="center">
  <b>High-Density Reasoning with Minimal Verbosity</b>
</p>

## 📦 Model Collection

We release the GoT-R1 models across three parameter scales. You can access them here:

- [**GoT-R1-4B**](https://huggingface.co/MYTH-Lab/GoT-R1-4B)
- [**GoT-R1-8B**](https://huggingface.co/MYTH-Lab/GoT-R1-8B) 
- [**GoT-R1-14B**](https://huggingface.co/MYTH-Lab/GoT-R1-14B) 

## 📖 Model Description

**GoT-R1** is a novel reasoning framework that fundamentally redefines how Large Language Models (LLMs) handle complex problem-solving. Developed jointly by Wuhan University and Shanghai Jiao Tong University, GoT-R1 shifts the paradigm from linear **Chain-of-Thought (CoT)** to an internalized **Graph-of-Thought (GoT)**. 

While standard CoT models often fall into the "overthinking" trap—generating redundant narrative filler and suffering from cascading errors—GoT-R1 constructs a high-density structured reasoning graph internally. This structural reinforcement ensures that every reasoning node is an atomic logical primitive, enabling the model to solve complex logic tasks with unprecedented accuracy and minimal token usage.

- **Developer:** MYTH-Lab (Wuhan University & Shanghai Jiao Tong University)
- **Model Type:** Causal Language Model with RLHF (GRPO)
- **Architecture:** Transformer-based (4B, 8B, and 14B variants)
- **License:** Apache 2.0

## 🏗️ Architecture & Methodology

![GoT-R1 Architecture](got_r1_architecture.png)

1. **High-Density Reasoning:** Decouples pure logic from conversational filler. The model learns to output graph topological steps rather than rambling paragraphs.
2. **Elimination of Redundant Narration:** By strictly penalizing verbosity during RL training, GoT-R1 avoids the infinite "Wait... let me think" loops common in vanilla reasoning models.
3. **Automated Structural Synthesis:** Trained on high-fidelity logical skeletons purified from teacher-model CoT traces without requiring expensive manual graph labeling.
4. **Extreme Token Efficiency:** Achieves state-of-the-art accuracy using only **1.8%** of the token budget required by external search methods like Tree-of-Thought (ToT) (0.6M vs 33M tokens).

## 📊 Evaluation Results

GoT-R1 sets a new benchmark for accuracy and efficiency across various scales. Notably, it drastically reduces logical inconsistencies and hallucinations, evidenced by an 18% improvement on TruthfulQA at the 8B scale.

| Model                        | GSM8K (ACC) | IFEval (I-Strict) | TruthfulQA | Winogrande |
| :--------------------------- | :---------- | :---------------- | :--------- | :--------- |
| Qwen3-4B                     | 93.78       | 87.04             | 66.71      | 76.01      |
| **GoT-R1-4B (Ours)**         | **95.07**   | **90.53**         | **84.70**  | **81.93**      |
| Qwen3-8B                     | 94.62       | 90.46             | 74.42      | 80.58      |
| **GoT-R1-8B (Ours)**         | **96.74**   | **92.31**         | **84.82**  | **84.77**  |
| Qwen3-14B                    | 96.59       | 91.26             | 77.72      | 86.19      |
| **GoT-R1-14B (Ours)**        | **97.19**   | **92.59**         | **85.31**  | **87.69**  |

## ⚙️ Training Procedure

The model was trained using a rigorously designed two-stage regimen:

1. **Stage 1: Supervised Fine-Tuning (SFT):** The base model is pre-aligned to master the GoT syntax and structural formatting, learning to represent logic as discrete nodes.
2. **Stage 2: GRPO Evolution:** We apply Group Relative Policy Optimization (GRPO) to reinforce topological integrity. The reward function is defined as:
   \\(R_i = w_1 R_{task} + w_2 R_{graph} + w_3 R_{fmt} - w_4 P_{len}\\)

## 💻 Quick Start

You can easily use GoT-R1 with the `transformers` library. 

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "MYTH-Lab/GoT-R1-8B" # Choose 4B, 8B, or 14B

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

prompt = "Carly collected 7 starfish with 5 arms each and one seastar with 14 arms. How many arms do the animals she collected have in total?"
messages = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs, 
    max_new_tokens=4096,
    temperature=0.9
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```


## 📚 Citation

If you find our work helpful, please cite our paper:

```bibtex
@inproceedings{gotr1_2026,
  title={GoT-R1: Internalizing Graph-of-Thought via Structural Reinforcement for High-Density Reasoning}, 
  author={Li, Zuchao and Li, Qiwei and Yao, Yao and Zhao, Hai and Zhang, Lefei and Du, Bo},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2026},
  year={2026},
  note={To appear}
}
```