IntelliAsk-Qwen3-32B
IntelliAsk-32B is fine-tuned from Qwen3-32B using GRPO with IntelliReward, a reward model trained on 572 expert-annotated question-paper pairs. The goal: generate peer review questions that match human reviewer quality in effort, evidence, and grounding.
Paper: IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR
Demo: https://anonymousse123456.github.io/intelliask.github.io/
Authors: Karun Sharma, Vidushee Vats, Shengzhi Li, Yuxiang Wang, Zhongtian Sun, Prayag Tiwari
Overview
Given OCR-extracted text of a research paper, IntelliAsk generates peer review questions. SFT-trained models copy reviewer tone but produce shallow questions (scoring 0.03-0.10/3.0). IntelliAsk uses RL with a human-preference reward model to produce questions with actual depth, scoring 0.55/3.0 in auto eval using IntelliReward and 0.66/3.0 in human evaluation, beating Gemini 2.5 Pro (0.60).
Key Results
Question Generation (Automatic Evaluation via IntelliReward)
| Model | Effort | Evidence | Grounding | Total (0-3) | 1st Page Bias ↓ |
|---|---|---|---|---|---|
| Human questions | 0.54 | 0.46 | 0.57 | 1.57 | 28.21% |
| o3 | 0.28 | 0.14 | 0.30 | 0.72 | 16.81% |
| Gemini 2.5 Pro | 0.22 | 0.11 | 0.18 | 0.51 | 25.75% |
| GPT-5 | 0.09 | 0.20 | 0.16 | 0.45 | 18.63% |
| IntelliAsk-32B (Ours) | 0.23 | 0.12 | 0.20 | 0.55 | 21.37% |
| Qwen3-32B (base) | 0.05 | 0.13 | 0.09 | 0.28 | 26.73% |
IntelliAsk-32B scores highest among all models ≤32B. It also has the lowest first-page bias (21.37%) in this category, meaning it engages with the full paper rather than just the introduction.
Generalization to Writing and Reasoning Benchmarks
| Benchmark | IntelliAsk-32B | Qwen3-32B | Metric |
|---|---|---|---|
| DROP | 95.1 | 93.3 | F1 / Acc |
| MuSR | 68.3 | 64.7 | Accuracy |
| GPQA-Diamond | 69.1 | 68.4 | Accuracy |
| WritingBench | 8.31 | 8.07 | Score (0-10) |
| Arena Hard | 94.1 | 93.8 | Score (0-100) |
RL training on reviewer-question quality transfers to general reasoning and writing benchmarks.
Training Details
- Base model: Qwen3-32B
- Method: GRPO (Group Relative Policy Optimization) with IntelliReward
- Reward model: IntelliReward, a frozen autoregressive LLM with trainable multi-head transformer heads over the last 50 token states, trained on 572 expert-annotated question-paper pairs across three dimensions (Effort, Evidence, Grounding). Achieves 72% mean accuracy, outperforming GPT-5 (53%), Gemini 2.5 Flash (37%), and GPT-4.1 (32%) on reward prediction.
- Training data: 15.5K questions from ICLR 2024 reviews, filtered through multi-stage curation (length filtering, semantic deduplication, non-technical content removal)
- Paper OCR: olmOCR (first 9 pages per paper)
Limitations
This is early-stage research with limited compute, short training runs, and a small annotation budget. The model is best suited for ML papers (NLP, CV) targeting venues like ICLR, NeurIPS, CVPR, ACL, and EMNLP. Performance on papers outside these domains may vary.
Related Resources
Citation
@misc{sharma2026preferenceoptimizationreviewquestion,
title={Preference Optimization for Review Question Generation Improves Writing Quality},
author={Karun Sharma and Vidushee Vats and Shengzhi Li and Yuxiang Wang and Zhongtian Sun and Prayag Tiwari},
year={2026},
eprint={2602.15849},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.15849},
}
- Downloads last month
- 80