IntelliAsk-Qwen3-32B

IntelliAsk-32B is fine-tuned from Qwen3-32B using GRPO with IntelliReward, a reward model trained on 572 expert-annotated question-paper pairs. The goal: generate peer review questions that match human reviewer quality in effort, evidence, and grounding.

Paper: IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR
Demo: https://anonymousse123456.github.io/intelliask.github.io/
Authors: Karun Sharma, Vidushee Vats, Shengzhi Li, Yuxiang Wang, Zhongtian Sun, Prayag Tiwari

Overview

Given OCR-extracted text of a research paper, IntelliAsk generates peer review questions. SFT-trained models copy reviewer tone but produce shallow questions (scoring 0.03-0.10/3.0). IntelliAsk uses RL with a human-preference reward model to produce questions with actual depth, scoring 0.55/3.0 in auto eval using IntelliReward and 0.66/3.0 in human evaluation, beating Gemini 2.5 Pro (0.60).

Key Results

Question Generation (Automatic Evaluation via IntelliReward)

Model	Effort	Evidence	Grounding	Total (0-3)	1st Page Bias ↓
Human questions	0.54	0.46	0.57	1.57	28.21%
o3	0.28	0.14	0.30	0.72	16.81%
Gemini 2.5 Pro	0.22	0.11	0.18	0.51	25.75%
GPT-5	0.09	0.20	0.16	0.45	18.63%
IntelliAsk-32B (Ours)	0.23	0.12	0.20	0.55	21.37%
Qwen3-32B (base)	0.05	0.13	0.09	0.28	26.73%

IntelliAsk-32B scores highest among all models ≤32B. It also has the lowest first-page bias (21.37%) in this category, meaning it engages with the full paper rather than just the introduction.

Generalization to Writing and Reasoning Benchmarks

Benchmark	IntelliAsk-32B	Qwen3-32B	Metric
DROP	95.1	93.3	F1 / Acc
MuSR	68.3	64.7	Accuracy
GPQA-Diamond	69.1	68.4	Accuracy
WritingBench	8.31	8.07	Score (0-10)
Arena Hard	94.1	93.8	Score (0-100)

RL training on reviewer-question quality transfers to general reasoning and writing benchmarks.

Training Details

Base model: Qwen3-32B
Method: GRPO (Group Relative Policy Optimization) with IntelliReward
Reward model: IntelliReward, a frozen autoregressive LLM with trainable multi-head transformer heads over the last 50 token states, trained on 572 expert-annotated question-paper pairs across three dimensions (Effort, Evidence, Grounding). Achieves 72% mean accuracy, outperforming GPT-5 (53%), Gemini 2.5 Flash (37%), and GPT-4.1 (32%) on reward prediction.
Training data: 15.5K questions from ICLR 2024 reviews, filtered through multi-stage curation (length filtering, semantic deduplication, non-technical content removal)
Paper OCR: olmOCR (first 9 pages per paper)

Limitations

This is early-stage research with limited compute, short training runs, and a small annotation budget. The model is best suited for ML papers (NLP, CV) targeting venues like ICLR, NeurIPS, CVPR, ACL, and EMNLP. Performance on papers outside these domains may vary.

Related Resources

Citation

@misc{sharma2026preferenceoptimizationreviewquestion,
  title={Preference Optimization for Review Question Generation Improves Writing Quality},
  author={Karun Sharma and Vidushee Vats and Shengzhi Li and Yuxiang Wang and Zhongtian Sun and Prayag Tiwari},
  year={2026},
  eprint={2602.15849},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2602.15849},
}