IntelliAsk-Qwen3-32B

IntelliAsk-32B is fine-tuned from Qwen3-32B using GRPO with IntelliReward, a reward model trained on 572 expert-annotated question-paper pairs. The goal: generate peer review questions that match human reviewer quality in effort, evidence, and grounding.

Paper: IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR
Demo: https://anonymousse123456.github.io/intelliask.github.io/
Authors: Karun Sharma, Vidushee Vats, Shengzhi Li, Yuxiang Wang, Zhongtian Sun, Prayag Tiwari

Overview

Given OCR-extracted text of a research paper, IntelliAsk generates peer review questions. SFT-trained models copy reviewer tone but produce shallow questions (scoring 0.03-0.10/3.0). IntelliAsk uses RL with a human-preference reward model to produce questions with actual depth, scoring 0.55/3.0 in auto eval using IntelliReward and 0.66/3.0 in human evaluation, beating Gemini 2.5 Pro (0.60).

Key Results

Question Generation (Automatic Evaluation via IntelliReward)

Model Effort Evidence Grounding Total (0-3) 1st Page Bias ↓
Human questions 0.54 0.46 0.57 1.57 28.21%
o3 0.28 0.14 0.30 0.72 16.81%
Gemini 2.5 Pro 0.22 0.11 0.18 0.51 25.75%
GPT-5 0.09 0.20 0.16 0.45 18.63%
IntelliAsk-32B (Ours) 0.23 0.12 0.20 0.55 21.37%
Qwen3-32B (base) 0.05 0.13 0.09 0.28 26.73%

IntelliAsk-32B scores highest among all models ≤32B. It also has the lowest first-page bias (21.37%) in this category, meaning it engages with the full paper rather than just the introduction.

Generalization to Writing and Reasoning Benchmarks

Benchmark IntelliAsk-32B Qwen3-32B Metric
DROP 95.1 93.3 F1 / Acc
MuSR 68.3 64.7 Accuracy
GPQA-Diamond 69.1 68.4 Accuracy
WritingBench 8.31 8.07 Score (0-10)
Arena Hard 94.1 93.8 Score (0-100)

RL training on reviewer-question quality transfers to general reasoning and writing benchmarks.

Training Details

  • Base model: Qwen3-32B
  • Method: GRPO (Group Relative Policy Optimization) with IntelliReward
  • Reward model: IntelliReward, a frozen autoregressive LLM with trainable multi-head transformer heads over the last 50 token states, trained on 572 expert-annotated question-paper pairs across three dimensions (Effort, Evidence, Grounding). Achieves 72% mean accuracy, outperforming GPT-5 (53%), Gemini 2.5 Flash (37%), and GPT-4.1 (32%) on reward prediction.
  • Training data: 15.5K questions from ICLR 2024 reviews, filtered through multi-stage curation (length filtering, semantic deduplication, non-technical content removal)
  • Paper OCR: olmOCR (first 9 pages per paper)

Limitations

This is early-stage research with limited compute, short training runs, and a small annotation budget. The model is best suited for ML papers (NLP, CV) targeting venues like ICLR, NeurIPS, CVPR, ACL, and EMNLP. Performance on papers outside these domains may vary.

Related Resources

Citation

@misc{sharma2026preferenceoptimizationreviewquestion,
  title={Preference Optimization for Review Question Generation Improves Writing Quality},
  author={Karun Sharma and Vidushee Vats and Shengzhi Li and Yuxiang Wang and Zhongtian Sun and Prayag Tiwari},
  year={2026},
  eprint={2602.15849},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2602.15849},
}
Downloads last month
80
Safetensors
Model size
33B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 3 Ask for provider support

Model tree for anonymousatom/IntelliAsk-Qwen3-32B-450-Merged

Base model

Qwen/Qwen3-32B
Finetuned
(416)
this model
Quantizations
1 model

Paper for anonymousatom/IntelliAsk-Qwen3-32B-450-Merged