Qwen2.5-VL-32B Computer Use Agent (GPT-5 Distillation)

A LoRA fine-tune of Qwen2.5-VL-32B-MegaReal-SFT trained on GPT-5 demonstrations of challenging computer use tasks.

Model Description

This model was trained via distillation from GPT-5 on 30 computer use tasks where GPT-5 succeeded but our previous model (DoubleBao-1.5-Thinking-Vision) failed. The training data consists of successful GPT-5 execution traces demonstrating how to complete complex UI interaction challenges.

Training Configuration:

LoRA Rank: 64, Alpha: 64, Dropout: 0.05
Vision LoRA enabled
Batch size: 64, Learning rate: 2e-4
1 epoch, bfloat16 precision
8× NVIDIA GPUs with DeepSpeed Zero3

Training Data

The model was trained on GPT-5's successful demonstrations of these 30 tasks, about 330 input output pairs:

Use logical deduction to identify and click the correct button based on the clues
Click the search input to open the employee dropdown, then scroll through the list to find and click on Dr. Alexandra Sterling
Use Ctrl+Click (Windows/Linux) or Cmd+Click (Mac) to select items #10, #35, #67, and #95 (marked with stars). The scroll position must be preserved when using meta+click.
Navigate the folder tree and use Ctrl+Click (Windows/Linux) or Cmd+Click (Mac) to select all 4 items marked with stars (⭐). Regular clicks expand/collapse folders.
Find and highlight the 3:30 PM doctor's appointment in the schedule by scrolling to it
Navigate through nested scrollable containers to find the red target
Find and click the golden star target by scrolling through the different boxes
Navigate across all Discord servers to count the total number of users in all Study Room voice channels and enter the correct total
Navigate through the wiki articles to find when the first Mars rover landed and enter the year
Navigate the FAQ sections to find what ports need to be open, then enter the answer and submit
Find message thread and count total replies
Find Sarah Chen's phone number in the employee directory
Find the image with caption 'Hidden Gem of Prague' in the Architecture collection and enter it on the start page
Navigate the sidebar menu to explore different themed environments. Find and click the hidden planet in the Space section to reveal a secret code, then enter the code COSMOS-999 in the input field.
Shop for items from different categories to create a cart total of exactly $75.00, then enter this total on the home page
Click the Select Country button and select Republic of Zalandia from the scrollable menu
Complete all 5 steps in the registration process to receive the completion code
Use middle-click and drag to pan around the large canvas and find the golden treasure marked with '🏆 GOLDEN TREASURE' in the bottom-right area
Use middle-click and drag to scroll in any direction and visit all 4 corner checkpoints (NE, NW, SE, SW)
Double-click on any location marker to zoom in on the map
Double-click numbers in the combination sequence 4-7-2 to unlock the lock
Middle-click each card to open its quick action menu, then select the specific action marked with 'REQUIRED' from each menu
Use the attachment button to open the gallery, select the "Sunset" photo (3rd image), and send it to the chat
Click the dropdown, type to filter countries, and select any country to see its information display
Select exactly 3:30 PM using the time dropdown selectors
Complete all three mini-challenges to unlock the achievement
Navigate months/years to find July 4, 1776 and select it (Independence Day)
Join the Slack huddle and enter all participant names
Trigger the "Expense Report" workflow from Slack shortcuts
React to the poll with the :pythonlogo: custom emoji

Task Categories:

Scrolling & Navigation (finding elements through scrolling, nested containers)
Click Interactions (single, double, Ctrl/Cmd+click, middle-click, right-click)
Drag Operations (drag-to-select, drag-to-pan, drag-to-measure)
Form & Dropdown Interactions (multi-step forms, cascading menus, date pickers)
Complex Selection Patterns (Ctrl+click multi-select, Shift+click range selection)
Multi-step Workflows (tab navigation, nested trees, information extraction)
Visual Reasoning (logical deduction, pattern matching, spatial understanding)

Intended Use

Computer use automation requiring visual understanding of screenshots and planning UI interactions across diverse web interfaces and applications.

Downloads last month: 5

Safetensors

Model size

33B params

Tensor type

F16

Model tree for agi-inc/Qwen2.5-VL-32B-ScrollingAgent

Base model

agi-inc/Qwen2.5-VL-32B-MegaReal-SFT

Adapter

(1)

this model