Qwen2.5-VL-32B Computer Use Agent (GPT-5 Distillation)
A LoRA fine-tune of Qwen2.5-VL-32B-MegaReal-SFT trained on GPT-5 demonstrations of challenging computer use tasks.
Model Description
This model was trained via distillation from GPT-5 on 30 computer use tasks where GPT-5 succeeded but our previous model (DoubleBao-1.5-Thinking-Vision) failed. The training data consists of successful GPT-5 execution traces demonstrating how to complete complex UI interaction challenges.
Training Configuration:
- LoRA Rank: 64, Alpha: 64, Dropout: 0.05
- Vision LoRA enabled
- Batch size: 64, Learning rate: 2e-4
- 1 epoch, bfloat16 precision
- 8ร NVIDIA GPUs with DeepSpeed Zero3
Training Data
The model was trained on GPT-5's successful demonstrations of these 30 tasks, about 330 input output pairs:
- Use logical deduction to identify and click the correct button based on the clues
- Click the search input to open the employee dropdown, then scroll through the list to find and click on Dr. Alexandra Sterling
- Use Ctrl+Click (Windows/Linux) or Cmd+Click (Mac) to select items #10, #35, #67, and #95 (marked with stars). The scroll position must be preserved when using meta+click.
- Navigate the folder tree and use Ctrl+Click (Windows/Linux) or Cmd+Click (Mac) to select all 4 items marked with stars (โญ). Regular clicks expand/collapse folders.
- Find and highlight the 3:30 PM doctor's appointment in the schedule by scrolling to it
- Navigate through nested scrollable containers to find the red target
- Find and click the golden star target by scrolling through the different boxes
- Navigate across all Discord servers to count the total number of users in all Study Room voice channels and enter the correct total
- Navigate through the wiki articles to find when the first Mars rover landed and enter the year
- Navigate the FAQ sections to find what ports need to be open, then enter the answer and submit
- Find message thread and count total replies
- Find Sarah Chen's phone number in the employee directory
- Find the image with caption 'Hidden Gem of Prague' in the Architecture collection and enter it on the start page
- Navigate the sidebar menu to explore different themed environments. Find and click the hidden planet in the Space section to reveal a secret code, then enter the code COSMOS-999 in the input field.
- Shop for items from different categories to create a cart total of exactly $75.00, then enter this total on the home page
- Click the Select Country button and select Republic of Zalandia from the scrollable menu
- Complete all 5 steps in the registration process to receive the completion code
- Use middle-click and drag to pan around the large canvas and find the golden treasure marked with '๐ GOLDEN TREASURE' in the bottom-right area
- Use middle-click and drag to scroll in any direction and visit all 4 corner checkpoints (NE, NW, SE, SW)
- Double-click on any location marker to zoom in on the map
- Double-click numbers in the combination sequence 4-7-2 to unlock the lock
- Middle-click each card to open its quick action menu, then select the specific action marked with 'REQUIRED' from each menu
- Use the attachment button to open the gallery, select the "Sunset" photo (3rd image), and send it to the chat
- Click the dropdown, type to filter countries, and select any country to see its information display
- Select exactly 3:30 PM using the time dropdown selectors
- Complete all three mini-challenges to unlock the achievement
- Navigate months/years to find July 4, 1776 and select it (Independence Day)
- Join the Slack huddle and enter all participant names
- Trigger the "Expense Report" workflow from Slack shortcuts
- React to the poll with the :pythonlogo: custom emoji
Task Categories:
- Scrolling & Navigation (finding elements through scrolling, nested containers)
- Click Interactions (single, double, Ctrl/Cmd+click, middle-click, right-click)
- Drag Operations (drag-to-select, drag-to-pan, drag-to-measure)
- Form & Dropdown Interactions (multi-step forms, cascading menus, date pickers)
- Complex Selection Patterns (Ctrl+click multi-select, Shift+click range selection)
- Multi-step Workflows (tab navigation, nested trees, information extraction)
- Visual Reasoning (logical deduction, pattern matching, spatial understanding)
Intended Use
Computer use automation requiring visual understanding of screenshots and planning UI interactions across diverse web interfaces and applications.
- Downloads last month
- 5
Model tree for agi-inc/Qwen2.5-VL-32B-ScrollingAgent
Base model
agi-inc/Qwen2.5-VL-32B-MegaReal-SFT