Qwen2.5-VL-32B Computer Use Agent (GPT-5 Distillation)

A LoRA fine-tune of Qwen2.5-VL-32B-MegaReal-SFT trained on GPT-5 demonstrations of challenging computer use tasks.

Model Description

This model was trained via distillation from GPT-5 on 30 computer use tasks where GPT-5 succeeded but our previous model (DoubleBao-1.5-Thinking-Vision) failed. The training data consists of successful GPT-5 execution traces demonstrating how to complete complex UI interaction challenges.

Training Configuration:

  • LoRA Rank: 64, Alpha: 64, Dropout: 0.05
  • Vision LoRA enabled
  • Batch size: 64, Learning rate: 2e-4
  • 1 epoch, bfloat16 precision
  • 8ร— NVIDIA GPUs with DeepSpeed Zero3

Training Data

The model was trained on GPT-5's successful demonstrations of these 30 tasks, about 330 input output pairs:

  1. Use logical deduction to identify and click the correct button based on the clues
  2. Click the search input to open the employee dropdown, then scroll through the list to find and click on Dr. Alexandra Sterling
  3. Use Ctrl+Click (Windows/Linux) or Cmd+Click (Mac) to select items #10, #35, #67, and #95 (marked with stars). The scroll position must be preserved when using meta+click.
  4. Navigate the folder tree and use Ctrl+Click (Windows/Linux) or Cmd+Click (Mac) to select all 4 items marked with stars (โญ). Regular clicks expand/collapse folders.
  5. Find and highlight the 3:30 PM doctor's appointment in the schedule by scrolling to it
  6. Navigate through nested scrollable containers to find the red target
  7. Find and click the golden star target by scrolling through the different boxes
  8. Navigate across all Discord servers to count the total number of users in all Study Room voice channels and enter the correct total
  9. Navigate through the wiki articles to find when the first Mars rover landed and enter the year
  10. Navigate the FAQ sections to find what ports need to be open, then enter the answer and submit
  11. Find message thread and count total replies
  12. Find Sarah Chen's phone number in the employee directory
  13. Find the image with caption 'Hidden Gem of Prague' in the Architecture collection and enter it on the start page
  14. Navigate the sidebar menu to explore different themed environments. Find and click the hidden planet in the Space section to reveal a secret code, then enter the code COSMOS-999 in the input field.
  15. Shop for items from different categories to create a cart total of exactly $75.00, then enter this total on the home page
  16. Click the Select Country button and select Republic of Zalandia from the scrollable menu
  17. Complete all 5 steps in the registration process to receive the completion code
  18. Use middle-click and drag to pan around the large canvas and find the golden treasure marked with '๐Ÿ† GOLDEN TREASURE' in the bottom-right area
  19. Use middle-click and drag to scroll in any direction and visit all 4 corner checkpoints (NE, NW, SE, SW)
  20. Double-click on any location marker to zoom in on the map
  21. Double-click numbers in the combination sequence 4-7-2 to unlock the lock
  22. Middle-click each card to open its quick action menu, then select the specific action marked with 'REQUIRED' from each menu
  23. Use the attachment button to open the gallery, select the "Sunset" photo (3rd image), and send it to the chat
  24. Click the dropdown, type to filter countries, and select any country to see its information display
  25. Select exactly 3:30 PM using the time dropdown selectors
  26. Complete all three mini-challenges to unlock the achievement
  27. Navigate months/years to find July 4, 1776 and select it (Independence Day)
  28. Join the Slack huddle and enter all participant names
  29. Trigger the "Expense Report" workflow from Slack shortcuts
  30. React to the poll with the :pythonlogo: custom emoji

Task Categories:

  • Scrolling & Navigation (finding elements through scrolling, nested containers)
  • Click Interactions (single, double, Ctrl/Cmd+click, middle-click, right-click)
  • Drag Operations (drag-to-select, drag-to-pan, drag-to-measure)
  • Form & Dropdown Interactions (multi-step forms, cascading menus, date pickers)
  • Complex Selection Patterns (Ctrl+click multi-select, Shift+click range selection)
  • Multi-step Workflows (tab navigation, nested trees, information extraction)
  • Visual Reasoning (logical deduction, pattern matching, spatial understanding)

Intended Use

Computer use automation requiring visual understanding of screenshots and planning UI interactions across diverse web interfaces and applications.

Downloads last month
5
Safetensors
Model size
33B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for agi-inc/Qwen2.5-VL-32B-ScrollingAgent

Adapter
(1)
this model