Future Optical Flow Prediction Improves Robot Control & Video Generation
Abstract
A novel language-conditioned optical flow forecasting model combines Vision-Language Model and Diffusion architecture to predict future motion from noisy web-scale video data, demonstrating versatility in robotic manipulation and video generation tasks.
Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.
Community
We introduce FOFPred, a language-driven future optical flow prediction framework that enables improved robot control and video generation. Instead of reacting to motion, FOFPred predicts how motion will evolve, conditioned on natural language.
๐ Project: fofpred.github.io
๐ Paper: arxiv.org/abs/2601.10781
๐ป Code: github.com/SalesforceAIResearch/FOFPred
๐ค Model: huggingface.co/Salesforce/FOFPred
๐น๏ธ Demo: fofpred.salesforceresearch.ai
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Robotic VLA Benefits from Joint Learning with Motion Image Diffusion (2025)
- Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling (2025)
- Image Generation as a Visual Planner for Robotic Manipulation (2025)
- DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos (2025)
- Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation (2025)
- Large Video Planner Enables Generalizable Robot Control (2025)
- mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper