patricebechard commited on
Commit
99155c4
·
verified ·
1 Parent(s): 765abb1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +225 -3
README.md CHANGED
@@ -1,3 +1,225 @@
1
- ---
2
- license: llama3.2
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ pipeline_tag: image-text-to-text
6
+ license: llama3.2
7
+ datasets:
8
+ - ServiceNow/BigDocs-Sketch2Flow
9
+ base_model:
10
+ - meta-llama/Llama-3.2-11B-Vision-Instruct
11
+ ---
12
+ # Model Card for ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow
13
+
14
+ Llama-3.2-11B-Vision-Instruct-StarFlow is a vision-language model finetuned for **structured workflow generation from sketch images**. It translates hand-drawn or computer-generated workflow diagrams into structured JSON workflows, including triggers, flow logic, and actions.
15
+
16
+ ## Model Details
17
+
18
+ ### Model Description
19
+
20
+ Llama-3.2-11B-Vision-Instruct-StarFlow is part of the **StarFlow** framework for automating workflow creation. It extends Meta's Llama-3.2-11B-Vision-Instruct with domain-specific finetuning on workflow diagrams, enabling accurate sketch-to-workflow generation.
21
+
22
+ * **Developed by:** ServiceNow Research
23
+ * **Model type:** Transformer-based Vision-Language Model (VLM)
24
+ * **Language(s) (NLP):** English
25
+ * **License:** [llama3.2](https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE.txt)
26
+ * **Finetuned from model :** [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)
27
+
28
+ ### Model Sources
29
+
30
+ * **Repository:** [ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow](https://huggingface.co/ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow)
31
+ * **Paper:** [StarFlow: Generating Structured Workflow Outputs From Sketch Images](https://arxiv.org/abs/2503.21889);
32
+
33
+ ---
34
+
35
+ ## Uses
36
+
37
+ ### Direct Use
38
+
39
+ * Translating **sketches of workflows** (hand-drawn, whiteboard, or digital diagrams) into **JSON structured workflows**.
40
+ * Supporting **workflow automation** in enterprise platforms by removing the need for manual low-code configuration.
41
+
42
+ ### Downstream Use
43
+
44
+ * Integration into **low-code platforms** (e.g., ServiceNow Flow Designer) for rapid prototyping of workflows.
45
+ * Used in **automation migration pipelines**, e.g., converting legacy workflow screenshots into JSON representations.
46
+
47
+ ### Out-of-Scope Use
48
+
49
+ * General-purpose vision-language tasks (e.g., image captioning, OCR).
50
+ * Use on domains outside workflow automation (e.g., arbitrary diagram-to-code).
51
+ * Real-time handwriting recognition (StarFlow focuses on structured workflow translation, not raw OCR).
52
+
53
+ ---
54
+
55
+ ## Bias, Risks, and Limitations
56
+
57
+ * **Limited generalization**: Finetuned models perform poorly on out-of-distribution diagrams from unfamiliar platforms.
58
+ * **Sensitivity to input style**: Whiteboard/handwritten sketches degrade performance compared to digital or UI-rendered workflows.
59
+ * **Component naming mismatches**: Model may mispredict action definitions (e.g., “create\_user” vs. “create\_a\_user”), leading to execution errors.
60
+ * **Evaluation gap**: Current metrics don’t always reflect execution correctness of generated workflows.
61
+
62
+ ### Recommendations
63
+
64
+ Users should:
65
+
66
+ * Validate outputs before deployment.
67
+ * Be cautious with **handwritten/ambiguous sketches**.
68
+ * Consider supplementing with **retrieval-augmented generation (RAG)** or **tool grounding** for robustness.
69
+
70
+ ---
71
+
72
+ ## How to Get Started with the Model
73
+
74
+ ```python
75
+ from transformers import AutoProcessor, AutoModelForVision2Seq
76
+ from PIL import Image
77
+
78
+ processor = AutoProcessor.from_pretrained("ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow")
79
+ model = AutoModelForVision2Seq.from_pretrained("ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow")
80
+
81
+ image = Image.open("workflow_sketch.png")
82
+ inputs = processor(images=image, text="Generate workflow JSON", return_tensors="pt")
83
+
84
+ outputs = model.generate(**inputs, max_new_tokens=4096)
85
+ workflow_json = processor.decode(outputs[0], skip_special_tokens=True)
86
+
87
+ print(workflow_json)
88
+ ```
89
+
90
+ ---
91
+
92
+ ## Training Details
93
+
94
+ ### Training Data
95
+
96
+ The model was trained using the [ServiceNow/BigDocs-Sketch2Flow](https://huggingface.co/datasets/ServiceNow/BigDocs-Sketch2Flow) dataset, which includes the following data distribution:
97
+
98
+ * **Synthetic** (12,376 Graphviz-generated diagrams)
99
+ * **Manual** (3,035 sketches hand-drawn by annotators)
100
+ * **Digital** (2,613 diagrams drawn using software)
101
+ * **Whiteboard** (484 sketches drawn on whiteboard / blackboard)
102
+ * **User Interface** (373 screenshots from ServiceNow Flow Designer)
103
+
104
+ ### Training Procedure
105
+
106
+ #### Preprocessing
107
+
108
+ * Synthetic workflows generated via **heuristics** (Scheduled Loop, IF/ELSE, FOREACH, etc.).
109
+ * Annotators recreated flows in digital, manual, and whiteboard formats.
110
+
111
+ #### Training Hyperparameters
112
+
113
+ * Optimizer: **AdamW** with β=(0.95,0.999), lr=2e-5, weight decay=1e-6.
114
+ * Scheduler: **cosine learning rate** with 30 warmup steps.
115
+ * Early stopping based on validation loss.
116
+ * Precision: **bf16 mixed-precision**.
117
+ * Sequence length: up to **32k tokens**.
118
+
119
+ #### Speeds, Sizes, Times
120
+
121
+ * Trained with **16× NVIDIA H100 80GB GPUs** across two nodes.
122
+ * Full Sharded Data Parallel (FSDP) training, no CPU offloading.
123
+
124
+ ---
125
+
126
+ ## Evaluation
127
+
128
+ ### Testing Data
129
+
130
+ Same dataset distribution as training: synthetic, manual, digital, whiteboard, UI-rendered workflows.
131
+
132
+ ### Factors
133
+
134
+ * **Source of sample** (synthetic, manual, UI, etc.)
135
+ * **Orientation** (portrait vs. landscape diagrams)
136
+ * **Resolution** (small <400k pixels, medium, large >1M pixels)
137
+
138
+ ### Metrics
139
+
140
+ All Evaluation metrics can be found in the official [StarFlow repo](https://github.com/ServiceNow/StarFlow).
141
+
142
+ * **Flow Similarity (FlowSim)** – tree edit distance similarity.
143
+ * **TreeBLEU** – structural recall of subtrees.
144
+ * **Trigger Match (TM)** – accuracy of workflow triggers.
145
+ * **Component Match (CM)** – overlap of predicted vs. gold components.
146
+
147
+ ### Results
148
+
149
+ * Proprietary models (GPT-4o, Claude-3.7, Gemini 2.0) outperform open-weights **without finetuning**.
150
+ * **Finetuned Pixtral-12B achieves SOTA**:
151
+
152
+ * FlowSim w/ inputs: **0.919**
153
+ * TreeBLEU w/ inputs: **0.950**
154
+ * Trigger Match: **0.753**
155
+ * Component Match: **0.930**
156
+
157
+ #### Summary
158
+
159
+ Finetuning yields **large gains over base Pixtral-12B and GPT-4o**, particularly in matching workflow components and triggers.
160
+
161
+ ## Model Examination
162
+
163
+ * Finetuned models capture **naming conventions** and structured execution logic better.
164
+ * Failure modes include **missing ELSE branches** or **generic table names**.
165
+
166
+ ---
167
+
168
+ ## Technical Specifications
169
+
170
+ ### Model Architecture and Objective
171
+
172
+ * Base: **Llama-3.2-11B Vision Instruct**, a multimodal LLM with 11 B parameters, optimized for image reasoning and instruction-following tasks.
173
+ * Objective: **Image-to-JSON structured workflow generation**.
174
+
175
+ ### Compute Infrastructure
176
+
177
+ * **Hardware:** 16× NVIDIA H100 80GB (2 nodes)
178
+ * **Software:** FSDP, bf16 mixed precision, PyTorch/Transformers
179
+
180
+ ---
181
+
182
+ ## Citation
183
+
184
+ **BibTeX:**
185
+
186
+ ```bibtex
187
+ @article{bechard2025starflow,
188
+ title={StarFlow: Generating Structured Workflow Outputs from Sketch Images},
189
+ author={B{\'e}chard, Patrice and Wang, Chao and Abaskohi, Amirhossein and Rodriguez, Juan and Pal, Christopher and Vazquez, David and Gella, Spandana and Rajeswar, Sai and Taslakian, Perouz},
190
+ journal={arXiv preprint arXiv:2503.21889},
191
+ year={2025}
192
+ }
193
+ ```
194
+
195
+ **APA:**
196
+ Béchard, P., Wang, C., Abaskohi, A., Rodriguez, J., Pal, C., Vazquez, D., Gella, S., Rajeswar, S., & Taslakian, P. (2025). **StarFlow: Generating Structured Workflow Outputs from Sketch Images**. *arXiv preprint arXiv:2503.21889*.
197
+
198
+ ---
199
+
200
+ ## Glossary
201
+
202
+ * **FlowSim**: Metric based on tree edit distance for workflows.
203
+ * **TreeBLEU**: BLEU-like score using tree structures.
204
+ * **Trigger Match**: Correctness of predicted workflow trigger.
205
+ * **Component Match**: Correctness of predicted components (order-agnostic).
206
+
207
+ ---
208
+
209
+ ## More Information
210
+
211
+ * [ServiceNow Flow Designer](https://www.servicenow.com/products/platform-flow-designer.html)
212
+ * [StarFlow Blog](https://www.servicenow.com/blogs/2025/starflow-ai-turns-sketches-into-workflows)
213
+
214
+ ---
215
+
216
+ ## The StarFlow Team
217
+
218
+ * Patrice Béchard, Chao Wang, Amirhossein Abaskohi, Juan Rodriguez, Christopher Pal, David Vazquez, Spandana Gella, Sai Rajeswar, Perouz Taslakian
219
+
220
+ ---
221
+
222
+ ## Model Card Contact
223
+
224
+ * Patrice Bechard - [[email protected]](mailto:[email protected])
225
+ * ServiceNow Research – [research.servicenow.com](https://research.servicenow.com)