tugrulkaya commited on
Commit
a98d832
ยท
verified ยท
1 Parent(s): 72fb090

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +85 -8
  2. app.py +853 -0
  3. requirements.txt +1 -0
README.md CHANGED
@@ -1,14 +1,91 @@
1
  ---
2
- title: Audio Reasoning Explorer
3
- emoji: ๐Ÿ‘€
4
- colorFrom: blue
5
- colorTo: gray
6
  sdk: gradio
7
- sdk_version: 6.0.0
8
  app_file: app.py
9
  pinned: false
10
- license: mit
11
- short_description: Interactive Hugging Face Space for exploring audio reasoning
 
 
 
 
 
 
 
 
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Audio Reasoning & Step-Audio-R1 Explorer
3
+ emoji: ๐ŸŽง
4
+ colorFrom: purple
5
+ colorTo: blue
6
  sdk: gradio
7
+ sdk_version: 4.44.0
8
  app_file: app.py
9
  pinned: false
10
+ license: cc-by-4.0
11
+ short_description: Interactive guide to audio reasoning and Step-Audio-R1 model
12
+ tags:
13
+ - audio
14
+ - reasoning
15
+ - multimodal
16
+ - step-audio-r1
17
+ - LALM
18
+ - chain-of-thought
19
+ - education
20
  ---
21
 
22
+ # ๐ŸŽง Audio Reasoning & Step-Audio-R1 Explorer
23
+
24
+ An interactive educational space exploring the groundbreaking concepts behind **audio reasoning** and the **Step-Audio-R1** model.
25
+
26
+ ## ๐ŸŽฏ What is Audio Reasoning?
27
+
28
+ Audio reasoning is an AI model's ability to perform **deliberate, multi-step thinking processes** over audio inputs. This goes far beyond simple speech recognition (ASR) or audio classification.
29
+
30
+ **Step-Audio-R1** is the first model to successfully unlock reasoning capabilities in the audio domain, solving the "inverted scaling anomaly" that plagued previous audio language models.
31
+
32
+ ## ๐Ÿš€ Features of This Space
33
+
34
+ | Tab | Content |
35
+ |-----|---------|
36
+ | ๐Ÿ  **Introduction** | Overview of audio reasoning and key achievements |
37
+ | ๐Ÿง  **Reasoning Types** | Interactive explorer for 5 types of audio reasoning |
38
+ | ๐Ÿšซ **The Problem** | Understanding the inverted scaling anomaly |
39
+ | ๐Ÿ”ฌ **MGRD Solution** | How Modality-Grounded Reasoning Distillation works |
40
+ | ๐Ÿ—๏ธ **Architecture** | Step-Audio-R1 model architecture breakdown |
41
+ | ๐Ÿ“Š **Benchmarks** | Performance comparisons and results |
42
+ | ๐ŸŽฎ **Interactive Demo** | Simulated audio reasoning examples |
43
+ | ๐Ÿš€ **Applications** | Real-world use cases |
44
+ | ๐Ÿ“š **Resources** | Papers, code, and references |
45
+
46
+ ## ๐Ÿ”ฌ Key Innovation: MGRD
47
+
48
+ **Modality-Grounded Reasoning Distillation (MGRD)** is the core innovation that makes Step-Audio-R1 work:
49
+
50
+ ```
51
+ Text-based reasoning โ†’ Filter textual surrogates โ†’ Keep acoustic-grounded chains โ†’ Native Audio Think
52
+ ```
53
+
54
+ This iterative process teaches the model to reason over **actual acoustic features** instead of text transcripts.
55
+
56
+ ## ๐Ÿ“Š Performance
57
+
58
+ Step-Audio-R1 achieves:
59
+ - โœ… **Surpasses Gemini 2.5 Pro** on comprehensive audio benchmarks
60
+ - โœ… **Comparable to Gemini 3 Pro** (state-of-the-art)
61
+ - โœ… **First successful test-time compute scaling** for audio
62
+
63
+ ## ๐Ÿ“š Resources
64
+
65
+ - ๐Ÿ“„ [Step-Audio-R1 Paper](https://arxiv.org/abs/2511.15848)
66
+ - ๐Ÿ’ป [GitHub Repository](https://github.com/stepfun-ai/Step-Audio-R1)
67
+ - ๐Ÿค— [HuggingFace Collection](https://huggingface.co/collections/stepfun-ai/step-audio-r1)
68
+ - ๐ŸŽฏ [Official Demo](https://stepaudiollm.github.io/step-audio-r1/)
69
+
70
+ ## ๐Ÿ‘ค Author
71
+
72
+ **Mehmet TuฤŸrul Kaya**
73
+ - ๐Ÿ™ GitHub: [@mtkaya](https://github.com/mtkaya)
74
+ - ๐Ÿค— HuggingFace: [tugrulkaya](https://huggingface.co/tugrulkaya)
75
+
76
+ ## ๐Ÿ“ Citation
77
+
78
+ ```bibtex
79
+ @article{stepaudioR1,
80
+ title={Step-Audio-R1 Technical Report},
81
+ author={Tian, Fei and others},
82
+ journal={arXiv preprint arXiv:2511.15848},
83
+ year={2025}
84
+ }
85
+ ```
86
+
87
+ ---
88
+
89
+ <p align="center">
90
+ <b>๐ŸŽง Sound Speaks, AI Listens and Thinks ๐Ÿง </b>
91
+ </p>
app.py ADDED
@@ -0,0 +1,853 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ๐ŸŽง Audio Reasoning & Step-Audio-R1 Explorer
3
+ Interactive Hugging Face Space for exploring audio reasoning concepts
4
+
5
+ Author: Mehmet TuฤŸrul Kaya
6
+ """
7
+
8
+ import gradio as gr
9
+
10
+ # ============================================
11
+ # CONTENT DATA
12
+ # ============================================
13
+
14
+ INTRO_CONTENT = """
15
+ # ๐ŸŽง Audio Reasoning & Step-Audio-R1
16
+
17
+ ## Teaching AI to Think About Sound
18
+
19
+ **Step-Audio-R1** is the first audio language model to successfully unlock reasoning capabilities in the audio domain.
20
+ This space explores the groundbreaking concepts behind audio reasoning and the innovative MGRD framework.
21
+
22
+ ### ๐ŸŽฏ Key Achievement
23
+ > *"Can audio intelligence truly benefit from deliberate thinking?"* โ€” **YES!**
24
+
25
+ Step-Audio-R1 proves that reasoning is a **transferable capability across modalities** when properly grounded in acoustic features.
26
+
27
+ ---
28
+
29
+ ### ๐Ÿ“Š Quick Stats
30
+
31
+ | Metric | Value |
32
+ |--------|-------|
33
+ | **Model Size** | 32B parameters (Qwen2.5 LLM) |
34
+ | **Audio Encoder** | Qwen2 (25 Hz, frozen) |
35
+ | **Performance** | Surpasses Gemini 2.5 Pro |
36
+ | **Innovation** | First successful audio reasoning model |
37
+
38
+ ---
39
+
40
+ *Navigate through the tabs to explore different aspects of audio reasoning!*
41
+ """
42
+
43
+ # Audio Reasoning Types Data
44
+ REASONING_TYPES = {
45
+ "Factual Reasoning": {
46
+ "emoji": "๐Ÿ“‹",
47
+ "description": "Extracting concrete information from audio",
48
+ "example_question": "What date is mentioned in this conversation?",
49
+ "example_audio": "A business call discussing a meeting scheduled for March 15th",
50
+ "what_model_does": "Identifies specific facts, numbers, names, dates from speech content",
51
+ "challenge": "Requires accurate speech recognition + information extraction"
52
+ },
53
+ "Procedural Reasoning": {
54
+ "emoji": "๐Ÿ“",
55
+ "description": "Understanding step-by-step processes and sequences",
56
+ "example_question": "What is the third step in this instruction set?",
57
+ "example_audio": "A cooking tutorial explaining how to make pasta",
58
+ "what_model_does": "Tracks sequential information, understands ordering and dependencies",
59
+ "challenge": "Must maintain context across long audio segments"
60
+ },
61
+ "Normative Reasoning": {
62
+ "emoji": "โš–๏ธ",
63
+ "description": "Evaluating social, ethical, or behavioral norms",
64
+ "example_question": "Is the speaker behaving appropriately in this dialogue?",
65
+ "example_audio": "A customer service call with an upset customer",
66
+ "what_model_does": "Assesses tone, politeness, social appropriateness based on context",
67
+ "challenge": "Requires understanding of social norms + prosodic analysis"
68
+ },
69
+ "Contextual Reasoning": {
70
+ "emoji": "๐ŸŒ",
71
+ "description": "Inferring environmental and situational context",
72
+ "example_question": "Where might this sound have been recorded?",
73
+ "example_audio": "Background noise with birds, wind, and distant traffic",
74
+ "what_model_does": "Analyzes ambient sounds to determine location/situation",
75
+ "challenge": "Must process non-speech audio elements"
76
+ },
77
+ "Causal Reasoning": {
78
+ "emoji": "๐Ÿ”—",
79
+ "description": "Establishing cause-effect relationships",
80
+ "example_question": "Why might this sound event have occurred?",
81
+ "example_audio": "A loud crash followed by glass breaking",
82
+ "what_model_does": "Infers causality from sound sequences and patterns",
83
+ "challenge": "Requires world knowledge + temporal understanding"
84
+ }
85
+ }
86
+
87
+ # The Problem Content
88
+ PROBLEM_CONTENT = """
89
+ ## ๐Ÿšซ The Inverted Scaling Anomaly
90
+
91
+ ### The Paradox
92
+ Traditional audio language models showed a **strange behavior**: they performed **WORSE** when reasoning longer!
93
+
94
+ This is the opposite of what happens in text models (like GPT-4, Claude) where more thinking = better answers.
95
+
96
+ ### Root Cause: Textual Surrogate Reasoning
97
+
98
+ ```
99
+ ๐Ÿ”Š Audio Input
100
+ โ†“
101
+ ๐Ÿ“ Model converts to text (transcript)
102
+ โ†“
103
+ ๐Ÿง  Reasons over TEXT, not SOUND
104
+ โ†“
105
+ โŒ Acoustic features IGNORED
106
+ โ†“
107
+ ๐Ÿ’€ Performance degrades with longer reasoning
108
+ ```
109
+
110
+ ### Why Does This Happen?
111
+
112
+ 1. **Text-based initialization**: Models are fine-tuned from text LLMs
113
+ 2. **Inherited patterns**: They learn to reason like text models
114
+ 3. **Modality mismatch**: Audio is treated as "text with extra steps"
115
+ 4. **Lost information**: Tone, emotion, prosody, ambient sounds are ignored
116
+
117
+ ### Real Example
118
+
119
+ **Audio**: Person says "Sure, I'll do it" in a *sarcastic, annoyed tone*
120
+
121
+ | Approach | Interpretation |
122
+ |----------|---------------|
123
+ | **Textual Surrogate** โŒ | "Person agrees to do the task" |
124
+ | **Acoustic-Grounded** โœ… | "Person is reluctant/annoyed, may not follow through" |
125
+
126
+ The acoustic-grounded approach captures the TRUE meaning!
127
+ """
128
+
129
+ # MGRD Content
130
+ MGRD_CONTENT = """
131
+ ## ๐Ÿ”ฌ MGRD: Modality-Grounded Reasoning Distillation
132
+
133
+ MGRD is the **key innovation** that makes Step-Audio-R1 work. It's an iterative training framework that teaches the model to reason over actual acoustic features instead of text surrogates.
134
+
135
+ ### The MGRD Pipeline
136
+
137
+ ```
138
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
139
+ โ”‚ MGRD ITERATIVE PROCESS โ”‚
140
+ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
141
+ โ”‚ โ”‚
142
+ โ”‚ START: Text-based reasoning (inherited) โ”‚
143
+ โ”‚ โ†“ โ”‚
144
+ โ”‚ ITERATION 1: Generate reasoning chains โ”‚
145
+ โ”‚ โ†“ โ”‚
146
+ โ”‚ FILTER: Remove textual surrogate chains โ”‚
147
+ โ”‚ โ†“ โ”‚
148
+ โ”‚ SELECT: Keep acoustically-grounded chains โ”‚
149
+ โ”‚ โ†“ โ”‚
150
+ โ”‚ RETRAIN: Update model with filtered data โ”‚
151
+ โ”‚ โ†“ โ”‚
152
+ โ”‚ REPEAT until "Native Audio Think" emerges โ”‚
153
+ โ”‚ โ†“ โ”‚
154
+ โ”‚ RESULT: Model reasons over acoustic features! โ”‚
155
+ โ”‚ โ”‚
156
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
157
+ ```
158
+
159
+ ### Three Training Stages
160
+
161
+ | Stage | Name | What Happens |
162
+ |-------|------|--------------|
163
+ | **1** | Cold-Start | SFT + RLVR to establish basic audio understanding |
164
+ | **2** | Iterative Distillation | Filter and refine reasoning chains |
165
+ | **3** | Native Audio Think | Model develops true acoustic reasoning |
166
+
167
+ ### What Makes a "Good" Reasoning Chain?
168
+
169
+ **โŒ Bad (Textual Surrogate):**
170
+ > "The speaker says 'I'm fine' so they must be feeling okay."
171
+
172
+ **โœ… Good (Acoustically-Grounded):**
173
+ > "The speaker's voice shows elevated pitch (+15%), faster tempo, and slight tremor, indicating stress despite saying 'I'm fine'. The background noise suggests a busy environment which may be contributing to their tension."
174
+
175
+ The good chain references **actual acoustic features**!
176
+ """
177
+
178
+ # Architecture Content
179
+ ARCHITECTURE_CONTENT = """
180
+ ## ๐Ÿ—๏ธ Step-Audio-R1 Architecture
181
+
182
+ Step-Audio-R1 builds on Step-Audio 2 with three main components:
183
+
184
+ ```
185
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
186
+ โ”‚ STEP-AUDIO-R1 ARCHITECTURE โ”‚
187
+ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
188
+ โ”‚ โ”‚
189
+ โ”‚ ๐ŸŽค AUDIO INPUT (waveform) โ”‚
190
+ โ”‚ โ”‚ โ”‚
191
+ โ”‚ โ–ผ โ”‚
192
+ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
193
+ โ”‚ โ”‚ AUDIO ENCODER โ”‚ โ”‚
194
+ โ”‚ โ”‚ โ€ข Qwen2 Audio Encoder โ”‚ โ”‚
195
+ โ”‚ โ”‚ โ€ข 25 Hz frame rate โ”‚ โ”‚
196
+ โ”‚ โ”‚ โ€ข FROZEN during training โ”‚ โ”‚
197
+ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
198
+ โ”‚ โ”‚ โ”‚
199
+ โ”‚ โ–ผ โ”‚
200
+ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
201
+ โ”‚ โ”‚ AUDIO ADAPTOR โ”‚ โ”‚
202
+ โ”‚ โ”‚ โ€ข 2x downsampling โ”‚ โ”‚
203
+ โ”‚ โ”‚ โ€ข 12.5 Hz output โ”‚ โ”‚
204
+ โ”‚ โ”‚ โ€ข Bridge to LLM โ”‚ โ”‚
205
+ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
206
+ โ”‚ โ”‚ โ”‚
207
+ โ”‚ โ–ผ โ”‚
208
+ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
209
+ โ”‚ โ”‚ LLM DECODER โ”‚ โ”‚
210
+ โ”‚ โ”‚ โ€ข Qwen2.5 32B โ”‚ โ”‚
211
+ โ”‚ โ”‚ โ€ข Core reasoning engine โ”‚ โ”‚
212
+ โ”‚ โ”‚ โ€ข Outputs: Think โ†’ Response โ”‚ โ”‚
213
+ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€๏ฟฝ๏ฟฝโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
214
+ โ”‚ โ”‚ โ”‚
215
+ โ”‚ โ–ผ โ”‚
216
+ โ”‚ ๐Ÿ“ TEXT OUTPUT โ”‚
217
+ โ”‚ <thinking>...</thinking> โ”‚
218
+ โ”‚ <response>...</response> โ”‚
219
+ โ”‚ โ”‚
220
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
221
+ ```
222
+
223
+ ### Component Details
224
+
225
+ | Component | Model | Frame Rate | Status |
226
+ |-----------|-------|------------|--------|
227
+ | Audio Encoder | Qwen2 Audio | 25 Hz | Frozen |
228
+ | Audio Adaptor | Custom MLP | 12.5 Hz (2x down) | Trainable |
229
+ | LLM Decoder | Qwen2.5 32B | N/A | Trainable |
230
+
231
+ ### Output Format
232
+
233
+ The model produces structured reasoning:
234
+
235
+ ```xml
236
+ <thinking>
237
+ 1. Acoustic Analysis: [describes sound properties]
238
+ 2. Pattern Recognition: [identifies key features]
239
+ 3. Inference: [draws conclusions from audio]
240
+ </thinking>
241
+
242
+ <response>
243
+ [Final answer based on acoustic reasoning]
244
+ </response>
245
+ ```
246
+ """
247
+
248
+ # Benchmarks Data
249
+ BENCHMARK_DATA = """
250
+ ## ๐Ÿ“Š Benchmark Results
251
+
252
+ Step-Audio-R1 was evaluated on comprehensive audio understanding benchmarks:
253
+
254
+ ### MMAU (Massive Multi-Task Audio Understanding)
255
+ - **10,000** audio clips with human-annotated Q&A
256
+ - **27** distinct skills tested
257
+ - Covers: Speech, Environmental Sounds, Music
258
+
259
+ ### Performance Comparison
260
+
261
+ | Model | MMAU Avg | vs Gemini 2.5 Pro |
262
+ |-------|----------|-------------------|
263
+ | **Step-Audio-R1** | **~78%** | **+12%** โœ… |
264
+ | Gemini 3 Pro | ~77% | +11% |
265
+ | Gemini 2.5 Pro | ~66% | baseline |
266
+ | GPT-4o Audio | ~55% | -11% |
267
+ | Qwen2.5-Omni | ~52% | -14% |
268
+
269
+ ### The Breakthrough: Test-Time Compute Scaling
270
+
271
+ ```
272
+ BEFORE Step-Audio-R1:
273
+ More thinking โ†’ โŒ Worse performance (inverted scaling)
274
+
275
+ AFTER Step-Audio-R1:
276
+ More thinking โ†’ โœ… Better performance (normal scaling)
277
+ ```
278
+
279
+ **This is the first time test-time compute scaling works for audio!**
280
+
281
+ ### Domain Performance
282
+
283
+ | Domain | Step-Audio-R1 | Previous SOTA |
284
+ |--------|---------------|---------------|
285
+ | Speech | ๐ŸŸข High | Medium |
286
+ | Sound | ๐ŸŸข High | Medium |
287
+ | Music | ๐ŸŸข High | Low |
288
+ """
289
+
290
+ # Applications Content
291
+ APPLICATIONS_CONTENT = """
292
+ ## ๐Ÿš€ Practical Applications
293
+
294
+ Audio reasoning enables many new AI capabilities:
295
+
296
+ ### 1. ๐ŸŽ™๏ธ Advanced Voice Assistants
297
+ - Understand complex multi-step instructions
298
+ - Detect user emotion and adjust responses
299
+ - Handle ambiguous requests intelligently
300
+
301
+ ### 2. ๐Ÿ“ž Call Center Analytics
302
+ - Analyze customer sentiment in real-time
303
+ - Detect escalation patterns before they happen
304
+ - Extract action items from conversations
305
+
306
+ ### 3. โ™ฟ Accessibility Tools
307
+ - Rich audio descriptions for hearing impaired
308
+ - Environmental sound narration
309
+ - Music content analysis and description
310
+
311
+ ### 4. ๐Ÿ”’ Security & Monitoring
312
+ - Anomalous sound event detection
313
+ - Contextual threat assessment
314
+ - Multi-source audio analysis
315
+
316
+ ### 5. ๐ŸŽ“ Education & Learning
317
+ - Pronunciation analysis for language learning
318
+ - Music performance evaluation
319
+ - Lecture comprehension and Q&A
320
+
321
+ ### Example: Meeting Analysis
322
+
323
+ ```
324
+ ๐Ÿ“ฅ Input: [30-minute team meeting recording]
325
+
326
+ ๐Ÿค” Step-Audio-R1 Analysis:
327
+
328
+ <thinking>
329
+ 1. Speaker identification: 4 distinct voices detected
330
+ 2. Topic tracking: Budget discussion (0-10min),
331
+ Project timeline (10-20min), Action items (20-30min)
332
+ 3. Sentiment analysis:
333
+ - Speaker A: Confident, leading discussion
334
+ - Speaker B: Concerned (elevated pitch during budget section)
335
+ - Speaker C: Disengaged (low energy, minimal contributions)
336
+ - Speaker D: Supportive, mediating tensions
337
+ 4. Key moments: Tension spike at 8:42 (disagreement on budget)
338
+ </thinking>
339
+
340
+ <response>
341
+ Meeting Summary:
342
+ - Main topics: Q3 budget allocation, Project Alpha timeline
343
+ - Key decision: Budget approved with 10% reduction
344
+ - Action items: 3 identified (assigned to Speakers A, B, D)
345
+ - Team dynamics: Some tension around budget, resolved by end
346
+ - Follow-up recommended: 1-on-1 with Speaker C (low engagement)
347
+ </response>
348
+ ```
349
+ """
350
+
351
+ # Resources Content
352
+ RESOURCES_CONTENT = """
353
+ ## ๐Ÿ“š Resources & Links
354
+
355
+ ### ๐Ÿ“„ Papers
356
+ | Paper | Link |
357
+ |-------|------|
358
+ | Step-Audio-R1 Technical Report | [arXiv:2511.15848](https://arxiv.org/abs/2511.15848) |
359
+ | MMAU Benchmark | [arXiv:2410.19168](https://arxiv.org/abs/2410.19168) |
360
+ | Audio-Reasoner | [arXiv:2503.02318](https://arxiv.org/abs/2503.02318) |
361
+ | SpeechR Benchmark | [arXiv:2508.02018](https://arxiv.org/abs/2508.02018) |
362
+
363
+ ### ๐Ÿ’ป Code & Models
364
+ | Resource | Link |
365
+ |----------|------|
366
+ | Step-Audio-R1 GitHub | [github.com/stepfun-ai/Step-Audio-R1](https://github.com/stepfun-ai/Step-Audio-R1) |
367
+ | Step-Audio-R1 Demo | [stepaudiollm.github.io/step-audio-r1](https://stepaudiollm.github.io/step-audio-r1/) |
368
+ | HuggingFace Collection | [huggingface.co/collections/stepfun-ai/step-audio-r1](https://huggingface.co/collections/stepfun-ai/step-audio-r1) |
369
+ | AudioBench | [github.com/AudioLLMs/AudioBench](https://github.com/AudioLLMs/AudioBench) |
370
+
371
+ ### ๐Ÿ“– Key Concepts Glossary
372
+
373
+ | Term | Full Name | Description |
374
+ |------|-----------|-------------|
375
+ | **LALM** | Large Audio Language Model | AI model that understands and reasons over audio |
376
+ | **CoT** | Chain-of-Thought | Step-by-step reasoning approach |
377
+ | **MGRD** | Modality-Grounded Reasoning Distillation | Training framework for acoustic reasoning |
378
+ | **TSR** | Textual Surrogate Reasoning | Problem where model reasons over text instead of audio |
379
+ | **RLVR** | Reinforcement Learning with Verified Rewards | Training with binary correctness rewards |
380
+ | **SFT** | Supervised Fine-Tuning | Standard fine-tuning on labeled data |
381
+
382
+ ### ๐Ÿ“ Citation
383
+
384
+ ```bibtex
385
+ @article{stepaudioR1,
386
+ title={Step-Audio-R1 Technical Report},
387
+ author={Tian, Fei and others},
388
+ journal={arXiv preprint arXiv:2511.15848},
389
+ year={2025}
390
+ }
391
+ ```
392
+
393
+ ---
394
+
395
+ ### ๐Ÿ‘ค About This Space
396
+
397
+ Created by **Mehmet TuฤŸrul Kaya**
398
+ - ๐Ÿ™ GitHub: [@mtkaya](https://github.com/mtkaya)
399
+ - ๐Ÿค— HuggingFace: [tugrulkaya](https://huggingface.co/tugrulkaya)
400
+
401
+ *This educational space explores the concepts behind Step-Audio-R1 and audio reasoning.*
402
+ """
403
+
404
+ # ============================================
405
+ # HELPER FUNCTIONS
406
+ # ============================================
407
+
408
+ def get_reasoning_type_info(reasoning_type):
409
+ """Get detailed information about a reasoning type"""
410
+ if reasoning_type not in REASONING_TYPES:
411
+ return "Please select a reasoning type"
412
+
413
+ info = REASONING_TYPES[reasoning_type]
414
+
415
+ output = f"""
416
+ ## {info['emoji']} {reasoning_type}
417
+
418
+ ### Description
419
+ {info['description']}
420
+
421
+ ### Example Question
422
+ > *"{info['example_question']}"*
423
+
424
+ ### Example Audio Scenario
425
+ ๐ŸŽง {info['example_audio']}
426
+
427
+ ### What the Model Does
428
+ {info['what_model_does']}
429
+
430
+ ### Key Challenge
431
+ โš ๏ธ {info['challenge']}
432
+
433
+ ---
434
+
435
+ ### How Step-Audio-R1 Handles This
436
+
437
+ Unlike traditional models that would convert this to text first, Step-Audio-R1:
438
+
439
+ 1. **Analyzes acoustic features** directly from the audio waveform
440
+ 2. **Generates reasoning chains** grounded in sound properties
441
+ 3. **Produces answers** that account for non-verbal information
442
+
443
+ This is what makes it the first true **audio reasoning** model!
444
+ """
445
+ return output
446
+
447
+
448
+ def create_comparison_chart():
449
+ """Create model comparison data"""
450
+ return """
451
+ ### ๐Ÿ“Š Model Comparison on MMAU Benchmark
452
+
453
+ | Rank | Model | Score | Type |
454
+ |------|-------|-------|------|
455
+ | ๐Ÿฅ‡ | **Step-Audio-R1** | ~78% | Open |
456
+ | ๐Ÿฅˆ | Gemini 3 Pro | ~77% | Proprietary |
457
+ | ๐Ÿฅ‰ | Gemini 2.5 Pro | ~66% | Proprietary |
458
+ | 4 | Audio Flamingo 3 | ~60% | Open |
459
+ | 5 | GPT-4o Audio | ~55% | Proprietary |
460
+ | 6 | Qwen2.5-Omni | ~52% | Open |
461
+
462
+ **Key Insight**: Step-Audio-R1 is the first **open** model to match proprietary SOTA!
463
+ """
464
+
465
+
466
+ def generate_demo_reasoning(scenario):
467
+ """Generate example reasoning for demo scenarios"""
468
+ scenarios = {
469
+ "Customer Service Call": """
470
+ ## ๐ŸŽง Demo: Customer Service Call Analysis
471
+
472
+ **Scenario**: A customer calls about a billing issue
473
+
474
+ ### Simulated Audio Description
475
+ - Customer voice: Female, middle-aged, American English
476
+ - Tone: Initially frustrated, becomes calmer
477
+ - Agent voice: Male, professional, empathetic tone
478
+ - Background: Quiet call center environment
479
+ - Duration: ~3 minutes
480
+
481
+ ### Step-Audio-R1 Reasoning (Simulated)
482
+
483
+ ```xml
484
+ <thinking>
485
+ 1. SPEAKER ANALYSIS:
486
+ - Customer: F0 mean 220Hz (elevated), speech rate 4.2 syl/sec (fast)
487
+ - Initial segment shows tension markers: pitch variation +40%
488
+ - Gradual calming: pitch stabilizes by minute 2
489
+
490
+ 2. AGENT ANALYSIS:
491
+ - Consistent calm prosody throughout
492
+ - Strategic pauses after customer complaints
493
+ - Mirroring technique detected (matching customer's pace)
494
+
495
+ 3. EMOTIONAL ARC:
496
+ - 0:00-0:45: Customer frustrated (anger markers)
497
+ - 0:45-1:30: Tension peak, interruption detected
498
+ - 1:30-2:30: De-escalation successful
499
+ - 2:30-3:00: Resolution, positive closing
500
+
501
+ 4. ACOUSTIC CONTEXT:
502
+ - Low background noise suggests professional environment
503
+ - No hold music interruptions
504
+ - Clear audio quality on both sides
505
+ </thinking>
506
+
507
+ <response>
508
+ This customer service interaction shows successful de-escalation.
509
+ The customer initially displayed frustration (elevated pitch, fast
510
+ speech) but the agent's calm, empathetic approach led to resolution.
511
+ Key success factor: Agent's strategic use of pauses and mirroring.
512
+ Customer satisfaction likely: HIGH (based on closing tone).
513
+ </response>
514
+ ```
515
+ """,
516
+ "Meeting Recording": """
517
+ ## ๐ŸŽง Demo: Meeting Recording Analysis
518
+
519
+ **Scenario**: Team standup meeting (15 minutes)
520
+
521
+ ### Simulated Audio Description
522
+ - 4 speakers in conference room
523
+ - Mixed engagement levels
524
+ - Some background office noise
525
+ - One speaker joins via video call
526
+
527
+ ### Step-Audio-R1 Reasoning (Simulated)
528
+
529
+ ```xml
530
+ <thinking>
531
+ 1. SPEAKER IDENTIFICATION:
532
+ - Speaker A (Lead): Clear, confident, 65% of speaking time
533
+ - Speaker B (Remote): Slight compression artifacts, engaged
534
+ - Speaker C (Junior): Hesitant prosody, brief contributions
535
+ - Speaker D (Senior): Minimal input, possible disengagement
536
+
537
+ 2. TOPIC TRACKING:
538
+ - 0:00-5:00: Sprint review (Speaker A leading)
539
+ - 5:00-10:00: Blockers discussion (B raises concern)
540
+ - 10:00-15:00: Planning next steps
541
+
542
+ 3. ENGAGEMENT ANALYSIS:
543
+ - Speaker C shows uncertainty (rising intonation on statements)
544
+ - Speaker D energy levels low (F0 variance minimal)
545
+ - Remote Speaker B most engaged despite audio quality
546
+
547
+ 4. ACOUSTIC ENVIRONMENT:
548
+ - Room reverb suggests medium conference room
549
+ - HVAC noise consistent throughout
550
+ - Remote connection stable (no dropouts)
551
+
552
+ 5. KEY MOMENTS:
553
+ - 6:23: Tension spike (overlapping speech, raised voices)
554
+ - 12:45: Agreement reached (synchronized acknowledgments)
555
+ </thinking>
556
+
557
+ <response>
558
+ Meeting Analysis Summary:
559
+ - Duration: 15 minutes, 4 participants
560
+ - Key discussion: Sprint blockers and resolution
561
+ - Action items identified: 3 (assigned to Speakers A, B, C)
562
+ - Team dynamics note: Speaker D showed low engagement -
563
+ recommend follow-up. Speaker C may need support (uncertainty
564
+ markers detected in their updates).
565
+ - Overall meeting effectiveness: MODERATE
566
+ </response>
567
+ ```
568
+ """,
569
+ "Podcast Episode": """
570
+ ## ๐ŸŽง Demo: Podcast Episode Analysis
571
+
572
+ **Scenario**: Tech podcast interview (45 minutes)
573
+
574
+ ### Simulated Audio Description
575
+ - Host and guest conversation
576
+ - Professional studio recording
577
+ - Music intro/outro
578
+ - Natural conversational flow
579
+
580
+ ### Step-Audio-R1 Reasoning (Simulated)
581
+
582
+ ```xml
583
+ <thinking>
584
+ 1. AUDIO QUALITY ASSESSMENT:
585
+ - Studio-quality recording (low noise floor)
586
+ - Two distinct microphones detected
587
+ - Music segments: Intro (0:00-0:30), Outro (44:30-45:00)
588
+ - No significant audio issues
589
+
590
+ 2. SPEAKER DYNAMICS:
591
+ - Host: Experienced interviewer, good pacing
592
+ - Guest: Expert on topic, occasional filler words
593
+ - Turn-taking: Natural, minimal interruptions
594
+ - Rapport: Strong (laughter at 12:34, 23:45, 38:12)
595
+
596
+ 3. CONTENT STRUCTURE:
597
+ - Introduction: 0:30-3:00
598
+ - Main discussion: 3:00-40:00
599
+ - Audience Q&A: 40:00-44:30
600
+ - Closing: 44:30-45:00
601
+
602
+ 4. ENGAGEMENT INDICATORS:
603
+ - Energy peaks: 15:00 (exciting announcement)
604
+ - Deep technical discussion: 25:00-32:00
605
+ - Emotional moment: 35:00 (personal story shared)
606
+
607
+ 5. PROSODIC ANALYSIS:
608
+ - Guest becomes more animated on specific topics
609
+ - Host maintains consistent energy throughout
610
+ - Pace accelerates during exciting segments
611
+ </thinking>
612
+
613
+ <response>
614
+ Podcast Episode Analysis:
615
+ - Format: Interview style, 45 minutes
616
+ - Quality: Professional studio recording
617
+ - Key topics: [Inferred from discussion patterns]
618
+ - Highlight moments: 15:00 (announcement), 35:00 (personal story)
619
+ - Listener engagement prediction: HIGH
620
+ - Recommended clips for promotion: 15:00-16:30, 35:00-36:45
621
+ - Overall quality: EXCELLENT (clear audio, good rapport,
622
+ well-structured conversation)
623
+ </response>
624
+ ```
625
+ """,
626
+ "Music Analysis": """
627
+ ## ๐ŸŽง Demo: Music Analysis
628
+
629
+ **Scenario**: Unknown music track analysis
630
+
631
+ ### Simulated Audio Description
632
+ - Instrumental track
633
+ - ~4 minutes duration
634
+ - Multiple instruments
635
+ - Studio production
636
+
637
+ ### Step-Audio-R1 Reasoning (Simulated)
638
+
639
+ ```xml
640
+ <thinking>
641
+ 1. ACOUSTIC ANALYSIS:
642
+ - Tempo: ~120 BPM (moderate, danceable)
643
+ - Key: A minor (melancholic but energetic)
644
+ - Time signature: 4/4
645
+ - Dynamic range: 12 dB (well-mastered)
646
+
647
+ 2. INSTRUMENTATION:
648
+ - Drums: Electronic, four-on-the-floor pattern
649
+ - Bass: Synthesizer, prominent low-end
650
+ - Lead: Analog-style synth, saw wave
651
+ - Pads: Ambient, reverb-heavy
652
+ - No vocals detected
653
+
654
+ 3. STRUCTURE:
655
+ - Intro: 0:00-0:30 (buildup)
656
+ - Verse 1: 0:30-1:30 (main groove)
657
+ - Build: 1:30-2:00 (tension)
658
+ - Drop: 2:00-2:30 (energy peak)
659
+ - Verse 2: 2:30-3:30 (variation)
660
+ - Outro: 3:30-4:00 (fadeout)
661
+
662
+ 4. PRODUCTION ANALYSIS:
663
+ - Sidechain compression detected on pads
664
+ - Stereo width: Wide (good separation)
665
+ - Reference similar to: Melodic techno genre
666
+ - Production quality: Professional
667
+
668
+ 5. EMOTIONAL CHARACTER:
669
+ - Overall mood: Driving but melancholic
670
+ - Energy arc: Building โ†’ Peak โ†’ Sustain โ†’ Release
671
+ </thinking>
672
+
673
+ <response>
674
+ Music Analysis Summary:
675
+ - Genre: Melodic Techno / Progressive House
676
+ - Tempo: 120 BPM
677
+ - Key: A minor
678
+ - Duration: ~4 minutes
679
+ - Mood: Energetic yet melancholic
680
+ - Production: Professional quality, well-mastered
681
+ - Use cases: DJ sets, workout playlists, focus music
682
+ - Similar artists: [Based on production style]
683
+ - Standout elements: Strong bass design, effective buildup
684
+ </response>
685
+ ```
686
+ """
687
+ }
688
+
689
+ return scenarios.get(scenario, "Please select a scenario")
690
+
691
+
692
+ # ============================================
693
+ # GRADIO INTERFACE
694
+ # ============================================
695
+
696
+ # Custom CSS
697
+ custom_css = """
698
+ .gradio-container {
699
+ max-width: 1200px !important;
700
+ }
701
+ .tab-nav button {
702
+ font-size: 16px !important;
703
+ }
704
+ .prose h1 {
705
+ color: #FF6B35 !important;
706
+ }
707
+ .prose h2 {
708
+ color: #4ECDC4 !important;
709
+ border-bottom: 2px solid #4ECDC4;
710
+ padding-bottom: 5px;
711
+ }
712
+ .highlight-box {
713
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
714
+ padding: 20px;
715
+ border-radius: 10px;
716
+ color: white;
717
+ }
718
+ """
719
+
720
+ # Build the interface
721
+ with gr.Blocks(css=custom_css, title="๐ŸŽง Audio Reasoning Explorer", theme=gr.themes.Soft()) as demo:
722
+
723
+ # Header
724
+ gr.Markdown("""
725
+ <div style="text-align: center; padding: 20px;">
726
+ <h1>๐ŸŽง Audio Reasoning & Step-Audio-R1 Explorer</h1>
727
+ <p style="font-size: 18px; color: #666;">
728
+ Interactive guide to understanding how AI learns to think about sound
729
+ </p>
730
+ </div>
731
+ """)
732
+
733
+ # Main tabs
734
+ with gr.Tabs():
735
+
736
+ # Tab 1: Introduction
737
+ with gr.TabItem("๐Ÿ  Introduction", id=0):
738
+ gr.Markdown(INTRO_CONTENT)
739
+
740
+ # Tab 2: Audio Reasoning Types
741
+ with gr.TabItem("๐Ÿง  Reasoning Types", id=1):
742
+ gr.Markdown("## ๐Ÿง  Types of Audio Reasoning\n\nSelect a reasoning type to learn more:")
743
+
744
+ with gr.Row():
745
+ with gr.Column(scale=1):
746
+ reasoning_dropdown = gr.Dropdown(
747
+ choices=list(REASONING_TYPES.keys()),
748
+ label="Select Reasoning Type",
749
+ value="Factual Reasoning"
750
+ )
751
+
752
+ gr.Markdown("### Quick Overview")
753
+ for rtype, info in REASONING_TYPES.items():
754
+ gr.Markdown(f"{info['emoji']} **{rtype}**: {info['description']}")
755
+
756
+ with gr.Column(scale=2):
757
+ reasoning_output = gr.Markdown(
758
+ value=get_reasoning_type_info("Factual Reasoning")
759
+ )
760
+
761
+ reasoning_dropdown.change(
762
+ fn=get_reasoning_type_info,
763
+ inputs=[reasoning_dropdown],
764
+ outputs=[reasoning_output]
765
+ )
766
+
767
+ # Tab 3: The Problem
768
+ with gr.TabItem("๐Ÿšซ The Problem", id=2):
769
+ gr.Markdown(PROBLEM_CONTENT)
770
+
771
+ # Tab 4: MGRD Solution
772
+ with gr.TabItem("๐Ÿ”ฌ MGRD Solution", id=3):
773
+ gr.Markdown(MGRD_CONTENT)
774
+
775
+ # Tab 5: Architecture
776
+ with gr.TabItem("๐Ÿ—๏ธ Architecture", id=4):
777
+ gr.Markdown(ARCHITECTURE_CONTENT)
778
+
779
+ # Tab 6: Benchmarks
780
+ with gr.TabItem("๐Ÿ“Š Benchmarks", id=5):
781
+ gr.Markdown(BENCHMARK_DATA)
782
+ gr.Markdown(create_comparison_chart())
783
+
784
+ # Tab 7: Interactive Demo
785
+ with gr.TabItem("๐ŸŽฎ Interactive Demo", id=6):
786
+ gr.Markdown("""
787
+ ## ๐ŸŽฎ Interactive Audio Reasoning Demo
788
+
789
+ See how Step-Audio-R1 would analyze different audio scenarios!
790
+
791
+ *Note: This is a simulation showing the reasoning process.
792
+ The actual model processes real audio input.*
793
+ """)
794
+
795
+ with gr.Row():
796
+ with gr.Column(scale=1):
797
+ scenario_dropdown = gr.Dropdown(
798
+ choices=[
799
+ "Customer Service Call",
800
+ "Meeting Recording",
801
+ "Podcast Episode",
802
+ "Music Analysis"
803
+ ],
804
+ label="Select Audio Scenario",
805
+ value="Customer Service Call"
806
+ )
807
+
808
+ analyze_btn = gr.Button("๐Ÿ” Analyze Scenario", variant="primary")
809
+
810
+ gr.Markdown("""
811
+ ### What This Shows
812
+
813
+ Each scenario demonstrates:
814
+ 1. **Acoustic analysis** - What the model "hears"
815
+ 2. **Reasoning process** - Step-by-step thinking
816
+ 3. **Final output** - Actionable insights
817
+
818
+ This is the power of **audio reasoning**!
819
+ """)
820
+
821
+ with gr.Column(scale=2):
822
+ demo_output = gr.Markdown(
823
+ value=generate_demo_reasoning("Customer Service Call")
824
+ )
825
+
826
+ analyze_btn.click(
827
+ fn=generate_demo_reasoning,
828
+ inputs=[scenario_dropdown],
829
+ outputs=[demo_output]
830
+ )
831
+
832
+ # Tab 8: Applications
833
+ with gr.TabItem("๐Ÿš€ Applications", id=7):
834
+ gr.Markdown(APPLICATIONS_CONTENT)
835
+
836
+ # Tab 9: Resources
837
+ with gr.TabItem("๐Ÿ“š Resources", id=8):
838
+ gr.Markdown(RESOURCES_CONTENT)
839
+
840
+ # Footer
841
+ gr.Markdown("""
842
+ ---
843
+ <div style="text-align: center; padding: 20px; color: #666;">
844
+ <p>Created by <strong>Mehmet TuฤŸrul Kaya</strong> |
845
+ <a href="https://github.com/mtkaya">GitHub</a> |
846
+ <a href="https://huggingface.co/tugrulkaya">HuggingFace</a></p>
847
+ <p>๐ŸŽง Sound Speaks, AI Listens and Thinks ๐Ÿง </p>
848
+ </div>
849
+ """)
850
+
851
+ # Launch
852
+ if __name__ == "__main__":
853
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ gradio>=4.0.0