Upload 3 files
Browse files- README.md +85 -8
- app.py +853 -0
- requirements.txt +1 -0
README.md
CHANGED
|
@@ -1,14 +1,91 @@
|
|
| 1 |
---
|
| 2 |
-
title: Audio Reasoning Explorer
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: gradio
|
| 7 |
-
sdk_version:
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
-
license:
|
| 11 |
-
short_description: Interactive
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Audio Reasoning & Step-Audio-R1 Explorer
|
| 3 |
+
emoji: ๐ง
|
| 4 |
+
colorFrom: purple
|
| 5 |
+
colorTo: blue
|
| 6 |
sdk: gradio
|
| 7 |
+
sdk_version: 4.44.0
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
+
license: cc-by-4.0
|
| 11 |
+
short_description: Interactive guide to audio reasoning and Step-Audio-R1 model
|
| 12 |
+
tags:
|
| 13 |
+
- audio
|
| 14 |
+
- reasoning
|
| 15 |
+
- multimodal
|
| 16 |
+
- step-audio-r1
|
| 17 |
+
- LALM
|
| 18 |
+
- chain-of-thought
|
| 19 |
+
- education
|
| 20 |
---
|
| 21 |
|
| 22 |
+
# ๐ง Audio Reasoning & Step-Audio-R1 Explorer
|
| 23 |
+
|
| 24 |
+
An interactive educational space exploring the groundbreaking concepts behind **audio reasoning** and the **Step-Audio-R1** model.
|
| 25 |
+
|
| 26 |
+
## ๐ฏ What is Audio Reasoning?
|
| 27 |
+
|
| 28 |
+
Audio reasoning is an AI model's ability to perform **deliberate, multi-step thinking processes** over audio inputs. This goes far beyond simple speech recognition (ASR) or audio classification.
|
| 29 |
+
|
| 30 |
+
**Step-Audio-R1** is the first model to successfully unlock reasoning capabilities in the audio domain, solving the "inverted scaling anomaly" that plagued previous audio language models.
|
| 31 |
+
|
| 32 |
+
## ๐ Features of This Space
|
| 33 |
+
|
| 34 |
+
| Tab | Content |
|
| 35 |
+
|-----|---------|
|
| 36 |
+
| ๐ **Introduction** | Overview of audio reasoning and key achievements |
|
| 37 |
+
| ๐ง **Reasoning Types** | Interactive explorer for 5 types of audio reasoning |
|
| 38 |
+
| ๐ซ **The Problem** | Understanding the inverted scaling anomaly |
|
| 39 |
+
| ๐ฌ **MGRD Solution** | How Modality-Grounded Reasoning Distillation works |
|
| 40 |
+
| ๐๏ธ **Architecture** | Step-Audio-R1 model architecture breakdown |
|
| 41 |
+
| ๐ **Benchmarks** | Performance comparisons and results |
|
| 42 |
+
| ๐ฎ **Interactive Demo** | Simulated audio reasoning examples |
|
| 43 |
+
| ๐ **Applications** | Real-world use cases |
|
| 44 |
+
| ๐ **Resources** | Papers, code, and references |
|
| 45 |
+
|
| 46 |
+
## ๐ฌ Key Innovation: MGRD
|
| 47 |
+
|
| 48 |
+
**Modality-Grounded Reasoning Distillation (MGRD)** is the core innovation that makes Step-Audio-R1 work:
|
| 49 |
+
|
| 50 |
+
```
|
| 51 |
+
Text-based reasoning โ Filter textual surrogates โ Keep acoustic-grounded chains โ Native Audio Think
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
This iterative process teaches the model to reason over **actual acoustic features** instead of text transcripts.
|
| 55 |
+
|
| 56 |
+
## ๐ Performance
|
| 57 |
+
|
| 58 |
+
Step-Audio-R1 achieves:
|
| 59 |
+
- โ
**Surpasses Gemini 2.5 Pro** on comprehensive audio benchmarks
|
| 60 |
+
- โ
**Comparable to Gemini 3 Pro** (state-of-the-art)
|
| 61 |
+
- โ
**First successful test-time compute scaling** for audio
|
| 62 |
+
|
| 63 |
+
## ๐ Resources
|
| 64 |
+
|
| 65 |
+
- ๐ [Step-Audio-R1 Paper](https://arxiv.org/abs/2511.15848)
|
| 66 |
+
- ๐ป [GitHub Repository](https://github.com/stepfun-ai/Step-Audio-R1)
|
| 67 |
+
- ๐ค [HuggingFace Collection](https://huggingface.co/collections/stepfun-ai/step-audio-r1)
|
| 68 |
+
- ๐ฏ [Official Demo](https://stepaudiollm.github.io/step-audio-r1/)
|
| 69 |
+
|
| 70 |
+
## ๐ค Author
|
| 71 |
+
|
| 72 |
+
**Mehmet Tuฤrul Kaya**
|
| 73 |
+
- ๐ GitHub: [@mtkaya](https://github.com/mtkaya)
|
| 74 |
+
- ๐ค HuggingFace: [tugrulkaya](https://huggingface.co/tugrulkaya)
|
| 75 |
+
|
| 76 |
+
## ๐ Citation
|
| 77 |
+
|
| 78 |
+
```bibtex
|
| 79 |
+
@article{stepaudioR1,
|
| 80 |
+
title={Step-Audio-R1 Technical Report},
|
| 81 |
+
author={Tian, Fei and others},
|
| 82 |
+
journal={arXiv preprint arXiv:2511.15848},
|
| 83 |
+
year={2025}
|
| 84 |
+
}
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
---
|
| 88 |
+
|
| 89 |
+
<p align="center">
|
| 90 |
+
<b>๐ง Sound Speaks, AI Listens and Thinks ๐ง </b>
|
| 91 |
+
</p>
|
app.py
ADDED
|
@@ -0,0 +1,853 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
๐ง Audio Reasoning & Step-Audio-R1 Explorer
|
| 3 |
+
Interactive Hugging Face Space for exploring audio reasoning concepts
|
| 4 |
+
|
| 5 |
+
Author: Mehmet Tuฤrul Kaya
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import gradio as gr
|
| 9 |
+
|
| 10 |
+
# ============================================
|
| 11 |
+
# CONTENT DATA
|
| 12 |
+
# ============================================
|
| 13 |
+
|
| 14 |
+
INTRO_CONTENT = """
|
| 15 |
+
# ๐ง Audio Reasoning & Step-Audio-R1
|
| 16 |
+
|
| 17 |
+
## Teaching AI to Think About Sound
|
| 18 |
+
|
| 19 |
+
**Step-Audio-R1** is the first audio language model to successfully unlock reasoning capabilities in the audio domain.
|
| 20 |
+
This space explores the groundbreaking concepts behind audio reasoning and the innovative MGRD framework.
|
| 21 |
+
|
| 22 |
+
### ๐ฏ Key Achievement
|
| 23 |
+
> *"Can audio intelligence truly benefit from deliberate thinking?"* โ **YES!**
|
| 24 |
+
|
| 25 |
+
Step-Audio-R1 proves that reasoning is a **transferable capability across modalities** when properly grounded in acoustic features.
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
### ๐ Quick Stats
|
| 30 |
+
|
| 31 |
+
| Metric | Value |
|
| 32 |
+
|--------|-------|
|
| 33 |
+
| **Model Size** | 32B parameters (Qwen2.5 LLM) |
|
| 34 |
+
| **Audio Encoder** | Qwen2 (25 Hz, frozen) |
|
| 35 |
+
| **Performance** | Surpasses Gemini 2.5 Pro |
|
| 36 |
+
| **Innovation** | First successful audio reasoning model |
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
*Navigate through the tabs to explore different aspects of audio reasoning!*
|
| 41 |
+
"""
|
| 42 |
+
|
| 43 |
+
# Audio Reasoning Types Data
|
| 44 |
+
REASONING_TYPES = {
|
| 45 |
+
"Factual Reasoning": {
|
| 46 |
+
"emoji": "๐",
|
| 47 |
+
"description": "Extracting concrete information from audio",
|
| 48 |
+
"example_question": "What date is mentioned in this conversation?",
|
| 49 |
+
"example_audio": "A business call discussing a meeting scheduled for March 15th",
|
| 50 |
+
"what_model_does": "Identifies specific facts, numbers, names, dates from speech content",
|
| 51 |
+
"challenge": "Requires accurate speech recognition + information extraction"
|
| 52 |
+
},
|
| 53 |
+
"Procedural Reasoning": {
|
| 54 |
+
"emoji": "๐",
|
| 55 |
+
"description": "Understanding step-by-step processes and sequences",
|
| 56 |
+
"example_question": "What is the third step in this instruction set?",
|
| 57 |
+
"example_audio": "A cooking tutorial explaining how to make pasta",
|
| 58 |
+
"what_model_does": "Tracks sequential information, understands ordering and dependencies",
|
| 59 |
+
"challenge": "Must maintain context across long audio segments"
|
| 60 |
+
},
|
| 61 |
+
"Normative Reasoning": {
|
| 62 |
+
"emoji": "โ๏ธ",
|
| 63 |
+
"description": "Evaluating social, ethical, or behavioral norms",
|
| 64 |
+
"example_question": "Is the speaker behaving appropriately in this dialogue?",
|
| 65 |
+
"example_audio": "A customer service call with an upset customer",
|
| 66 |
+
"what_model_does": "Assesses tone, politeness, social appropriateness based on context",
|
| 67 |
+
"challenge": "Requires understanding of social norms + prosodic analysis"
|
| 68 |
+
},
|
| 69 |
+
"Contextual Reasoning": {
|
| 70 |
+
"emoji": "๐",
|
| 71 |
+
"description": "Inferring environmental and situational context",
|
| 72 |
+
"example_question": "Where might this sound have been recorded?",
|
| 73 |
+
"example_audio": "Background noise with birds, wind, and distant traffic",
|
| 74 |
+
"what_model_does": "Analyzes ambient sounds to determine location/situation",
|
| 75 |
+
"challenge": "Must process non-speech audio elements"
|
| 76 |
+
},
|
| 77 |
+
"Causal Reasoning": {
|
| 78 |
+
"emoji": "๐",
|
| 79 |
+
"description": "Establishing cause-effect relationships",
|
| 80 |
+
"example_question": "Why might this sound event have occurred?",
|
| 81 |
+
"example_audio": "A loud crash followed by glass breaking",
|
| 82 |
+
"what_model_does": "Infers causality from sound sequences and patterns",
|
| 83 |
+
"challenge": "Requires world knowledge + temporal understanding"
|
| 84 |
+
}
|
| 85 |
+
}
|
| 86 |
+
|
| 87 |
+
# The Problem Content
|
| 88 |
+
PROBLEM_CONTENT = """
|
| 89 |
+
## ๐ซ The Inverted Scaling Anomaly
|
| 90 |
+
|
| 91 |
+
### The Paradox
|
| 92 |
+
Traditional audio language models showed a **strange behavior**: they performed **WORSE** when reasoning longer!
|
| 93 |
+
|
| 94 |
+
This is the opposite of what happens in text models (like GPT-4, Claude) where more thinking = better answers.
|
| 95 |
+
|
| 96 |
+
### Root Cause: Textual Surrogate Reasoning
|
| 97 |
+
|
| 98 |
+
```
|
| 99 |
+
๐ Audio Input
|
| 100 |
+
โ
|
| 101 |
+
๐ Model converts to text (transcript)
|
| 102 |
+
โ
|
| 103 |
+
๐ง Reasons over TEXT, not SOUND
|
| 104 |
+
โ
|
| 105 |
+
โ Acoustic features IGNORED
|
| 106 |
+
โ
|
| 107 |
+
๐ Performance degrades with longer reasoning
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
### Why Does This Happen?
|
| 111 |
+
|
| 112 |
+
1. **Text-based initialization**: Models are fine-tuned from text LLMs
|
| 113 |
+
2. **Inherited patterns**: They learn to reason like text models
|
| 114 |
+
3. **Modality mismatch**: Audio is treated as "text with extra steps"
|
| 115 |
+
4. **Lost information**: Tone, emotion, prosody, ambient sounds are ignored
|
| 116 |
+
|
| 117 |
+
### Real Example
|
| 118 |
+
|
| 119 |
+
**Audio**: Person says "Sure, I'll do it" in a *sarcastic, annoyed tone*
|
| 120 |
+
|
| 121 |
+
| Approach | Interpretation |
|
| 122 |
+
|----------|---------------|
|
| 123 |
+
| **Textual Surrogate** โ | "Person agrees to do the task" |
|
| 124 |
+
| **Acoustic-Grounded** โ
| "Person is reluctant/annoyed, may not follow through" |
|
| 125 |
+
|
| 126 |
+
The acoustic-grounded approach captures the TRUE meaning!
|
| 127 |
+
"""
|
| 128 |
+
|
| 129 |
+
# MGRD Content
|
| 130 |
+
MGRD_CONTENT = """
|
| 131 |
+
## ๐ฌ MGRD: Modality-Grounded Reasoning Distillation
|
| 132 |
+
|
| 133 |
+
MGRD is the **key innovation** that makes Step-Audio-R1 work. It's an iterative training framework that teaches the model to reason over actual acoustic features instead of text surrogates.
|
| 134 |
+
|
| 135 |
+
### The MGRD Pipeline
|
| 136 |
+
|
| 137 |
+
```
|
| 138 |
+
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
| 139 |
+
โ MGRD ITERATIVE PROCESS โ
|
| 140 |
+
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
|
| 141 |
+
โ โ
|
| 142 |
+
โ START: Text-based reasoning (inherited) โ
|
| 143 |
+
โ โ โ
|
| 144 |
+
โ ITERATION 1: Generate reasoning chains โ
|
| 145 |
+
โ โ โ
|
| 146 |
+
โ FILTER: Remove textual surrogate chains โ
|
| 147 |
+
โ โ โ
|
| 148 |
+
โ SELECT: Keep acoustically-grounded chains โ
|
| 149 |
+
โ โ โ
|
| 150 |
+
โ RETRAIN: Update model with filtered data โ
|
| 151 |
+
โ โ โ
|
| 152 |
+
โ REPEAT until "Native Audio Think" emerges โ
|
| 153 |
+
โ โ โ
|
| 154 |
+
โ RESULT: Model reasons over acoustic features! โ
|
| 155 |
+
โ โ
|
| 156 |
+
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
### Three Training Stages
|
| 160 |
+
|
| 161 |
+
| Stage | Name | What Happens |
|
| 162 |
+
|-------|------|--------------|
|
| 163 |
+
| **1** | Cold-Start | SFT + RLVR to establish basic audio understanding |
|
| 164 |
+
| **2** | Iterative Distillation | Filter and refine reasoning chains |
|
| 165 |
+
| **3** | Native Audio Think | Model develops true acoustic reasoning |
|
| 166 |
+
|
| 167 |
+
### What Makes a "Good" Reasoning Chain?
|
| 168 |
+
|
| 169 |
+
**โ Bad (Textual Surrogate):**
|
| 170 |
+
> "The speaker says 'I'm fine' so they must be feeling okay."
|
| 171 |
+
|
| 172 |
+
**โ
Good (Acoustically-Grounded):**
|
| 173 |
+
> "The speaker's voice shows elevated pitch (+15%), faster tempo, and slight tremor, indicating stress despite saying 'I'm fine'. The background noise suggests a busy environment which may be contributing to their tension."
|
| 174 |
+
|
| 175 |
+
The good chain references **actual acoustic features**!
|
| 176 |
+
"""
|
| 177 |
+
|
| 178 |
+
# Architecture Content
|
| 179 |
+
ARCHITECTURE_CONTENT = """
|
| 180 |
+
## ๐๏ธ Step-Audio-R1 Architecture
|
| 181 |
+
|
| 182 |
+
Step-Audio-R1 builds on Step-Audio 2 with three main components:
|
| 183 |
+
|
| 184 |
+
```
|
| 185 |
+
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
| 186 |
+
โ STEP-AUDIO-R1 ARCHITECTURE โ
|
| 187 |
+
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
|
| 188 |
+
โ โ
|
| 189 |
+
โ ๐ค AUDIO INPUT (waveform) โ
|
| 190 |
+
โ โ โ
|
| 191 |
+
โ โผ โ
|
| 192 |
+
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
|
| 193 |
+
โ โ AUDIO ENCODER โ โ
|
| 194 |
+
โ โ โข Qwen2 Audio Encoder โ โ
|
| 195 |
+
โ โ โข 25 Hz frame rate โ โ
|
| 196 |
+
โ โ โข FROZEN during training โ โ
|
| 197 |
+
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
|
| 198 |
+
โ โ โ
|
| 199 |
+
โ โผ โ
|
| 200 |
+
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
|
| 201 |
+
โ โ AUDIO ADAPTOR โ โ
|
| 202 |
+
โ โ โข 2x downsampling โ โ
|
| 203 |
+
โ โ โข 12.5 Hz output โ โ
|
| 204 |
+
โ โ โข Bridge to LLM โ โ
|
| 205 |
+
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
|
| 206 |
+
โ โ โ
|
| 207 |
+
โ โผ โ
|
| 208 |
+
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
|
| 209 |
+
โ โ LLM DECODER โ โ
|
| 210 |
+
โ โ โข Qwen2.5 32B โ โ
|
| 211 |
+
โ โ โข Core reasoning engine โ โ
|
| 212 |
+
โ โ โข Outputs: Think โ Response โ โ
|
| 213 |
+
โ โโโโโโโโ๏ฟฝ๏ฟฝโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
|
| 214 |
+
โ โ โ
|
| 215 |
+
โ โผ โ
|
| 216 |
+
โ ๐ TEXT OUTPUT โ
|
| 217 |
+
โ <thinking>...</thinking> โ
|
| 218 |
+
โ <response>...</response> โ
|
| 219 |
+
โ โ
|
| 220 |
+
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
| 221 |
+
```
|
| 222 |
+
|
| 223 |
+
### Component Details
|
| 224 |
+
|
| 225 |
+
| Component | Model | Frame Rate | Status |
|
| 226 |
+
|-----------|-------|------------|--------|
|
| 227 |
+
| Audio Encoder | Qwen2 Audio | 25 Hz | Frozen |
|
| 228 |
+
| Audio Adaptor | Custom MLP | 12.5 Hz (2x down) | Trainable |
|
| 229 |
+
| LLM Decoder | Qwen2.5 32B | N/A | Trainable |
|
| 230 |
+
|
| 231 |
+
### Output Format
|
| 232 |
+
|
| 233 |
+
The model produces structured reasoning:
|
| 234 |
+
|
| 235 |
+
```xml
|
| 236 |
+
<thinking>
|
| 237 |
+
1. Acoustic Analysis: [describes sound properties]
|
| 238 |
+
2. Pattern Recognition: [identifies key features]
|
| 239 |
+
3. Inference: [draws conclusions from audio]
|
| 240 |
+
</thinking>
|
| 241 |
+
|
| 242 |
+
<response>
|
| 243 |
+
[Final answer based on acoustic reasoning]
|
| 244 |
+
</response>
|
| 245 |
+
```
|
| 246 |
+
"""
|
| 247 |
+
|
| 248 |
+
# Benchmarks Data
|
| 249 |
+
BENCHMARK_DATA = """
|
| 250 |
+
## ๐ Benchmark Results
|
| 251 |
+
|
| 252 |
+
Step-Audio-R1 was evaluated on comprehensive audio understanding benchmarks:
|
| 253 |
+
|
| 254 |
+
### MMAU (Massive Multi-Task Audio Understanding)
|
| 255 |
+
- **10,000** audio clips with human-annotated Q&A
|
| 256 |
+
- **27** distinct skills tested
|
| 257 |
+
- Covers: Speech, Environmental Sounds, Music
|
| 258 |
+
|
| 259 |
+
### Performance Comparison
|
| 260 |
+
|
| 261 |
+
| Model | MMAU Avg | vs Gemini 2.5 Pro |
|
| 262 |
+
|-------|----------|-------------------|
|
| 263 |
+
| **Step-Audio-R1** | **~78%** | **+12%** โ
|
|
| 264 |
+
| Gemini 3 Pro | ~77% | +11% |
|
| 265 |
+
| Gemini 2.5 Pro | ~66% | baseline |
|
| 266 |
+
| GPT-4o Audio | ~55% | -11% |
|
| 267 |
+
| Qwen2.5-Omni | ~52% | -14% |
|
| 268 |
+
|
| 269 |
+
### The Breakthrough: Test-Time Compute Scaling
|
| 270 |
+
|
| 271 |
+
```
|
| 272 |
+
BEFORE Step-Audio-R1:
|
| 273 |
+
More thinking โ โ Worse performance (inverted scaling)
|
| 274 |
+
|
| 275 |
+
AFTER Step-Audio-R1:
|
| 276 |
+
More thinking โ โ
Better performance (normal scaling)
|
| 277 |
+
```
|
| 278 |
+
|
| 279 |
+
**This is the first time test-time compute scaling works for audio!**
|
| 280 |
+
|
| 281 |
+
### Domain Performance
|
| 282 |
+
|
| 283 |
+
| Domain | Step-Audio-R1 | Previous SOTA |
|
| 284 |
+
|--------|---------------|---------------|
|
| 285 |
+
| Speech | ๐ข High | Medium |
|
| 286 |
+
| Sound | ๐ข High | Medium |
|
| 287 |
+
| Music | ๐ข High | Low |
|
| 288 |
+
"""
|
| 289 |
+
|
| 290 |
+
# Applications Content
|
| 291 |
+
APPLICATIONS_CONTENT = """
|
| 292 |
+
## ๐ Practical Applications
|
| 293 |
+
|
| 294 |
+
Audio reasoning enables many new AI capabilities:
|
| 295 |
+
|
| 296 |
+
### 1. ๐๏ธ Advanced Voice Assistants
|
| 297 |
+
- Understand complex multi-step instructions
|
| 298 |
+
- Detect user emotion and adjust responses
|
| 299 |
+
- Handle ambiguous requests intelligently
|
| 300 |
+
|
| 301 |
+
### 2. ๐ Call Center Analytics
|
| 302 |
+
- Analyze customer sentiment in real-time
|
| 303 |
+
- Detect escalation patterns before they happen
|
| 304 |
+
- Extract action items from conversations
|
| 305 |
+
|
| 306 |
+
### 3. โฟ Accessibility Tools
|
| 307 |
+
- Rich audio descriptions for hearing impaired
|
| 308 |
+
- Environmental sound narration
|
| 309 |
+
- Music content analysis and description
|
| 310 |
+
|
| 311 |
+
### 4. ๐ Security & Monitoring
|
| 312 |
+
- Anomalous sound event detection
|
| 313 |
+
- Contextual threat assessment
|
| 314 |
+
- Multi-source audio analysis
|
| 315 |
+
|
| 316 |
+
### 5. ๐ Education & Learning
|
| 317 |
+
- Pronunciation analysis for language learning
|
| 318 |
+
- Music performance evaluation
|
| 319 |
+
- Lecture comprehension and Q&A
|
| 320 |
+
|
| 321 |
+
### Example: Meeting Analysis
|
| 322 |
+
|
| 323 |
+
```
|
| 324 |
+
๐ฅ Input: [30-minute team meeting recording]
|
| 325 |
+
|
| 326 |
+
๐ค Step-Audio-R1 Analysis:
|
| 327 |
+
|
| 328 |
+
<thinking>
|
| 329 |
+
1. Speaker identification: 4 distinct voices detected
|
| 330 |
+
2. Topic tracking: Budget discussion (0-10min),
|
| 331 |
+
Project timeline (10-20min), Action items (20-30min)
|
| 332 |
+
3. Sentiment analysis:
|
| 333 |
+
- Speaker A: Confident, leading discussion
|
| 334 |
+
- Speaker B: Concerned (elevated pitch during budget section)
|
| 335 |
+
- Speaker C: Disengaged (low energy, minimal contributions)
|
| 336 |
+
- Speaker D: Supportive, mediating tensions
|
| 337 |
+
4. Key moments: Tension spike at 8:42 (disagreement on budget)
|
| 338 |
+
</thinking>
|
| 339 |
+
|
| 340 |
+
<response>
|
| 341 |
+
Meeting Summary:
|
| 342 |
+
- Main topics: Q3 budget allocation, Project Alpha timeline
|
| 343 |
+
- Key decision: Budget approved with 10% reduction
|
| 344 |
+
- Action items: 3 identified (assigned to Speakers A, B, D)
|
| 345 |
+
- Team dynamics: Some tension around budget, resolved by end
|
| 346 |
+
- Follow-up recommended: 1-on-1 with Speaker C (low engagement)
|
| 347 |
+
</response>
|
| 348 |
+
```
|
| 349 |
+
"""
|
| 350 |
+
|
| 351 |
+
# Resources Content
|
| 352 |
+
RESOURCES_CONTENT = """
|
| 353 |
+
## ๐ Resources & Links
|
| 354 |
+
|
| 355 |
+
### ๐ Papers
|
| 356 |
+
| Paper | Link |
|
| 357 |
+
|-------|------|
|
| 358 |
+
| Step-Audio-R1 Technical Report | [arXiv:2511.15848](https://arxiv.org/abs/2511.15848) |
|
| 359 |
+
| MMAU Benchmark | [arXiv:2410.19168](https://arxiv.org/abs/2410.19168) |
|
| 360 |
+
| Audio-Reasoner | [arXiv:2503.02318](https://arxiv.org/abs/2503.02318) |
|
| 361 |
+
| SpeechR Benchmark | [arXiv:2508.02018](https://arxiv.org/abs/2508.02018) |
|
| 362 |
+
|
| 363 |
+
### ๐ป Code & Models
|
| 364 |
+
| Resource | Link |
|
| 365 |
+
|----------|------|
|
| 366 |
+
| Step-Audio-R1 GitHub | [github.com/stepfun-ai/Step-Audio-R1](https://github.com/stepfun-ai/Step-Audio-R1) |
|
| 367 |
+
| Step-Audio-R1 Demo | [stepaudiollm.github.io/step-audio-r1](https://stepaudiollm.github.io/step-audio-r1/) |
|
| 368 |
+
| HuggingFace Collection | [huggingface.co/collections/stepfun-ai/step-audio-r1](https://huggingface.co/collections/stepfun-ai/step-audio-r1) |
|
| 369 |
+
| AudioBench | [github.com/AudioLLMs/AudioBench](https://github.com/AudioLLMs/AudioBench) |
|
| 370 |
+
|
| 371 |
+
### ๐ Key Concepts Glossary
|
| 372 |
+
|
| 373 |
+
| Term | Full Name | Description |
|
| 374 |
+
|------|-----------|-------------|
|
| 375 |
+
| **LALM** | Large Audio Language Model | AI model that understands and reasons over audio |
|
| 376 |
+
| **CoT** | Chain-of-Thought | Step-by-step reasoning approach |
|
| 377 |
+
| **MGRD** | Modality-Grounded Reasoning Distillation | Training framework for acoustic reasoning |
|
| 378 |
+
| **TSR** | Textual Surrogate Reasoning | Problem where model reasons over text instead of audio |
|
| 379 |
+
| **RLVR** | Reinforcement Learning with Verified Rewards | Training with binary correctness rewards |
|
| 380 |
+
| **SFT** | Supervised Fine-Tuning | Standard fine-tuning on labeled data |
|
| 381 |
+
|
| 382 |
+
### ๐ Citation
|
| 383 |
+
|
| 384 |
+
```bibtex
|
| 385 |
+
@article{stepaudioR1,
|
| 386 |
+
title={Step-Audio-R1 Technical Report},
|
| 387 |
+
author={Tian, Fei and others},
|
| 388 |
+
journal={arXiv preprint arXiv:2511.15848},
|
| 389 |
+
year={2025}
|
| 390 |
+
}
|
| 391 |
+
```
|
| 392 |
+
|
| 393 |
+
---
|
| 394 |
+
|
| 395 |
+
### ๐ค About This Space
|
| 396 |
+
|
| 397 |
+
Created by **Mehmet Tuฤrul Kaya**
|
| 398 |
+
- ๐ GitHub: [@mtkaya](https://github.com/mtkaya)
|
| 399 |
+
- ๐ค HuggingFace: [tugrulkaya](https://huggingface.co/tugrulkaya)
|
| 400 |
+
|
| 401 |
+
*This educational space explores the concepts behind Step-Audio-R1 and audio reasoning.*
|
| 402 |
+
"""
|
| 403 |
+
|
| 404 |
+
# ============================================
|
| 405 |
+
# HELPER FUNCTIONS
|
| 406 |
+
# ============================================
|
| 407 |
+
|
| 408 |
+
def get_reasoning_type_info(reasoning_type):
|
| 409 |
+
"""Get detailed information about a reasoning type"""
|
| 410 |
+
if reasoning_type not in REASONING_TYPES:
|
| 411 |
+
return "Please select a reasoning type"
|
| 412 |
+
|
| 413 |
+
info = REASONING_TYPES[reasoning_type]
|
| 414 |
+
|
| 415 |
+
output = f"""
|
| 416 |
+
## {info['emoji']} {reasoning_type}
|
| 417 |
+
|
| 418 |
+
### Description
|
| 419 |
+
{info['description']}
|
| 420 |
+
|
| 421 |
+
### Example Question
|
| 422 |
+
> *"{info['example_question']}"*
|
| 423 |
+
|
| 424 |
+
### Example Audio Scenario
|
| 425 |
+
๐ง {info['example_audio']}
|
| 426 |
+
|
| 427 |
+
### What the Model Does
|
| 428 |
+
{info['what_model_does']}
|
| 429 |
+
|
| 430 |
+
### Key Challenge
|
| 431 |
+
โ ๏ธ {info['challenge']}
|
| 432 |
+
|
| 433 |
+
---
|
| 434 |
+
|
| 435 |
+
### How Step-Audio-R1 Handles This
|
| 436 |
+
|
| 437 |
+
Unlike traditional models that would convert this to text first, Step-Audio-R1:
|
| 438 |
+
|
| 439 |
+
1. **Analyzes acoustic features** directly from the audio waveform
|
| 440 |
+
2. **Generates reasoning chains** grounded in sound properties
|
| 441 |
+
3. **Produces answers** that account for non-verbal information
|
| 442 |
+
|
| 443 |
+
This is what makes it the first true **audio reasoning** model!
|
| 444 |
+
"""
|
| 445 |
+
return output
|
| 446 |
+
|
| 447 |
+
|
| 448 |
+
def create_comparison_chart():
|
| 449 |
+
"""Create model comparison data"""
|
| 450 |
+
return """
|
| 451 |
+
### ๐ Model Comparison on MMAU Benchmark
|
| 452 |
+
|
| 453 |
+
| Rank | Model | Score | Type |
|
| 454 |
+
|------|-------|-------|------|
|
| 455 |
+
| ๐ฅ | **Step-Audio-R1** | ~78% | Open |
|
| 456 |
+
| ๐ฅ | Gemini 3 Pro | ~77% | Proprietary |
|
| 457 |
+
| ๐ฅ | Gemini 2.5 Pro | ~66% | Proprietary |
|
| 458 |
+
| 4 | Audio Flamingo 3 | ~60% | Open |
|
| 459 |
+
| 5 | GPT-4o Audio | ~55% | Proprietary |
|
| 460 |
+
| 6 | Qwen2.5-Omni | ~52% | Open |
|
| 461 |
+
|
| 462 |
+
**Key Insight**: Step-Audio-R1 is the first **open** model to match proprietary SOTA!
|
| 463 |
+
"""
|
| 464 |
+
|
| 465 |
+
|
| 466 |
+
def generate_demo_reasoning(scenario):
|
| 467 |
+
"""Generate example reasoning for demo scenarios"""
|
| 468 |
+
scenarios = {
|
| 469 |
+
"Customer Service Call": """
|
| 470 |
+
## ๐ง Demo: Customer Service Call Analysis
|
| 471 |
+
|
| 472 |
+
**Scenario**: A customer calls about a billing issue
|
| 473 |
+
|
| 474 |
+
### Simulated Audio Description
|
| 475 |
+
- Customer voice: Female, middle-aged, American English
|
| 476 |
+
- Tone: Initially frustrated, becomes calmer
|
| 477 |
+
- Agent voice: Male, professional, empathetic tone
|
| 478 |
+
- Background: Quiet call center environment
|
| 479 |
+
- Duration: ~3 minutes
|
| 480 |
+
|
| 481 |
+
### Step-Audio-R1 Reasoning (Simulated)
|
| 482 |
+
|
| 483 |
+
```xml
|
| 484 |
+
<thinking>
|
| 485 |
+
1. SPEAKER ANALYSIS:
|
| 486 |
+
- Customer: F0 mean 220Hz (elevated), speech rate 4.2 syl/sec (fast)
|
| 487 |
+
- Initial segment shows tension markers: pitch variation +40%
|
| 488 |
+
- Gradual calming: pitch stabilizes by minute 2
|
| 489 |
+
|
| 490 |
+
2. AGENT ANALYSIS:
|
| 491 |
+
- Consistent calm prosody throughout
|
| 492 |
+
- Strategic pauses after customer complaints
|
| 493 |
+
- Mirroring technique detected (matching customer's pace)
|
| 494 |
+
|
| 495 |
+
3. EMOTIONAL ARC:
|
| 496 |
+
- 0:00-0:45: Customer frustrated (anger markers)
|
| 497 |
+
- 0:45-1:30: Tension peak, interruption detected
|
| 498 |
+
- 1:30-2:30: De-escalation successful
|
| 499 |
+
- 2:30-3:00: Resolution, positive closing
|
| 500 |
+
|
| 501 |
+
4. ACOUSTIC CONTEXT:
|
| 502 |
+
- Low background noise suggests professional environment
|
| 503 |
+
- No hold music interruptions
|
| 504 |
+
- Clear audio quality on both sides
|
| 505 |
+
</thinking>
|
| 506 |
+
|
| 507 |
+
<response>
|
| 508 |
+
This customer service interaction shows successful de-escalation.
|
| 509 |
+
The customer initially displayed frustration (elevated pitch, fast
|
| 510 |
+
speech) but the agent's calm, empathetic approach led to resolution.
|
| 511 |
+
Key success factor: Agent's strategic use of pauses and mirroring.
|
| 512 |
+
Customer satisfaction likely: HIGH (based on closing tone).
|
| 513 |
+
</response>
|
| 514 |
+
```
|
| 515 |
+
""",
|
| 516 |
+
"Meeting Recording": """
|
| 517 |
+
## ๐ง Demo: Meeting Recording Analysis
|
| 518 |
+
|
| 519 |
+
**Scenario**: Team standup meeting (15 minutes)
|
| 520 |
+
|
| 521 |
+
### Simulated Audio Description
|
| 522 |
+
- 4 speakers in conference room
|
| 523 |
+
- Mixed engagement levels
|
| 524 |
+
- Some background office noise
|
| 525 |
+
- One speaker joins via video call
|
| 526 |
+
|
| 527 |
+
### Step-Audio-R1 Reasoning (Simulated)
|
| 528 |
+
|
| 529 |
+
```xml
|
| 530 |
+
<thinking>
|
| 531 |
+
1. SPEAKER IDENTIFICATION:
|
| 532 |
+
- Speaker A (Lead): Clear, confident, 65% of speaking time
|
| 533 |
+
- Speaker B (Remote): Slight compression artifacts, engaged
|
| 534 |
+
- Speaker C (Junior): Hesitant prosody, brief contributions
|
| 535 |
+
- Speaker D (Senior): Minimal input, possible disengagement
|
| 536 |
+
|
| 537 |
+
2. TOPIC TRACKING:
|
| 538 |
+
- 0:00-5:00: Sprint review (Speaker A leading)
|
| 539 |
+
- 5:00-10:00: Blockers discussion (B raises concern)
|
| 540 |
+
- 10:00-15:00: Planning next steps
|
| 541 |
+
|
| 542 |
+
3. ENGAGEMENT ANALYSIS:
|
| 543 |
+
- Speaker C shows uncertainty (rising intonation on statements)
|
| 544 |
+
- Speaker D energy levels low (F0 variance minimal)
|
| 545 |
+
- Remote Speaker B most engaged despite audio quality
|
| 546 |
+
|
| 547 |
+
4. ACOUSTIC ENVIRONMENT:
|
| 548 |
+
- Room reverb suggests medium conference room
|
| 549 |
+
- HVAC noise consistent throughout
|
| 550 |
+
- Remote connection stable (no dropouts)
|
| 551 |
+
|
| 552 |
+
5. KEY MOMENTS:
|
| 553 |
+
- 6:23: Tension spike (overlapping speech, raised voices)
|
| 554 |
+
- 12:45: Agreement reached (synchronized acknowledgments)
|
| 555 |
+
</thinking>
|
| 556 |
+
|
| 557 |
+
<response>
|
| 558 |
+
Meeting Analysis Summary:
|
| 559 |
+
- Duration: 15 minutes, 4 participants
|
| 560 |
+
- Key discussion: Sprint blockers and resolution
|
| 561 |
+
- Action items identified: 3 (assigned to Speakers A, B, C)
|
| 562 |
+
- Team dynamics note: Speaker D showed low engagement -
|
| 563 |
+
recommend follow-up. Speaker C may need support (uncertainty
|
| 564 |
+
markers detected in their updates).
|
| 565 |
+
- Overall meeting effectiveness: MODERATE
|
| 566 |
+
</response>
|
| 567 |
+
```
|
| 568 |
+
""",
|
| 569 |
+
"Podcast Episode": """
|
| 570 |
+
## ๐ง Demo: Podcast Episode Analysis
|
| 571 |
+
|
| 572 |
+
**Scenario**: Tech podcast interview (45 minutes)
|
| 573 |
+
|
| 574 |
+
### Simulated Audio Description
|
| 575 |
+
- Host and guest conversation
|
| 576 |
+
- Professional studio recording
|
| 577 |
+
- Music intro/outro
|
| 578 |
+
- Natural conversational flow
|
| 579 |
+
|
| 580 |
+
### Step-Audio-R1 Reasoning (Simulated)
|
| 581 |
+
|
| 582 |
+
```xml
|
| 583 |
+
<thinking>
|
| 584 |
+
1. AUDIO QUALITY ASSESSMENT:
|
| 585 |
+
- Studio-quality recording (low noise floor)
|
| 586 |
+
- Two distinct microphones detected
|
| 587 |
+
- Music segments: Intro (0:00-0:30), Outro (44:30-45:00)
|
| 588 |
+
- No significant audio issues
|
| 589 |
+
|
| 590 |
+
2. SPEAKER DYNAMICS:
|
| 591 |
+
- Host: Experienced interviewer, good pacing
|
| 592 |
+
- Guest: Expert on topic, occasional filler words
|
| 593 |
+
- Turn-taking: Natural, minimal interruptions
|
| 594 |
+
- Rapport: Strong (laughter at 12:34, 23:45, 38:12)
|
| 595 |
+
|
| 596 |
+
3. CONTENT STRUCTURE:
|
| 597 |
+
- Introduction: 0:30-3:00
|
| 598 |
+
- Main discussion: 3:00-40:00
|
| 599 |
+
- Audience Q&A: 40:00-44:30
|
| 600 |
+
- Closing: 44:30-45:00
|
| 601 |
+
|
| 602 |
+
4. ENGAGEMENT INDICATORS:
|
| 603 |
+
- Energy peaks: 15:00 (exciting announcement)
|
| 604 |
+
- Deep technical discussion: 25:00-32:00
|
| 605 |
+
- Emotional moment: 35:00 (personal story shared)
|
| 606 |
+
|
| 607 |
+
5. PROSODIC ANALYSIS:
|
| 608 |
+
- Guest becomes more animated on specific topics
|
| 609 |
+
- Host maintains consistent energy throughout
|
| 610 |
+
- Pace accelerates during exciting segments
|
| 611 |
+
</thinking>
|
| 612 |
+
|
| 613 |
+
<response>
|
| 614 |
+
Podcast Episode Analysis:
|
| 615 |
+
- Format: Interview style, 45 minutes
|
| 616 |
+
- Quality: Professional studio recording
|
| 617 |
+
- Key topics: [Inferred from discussion patterns]
|
| 618 |
+
- Highlight moments: 15:00 (announcement), 35:00 (personal story)
|
| 619 |
+
- Listener engagement prediction: HIGH
|
| 620 |
+
- Recommended clips for promotion: 15:00-16:30, 35:00-36:45
|
| 621 |
+
- Overall quality: EXCELLENT (clear audio, good rapport,
|
| 622 |
+
well-structured conversation)
|
| 623 |
+
</response>
|
| 624 |
+
```
|
| 625 |
+
""",
|
| 626 |
+
"Music Analysis": """
|
| 627 |
+
## ๐ง Demo: Music Analysis
|
| 628 |
+
|
| 629 |
+
**Scenario**: Unknown music track analysis
|
| 630 |
+
|
| 631 |
+
### Simulated Audio Description
|
| 632 |
+
- Instrumental track
|
| 633 |
+
- ~4 minutes duration
|
| 634 |
+
- Multiple instruments
|
| 635 |
+
- Studio production
|
| 636 |
+
|
| 637 |
+
### Step-Audio-R1 Reasoning (Simulated)
|
| 638 |
+
|
| 639 |
+
```xml
|
| 640 |
+
<thinking>
|
| 641 |
+
1. ACOUSTIC ANALYSIS:
|
| 642 |
+
- Tempo: ~120 BPM (moderate, danceable)
|
| 643 |
+
- Key: A minor (melancholic but energetic)
|
| 644 |
+
- Time signature: 4/4
|
| 645 |
+
- Dynamic range: 12 dB (well-mastered)
|
| 646 |
+
|
| 647 |
+
2. INSTRUMENTATION:
|
| 648 |
+
- Drums: Electronic, four-on-the-floor pattern
|
| 649 |
+
- Bass: Synthesizer, prominent low-end
|
| 650 |
+
- Lead: Analog-style synth, saw wave
|
| 651 |
+
- Pads: Ambient, reverb-heavy
|
| 652 |
+
- No vocals detected
|
| 653 |
+
|
| 654 |
+
3. STRUCTURE:
|
| 655 |
+
- Intro: 0:00-0:30 (buildup)
|
| 656 |
+
- Verse 1: 0:30-1:30 (main groove)
|
| 657 |
+
- Build: 1:30-2:00 (tension)
|
| 658 |
+
- Drop: 2:00-2:30 (energy peak)
|
| 659 |
+
- Verse 2: 2:30-3:30 (variation)
|
| 660 |
+
- Outro: 3:30-4:00 (fadeout)
|
| 661 |
+
|
| 662 |
+
4. PRODUCTION ANALYSIS:
|
| 663 |
+
- Sidechain compression detected on pads
|
| 664 |
+
- Stereo width: Wide (good separation)
|
| 665 |
+
- Reference similar to: Melodic techno genre
|
| 666 |
+
- Production quality: Professional
|
| 667 |
+
|
| 668 |
+
5. EMOTIONAL CHARACTER:
|
| 669 |
+
- Overall mood: Driving but melancholic
|
| 670 |
+
- Energy arc: Building โ Peak โ Sustain โ Release
|
| 671 |
+
</thinking>
|
| 672 |
+
|
| 673 |
+
<response>
|
| 674 |
+
Music Analysis Summary:
|
| 675 |
+
- Genre: Melodic Techno / Progressive House
|
| 676 |
+
- Tempo: 120 BPM
|
| 677 |
+
- Key: A minor
|
| 678 |
+
- Duration: ~4 minutes
|
| 679 |
+
- Mood: Energetic yet melancholic
|
| 680 |
+
- Production: Professional quality, well-mastered
|
| 681 |
+
- Use cases: DJ sets, workout playlists, focus music
|
| 682 |
+
- Similar artists: [Based on production style]
|
| 683 |
+
- Standout elements: Strong bass design, effective buildup
|
| 684 |
+
</response>
|
| 685 |
+
```
|
| 686 |
+
"""
|
| 687 |
+
}
|
| 688 |
+
|
| 689 |
+
return scenarios.get(scenario, "Please select a scenario")
|
| 690 |
+
|
| 691 |
+
|
| 692 |
+
# ============================================
|
| 693 |
+
# GRADIO INTERFACE
|
| 694 |
+
# ============================================
|
| 695 |
+
|
| 696 |
+
# Custom CSS
|
| 697 |
+
custom_css = """
|
| 698 |
+
.gradio-container {
|
| 699 |
+
max-width: 1200px !important;
|
| 700 |
+
}
|
| 701 |
+
.tab-nav button {
|
| 702 |
+
font-size: 16px !important;
|
| 703 |
+
}
|
| 704 |
+
.prose h1 {
|
| 705 |
+
color: #FF6B35 !important;
|
| 706 |
+
}
|
| 707 |
+
.prose h2 {
|
| 708 |
+
color: #4ECDC4 !important;
|
| 709 |
+
border-bottom: 2px solid #4ECDC4;
|
| 710 |
+
padding-bottom: 5px;
|
| 711 |
+
}
|
| 712 |
+
.highlight-box {
|
| 713 |
+
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
| 714 |
+
padding: 20px;
|
| 715 |
+
border-radius: 10px;
|
| 716 |
+
color: white;
|
| 717 |
+
}
|
| 718 |
+
"""
|
| 719 |
+
|
| 720 |
+
# Build the interface
|
| 721 |
+
with gr.Blocks(css=custom_css, title="๐ง Audio Reasoning Explorer", theme=gr.themes.Soft()) as demo:
|
| 722 |
+
|
| 723 |
+
# Header
|
| 724 |
+
gr.Markdown("""
|
| 725 |
+
<div style="text-align: center; padding: 20px;">
|
| 726 |
+
<h1>๐ง Audio Reasoning & Step-Audio-R1 Explorer</h1>
|
| 727 |
+
<p style="font-size: 18px; color: #666;">
|
| 728 |
+
Interactive guide to understanding how AI learns to think about sound
|
| 729 |
+
</p>
|
| 730 |
+
</div>
|
| 731 |
+
""")
|
| 732 |
+
|
| 733 |
+
# Main tabs
|
| 734 |
+
with gr.Tabs():
|
| 735 |
+
|
| 736 |
+
# Tab 1: Introduction
|
| 737 |
+
with gr.TabItem("๐ Introduction", id=0):
|
| 738 |
+
gr.Markdown(INTRO_CONTENT)
|
| 739 |
+
|
| 740 |
+
# Tab 2: Audio Reasoning Types
|
| 741 |
+
with gr.TabItem("๐ง Reasoning Types", id=1):
|
| 742 |
+
gr.Markdown("## ๐ง Types of Audio Reasoning\n\nSelect a reasoning type to learn more:")
|
| 743 |
+
|
| 744 |
+
with gr.Row():
|
| 745 |
+
with gr.Column(scale=1):
|
| 746 |
+
reasoning_dropdown = gr.Dropdown(
|
| 747 |
+
choices=list(REASONING_TYPES.keys()),
|
| 748 |
+
label="Select Reasoning Type",
|
| 749 |
+
value="Factual Reasoning"
|
| 750 |
+
)
|
| 751 |
+
|
| 752 |
+
gr.Markdown("### Quick Overview")
|
| 753 |
+
for rtype, info in REASONING_TYPES.items():
|
| 754 |
+
gr.Markdown(f"{info['emoji']} **{rtype}**: {info['description']}")
|
| 755 |
+
|
| 756 |
+
with gr.Column(scale=2):
|
| 757 |
+
reasoning_output = gr.Markdown(
|
| 758 |
+
value=get_reasoning_type_info("Factual Reasoning")
|
| 759 |
+
)
|
| 760 |
+
|
| 761 |
+
reasoning_dropdown.change(
|
| 762 |
+
fn=get_reasoning_type_info,
|
| 763 |
+
inputs=[reasoning_dropdown],
|
| 764 |
+
outputs=[reasoning_output]
|
| 765 |
+
)
|
| 766 |
+
|
| 767 |
+
# Tab 3: The Problem
|
| 768 |
+
with gr.TabItem("๐ซ The Problem", id=2):
|
| 769 |
+
gr.Markdown(PROBLEM_CONTENT)
|
| 770 |
+
|
| 771 |
+
# Tab 4: MGRD Solution
|
| 772 |
+
with gr.TabItem("๐ฌ MGRD Solution", id=3):
|
| 773 |
+
gr.Markdown(MGRD_CONTENT)
|
| 774 |
+
|
| 775 |
+
# Tab 5: Architecture
|
| 776 |
+
with gr.TabItem("๐๏ธ Architecture", id=4):
|
| 777 |
+
gr.Markdown(ARCHITECTURE_CONTENT)
|
| 778 |
+
|
| 779 |
+
# Tab 6: Benchmarks
|
| 780 |
+
with gr.TabItem("๐ Benchmarks", id=5):
|
| 781 |
+
gr.Markdown(BENCHMARK_DATA)
|
| 782 |
+
gr.Markdown(create_comparison_chart())
|
| 783 |
+
|
| 784 |
+
# Tab 7: Interactive Demo
|
| 785 |
+
with gr.TabItem("๐ฎ Interactive Demo", id=6):
|
| 786 |
+
gr.Markdown("""
|
| 787 |
+
## ๐ฎ Interactive Audio Reasoning Demo
|
| 788 |
+
|
| 789 |
+
See how Step-Audio-R1 would analyze different audio scenarios!
|
| 790 |
+
|
| 791 |
+
*Note: This is a simulation showing the reasoning process.
|
| 792 |
+
The actual model processes real audio input.*
|
| 793 |
+
""")
|
| 794 |
+
|
| 795 |
+
with gr.Row():
|
| 796 |
+
with gr.Column(scale=1):
|
| 797 |
+
scenario_dropdown = gr.Dropdown(
|
| 798 |
+
choices=[
|
| 799 |
+
"Customer Service Call",
|
| 800 |
+
"Meeting Recording",
|
| 801 |
+
"Podcast Episode",
|
| 802 |
+
"Music Analysis"
|
| 803 |
+
],
|
| 804 |
+
label="Select Audio Scenario",
|
| 805 |
+
value="Customer Service Call"
|
| 806 |
+
)
|
| 807 |
+
|
| 808 |
+
analyze_btn = gr.Button("๐ Analyze Scenario", variant="primary")
|
| 809 |
+
|
| 810 |
+
gr.Markdown("""
|
| 811 |
+
### What This Shows
|
| 812 |
+
|
| 813 |
+
Each scenario demonstrates:
|
| 814 |
+
1. **Acoustic analysis** - What the model "hears"
|
| 815 |
+
2. **Reasoning process** - Step-by-step thinking
|
| 816 |
+
3. **Final output** - Actionable insights
|
| 817 |
+
|
| 818 |
+
This is the power of **audio reasoning**!
|
| 819 |
+
""")
|
| 820 |
+
|
| 821 |
+
with gr.Column(scale=2):
|
| 822 |
+
demo_output = gr.Markdown(
|
| 823 |
+
value=generate_demo_reasoning("Customer Service Call")
|
| 824 |
+
)
|
| 825 |
+
|
| 826 |
+
analyze_btn.click(
|
| 827 |
+
fn=generate_demo_reasoning,
|
| 828 |
+
inputs=[scenario_dropdown],
|
| 829 |
+
outputs=[demo_output]
|
| 830 |
+
)
|
| 831 |
+
|
| 832 |
+
# Tab 8: Applications
|
| 833 |
+
with gr.TabItem("๐ Applications", id=7):
|
| 834 |
+
gr.Markdown(APPLICATIONS_CONTENT)
|
| 835 |
+
|
| 836 |
+
# Tab 9: Resources
|
| 837 |
+
with gr.TabItem("๐ Resources", id=8):
|
| 838 |
+
gr.Markdown(RESOURCES_CONTENT)
|
| 839 |
+
|
| 840 |
+
# Footer
|
| 841 |
+
gr.Markdown("""
|
| 842 |
+
---
|
| 843 |
+
<div style="text-align: center; padding: 20px; color: #666;">
|
| 844 |
+
<p>Created by <strong>Mehmet Tuฤrul Kaya</strong> |
|
| 845 |
+
<a href="https://github.com/mtkaya">GitHub</a> |
|
| 846 |
+
<a href="https://huggingface.co/tugrulkaya">HuggingFace</a></p>
|
| 847 |
+
<p>๐ง Sound Speaks, AI Listens and Thinks ๐ง </p>
|
| 848 |
+
</div>
|
| 849 |
+
""")
|
| 850 |
+
|
| 851 |
+
# Launch
|
| 852 |
+
if __name__ == "__main__":
|
| 853 |
+
demo.launch()
|
requirements.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
gradio>=4.0.0
|