File size: 11,127 Bytes
71a479f
 
 
 
 
 
 
 
 
 
6be837f
71a479f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6be837f
 
 
 
 
71a479f
 
6be837f
71a479f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62499af
71a479f
62499af
 
 
71a479f
 
 
62499af
 
71a479f
 
 
 
 
6be837f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71a479f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6be837f
71a479f
6be837f
71a479f
 
 
6be837f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71a479f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6be837f
71a479f
 
 
 
 
6be837f
71a479f
 
 
 
 
 
 
 
 
6be837f
71a479f
6be837f
71a479f
6be837f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71a479f
6be837f
 
 
71a479f
6be837f
 
 
 
 
 
 
 
 
 
 
 
 
71a479f
 
6be837f
 
 
 
 
 
 
 
 
 
71a479f
 
6be837f
 
 
 
 
71a479f
 
 
 
 
 
 
 
6be837f
71a479f
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Tiny Scribe is a transcript summarization tool with two interfaces:
1. **CLI tool** (`summarize_transcript.py`) - Standalone script for local use with SYCL/CPU acceleration
2. **Gradio web app** (`app.py`) - HuggingFace Spaces deployment with streaming UI

Both use llama-cpp-python to run GGUF quantized models (Qwen3, ERNIE, Granite, Gemma, etc.) and convert output to Traditional Chinese (zh-TW) via OpenCC.

## Development Commands

### Running the CLI

```bash
# Basic usage (default model: Qwen3-0.6B Q4_0)
python summarize_transcript.py -i ./transcripts/short.txt

# Specify model (format: repo_id:quantization)
python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L

# Force CPU-only (disable SYCL)
python summarize_transcript.py -c
```

### Running the Gradio App

```bash
# Local development
pip install -r requirements.txt
python app.py
# Opens at http://localhost:7860
```

### Testing

No test suite exists in the root project. To test llama-cpp-python submodule:

```bash
cd llama-cpp-python
pip install ".[test]"
pytest tests/test_llama.py -v

# Single test
pytest tests/test_llama.py::test_function_name -v
```

### Docker Deployment

```bash
# Build locally
docker build -t tiny-scribe .

# Run
docker run -p 7860:7860 tiny-scribe
```

## Architecture

### Two Execution Paths

**CLI Path:**
```
User โ†’ summarize_transcript.py โ†’ Llama.from_pretrained() โ†’ GGUF model
                                      โ†“
                                  Stream tokens โ†’ OpenCC (s2twp) โ†’ stdout
                                      โ†“
                    parse_thinking_blocks() โ†’ thinking.txt + summary.txt
```

**Gradio Path:**
```
User upload โ†’ Gradio File โ†’ app.py:summarize_streaming()
                              โ†“
                         Llama.create_chat_completion(stream=True)
                              โ†“
                    Token-by-token yield โ†’ OpenCC โ†’ Two textboxes:
                              โ†“                      - Thinking (raw stream)
                    parse_thinking_blocks()         - Summary (parsed output)
```

### Key Differences

| Feature | CLI (`summarize_transcript.py`) | Gradio (`app.py`) |
|---------|--------------------------------|-------------------|
| Model loading | On-demand per run | Global singleton (cached) |
| Model selection | CLI argument `repo_id:quant` | Dropdown with 10 models |
| Thinking tags | Supports both formats | Supports both formats + streaming |
| Reasoning toggle | Not supported | Qwen3: /think or /no_think |
| Inference settings | Hardcoded per run | Model-specific, dynamic UI |
| Output | Print to stdout + save files | Yield tuples for dual textboxes |
| GPU support | Configurable via `--cpu` flag | Hardcoded `n_gpu_layers=0` |
| Context window | 32K tokens | Per-model (32K-262K, capped at 32K) |

### Model Loading Pattern

Both scripts use `Llama.from_pretrained()` with HuggingFace Hub integration:

```python
llm = Llama.from_pretrained(
    repo_id="unsloth/Qwen3-0.6B-GGUF",
    filename="*Q4_K_M.gguf",      # Wildcard for flexible matching
    n_gpu_layers=0,                # 0=CPU, -1=all layers on GPU
    n_ctx=32768,                   # 32K context window
    seed=1337,                     # Reproducibility
    verbose=False,                 # Reduce log noise
)
```

**Important:** Always call `llm.reset()` after each completion to clear KV cache and ensure state isolation.

### Streaming Implementation

The Gradio app (`app.py`) implements real-time streaming with dual outputs:

1. **Raw stream** โ†’ `thinking_output` textbox (shows every token as generated)
2. **Parsed summary** โ†’ `summary_output` markdown (extracts content outside `<thinking>` tags)

Generator pattern:
```python
def summarize_streaming(...) -> Generator[Tuple[str, str], None, None]:
    for chunk in stream:
        content = chunk['choices'][0]['delta'].get('content', '')
        full_response += content

        # Show all tokens in thinking field
        current_thinking += content

        # Extract summary (content outside thinking tags)
        thinking_blocks, summary = parse_thinking_blocks(full_response)
        current_summary = summary

        # Yield both on every token
        yield (current_thinking, current_summary)
```

### Thinking Block Parsing

Models may wrap reasoning in special tags that should be separated from final output.

**Both versions now support both tag formats:**
- `<think>reasoning</think>` (common with Qwen models)
- `<thinking>reasoning</thinking>` (Claude-style)

Regex pattern:
```python
# Matches both <think> and <thinking> tags
pattern = r'<think(?:ing)?>(.*?)</think(?:ing)?>'
matches = re.findall(pattern, content, re.DOTALL)
thinking = '\n\n'.join(match.strip() for match in matches)
summary = re.sub(pattern, '', content, flags=re.DOTALL).strip()
```

The Gradio app also handles streaming mode with unclosed `<think>` tags for real-time display.

### Qwen3 Thinking Mode

Qwen3 models support a special "thinking mode" that generates `<think>...</think>` blocks for reasoning before the final answer.

**Implementation (llama.cpp/llama-cpp-python):**
- Add `/think` to system prompt or user message to enable thinking mode
- Add `/no_think` to disable thinking mode (faster, direct output)
- Most recent instruction takes precedence in multi-turn conversations

**Official Recommended Settings (from Unsloth):**

| Setting | Non-Thinking Mode | Thinking Mode |
|---------|------------------|---------------|
| Temperature | 0.7 | 0.6 |
| Top_P | 0.8 | 0.95 |
| Top_K | 20 | 20 |
| Min_P | 0.0 | 0.0 |

**Important Notes:**
- **DO NOT use greedy decoding** in thinking mode (causes endless repetitions)
- In thinking mode, model generates `<think>...</think>` block before final answer
- For non-thinking mode, empty `<think></think>` tags are purposely used

**Current Implementation:**
The Gradio app (`app.py`) implements this via:
- `enable_reasoning` checkbox (models with `supports_toggle: true`)
- Dynamic system prompt: `ไฝ ๆ˜ฏไธ€ๅ€‹ๆœ‰ๅŠฉ็š„ๅŠฉๆ‰‹๏ผŒ่ฒ ่ฒฌ็ธฝ็ต่ฝ‰้Œ„ๅ…งๅฎนใ€‚{reasoning_mode}`
- Where `reasoning_mode = "/think"` or `/no_think"` based on toggle

### Chinese Text Conversion

All outputs are converted from Simplified to Traditional Chinese (Taiwan standard):

```python
from opencc import OpenCC
converter = OpenCC('s2twp')  # s2twp = Simplified โ†’ Traditional (Taiwan + phrases)
traditional = converter.convert(simplified)
```

Applied token-by-token during streaming to maintain real-time display.

## HuggingFace Spaces Deployment

The Gradio app is optimized for HF Spaces Free Tier (2 vCPUs):

- **Models**: 10 models available (100M to 1.7B parameters), default: Qwen3-0.6B Q4_K_M (~400MB)
- **Dockerfile**: Uses prebuilt llama-cpp-python wheel (skips 10-min compilation)
- **Context limits**: Per-model context windows (32K to 262K tokens), capped at 32K for CPU performance

See `DEPLOY.md` for full deployment instructions.

### Deployment Workflow

The `deploy.sh` script ensures meaningful commit messages:

```bash
./deploy.sh "Add new model: Gemma-3 270M"
```

The script:
1. Checks for uncommitted changes
2. Prompts for commit message if not provided
3. Warns about generic/short messages
4. Shows commits to be pushed
5. Confirms before pushing
6. Verifies commit message was preserved on remote

### Docker Optimization

The Dockerfile avoids building llama-cpp-python from source by using a prebuilt wheel:

```dockerfile
RUN pip install --no-cache-dir \
    https://huggingface.co/Luigi/llama-cpp-python-wheels-hf-spaces-free-cpu/resolve/main/llama_cpp_python-0.3.22-cp310-cp310-linux_x86_64.whl
```

This reduces build time from 10+ minutes to ~2 minutes.

## Git Submodule

The `llama-cpp-python/` directory is a Git submodule tracking upstream development:

```bash
# Initialize after clone
git submodule update --init --recursive

# Update to latest
cd llama-cpp-python
git pull origin main
cd ..
git add llama-cpp-python
git commit -m "Update llama-cpp-python submodule"
```

## Model Format

CLI model argument format: `repo_id:quantization`

Examples:
- `unsloth/Qwen3-0.6B-GGUF:Q4_0` โ†’ Searches for `*Q4_0.gguf`
- `unsloth/Qwen3-1.7B-GGUF:Q2_K_L` โ†’ Searches for `*Q2_K_L.gguf`

The `:` separator is parsed in `summarize_transcript.py:128-130`.

## Error Handling Notes

When modifying streaming logic:
- **Always** handle `'choices'` key presence in chunks
- **Always** check for `'delta'` in choice before accessing `'content'`
- Gradio error handling: Yield error messages in the summary field, keep thinking field intact
- File upload: Validate file existence and encoding before reading

## Model Registry

The Gradio app (`app.py:32-155`) includes a model registry (`AVAILABLE_MODELS`) with:

1. **Model metadata** (repo_id, filename, max context)
2. **Model-specific inference settings** (temperature, top_p, top_k, repeat_penalty)
3. **Feature flags** (e.g., `supports_toggle` for Qwen3 reasoning mode)

Each model has optimized defaults. The UI updates inference controls when model selection changes.

### Available Models

| Key | Model | Params | Max Context | Quant |
|-----|-------|--------|-------------|-------|
| `falcon_h1_100m` | Falcon-H1 100M | 100M | 32K | Q8_0 |
| `gemma3_270m` | Gemma-3 270M | 270M | 32K | Q8_0 |
| `ernie_300m` | ERNIE-4.5 0.3B | 300M | 131K | Q8_0 |
| `granite_350m` | Granite-4.0 350M | 350M | 32K | Q8_0 |
| `lfm2_350m` | LFM2 350M | 350M | 32K | Q8_0 |
| `bitcpm4_500m` | BitCPM4 0.5B | 500M | 128K | q4_0 |
| `hunyuan_500m` | Hunyuan 0.5B | 500M | 256K | Q8_0 |
| `qwen3_600m_q4` | Qwen3 0.6B | 600M | 32K | Q4_K_M |
| `falcon_h1_1.5b_q4` | Falcon-H1 1.5B | 1.5B | 32K | Q4_K_M |
| `qwen3_1.7b_q4` | Qwen3 1.7B | 1.7B | 32K | Q4_K_M |

### Adding a New Model

1. Add entry to `AVAILABLE_MODELS` in `app.py`:
```python
"model_key": {
    "name": "Human-Readable Name",
    "repo_id": "org/model-name-GGUF",
    "filename": "*Quantization.gguf",
    "max_context": 32768,
    "supports_toggle": False,  # For Qwen3 /think mode
    "inference_settings": {
        "temperature": 0.6,
        "top_p": 0.95,
        "top_k": 20,
        "repeat_penalty": 1.05,
    },
},
```

2. Set `DEFAULT_MODEL_KEY` to the new key if it should be default

## Common Modifications

### Changing the Default Model

**CLI:** Use `-m` argument at runtime

**Gradio app:** Change `DEFAULT_MODEL_KEY` in `app.py:157`

### Adjusting Context Window

**CLI:** Change `n_ctx` in `summarize_transcript.py:23`

**Gradio app:** The app dynamically calculates `n_ctx` based on input size and model limits. To change the global cap, modify `MAX_USABLE_CTX` in `app.py:29`.

Values:
- 32768 (current) = handles ~24KB text input
- 8192 = faster, lower memory, ~6KB text
- 131072 = very slow on CPU, ~100KB text

### GPU Acceleration

**CLI:** Remove `-c` flag (defaults to SYCL/CUDA if available)

**Gradio app:** Change `app.py:206`:
```python
n_gpu_layers=-1,  # Use all GPU layers
```

Note: HF Spaces Free Tier has no GPU access.