Spaces:
Running
Running
File size: 11,127 Bytes
71a479f 6be837f 71a479f 6be837f 71a479f 6be837f 71a479f 62499af 71a479f 62499af 71a479f 62499af 71a479f 6be837f 71a479f 6be837f 71a479f 6be837f 71a479f 6be837f 71a479f 6be837f 71a479f 6be837f 71a479f 6be837f 71a479f 6be837f 71a479f 6be837f 71a479f 6be837f 71a479f 6be837f 71a479f 6be837f 71a479f 6be837f 71a479f 6be837f 71a479f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 | # CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
Tiny Scribe is a transcript summarization tool with two interfaces:
1. **CLI tool** (`summarize_transcript.py`) - Standalone script for local use with SYCL/CPU acceleration
2. **Gradio web app** (`app.py`) - HuggingFace Spaces deployment with streaming UI
Both use llama-cpp-python to run GGUF quantized models (Qwen3, ERNIE, Granite, Gemma, etc.) and convert output to Traditional Chinese (zh-TW) via OpenCC.
## Development Commands
### Running the CLI
```bash
# Basic usage (default model: Qwen3-0.6B Q4_0)
python summarize_transcript.py -i ./transcripts/short.txt
# Specify model (format: repo_id:quantization)
python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L
# Force CPU-only (disable SYCL)
python summarize_transcript.py -c
```
### Running the Gradio App
```bash
# Local development
pip install -r requirements.txt
python app.py
# Opens at http://localhost:7860
```
### Testing
No test suite exists in the root project. To test llama-cpp-python submodule:
```bash
cd llama-cpp-python
pip install ".[test]"
pytest tests/test_llama.py -v
# Single test
pytest tests/test_llama.py::test_function_name -v
```
### Docker Deployment
```bash
# Build locally
docker build -t tiny-scribe .
# Run
docker run -p 7860:7860 tiny-scribe
```
## Architecture
### Two Execution Paths
**CLI Path:**
```
User โ summarize_transcript.py โ Llama.from_pretrained() โ GGUF model
โ
Stream tokens โ OpenCC (s2twp) โ stdout
โ
parse_thinking_blocks() โ thinking.txt + summary.txt
```
**Gradio Path:**
```
User upload โ Gradio File โ app.py:summarize_streaming()
โ
Llama.create_chat_completion(stream=True)
โ
Token-by-token yield โ OpenCC โ Two textboxes:
โ - Thinking (raw stream)
parse_thinking_blocks() - Summary (parsed output)
```
### Key Differences
| Feature | CLI (`summarize_transcript.py`) | Gradio (`app.py`) |
|---------|--------------------------------|-------------------|
| Model loading | On-demand per run | Global singleton (cached) |
| Model selection | CLI argument `repo_id:quant` | Dropdown with 10 models |
| Thinking tags | Supports both formats | Supports both formats + streaming |
| Reasoning toggle | Not supported | Qwen3: /think or /no_think |
| Inference settings | Hardcoded per run | Model-specific, dynamic UI |
| Output | Print to stdout + save files | Yield tuples for dual textboxes |
| GPU support | Configurable via `--cpu` flag | Hardcoded `n_gpu_layers=0` |
| Context window | 32K tokens | Per-model (32K-262K, capped at 32K) |
### Model Loading Pattern
Both scripts use `Llama.from_pretrained()` with HuggingFace Hub integration:
```python
llm = Llama.from_pretrained(
repo_id="unsloth/Qwen3-0.6B-GGUF",
filename="*Q4_K_M.gguf", # Wildcard for flexible matching
n_gpu_layers=0, # 0=CPU, -1=all layers on GPU
n_ctx=32768, # 32K context window
seed=1337, # Reproducibility
verbose=False, # Reduce log noise
)
```
**Important:** Always call `llm.reset()` after each completion to clear KV cache and ensure state isolation.
### Streaming Implementation
The Gradio app (`app.py`) implements real-time streaming with dual outputs:
1. **Raw stream** โ `thinking_output` textbox (shows every token as generated)
2. **Parsed summary** โ `summary_output` markdown (extracts content outside `<thinking>` tags)
Generator pattern:
```python
def summarize_streaming(...) -> Generator[Tuple[str, str], None, None]:
for chunk in stream:
content = chunk['choices'][0]['delta'].get('content', '')
full_response += content
# Show all tokens in thinking field
current_thinking += content
# Extract summary (content outside thinking tags)
thinking_blocks, summary = parse_thinking_blocks(full_response)
current_summary = summary
# Yield both on every token
yield (current_thinking, current_summary)
```
### Thinking Block Parsing
Models may wrap reasoning in special tags that should be separated from final output.
**Both versions now support both tag formats:**
- `<think>reasoning</think>` (common with Qwen models)
- `<thinking>reasoning</thinking>` (Claude-style)
Regex pattern:
```python
# Matches both <think> and <thinking> tags
pattern = r'<think(?:ing)?>(.*?)</think(?:ing)?>'
matches = re.findall(pattern, content, re.DOTALL)
thinking = '\n\n'.join(match.strip() for match in matches)
summary = re.sub(pattern, '', content, flags=re.DOTALL).strip()
```
The Gradio app also handles streaming mode with unclosed `<think>` tags for real-time display.
### Qwen3 Thinking Mode
Qwen3 models support a special "thinking mode" that generates `<think>...</think>` blocks for reasoning before the final answer.
**Implementation (llama.cpp/llama-cpp-python):**
- Add `/think` to system prompt or user message to enable thinking mode
- Add `/no_think` to disable thinking mode (faster, direct output)
- Most recent instruction takes precedence in multi-turn conversations
**Official Recommended Settings (from Unsloth):**
| Setting | Non-Thinking Mode | Thinking Mode |
|---------|------------------|---------------|
| Temperature | 0.7 | 0.6 |
| Top_P | 0.8 | 0.95 |
| Top_K | 20 | 20 |
| Min_P | 0.0 | 0.0 |
**Important Notes:**
- **DO NOT use greedy decoding** in thinking mode (causes endless repetitions)
- In thinking mode, model generates `<think>...</think>` block before final answer
- For non-thinking mode, empty `<think></think>` tags are purposely used
**Current Implementation:**
The Gradio app (`app.py`) implements this via:
- `enable_reasoning` checkbox (models with `supports_toggle: true`)
- Dynamic system prompt: `ไฝ ๆฏไธๅๆๅฉ็ๅฉๆ๏ผ่ฒ ่ฒฌ็ธฝ็ต่ฝ้ๅ
งๅฎนใ{reasoning_mode}`
- Where `reasoning_mode = "/think"` or `/no_think"` based on toggle
### Chinese Text Conversion
All outputs are converted from Simplified to Traditional Chinese (Taiwan standard):
```python
from opencc import OpenCC
converter = OpenCC('s2twp') # s2twp = Simplified โ Traditional (Taiwan + phrases)
traditional = converter.convert(simplified)
```
Applied token-by-token during streaming to maintain real-time display.
## HuggingFace Spaces Deployment
The Gradio app is optimized for HF Spaces Free Tier (2 vCPUs):
- **Models**: 10 models available (100M to 1.7B parameters), default: Qwen3-0.6B Q4_K_M (~400MB)
- **Dockerfile**: Uses prebuilt llama-cpp-python wheel (skips 10-min compilation)
- **Context limits**: Per-model context windows (32K to 262K tokens), capped at 32K for CPU performance
See `DEPLOY.md` for full deployment instructions.
### Deployment Workflow
The `deploy.sh` script ensures meaningful commit messages:
```bash
./deploy.sh "Add new model: Gemma-3 270M"
```
The script:
1. Checks for uncommitted changes
2. Prompts for commit message if not provided
3. Warns about generic/short messages
4. Shows commits to be pushed
5. Confirms before pushing
6. Verifies commit message was preserved on remote
### Docker Optimization
The Dockerfile avoids building llama-cpp-python from source by using a prebuilt wheel:
```dockerfile
RUN pip install --no-cache-dir \
https://huggingface.co/Luigi/llama-cpp-python-wheels-hf-spaces-free-cpu/resolve/main/llama_cpp_python-0.3.22-cp310-cp310-linux_x86_64.whl
```
This reduces build time from 10+ minutes to ~2 minutes.
## Git Submodule
The `llama-cpp-python/` directory is a Git submodule tracking upstream development:
```bash
# Initialize after clone
git submodule update --init --recursive
# Update to latest
cd llama-cpp-python
git pull origin main
cd ..
git add llama-cpp-python
git commit -m "Update llama-cpp-python submodule"
```
## Model Format
CLI model argument format: `repo_id:quantization`
Examples:
- `unsloth/Qwen3-0.6B-GGUF:Q4_0` โ Searches for `*Q4_0.gguf`
- `unsloth/Qwen3-1.7B-GGUF:Q2_K_L` โ Searches for `*Q2_K_L.gguf`
The `:` separator is parsed in `summarize_transcript.py:128-130`.
## Error Handling Notes
When modifying streaming logic:
- **Always** handle `'choices'` key presence in chunks
- **Always** check for `'delta'` in choice before accessing `'content'`
- Gradio error handling: Yield error messages in the summary field, keep thinking field intact
- File upload: Validate file existence and encoding before reading
## Model Registry
The Gradio app (`app.py:32-155`) includes a model registry (`AVAILABLE_MODELS`) with:
1. **Model metadata** (repo_id, filename, max context)
2. **Model-specific inference settings** (temperature, top_p, top_k, repeat_penalty)
3. **Feature flags** (e.g., `supports_toggle` for Qwen3 reasoning mode)
Each model has optimized defaults. The UI updates inference controls when model selection changes.
### Available Models
| Key | Model | Params | Max Context | Quant |
|-----|-------|--------|-------------|-------|
| `falcon_h1_100m` | Falcon-H1 100M | 100M | 32K | Q8_0 |
| `gemma3_270m` | Gemma-3 270M | 270M | 32K | Q8_0 |
| `ernie_300m` | ERNIE-4.5 0.3B | 300M | 131K | Q8_0 |
| `granite_350m` | Granite-4.0 350M | 350M | 32K | Q8_0 |
| `lfm2_350m` | LFM2 350M | 350M | 32K | Q8_0 |
| `bitcpm4_500m` | BitCPM4 0.5B | 500M | 128K | q4_0 |
| `hunyuan_500m` | Hunyuan 0.5B | 500M | 256K | Q8_0 |
| `qwen3_600m_q4` | Qwen3 0.6B | 600M | 32K | Q4_K_M |
| `falcon_h1_1.5b_q4` | Falcon-H1 1.5B | 1.5B | 32K | Q4_K_M |
| `qwen3_1.7b_q4` | Qwen3 1.7B | 1.7B | 32K | Q4_K_M |
### Adding a New Model
1. Add entry to `AVAILABLE_MODELS` in `app.py`:
```python
"model_key": {
"name": "Human-Readable Name",
"repo_id": "org/model-name-GGUF",
"filename": "*Quantization.gguf",
"max_context": 32768,
"supports_toggle": False, # For Qwen3 /think mode
"inference_settings": {
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"repeat_penalty": 1.05,
},
},
```
2. Set `DEFAULT_MODEL_KEY` to the new key if it should be default
## Common Modifications
### Changing the Default Model
**CLI:** Use `-m` argument at runtime
**Gradio app:** Change `DEFAULT_MODEL_KEY` in `app.py:157`
### Adjusting Context Window
**CLI:** Change `n_ctx` in `summarize_transcript.py:23`
**Gradio app:** The app dynamically calculates `n_ctx` based on input size and model limits. To change the global cap, modify `MAX_USABLE_CTX` in `app.py:29`.
Values:
- 32768 (current) = handles ~24KB text input
- 8192 = faster, lower memory, ~6KB text
- 131072 = very slow on CPU, ~100KB text
### GPU Acceleration
**CLI:** Remove `-c` flag (defaults to SYCL/CUDA if available)
**Gradio app:** Change `app.py:206`:
```python
n_gpu_layers=-1, # Use all GPU layers
```
Note: HF Spaces Free Tier has no GPU access.
|