[Docs] Add LightLLM deployment example (faster than SGLang)
Browse filesHi @zai-org team,
We have recently added support for **GLM-4.7-Flash** in **[LightLLM](https://github.com/ModelTC/lightllm)**.
To provide the community with more deployment options, we would like to contribute a brief guide and some performance references to the Model Card. We have implemented and verified the **tool calling** and **reasoning** capabilities of this model to ensure a robust user experience.
#### Performance Reference (TP2)
In our local benchmarks (64 concurrent requests, 8k input / 1k output), LightLLM demonstrates efficient serving capabilities:
* **Throughput:** Reaches **18,931 tok/s** (~31% optimization over SGLang in this specific setup).
* **Latency:** Reduced Mean TPOT by approximately 24%.
#### Accuracy Reference (BFCL)
On the Berkeley Function Calling Leaderboard:
* **LightLLM Overall Accuracy:** **49.12%**
* **SGLang Overall Accuracy:** 45.41%
#### Full Result
```bash
# Benchmark script
python -m sglang.bench_serving --backend sglang-oai --model /dev/shm/GLM-4.7-Flash --dataset-name random --random-input-len 8000 --random-output-len 1000 --num-prompts 320 --max-concurrency 64 --request-rate inf
# LightLLM tp2 startup script
python -m lightllm.server.api_server \
--model_dir /dev/shm/GLM-4.7-Flash/ \
--tp 2 \
--max_req_total_len 202752 \
--port 30000
# Result
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 64
Successful requests: 320
Benchmark duration (s): 76.27
Total input tokens: 1273893
Total input text tokens: 1273893
Total generated tokens: 170000
Total generated tokens (retokenized): 169853
Request throughput (req/s): 4.20
Input token throughput (tok/s): 16702.93
Output token throughput (tok/s): 2228.99
Peak output token throughput (tok/s): 3335.00
Peak concurrent requests: 71
Total token throughput (tok/s): 18931.93
Concurrency: 59.12
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 14091.45
Median E2E Latency (ms): 13633.65
P90 E2E Latency (ms): 23682.33
P99 E2E Latency (ms): 27589.25
---------------Time to First Token----------------
Mean TTFT (ms): 652.54
Median TTFT (ms): 177.28
P99 TTFT (ms): 3984.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 26.01
Median TPOT (ms): 26.70
P99 TPOT (ms): 38.65
---------------Inter-Token Latency----------------
Mean ITL (ms): 25.35
Median ITL (ms): 17.52
P95 ITL (ms): 90.26
P99 ITL (ms): 117.75
Max ITL (ms): 3209.75
# SGLang tp2 startup script
python -m sglang.launch_server \
--model /dev/shm/GLM-4.7-Flash \
--attention-backend flashinfer \
--tp 2
# Result
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 64
Successful requests: 320
Benchmark duration (s): 100.25
Total input tokens: 1273893
Total input text tokens: 1273893
Total generated tokens: 170000
Total generated tokens (retokenized): 169152
Request throughput (req/s): 3.19
Input token throughput (tok/s): 12707.48
Output token throughput (tok/s): 1695.80
Peak output token throughput (tok/s): 2730.00
Peak concurrent requests: 71
Total token throughput (tok/s): 14403.29
Concurrency: 58.90
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 18451.65
Median E2E Latency (ms): 17985.79
P90 E2E Latency (ms): 30891.18
P99 E2E Latency (ms): 36007.76
---------------Time to First Token----------------
Mean TTFT (ms): 810.07
Median TTFT (ms): 186.54
P99 TTFT (ms): 5677.85
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 34.24
Median TPOT (ms): 34.92
P99 TPOT (ms): 55.08
---------------Inter-Token Latency----------------
Mean ITL (ms): 33.27
Median ITL (ms): 22.77
P95 ITL (ms): 104.16
P99 ITL (ms): 142.71
Max ITL (ms): 5212.68
==================================================
# LightLLM tp1 startup script
python -m lightllm.server.api_server \
--model_dir /dev/shm/GLM-4.7-Flash/ \
--tp 1 \
--max_req_total_len 202752 \
--port 30000
# Result
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 64
Successful requests: 320
Benchmark duration (s): 106.86
Total input tokens: 1273893
Total input text tokens: 1273893
Total generated tokens: 170000
Total generated tokens (retokenized): 169819
Request throughput (req/s): 2.99
Input token throughput (tok/s): 11921.64
Output token throughput (tok/s): 1590.93
Peak output token throughput (tok/s): 2528.00
Peak concurrent requests: 71
Total token throughput (tok/s): 13512.57
Concurrency: 59.28
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 19796.59
Median E2E Latency (ms): 19288.46
P90 E2E Latency (ms): 33296.09
P99 E2E Latency (ms): 38643.51
---------------Time to First Token----------------
Mean TTFT (ms): 978.99
Median TTFT (ms): 245.46
P99 TTFT (ms): 6411.26
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 36.52
Median TPOT (ms): 37.32
P99 TPOT (ms): 58.54
---------------Inter-Token Latency----------------
Mean ITL (ms): 35.50
Median ITL (ms): 23.90
P95 ITL (ms): 105.18
P99 ITL (ms): 204.22
Max ITL (ms): 5473.59
==================================================
# SGLang tp1 startup script
python -m sglang.launch_server \
--model /dev/shm/GLM-4.7-Flash \
--attention-backend flashinfer \
--tp 1
# Result
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 64
Successful requests: 320
Benchmark duration (s): 130.16
Total input tokens: 1273893
Total input text tokens: 1273893
Total generated tokens: 170000
Total generated tokens (retokenized): 169201
Request throughput (req/s): 2.46
Input token throughput (tok/s): 9787.45
Output token throughput (tok/s): 1306.13
Peak output token throughput (tok/s): 2304.00
Peak concurrent requests: 70
Total token throughput (tok/s): 11093.57
Concurrency: 59.28
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 24110.17
Median E2E Latency (ms): 23328.84
P90 E2E Latency (ms): 40500.42
P99 E2E Latency (ms): 47501.98
---------------Time to First Token----------------
Mean TTFT (ms): 1168.04
Median TTFT (ms): 275.58
P99 TTFT (ms): 8623.51
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 44.74
Median TPOT (ms): 45.52
P99 TPOT (ms): 78.30
---------------Inter-Token Latency----------------
Mean ITL (ms): 43.26
Median ITL (ms): 27.99
P95 ITL (ms): 134.90
P99 ITL (ms): 217.75
Max ITL (ms): 8390.66
==================================================
# LightLLM bfcl eval result
============================================================
SUMMARY : LightLLM
============================================================
Category Total Passed Accuracy
------------------------------------------------------------
simple 400 250 62.50%
multiple 200 109 54.50%
parallel 200 139 69.50%
parallel_multiple 200 123 61.50%
java 100 66 66.00%
javascript 50 24 48.00%
irrelevance 240 200 83.33%
live_simple 258 118
|
@@ -156,6 +156,26 @@ python3 -m sglang.launch_server \
|
|
| 156 |
```
|
| 157 |
+ For Blackwell GPUs, include `--attention-backend triton --speculative-draft-attention-backend triton` in your SGLang launch command.
|
| 158 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
## Citation
|
| 160 |
|
| 161 |
If you find our work useful in your research, please consider citing the following paper:
|
|
|
|
| 156 |
```
|
| 157 |
+ For Blackwell GPUs, include `--attention-backend triton --speculative-draft-attention-backend triton` in your SGLang launch command.
|
| 158 |
|
| 159 |
+
### LightLLM
|
| 160 |
+
```shell
|
| 161 |
+
# Install lightllm (Recommended to use Docker.).
|
| 162 |
+
pip install git+https://github.com/ModelTC/LightLLM.git
|
| 163 |
+
|
| 164 |
+
LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1 LOADWORKER=18 \
|
| 165 |
+
python3 -m lightllm.server.api_server \
|
| 166 |
+
--model_dir /path/to/GLM-4.7-Flash/ \
|
| 167 |
+
--tp 1 \
|
| 168 |
+
--max_req_total_len 202752 \
|
| 169 |
+
--chunked_prefill_size 8192 \
|
| 170 |
+
--llm_prefill_att_backend fa3 \
|
| 171 |
+
--llm_decode_att_backend fa3 \
|
| 172 |
+
--graph_max_batch_size 512 \
|
| 173 |
+
--tool_call_parser glm47 \
|
| 174 |
+
--reasoning_parser glm45 \
|
| 175 |
+
--host 0.0.0.0 \
|
| 176 |
+
--port 8000
|
| 177 |
+
```
|
| 178 |
+
|
| 179 |
## Citation
|
| 180 |
|
| 181 |
If you find our work useful in your research, please consider citing the following paper:
|