FubaoSu commited on
Commit
e10500a
·
verified ·
1 Parent(s): 7dd2089

[Docs] Add LightLLM deployment example (faster than SGLang)

Browse files

Hi @zai-org team,

We have recently added support for **GLM-4.7-Flash** in **[LightLLM](https://github.com/ModelTC/lightllm)**.

To provide the community with more deployment options, we would like to contribute a brief guide and some performance references to the Model Card. We have implemented and verified the **tool calling** and **reasoning** capabilities of this model to ensure a robust user experience.

#### Performance Reference (TP2)
In our local benchmarks (64 concurrent requests, 8k input / 1k output), LightLLM demonstrates efficient serving capabilities:
* **Throughput:** Reaches **18,931 tok/s** (~31% optimization over SGLang in this specific setup).
* **Latency:** Reduced Mean TPOT by approximately 24%.

#### Accuracy Reference (BFCL)
On the Berkeley Function Calling Leaderboard:
* **LightLLM Overall Accuracy:** **49.12%**
* **SGLang Overall Accuracy:** 45.41%

#### Full Result
```bash
# Benchmark script
python -m sglang.bench_serving --backend sglang-oai --model /dev/shm/GLM-4.7-Flash --dataset-name random --random-input-len 8000 --random-output-len 1000 --num-prompts 320 --max-concurrency 64 --request-rate inf


# LightLLM tp2 startup script
python -m lightllm.server.api_server \
--model_dir /dev/shm/GLM-4.7-Flash/ \
--tp 2 \
--max_req_total_len 202752 \
--port 30000

# Result
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 64
Successful requests: 320
Benchmark duration (s): 76.27
Total input tokens: 1273893
Total input text tokens: 1273893
Total generated tokens: 170000
Total generated tokens (retokenized): 169853
Request throughput (req/s): 4.20
Input token throughput (tok/s): 16702.93
Output token throughput (tok/s): 2228.99
Peak output token throughput (tok/s): 3335.00
Peak concurrent requests: 71
Total token throughput (tok/s): 18931.93
Concurrency: 59.12
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 14091.45
Median E2E Latency (ms): 13633.65
P90 E2E Latency (ms): 23682.33
P99 E2E Latency (ms): 27589.25
---------------Time to First Token----------------
Mean TTFT (ms): 652.54
Median TTFT (ms): 177.28
P99 TTFT (ms): 3984.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 26.01
Median TPOT (ms): 26.70
P99 TPOT (ms): 38.65
---------------Inter-Token Latency----------------
Mean ITL (ms): 25.35
Median ITL (ms): 17.52
P95 ITL (ms): 90.26
P99 ITL (ms): 117.75
Max ITL (ms): 3209.75

# SGLang tp2 startup script
python -m sglang.launch_server \
--model /dev/shm/GLM-4.7-Flash \
--attention-backend flashinfer \
--tp 2

# Result
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 64
Successful requests: 320
Benchmark duration (s): 100.25
Total input tokens: 1273893
Total input text tokens: 1273893
Total generated tokens: 170000
Total generated tokens (retokenized): 169152
Request throughput (req/s): 3.19
Input token throughput (tok/s): 12707.48
Output token throughput (tok/s): 1695.80
Peak output token throughput (tok/s): 2730.00
Peak concurrent requests: 71
Total token throughput (tok/s): 14403.29
Concurrency: 58.90
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 18451.65
Median E2E Latency (ms): 17985.79
P90 E2E Latency (ms): 30891.18
P99 E2E Latency (ms): 36007.76
---------------Time to First Token----------------
Mean TTFT (ms): 810.07
Median TTFT (ms): 186.54
P99 TTFT (ms): 5677.85
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 34.24
Median TPOT (ms): 34.92
P99 TPOT (ms): 55.08
---------------Inter-Token Latency----------------
Mean ITL (ms): 33.27
Median ITL (ms): 22.77
P95 ITL (ms): 104.16
P99 ITL (ms): 142.71
Max ITL (ms): 5212.68
==================================================

# LightLLM tp1 startup script
python -m lightllm.server.api_server \
--model_dir /dev/shm/GLM-4.7-Flash/ \
--tp 1 \
--max_req_total_len 202752 \
--port 30000

# Result
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 64
Successful requests: 320
Benchmark duration (s): 106.86
Total input tokens: 1273893
Total input text tokens: 1273893
Total generated tokens: 170000
Total generated tokens (retokenized): 169819
Request throughput (req/s): 2.99
Input token throughput (tok/s): 11921.64
Output token throughput (tok/s): 1590.93
Peak output token throughput (tok/s): 2528.00
Peak concurrent requests: 71
Total token throughput (tok/s): 13512.57
Concurrency: 59.28
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 19796.59
Median E2E Latency (ms): 19288.46
P90 E2E Latency (ms): 33296.09
P99 E2E Latency (ms): 38643.51
---------------Time to First Token----------------
Mean TTFT (ms): 978.99
Median TTFT (ms): 245.46
P99 TTFT (ms): 6411.26
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 36.52
Median TPOT (ms): 37.32
P99 TPOT (ms): 58.54
---------------Inter-Token Latency----------------
Mean ITL (ms): 35.50
Median ITL (ms): 23.90
P95 ITL (ms): 105.18
P99 ITL (ms): 204.22
Max ITL (ms): 5473.59
==================================================

# SGLang tp1 startup script
python -m sglang.launch_server \
--model /dev/shm/GLM-4.7-Flash \
--attention-backend flashinfer \
--tp 1

# Result
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 64
Successful requests: 320
Benchmark duration (s): 130.16
Total input tokens: 1273893
Total input text tokens: 1273893
Total generated tokens: 170000
Total generated tokens (retokenized): 169201
Request throughput (req/s): 2.46
Input token throughput (tok/s): 9787.45
Output token throughput (tok/s): 1306.13
Peak output token throughput (tok/s): 2304.00
Peak concurrent requests: 70
Total token throughput (tok/s): 11093.57
Concurrency: 59.28
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 24110.17
Median E2E Latency (ms): 23328.84
P90 E2E Latency (ms): 40500.42
P99 E2E Latency (ms): 47501.98
---------------Time to First Token----------------
Mean TTFT (ms): 1168.04
Median TTFT (ms): 275.58
P99 TTFT (ms): 8623.51
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 44.74
Median TPOT (ms): 45.52
P99 TPOT (ms): 78.30
---------------Inter-Token Latency----------------
Mean ITL (ms): 43.26
Median ITL (ms): 27.99
P95 ITL (ms): 134.90
P99 ITL (ms): 217.75
Max ITL (ms): 8390.66
==================================================

# LightLLM bfcl eval result
============================================================
SUMMARY : LightLLM
============================================================
Category Total Passed Accuracy
------------------------------------------------------------
simple 400 250 62.50%
multiple 200 109 54.50%
parallel 200 139 69.50%
parallel_multiple 200 123 61.50%
java 100 66 66.00%
javascript 50 24 48.00%
irrelevance 240 200 83.33%
live_simple 258 118

Files changed (1) hide show
  1. README.md +20 -0
README.md CHANGED
@@ -156,6 +156,26 @@ python3 -m sglang.launch_server \
156
  ```
157
  + For Blackwell GPUs, include `--attention-backend triton --speculative-draft-attention-backend triton` in your SGLang launch command.
158
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
  ## Citation
160
 
161
  If you find our work useful in your research, please consider citing the following paper:
 
156
  ```
157
  + For Blackwell GPUs, include `--attention-backend triton --speculative-draft-attention-backend triton` in your SGLang launch command.
158
 
159
+ ### LightLLM
160
+ ```shell
161
+ # Install lightllm (Recommended to use Docker.).
162
+ pip install git+https://github.com/ModelTC/LightLLM.git
163
+
164
+ LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1 LOADWORKER=18 \
165
+ python3 -m lightllm.server.api_server \
166
+ --model_dir /path/to/GLM-4.7-Flash/ \
167
+ --tp 1 \
168
+ --max_req_total_len 202752 \
169
+ --chunked_prefill_size 8192 \
170
+ --llm_prefill_att_backend fa3 \
171
+ --llm_decode_att_backend fa3 \
172
+ --graph_max_batch_size 512 \
173
+ --tool_call_parser glm47 \
174
+ --reasoning_parser glm45 \
175
+ --host 0.0.0.0 \
176
+ --port 8000
177
+ ```
178
+
179
  ## Citation
180
 
181
  If you find our work useful in your research, please consider citing the following paper: