Metrics

TGI exposes multiple metrics that can be collected via the /metrics Prometheus endpoint. These metrics can be used to monitor the performance of TGI, autoscale deployment and to help identify bottlenecks.

The following metrics are exposed:

Metric Name	Description	Type	Unit
`tgi_batch_current_max_tokens`	Maximum tokens for the current batch	Gauge	Count
`tgi_batch_current_size`	Current batch size	Gauge	Count
`tgi_batch_decode_duration`	Time spent decoding a batch per method (prefill or decode)	Histogram	Seconds
`tgi_batch_filter_duration`	Time spent filtering batches and sending generated tokens per method (prefill or decode)	Histogram	Seconds
`tgi_batch_forward_duration`	Batch forward duration per method (prefill or decode)	Histogram	Seconds
`tgi_batch_inference_count`	Inference calls per method (prefill or decode)	Counter	Count
`tgi_batch_inference_duration`	Batch inference duration	Histogram	Seconds
`tgi_batch_inference_success`	Number of successful inference calls per method (prefill or decode)	Counter	Count
`tgi_batch_next_size`	Batch size of the next batch	Histogram	Count
`tgi_queue_size`	Current queue size	Gauge	Count
`tgi_request_count`	Total number of requests	Counter	Count
`tgi_request_duration`	Total time spent processing the request (e2e latency)	Histogram	Seconds
`tgi_request_generated_tokens`	Generated tokens per request	Histogram	Count
`tgi_request_inference_duration`	Request inference duration	Histogram	Seconds
`tgi_request_input_length`	Input token length per request	Histogram	Count
`tgi_request_max_new_tokens`	Maximum new tokens per request	Histogram	Count
`tgi_request_mean_time_per_token_duration`	Mean time per token per request (inter-token latency)	Histogram	Seconds
`tgi_request_queue_duration`	Time spent in the queue per request	Histogram	Seconds
`tgi_request_skipped_tokens`	Speculated tokens per request	Histogram	Count
`tgi_request_success`	Number of successful requests	Counter
`tgi_request_validation_duration`	Time spent validating the request	Histogram	Seconds

Update on GitHub

text-generation-inference

Metrics