Skip to main content

Benchmarks

Benchmarks for LiteLLM Gateway (Proxy Server) tested against a fake OpenAI endpoint.

Use this config for testing:

model_list:
- model_name: "fake-openai-endpoint"
litellm_params:
model: openai/any
api_base: https://your-fake-openai-endpoint.com/chat/completions
api_key: "test"

2 Instance LiteLLM Proxy​

In these tests the baseline latency characteristics are measured against a fake-openai-endpoint.

Performance Metrics​

TypeNameMedian (ms)95%ile (ms)99%ile (ms)Average (ms)Current RPS
POST/chat/completions2006301200262.461035.7
CustomLiteLLM Overhead Duration (ms)12294314.741035.7
Aggregated100430930138.62071.4

4 Instances​

TypeNameMedian (ms)95%ile (ms)99%ile (ms)Average (ms)Current RPS
POST/chat/completions100150240111.731170
CustomLiteLLM Overhead Duration (ms)28133.321170
Aggregated7713018057.532340

Key Findings​

  • Doubling from 2 to 4 LiteLLM instances halves median latency: 200 ms → 100 ms.
  • High-percentile latencies drop significantly: P95 630 ms → 150 ms, P99 1,200 ms → 240 ms.
  • Setting workers equal to CPU count gives optimal performance.

Machine Spec used for testing​

Each machine deploying LiteLLM had the following specs:

  • 4 CPU
  • 8GB RAM

Configuration​

  • Database: PostgreSQL
  • Redis: Not used

Locust Settings​

  • 1000 Users
  • 500 user Ramp Up

How to measure LiteLLM Overhead​

All responses from litellm will include the x-litellm-overhead-duration-ms header, this is the latency overhead in milliseconds added by LiteLLM Proxy.

If you want to measure this on locust you can use the following code:

Locust Code for measuring LiteLLM Overhead
import os
import uuid
from locust import HttpUser, task, between, events

# Custom metric to track LiteLLM overhead duration
overhead_durations = []

@events.request.add_listener
def on_request(request_type, name, response_time, response_length, response, context, exception, start_time, url, **kwargs):
if response and hasattr(response, 'headers'):
overhead_duration = response.headers.get('x-litellm-overhead-duration-ms')
if overhead_duration:
try:
duration_ms = float(overhead_duration)
overhead_durations.append(duration_ms)
# Report as custom metric
events.request.fire(
request_type="Custom",
name="LiteLLM Overhead Duration (ms)",
response_time=duration_ms,
response_length=0,
)
except (ValueError, TypeError):
pass

class MyUser(HttpUser):
wait_time = between(0.5, 1) # Random wait time between requests

def on_start(self):
self.api_key = os.getenv('API_KEY', 'sk-1234567890')
self.client.headers.update({'Authorization': f'Bearer {self.api_key}'})

@task
def litellm_completion(self):
# no cache hits with this
payload = {
"model": "db-openai-endpoint",
"messages": [{"role": "user", "content": f"{uuid.uuid4()} This is a test there will be no cache hits and we'll fill up the context" * 150}],
"user": "my-new-end-user-1"
}
response = self.client.post("chat/completions", json=payload)

if response.status_code != 200:
# log the errors in error.txt
with open("error.txt", "a") as error_log:
error_log.write(response.text + "\n")

LiteLLM vs Portkey Performance Comparison​

Test Configuration: 4 CPUs, 8 GB RAM per instance | Load: 1k concurrent users, 500 ramp-up Versions: Portkey v1.14.0 | LiteLLM v1.79.1-stable
Test Duration: 5 minutes

Multi-Instance (4×) Performance​

MetricPortkey (no DB)LiteLLM (with DB)Comment
Total Requests293,796312,405LiteLLM higher
Failed Requests00Same
Median Latency100 ms100 msSame
p95 Latency230 ms150 msLiteLLM lower
p99 Latency500 ms240 msLiteLLM lower
Average Latency123 ms111 msLiteLLM lower
Current RPS1,170.91,170Same

Lower is better for latency metrics; higher is better for requests and RPS.

Technical Insights​

Portkey

Pros

  • Low memory footprint
  • Stable latency with minimal spikes

Cons

  • CPU utilization capped around ~40%, indicating underutilization of available compute resources
  • Experienced three I/O timeout outages

LiteLLM

Pros

  • Fully utilizes available CPU capacity
  • Strong connection handling and low latency after initial warm-up spikes

Cons

  • High memory usage during initialization and per request

Logging Callbacks​

GCS Bucket Logging​

Using GCS Bucket has no impact on latency, RPS compared to Basic Litellm Proxy

MetricBasic Litellm ProxyLiteLLM Proxy with GCS Bucket Logging
RPS1133.21137.3
Median Latency (ms)140138

LangSmith logging​

Using LangSmith has no impact on latency, RPS compared to Basic Litellm Proxy

MetricBasic Litellm ProxyLiteLLM Proxy with LangSmith
RPS1133.21135
Median Latency (ms)140132