Skip to content

Load Testing

A pure-Rust load harness for the gateway lives in the agentzero-gateway crate at tests/load_baseline.rs. It spawns the gateway in-process, hammers the cheapest unauthenticated endpoints, and reports throughput, error count, and latency percentiles.

The harness is gated behind #[ignore] so it does not run during cargo test. Invoke it explicitly when you want a baseline.

Terminal window
# Default: 10 seconds, 64 concurrent clients, port 18800
cargo test --release -p agentzero-gateway --test load_baseline -- --ignored --nocapture load_baseline
# Custom configuration via env vars
AZ_LOAD_DURATION_SECS=30 \
AZ_LOAD_CONCURRENCY=128 \
AZ_LOAD_PORT=18800 \
cargo test --release -p agentzero-gateway --test load_baseline -- --ignored --nocapture load_baseline

Always pass --release. A debug build will dramatically under-report what production deployments can do.

For each of GET /health/live, GET /health, GET /metrics:

  • Total requests issued during the run window
  • Effective requests/second (total / wall-clock seconds)
  • Error count (any non-2xx response or transport error)
  • p50 / p95 / p99 / max latency in milliseconds

The harness uses a single reqwest::Client with connection pooling so we measure server-side throughput, not TCP open or TLS handshake cost.

Captured 2026-04-07 on a development MacBook Pro (Apple Silicon, 8 cores). Five-second windows, --no-auth, rate limiting disabled, --release.

EndpointReqsErrorsRPSp50p95p99Max
GET /health/live340,093068,0100.45 ms0.64 ms0.78 ms7.17 ms
GET /health349,942069,9800.45 ms0.60 ms0.69 ms1.35 ms
GET /metrics171,173034,2310.91 ms1.34 ms1.55 ms49.25 ms
EndpointReqsErrorsRPSp50p95p99Max
GET /health/live339,871067,9343.58 ms4.16 ms8.20 ms44.88 ms
GET /health342,826068,5213.66 ms4.17 ms5.48 ms17.11 ms
GET /metrics169,060033,7807.36 ms11.66 ms13.93 ms58.41 ms
  • Throughput cap on this hardware: ~68,000 RPS for cheap endpoints. Going from 32 to 256 concurrent clients (8x more) leaves throughput nearly identical, which means the gateway is CPU-bound, not concurrency-limited.
  • Graceful degradation under contention. With 8x the concurrent connections, latency increases proportionally (p99 from 0.78 ms to 8.20 ms) but the error count stays at zero. No timeouts, no dropped connections, no 5xx responses.
  • /metrics is roughly 2x heavier than /health/live because the Prometheus exporter walks the metrics registry on every call. This is the expected cost.
  • Tail latency stays controlled. Even under 256-way contention, p99 is well under 10 ms and the worst observed call is under 60 ms.

/health/live is the cheapest call possible: no auth, single tokio task spawn, ~10-byte response. It establishes the raw routing/serialization throughput ceiling. /health is similar but slightly larger. /metrics is the only baseline endpoint that does meaningful work — it’s a useful proxy for “the gateway is doing more than memcpy”.

We deliberately do not load-test /api/chat, /v1/runs, or any endpoint that calls into an LLM provider. Those numbers would be dominated by provider latency, not gateway behavior. For agent-loop benchmarks, configure a fake provider via agentzero-testkit and target a separate harness.

  • One process on commodity hardware (8 cores / 16 GB) handles ~50–70k RPS of cheap traffic with sub-10 ms p99.
  • For real workloads dominated by LLM provider calls, the gateway is never the bottleneck — provider latency (hundreds of milliseconds to seconds) dominates by orders of magnitude.
  • If you need >70k RPS of cheap traffic (e.g., aggressive Prometheus scraping), run multiple gateway instances behind a load balancer rather than vertically scaling.
  • The default MiddlewareConfig rate limit is 600 requests / 60 seconds = 10 RPS per global window. That is not a production setting — it’s a safe default. Tune rate_limit_max and rate_limit_per_identity for your deployment in agentzero.toml under [gateway].

If a future run shows the cheap-endpoint RPS dropping by more than 20% on the same hardware:

  1. git bisect between the last known-good run and the current commit, narrowing on commits that touched agentzero-gateway, axum, tower, hyper, or tokio.
  2. Profile with cargo flamegraph or samply to find the new hot spot.
  3. Check the middleware stack — newly added layers are the most common cause of latency regressions.
  4. Check for accidental allocations in hot paths: String::new() in loops, format! where write! would do, etc.
  • The harness does not yet exercise authenticated endpoints. Adding bearer-token support is straightforward when needed.
  • WebSocket and SSE streaming have a different load profile and warrant their own harness.
  • The harness does not run in CI. It is intentionally local-only because the numbers depend on hardware and CI runners are too noisy to produce useful regression thresholds. If you want CI tracking, capture relative numbers between runs on the same runner, not absolute thresholds.