Skip to content

Provider Setup Guides

This guide covers setup for the most common providers. AgentZero supports 37 providers — run agentzero providers for the full list.

  1. Get an API key from platform.openai.com/api-keys.
  2. Configure:
Terminal window
agentzero onboard --provider openai --model gpt-4o --yes
agentzero auth setup-token --provider openai --token sk-...

Or set the environment variable:

Terminal window
export OPENAI_API_KEY="sk-..."

TOML config:

[provider]
kind = "openai"
base_url = "https://api.openai.com/v1"
model = "gpt-4o"

Available models: gpt-4o, gpt-4o-mini, gpt-4-turbo, o1, o1-mini, o3-mini


Option A: Browser login (recommended) — uses your claude.ai subscription:

Terminal window
agentzero onboard --provider anthropic --model claude-sonnet-4-6 --yes
agentzero auth login --provider anthropic # opens browser for OAuth

Option B: API key — from console.anthropic.com/settings/keys:

Terminal window
agentzero auth setup-token --provider anthropic --token sk-ant-...

Or set the environment variable:

Terminal window
export ANTHROPIC_API_KEY="sk-ant-..."

TOML config:

[provider]
kind = "anthropic"
base_url = "https://api.anthropic.com"
model = "claude-sonnet-4-6"

Available models: claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5-20251001


OpenRouter gives you access to hundreds of models through a single API key.

  1. Get an API key from openrouter.ai/keys.
  2. Configure:
Terminal window
agentzero onboard --provider openrouter --model anthropic/claude-sonnet-4-6 --yes
agentzero auth setup-token --provider openrouter --token sk-or-v1-...

Or set the environment variable:

Terminal window
export OPENROUTER_API_KEY="sk-or-v1-..."

TOML config:

[provider]
kind = "openrouter"
base_url = "https://openrouter.ai/api/v1"
model = "anthropic/claude-sonnet-4-6"

Model names use the format provider/model — e.g., openai/gpt-4o, google/gemini-pro, meta-llama/llama-3.1-70b-instruct.


Section titled “Candle Local Model (recommended for local)”

AgentZero includes a local LLM provider powered by Candle, Hugging Face’s pure Rust ML framework. No external server, API key, or C++ compiler needed — the model runs entirely in-process.

Default model: Qwen2.5-Coder-3B-Instruct (Q4_K_M quantization, ~2 GB download on first run)

  1. Build with the candle feature (CPU), or candle-metal for Apple Silicon GPU acceleration:
Terminal window
# CPU only
cargo build --release --features candle
# Apple Silicon GPU (Metal) — recommended on Mac
cargo build --release --features candle-metal
# NVIDIA GPU (CUDA)
cargo build --release --features candle-cuda
  1. Configure:
[provider]
kind = "candle"
model = "qwen2.5-coder-3b"

That’s it. On first run, AgentZero automatically downloads the model and tokenizer from HuggingFace Hub to ~/.agentzero/models/ and shows a progress bar.

Tune inference parameters via the [local] config section:

[local]
model = "Qwen/Qwen2.5-Coder-3B-Instruct-GGUF" # HF repo
filename = "qwen2.5-coder-3b-instruct-q4_k_m.gguf"
n_ctx = 8192 # context window (tokens)
temperature = 0.7 # 0.0 = greedy, higher = more random
top_p = 0.9 # nucleus sampling
max_output_tokens = 2048 # max tokens per response
device = "auto" # "auto" | "cpu" | "metal" | "cuda"

You can use any GGUF model file:

# Local file path
model = "/path/to/my-model.gguf"
# HuggingFace repo (org/repo/filename.gguf)
model = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf"

The Candle provider supports tool calling via Qwen’s <tool_call> prompt format. Tool definitions are automatically injected into the system prompt and model outputs are parsed for tool invocations. Includes fuzzy JSON repair for common small-model mistakes (trailing commas, unquoted keys, key aliases). All built-in tools and plugin tools work with the Candle provider.

The Candle provider streams tokens as they are generated — you see output incrementally, not all at once.

The Candle provider includes an in-process tokenizer, enabling accurate token estimation for context window management. The estimate_tokens() method is available on the Provider trait for context overflow prevention.

Build with candle-metal (Apple Silicon) or candle-cuda (NVIDIA) for GPU-accelerated inference. Set device = "auto" (default) to auto-detect, or "metal" / "cuda" to force a specific backend. Falls back to CPU if the GPU feature is not enabled or unavailable.

When device = "auto", AgentZero now consults a runtime hardware capability probe (agentzero_core::device::detect()) before attempting any GPU init. The probe inspects the host without linking against CUDA or Metal at compile time — it checks for /System/Library/Frameworks/Metal.framework and /System/Library/Frameworks/CoreML.framework on Apple targets, and /proc/driver/nvidia plus nvidia-smi on PATH on Linux. The capability profile (cores, memory, GPU type, NPU type, detection confidence) is logged at startup so you can see exactly which backend was selected and why.

The probe is advisory: it informs which feature-gated init path to attempt, but the final selection still goes through the same Device::new_metal(0) / Device::new_cuda(0) calls that previously gated on cargo features alone. This means a misconfigured host (e.g., NVIDIA driver installed but unloaded) still falls back to CPU cleanly with a warn! log line, not a crash.

Compile-time guards also prevent the most common feature-flag mistakes: building with candle-cuda on macOS, candle-metal on Linux, both at once, or any local-inference feature on wasm32. Each guard produces a multi-line compile_error! explaining both the reason and the fix.

  • The default 3B model is best for simple tasks — coding assistance, file operations, basic research
  • For complex multi-step pipelines, consider using a larger model or a cloud provider
  • Vision/image inputs are not supported

The builtin provider uses llama.cpp via C++ bindings. It works but requires a C++ compiler and does not support real streaming (output appears all at once). Prefer the candle provider above.

Terminal window
cargo build --release --features local-model
[provider]
kind = "builtin"
model = "qwen2.5-coder-3b"

Ollama runs models locally. No API key needed.

  1. Install Ollama from ollama.com.
  2. Pull a model:
Terminal window
ollama pull llama3.1:8b
  1. Start Ollama (it runs on http://localhost:11434 by default):
Terminal window
ollama serve
  1. Configure AgentZero:
Terminal window
agentzero onboard --provider ollama --model llama3.1:8b --yes

TOML config:

[provider]
kind = "ollama"
base_url = "http://localhost:11434/v1"
model = "llama3.1:8b"

AgentZero can auto-discover local Ollama instances:

Terminal window
agentzero local discover

[provider]
kind = "lmstudio"
base_url = "http://localhost:1234/v1"
model = "your-model-name"
[provider]
kind = "llamacpp"
base_url = "http://localhost:8080/v1"
model = "default"
[provider]
kind = "vllm"
base_url = "http://localhost:8000/v1"
model = "your-model-name"

These providers have built-in base URLs — you only need to set the API key:

ProviderKindEnv Var
GroqgroqGROQ_API_KEY
MistralmistralMISTRAL_API_KEY
xAI (Grok)xaiXAI_API_KEY
DeepSeekdeepseekDEEPSEEK_API_KEY
Together AItogetherTOGETHER_API_KEY
Fireworks AIfireworks
Perplexityperplexity
Coherecohere
NVIDIA NIMnvidia

Example for Groq:

Terminal window
agentzero onboard --provider groq --model llama-3.1-70b-versatile --yes
export GROQ_API_KEY="gsk_..."

For any OpenAI-compatible API not in the catalog:

[provider]
kind = "custom:https://my-api.example.com/v1"
model = "my-model"

For Anthropic-compatible APIs:

[provider]
kind = "anthropic-custom:https://my-proxy.example.com"
model = "claude-sonnet-4-6"

Per-provider transport settings can be configured for timeout, retries, and circuit breaking:

[provider.transport]
timeout_ms = 30000 # request timeout (default: 30s)
max_retries = 3 # retry count on failure (default: 3)
circuit_breaker_threshold = 5 # failures before circuit opens (default: 5)
circuit_breaker_reset_ms = 30000 # time before half-open retry (default: 30s)

Retry policy: Retries on 429 Too Many Requests and 5xx server errors with exponential backoff and jitter. Honors Retry-After headers when present. Non-retryable errors (401, 403, 404) fail immediately.

Circuit breaker: Tracks consecutive failures per provider. After reaching the threshold, the circuit opens and rejects requests for the reset duration. It then transitions to half-open, allowing a single probe request. A successful probe closes the circuit; a failed probe reopens it.

Observability: Provider requests are instrumented with tracing spans (anthropic_complete, openai_stream, etc.). Request/response events log at info! level with provider, model, status, body size, and latency. Retries log at warn! level. Circuit breaker state transitions log at info!/warn! level.


Configure backup providers that activate automatically when the primary provider fails (circuit breaker open, 5xx errors, timeouts):

[provider]
kind = "anthropic"
base_url = "https://api.anthropic.com"
model = "claude-sonnet-4-6"
[[provider.fallback_providers]]
kind = "openai"
base_url = "https://api.openai.com/v1"
model = "gpt-4o"
api_key_env = "OPENAI_API_KEY"
[[provider.fallback_providers]]
kind = "openrouter"
base_url = "https://openrouter.ai/api/v1"
model = "anthropic/claude-sonnet-4-6"
api_key_env = "OPENROUTER_API_KEY"

Providers are tried in order. The first successful response is used. Each fallback entry requires:

FieldDescription
kindProvider type (openai, anthropic, openrouter, etc.)
base_urlAPI endpoint URL
modelModel identifier for this provider
api_key_envEnvironment variable name holding the API key

Fallback events emit the provider_fallback_total{from, to} Prometheus metric so you can monitor how often failover occurs.


Distribute requests across multiple API keys to avoid rate limits. Each key gets independent cooldown tracking — 1 hour on 429 (rate limit), 24 hours on persistent errors.

[provider.credential_pool]
strategy = "round-robin" # fill-first | round-robin | random
keys = ["OPENAI_KEY_1", "OPENAI_KEY_2", "OPENAI_KEY_3"]
StrategyBehavior
fill-firstUse the first key until exhausted, then move to next
round-robinCycle through keys sequentially (default)
randomPick randomly from available keys

When all keys are in cooldown, requests fail with a clear error message. Use agentzero providers quota to see per-key status.


Route queries to cheaper models when the task is simple, preserving premium models for complex work. The complexity scorer evaluates character count, word count, code presence, and keyword signals to classify each query as Simple, Medium, or Complex.

# Define model routes by complexity tier
[[model_routes]]
hint = "simple"
provider = "anthropic"
model = "claude-haiku-4-5-20251001"
[[model_routes]]
hint = "medium"
provider = "anthropic"
model = "claude-sonnet-4-6"
[[model_routes]]
hint = "complex"
provider = "anthropic"
model = "claude-opus-4-6"

Classification examples:

  • “hello” → Simple → Haiku
  • “explain how authentication works and compare with OAuth” → Medium → Sonnet
  • “implement a REST API with JWT tokens, refresh flow, and rate limiting” → Complex → Opus

The scorer is conservative: uncertain queries (composite score 0.15–0.35) default to Medium, not Simple.


The CostEstimateLayer estimates input token costs before each LLM call and logs a warning when the estimated cost exceeds a configurable threshold. This provides early visibility into expensive operations without blocking them — use CostCapLayer for hard limits.


The PromptCacheLayer annotates the system prompt and the last N messages with Anthropic’s cache_control markers, enabling up to 90% input token cost reduction for repeated prefixes. This is Anthropic-specific — other providers ignore the annotation and pass through unchanged.


When adaptive_reasoning is enabled, the reasoning budget adjusts dynamically based on query complexity:

ComplexityReasoning
SimpleDisabled
MediumMedium effort
ComplexHigh effort
[runtime]
reasoning_enabled = true
adaptive_reasoning = true

Terminal window
# List all supported providers (marks active one)
agentzero providers
# Check provider quota and API key status
agentzero providers quota
# Diagnose model availability
agentzero doctor models