Skip to content

oMLX

oMLX is a Mac-native multi-model inference server optimized for Apple Silicon. It provides:

  • Multi-model engine pool with LRU eviction and memory budgets
  • Tiered KV cache (hot GPU / cold SSD) for context persistence
  • Continuous batching for concurrent requests
  • Both OpenAI and Anthropic API compatibility — drop-in replacement

oMLX is ideal when you want to run multiple models simultaneously (e.g., a coding model pinned in memory + a larger reasoning model that auto-swaps on demand).

Install oMLX on macOS (menu bar app or CLI). See oMLX documentation for installation instructions.

oMLX exposes both OpenAI (/v1/chat/completions) and Anthropic (/v1/messages) compatible APIs. Configure AgentZero using either provider type.

{
"providers": [
{
"name": "omlx",
"type": "openai-compatible",
"url": "http://localhost:5100",
"default_model": "mlx-community/Qwen2.5-Coder-7B-Instruct-4bit",
"is_local": true
}
]
}
{
"providers": [
{
"name": "omlx",
"type": "anthropic",
"url": "http://localhost:5100",
"default_model": "mlx-community/Qwen2.5-Coder-7B-Instruct-4bit",
"is_local": true
}
]
}

Pin a fast coding model and have a larger reasoning model auto-swap:

{
"providers": [
{
"name": "omlx-coder",
"type": "openai-compatible",
"url": "http://localhost:5100",
"default_model": "mlx-community/Qwen2.5-Coder-7B-Instruct-4bit",
"is_local": true
},
{
"name": "omlx-reasoning",
"type": "openai-compatible",
"url": "http://localhost:5100",
"default_model": "mlx-community/Llama-3.3-70B-Instruct-4bit",
"is_local": true
}
]
}

Switch at runtime:

Terminal window
az chat --provider omlx-reasoning --model mlx-community/Llama-3.3-70B-Instruct-4bit
FeatureoMLXOllamaMLX (mlx-lm)
PlatformmacOS onlyCross-platformmacOS only
Multi-modelYes (LRU pool)One at a timeOne at a time
KV cacheTiered (GPU/SSD)In-memoryIn-memory
Memory managementBudget-basedPer-modelManual
API compatOpenAI + AnthropicNativeOpenAI
Context persistenceAcross restartsNoNo

oMLX returns streaming usage stats in its API responses. AgentZero captures these automatically through the OpenAI/Anthropic provider parsing. All local usage shows as $0.00 in cost reports.

  • Unified memory: Apple Silicon shares memory between CPU and GPU — larger models fit than on discrete GPU systems
  • Quantization: Use 4-bit quantized models for best memory/performance ratio
  • Pin models: Use oMLX’s pinning feature to keep your primary coding model in memory
  • Idle timeout: Configure oMLX’s idle eviction to free memory for other models
  • “cannot connect” — ensure oMLX server is running on the configured port
  • Model not found — oMLX downloads models on first use; check oMLX logs for progress
  • Out of memory — check oMLX memory budget settings; try smaller quantizations