oMLX
Overview
Section titled “Overview”oMLX is a Mac-native multi-model inference server optimized for Apple Silicon. It provides:
- Multi-model engine pool with LRU eviction and memory budgets
- Tiered KV cache (hot GPU / cold SSD) for context persistence
- Continuous batching for concurrent requests
- Both OpenAI and Anthropic API compatibility — drop-in replacement
oMLX is ideal when you want to run multiple models simultaneously (e.g., a coding model pinned in memory + a larger reasoning model that auto-swaps on demand).
Install
Section titled “Install”Install oMLX on macOS (menu bar app or CLI). See oMLX documentation for installation instructions.
Configure
Section titled “Configure”oMLX exposes both OpenAI (/v1/chat/completions) and Anthropic (/v1/messages) compatible APIs. Configure AgentZero using either provider type.
OpenAI-compatible mode
Section titled “OpenAI-compatible mode”{ "providers": [ { "name": "omlx", "type": "openai-compatible", "url": "http://localhost:5100", "default_model": "mlx-community/Qwen2.5-Coder-7B-Instruct-4bit", "is_local": true } ]}Anthropic-compatible mode
Section titled “Anthropic-compatible mode”{ "providers": [ { "name": "omlx", "type": "anthropic", "url": "http://localhost:5100", "default_model": "mlx-community/Qwen2.5-Coder-7B-Instruct-4bit", "is_local": true } ]}Multi-model setup
Section titled “Multi-model setup”Pin a fast coding model and have a larger reasoning model auto-swap:
{ "providers": [ { "name": "omlx-coder", "type": "openai-compatible", "url": "http://localhost:5100", "default_model": "mlx-community/Qwen2.5-Coder-7B-Instruct-4bit", "is_local": true }, { "name": "omlx-reasoning", "type": "openai-compatible", "url": "http://localhost:5100", "default_model": "mlx-community/Llama-3.3-70B-Instruct-4bit", "is_local": true } ]}Switch at runtime:
az chat --provider omlx-reasoning --model mlx-community/Llama-3.3-70B-Instruct-4bitoMLX vs Ollama vs MLX
Section titled “oMLX vs Ollama vs MLX”| Feature | oMLX | Ollama | MLX (mlx-lm) |
|---|---|---|---|
| Platform | macOS only | Cross-platform | macOS only |
| Multi-model | Yes (LRU pool) | One at a time | One at a time |
| KV cache | Tiered (GPU/SSD) | In-memory | In-memory |
| Memory management | Budget-based | Per-model | Manual |
| API compat | OpenAI + Anthropic | Native | OpenAI |
| Context persistence | Across restarts | No | No |
Token Usage
Section titled “Token Usage”oMLX returns streaming usage stats in its API responses. AgentZero captures these automatically through the OpenAI/Anthropic provider parsing. All local usage shows as $0.00 in cost reports.
Performance Tips
Section titled “Performance Tips”- Unified memory: Apple Silicon shares memory between CPU and GPU — larger models fit than on discrete GPU systems
- Quantization: Use 4-bit quantized models for best memory/performance ratio
- Pin models: Use oMLX’s pinning feature to keep your primary coding model in memory
- Idle timeout: Configure oMLX’s idle eviction to free memory for other models
Troubleshooting
Section titled “Troubleshooting”- “cannot connect” — ensure oMLX server is running on the configured port
- Model not found — oMLX downloads models on first use; check oMLX logs for progress
- Out of memory — check oMLX memory budget settings; try smaller quantizations