oMLX

Overview

oMLX is a Mac-native multi-model inference server optimized for Apple Silicon. It provides:

Multi-model engine pool with LRU eviction and memory budgets
Tiered KV cache (hot GPU / cold SSD) for context persistence
Continuous batching for concurrent requests
Both OpenAI and Anthropic API compatibility — drop-in replacement

oMLX is ideal when you want to run multiple models simultaneously (e.g., a coding model pinned in memory + a larger reasoning model that auto-swaps on demand).

Install

Install oMLX on macOS (menu bar app or CLI). See oMLX documentation for installation instructions.

Configure

oMLX exposes both OpenAI (/v1/chat/completions) and Anthropic (/v1/messages) compatible APIs. Configure AgentZero using either provider type.

OpenAI-compatible mode

{
  "providers": [
    {
      "name": "omlx",
      "type": "openai-compatible",
      "url": "http://localhost:5100",
      "default_model": "mlx-community/Qwen2.5-Coder-7B-Instruct-4bit",
      "is_local": true
    }
  ]
}

Anthropic-compatible mode

{
  "providers": [
    {
      "name": "omlx",
      "type": "anthropic",
      "url": "http://localhost:5100",
      "default_model": "mlx-community/Qwen2.5-Coder-7B-Instruct-4bit",
      "is_local": true
    }
  ]
}

Multi-model setup

Pin a fast coding model and have a larger reasoning model auto-swap:

{
  "providers": [
    {
      "name": "omlx-coder",
      "type": "openai-compatible",
      "url": "http://localhost:5100",
      "default_model": "mlx-community/Qwen2.5-Coder-7B-Instruct-4bit",
      "is_local": true
    },
    {
      "name": "omlx-reasoning",
      "type": "openai-compatible",
      "url": "http://localhost:5100",
      "default_model": "mlx-community/Llama-3.3-70B-Instruct-4bit",
      "is_local": true
    }
  ]
}

Switch at runtime:

az chat --provider omlx-reasoning --model mlx-community/Llama-3.3-70B-Instruct-4bit

oMLX vs Ollama vs MLX

Feature	oMLX	Ollama	MLX (mlx-lm)
Platform	macOS only	Cross-platform	macOS only
Multi-model	Yes (LRU pool)	One at a time	One at a time
KV cache	Tiered (GPU/SSD)	In-memory	In-memory
Memory management	Budget-based	Per-model	Manual
API compat	OpenAI + Anthropic	Native	OpenAI
Context persistence	Across restarts	No	No

Token Usage

oMLX returns streaming usage stats in its API responses. AgentZero captures these automatically through the OpenAI/Anthropic provider parsing. All local usage shows as $0.00 in cost reports.

Performance Tips

Unified memory: Apple Silicon shares memory between CPU and GPU — larger models fit than on discrete GPU systems
Quantization: Use 4-bit quantized models for best memory/performance ratio
Pin models: Use oMLX’s pinning feature to keep your primary coding model in memory
Idle timeout: Configure oMLX’s idle eviction to free memory for other models

Troubleshooting

“cannot connect” — ensure oMLX server is running on the configured port
Model not found — oMLX downloads models on first use; check oMLX logs for progress
Out of memory — check oMLX memory budget settings; try smaller quantizations