MLX (mlx-lm)
Overview
Section titled “Overview”mlx-lm is Apple’s lightweight inference library for Apple Silicon. Its built-in server exposes an OpenAI-compatible API.
Use MLX when you want a simple, single-model local server. For multi-model management with memory budgets, see oMLX.
Install
Section titled “Install”pip install mlx-lmStart the Server
Section titled “Start the Server”mlx_lm.server --model mlx-community/Qwen2.5-Coder-7B-Instruct-4bit --port 8080The server exposes an OpenAI-compatible API at http://localhost:8080.
Configure
Section titled “Configure”Add to .agentzero/models.json:
{ "providers": [ { "name": "mlx-local", "type": "openai-compatible", "url": "http://localhost:8080", "default_model": "mlx-community/Qwen2.5-Coder-7B-Instruct-4bit", "is_local": true } ]}az chat --provider mlx-localMLX vs oMLX
Section titled “MLX vs oMLX”| Feature | MLX (mlx-lm) | oMLX |
|---|---|---|
| Setup | Simple pip install | App/CLI install |
| Models | Single model | Multi-model pool |
| Memory | Manual | Budget-managed |
| KV cache | In-memory | Tiered (GPU/SSD) |
| Best for | Quick single-model use | Production multi-model |
Token Usage
Section titled “Token Usage”AgentZero captures usage from MLX’s OpenAI-compatible response format. All local usage shows as $0.00 in cost reports.
Recommended Models
Section titled “Recommended Models”| Model | Size | Use Case |
|---|---|---|
Qwen2.5-Coder-7B-Instruct-4bit | ~4GB | Code generation, tool calling |
Llama-3.2-3B-Instruct-4bit | ~2GB | Fast general purpose |
Mistral-7B-Instruct-v0.3-4bit | ~4GB | General purpose |
Troubleshooting
Section titled “Troubleshooting”- “cannot connect” — ensure
mlx_lm.serveris running on the configured port - Slow startup — MLX downloads and converts models on first use
- Tool calling — not all models support tool calling; Qwen2.5-Coder works well