Skip to content

MLX (mlx-lm)

mlx-lm is Apple’s lightweight inference library for Apple Silicon. Its built-in server exposes an OpenAI-compatible API.

Use MLX when you want a simple, single-model local server. For multi-model management with memory budgets, see oMLX.

Terminal window
pip install mlx-lm
Terminal window
mlx_lm.server --model mlx-community/Qwen2.5-Coder-7B-Instruct-4bit --port 8080

The server exposes an OpenAI-compatible API at http://localhost:8080.

Add to .agentzero/models.json:

{
"providers": [
{
"name": "mlx-local",
"type": "openai-compatible",
"url": "http://localhost:8080",
"default_model": "mlx-community/Qwen2.5-Coder-7B-Instruct-4bit",
"is_local": true
}
]
}
Terminal window
az chat --provider mlx-local
FeatureMLX (mlx-lm)oMLX
SetupSimple pip installApp/CLI install
ModelsSingle modelMulti-model pool
MemoryManualBudget-managed
KV cacheIn-memoryTiered (GPU/SSD)
Best forQuick single-model useProduction multi-model

AgentZero captures usage from MLX’s OpenAI-compatible response format. All local usage shows as $0.00 in cost reports.

ModelSizeUse Case
Qwen2.5-Coder-7B-Instruct-4bit~4GBCode generation, tool calling
Llama-3.2-3B-Instruct-4bit~2GBFast general purpose
Mistral-7B-Instruct-v0.3-4bit~4GBGeneral purpose
  • “cannot connect” — ensure mlx_lm.server is running on the configured port
  • Slow startup — MLX downloads and converts models on first use
  • Tool calling — not all models support tool calling; Qwen2.5-Coder works well