MLX (mlx-lm)

Overview

mlx-lm is Apple’s lightweight inference library for Apple Silicon. Its built-in server exposes an OpenAI-compatible API.

Use MLX when you want a simple, single-model local server. For multi-model management with memory budgets, see oMLX.

Install

pip install mlx-lm

Start the Server

mlx_lm.server --model mlx-community/Qwen2.5-Coder-7B-Instruct-4bit --port 8080

The server exposes an OpenAI-compatible API at http://localhost:8080.

Configure

Add to .agentzero/models.json:

{
  "providers": [
    {
      "name": "mlx-local",
      "type": "openai-compatible",
      "url": "http://localhost:8080",
      "default_model": "mlx-community/Qwen2.5-Coder-7B-Instruct-4bit",
      "is_local": true
    }
  ]
}

Use

az chat --provider mlx-local

MLX vs oMLX

Feature	MLX (mlx-lm)	oMLX
Setup	Simple pip install	App/CLI install
Models	Single model	Multi-model pool
Memory	Manual	Budget-managed
KV cache	In-memory	Tiered (GPU/SSD)
Best for	Quick single-model use	Production multi-model

Token Usage

AgentZero captures usage from MLX’s OpenAI-compatible response format. All local usage shows as $0.00 in cost reports.

Recommended Models

Model	Size	Use Case
`Qwen2.5-Coder-7B-Instruct-4bit`	~4GB	Code generation, tool calling
`Llama-3.2-3B-Instruct-4bit`	~2GB	Fast general purpose
`Mistral-7B-Instruct-v0.3-4bit`	~4GB	General purpose

Troubleshooting

“cannot connect” — ensure mlx_lm.server is running on the configured port
Slow startup — MLX downloads and converts models on first use
Tool calling — not all models support tool calling; Qwen2.5-Coder works well