
Ollama gains major performance boosts on Apple Silicon thanks to Apple's MLX framework
Ollama is gaining major performance gains on Apple Silicon, now powered by Apple’s MLX machine learning framework. By building directly on MLX and leveraging Apple’s unified memory architecture, Ollama reports a substantial speedup for users running large language models (LLMs) on Mac hardware.
While these improvements enhance overall performance, users on Apple’s latest M5, M5 Pro, and M5 Max chips will benefit from integration with the latest GPU Neural Accelerators. This reduces time to first token and increases generation speed, making the process smoother for both personal assistants and embedded coding agents like OpenClaw, Claude Code, OpenAI Codex, and OpenCode on macOS.
Beyond Apple’s platform, Ollama now supports NVIDIA’s NVFP4 format, enabling memory and storage reductions for inference tasks without compromising model accuracy. This allows users to achieve inference results consistent with production environments and introduces compatibility with models optimized by NVIDIA’s tooling.
Following these platform advances, Ollama’s improved cache system now reuses cache data across conversations to lower memory use and speed up prompt processing. For branching workflows, such as coding or agent-driven prompts, Ollama takes intelligent cache snapshots, producing faster responses and reducing computational overhead.
