Ollama 0.2 brings parallel requests and the ability to run multiple models simultaneously
Ollama, the popular desktop application designed for deploying large language models easily, has rolled out its version 0.2, introducing concurrency support. This update brings two key features: parallel requests and the ability to run multiple models simultaneously.
With concurrency support, Ollama can now handle multiple requests concurrently, requiring minimal additional memory for each request. This enhancement facilitates various use cases, such as managing multiple chat sessions, hosting code completion LLMs for teams, processing different document sections concurrently, and running multiple agents simultaneously.
Additionally, Ollama now supports loading different models at the same time, benefiting several scenarios. For Retrieval Augmented Generation (RAG), both embedding and text completion models can be loaded into memory simultaneously. Multiple versions of an agent can now run in parallel, and large and small models can operate side-by-side. Models are dynamically loaded and unloaded based on request demands and available GPU memory.

