Ollama 0.2 brings parallel requests and the ability to run multiple models simultaneously

Ollama 0.2 brings parallel requests and the ability to run multiple models simultaneously

Ollama, the popular desktop application designed for deploying large language models easily, has rolled out its version 0.2, introducing concurrency support. This update brings two key features: parallel requests and the ability to run multiple models simultaneously.

With concurrency support, Ollama can now handle multiple requests concurrently, requiring minimal additional memory for each request. This enhancement facilitates various use cases, such as managing multiple chat sessions, hosting code completion LLMs for teams, processing different document sections concurrently, and running multiple agents simultaneously.

Additionally, Ollama now supports loading different models at the same time, benefiting several scenarios. For Retrieval Augmented Generation (RAG), both embedding and text completion models can be loaded into memory simultaneously. Multiple versions of an agent can now run in parallel, and large and small models can operate side-by-side. Models are dynamically loaded and unloaded based on request demands and available GPU memory.

by Paul

mopsbublicjahanson
mopsbublic found this interesting
Ollama iconOllama
  117

Ollama is an AI chatbot designed for seamless integration with Llama 3 and other large language models locally. Rated 5 stars, it offers AI-powered interactions without ads and includes a dark mode feature for user comfort. Notable alternatives to Ollama include Google Gemma, Devin, and Devika.

No comments so far, maybe you want to be first?
Gu