Jul 9, 2024 at 2:15 PM

Ollama 0.2 brings parallel requests and the ability to run multiple models simultaneously

Ollama, the popular desktop application designed for deploying large language models easily, has rolled out its version 0.2, introducing concurrency support. This update brings two key features: parallel requests and the ability to run multiple models simultaneously.

With concurrency support, Ollama can now handle multiple requests concurrently, requiring minimal additional memory for each request. This enhancement facilitates various use cases, such as managing multiple chat sessions, hosting code completion LLMs for teams, processing different document sections concurrently, and running multiple agents simultaneously.

Additionally, Ollama now supports loading different models at the same time, benefiting several scenarios. For Retrieval Augmented Generation (RAG), both embedding and text completion models can be loaded into memory simultaneously. Multiple versions of an agent can now run in parallel, and large and small models can operate side-by-side. Models are dynamically loaded and unloaded based on request demands and available GPU memory.

Jul 9, 2024 by Paul

mopsbublic found this interesting

MORE ABOUT: #AI Chatbots #Large Language Model (LLM) Tools #AI Writing Tools #Ollama

Ollama

126

AI Chatbot
Free
Open Source

Ollama is an AI chatbot designed for seamless integration with Llama 3 and other large language models locally. Rated 5 stars, it offers AI-powered interactions without ads and includes a dark mode feature for user comfort. Notable alternatives to Ollama include Google Gemma, Devin, and Devika.

External links

Ollama 0.2 is here! Concurrency is now enabled by default.
X (formerly Twitter) • Official source
Ollama v0.2.0 released
GitHub • Official source

No comments so far, maybe you want to be first?

Ollama 0.2 brings parallel requests and the ability to run multiple models simultaneously

Related news

External links