Pricing
Free
Get started with Ollama
- Automate coding, document analysis, and other tasks with open models
- Keep your data private
- Run models on your hardware
- Access cloud models
- CLI, API, and desktop apps
- 40,000+ community integrations
- Unlimited public models
Pro
Solve harder tasks, faster
- Run 3 cloud models at a time
- 50x more cloud usage than Free
- Upload and share private models
Max
For your most demanding work
- Run 10 cloud models at a time
- 5x more usage than Pro
Frequently asked questions
Models
-
Which models are available?
See the full list of cloud-enabled models here.
-
Do models support tool calling?
Yes. Cloud models that are trained to support tools are tested for tool calling and with real agent workflows before they go live. If something isn't working, let us know at [email protected].
-
What quantization or data format do cloud models use?
Native weights, as released by the model provider. On modern NVIDIA hardware, models may use accelerated data formats supported by Blackwell and Vera Rubin architectures (e.g. NVFP4).
-
How fast is Ollama?
Speed depends on model size, architecture, and hardware optimization. We target and monitor for low time-to-first-token and high throughput across all cloud models. Priority tiers with faster performance may be available in the future.
Usage
-
What are the usage limits for each plan?
Running models on your own hardware is always unlimited. Cloud usage varies by plan:
Plan Usage Example use cases Free Light usage Chatting with models, evaluating larger models, coding and AI assistants with smaller models Pro Day-to-day work Larger models, coding automation, deep research Max Heavy, sustained usage Continuous agent tasks, multiple concurrent agents, large models over extended sessions Each plan has session limits that reset every 5 hours and weekly limits that reset every 7 days.
-
How is usage measured?
Usage reflects actual utilization of Ollama's cloud infrastructure - primarily GPU time, which depends on model size and request duration. Shorter requests and prompts that share cached context use less.
This is different from fixed token or request-based plans. Ollama doesn't cap you at a set number of tokens. As hardware and model architectures get more efficient, you'll get more out of your plan over time.
-
Can I purchase additional usage?
Soon. Additional usage at competitive per-token rates, including cache-aware pricing, is coming.
-
How much more usage does Pro include?
50x more than Free.
-
How much more usage does Max include?
5x more than Pro.
-
How do I know when I've hit my limit?
Check your usage here anytime. At 90% of your plan's limit, Ollama sends an email reminder. You can turn this off in settings.
-
How many cloud models can I run at once?
Concurrency limits ensure dedicated capacity for workflows that need multiple models running simultaneously:
Plan Concurrent models Free 1 Pro 3 Max 10 Requests beyond your plan's concurrency limit are queued and processed as soon as a slot is available. Queued requests are held up to a fixed limit - if the queue is full, the request will be rejected until one of your concurrency slots opens.
Privacy
-
Where are models hosted?
Ollama hosts models and compute resources primarily in the United States. To serve global demand, we may route to Europe and Singapore for additional capacity.
-
Is my prompt or response data trained on?
Prompt or response data is never logged or trained on.
-
Who does Ollama partner with to host models?
Ollama collaborates with NVIDIA Cloud Providers (NCPs) to host open models.
When Ollama partners with providers, we require no logging, no training, and zero data retention policies in place.