How to Host Your Own AI Model on a VPS: Running Ollama and Open-Source LLMs
Running AI language models no longer requires paying per-token to OpenAI or Anthropic. With Ollama, you can run powerful open-source models — Llama 3, Mistral, Gemma, Qwen, DeepSeek — directly on your VPS, with a local API that your applications can call as if it were any other service. Complete privacy, zero API costs, and full control over the model and its configuration.
This guide covers installing Ollama on an Ubuntu VPS, running your first models, building a simple chat API, and optimizing performance for CPU-only inference (the realistic scenario for most VPS setups).
Why Self-Host AI Models on a VPS?
- Privacy — Your prompts and data never leave your server. Critical for legal, medical, financial, or confidential business use cases.
- Cost control — No per-token billing. A $50/month VPS running Ollama serves unlimited inference.
- No rate limits — Commercial APIs throttle requests. Your own server handles as many requests as the hardware allows.
- Customization — Fine-tune models on your data, create custom system prompts, modify model parameters freely.
- Offline capability — Your AI runs even without internet connectivity.
- Integration freedom — Embed AI into internal tools, APIs, chatbots, and workflows without vendor restrictions.
Realistic Expectations: VPS vs GPU Server
Consumer GPUs like an RTX 4090 generate tokens at 80–150 tokens/second. A CPU-only VPS generates 3–15 tokens/second depending on the model size and CPU cores. For many use cases — background processing, internal tools, low-concurrency APIs — this is entirely acceptable.
| Model | Size | RAM needed | Tokens/sec (4 vCPU) | Use case |
|---|---|---|---|---|
| Llama 3.2 3B | 2 GB | 4 GB | 8–15 t/s | Fast responses, lightweight tasks |
| Llama 3.1 8B | 5 GB | 8 GB | 4–8 t/s | General purpose, balanced quality |
| Mistral 7B | 4 GB | 8 GB | 5–9 t/s | Instruction following, coding |
| Gemma 2 9B | 6 GB | 10 GB | 3–6 t/s | Strong reasoning, longer context |
| DeepSeek-R1 8B | 5 GB | 8 GB | 4–7 t/s | Reasoning, math, code |
| Qwen2.5 14B | 9 GB | 16 GB | 2–4 t/s | Multilingual, Chinese language |
💡 VPS Recommendation: For Llama 3.1 8B or Mistral 7B, use a 4 vCPU / 8 GB RAM VPS. For 13B+ models, 16 GB RAM is recommended. VPS.DO’s USA VPS plans provide the dedicated RAM and SSD needed for model storage and inference. View Plans →
Step 1: Update System and Check Resources
sudo apt update && sudo apt upgrade -y
# Check available RAM
free -h
# Check CPU cores
nproc
# Check disk space (models need 2–10 GB each)
df -h
Step 2: Install Ollama
# Official one-line installer
curl -fsSL https://ollama.com/install.sh | sh
This installs the Ollama binary, creates a systemd service, and starts it automatically.
# Verify installation
ollama --version
# Check service status
sudo systemctl status ollama
Step 3: Pull and Run Your First Model
# Pull Llama 3.2 3B (good starting point — only 2 GB)
ollama pull llama3.2:3b
# Start an interactive chat session
ollama run llama3.2:3b
Type a message and press Enter. You’ll see tokens streaming in real time. Press /bye to exit the chat.
# Pull other popular models
ollama pull mistral:7b # Excellent general-purpose model
ollama pull gemma2:9b # Strong reasoning
ollama pull qwen2.5:7b # Great for Chinese language tasks
ollama pull deepseek-r1:8b # Advanced reasoning and math
ollama pull codellama:7b # Specialized for code generation
# List downloaded models
ollama list
# Remove a model to free disk space
ollama rm codellama:7b
Step 4: Use the Ollama REST API
Ollama exposes a REST API on port 11434. By default it only listens on localhost.
Generate a completion (non-streaming)
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.2:3b",
"prompt": "What is a VPS and why should I use one?",
"stream": false
}'
Chat API (OpenAI-compatible format)
curl http://localhost:11434/api/chat \
-d '{
"model": "mistral:7b",
"messages": [
{"role": "system", "content": "You are a helpful server administrator."},
{"role": "user", "content": "How do I check my VPS memory usage?"}
],
"stream": false
}'
OpenAI-compatible endpoint
Ollama supports the OpenAI API format — drop-in replacement for apps using OpenAI SDK:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Step 5: Expose Ollama via Nginx with Authentication
To access Ollama from external applications, expose it through Nginx with basic authentication — never expose the raw Ollama port publicly.
Configure Ollama to listen on all interfaces
sudo nano /etc/systemd/system/ollama.service
Add the environment variable to the [Service] section:
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
# Keep this as localhost — Nginx will proxy it
sudo systemctl daemon-reload
sudo systemctl restart ollama
Set up Nginx reverse proxy with authentication
sudo apt install apache2-utils -y
sudo htpasswd -c /etc/nginx/.ollama-htpasswd apiuser
sudo nano /etc/nginx/sites-available/ollama
server {
listen 443 ssl;
server_name ai.yourdomain.com;
ssl_certificate /etc/letsencrypt/live/ai.yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/ai.yourdomain.com/privkey.pem;
# Require authentication
auth_basic "Ollama API";
auth_basic_user_file /etc/nginx/.ollama-htpasswd;
# Increase timeout for long AI responses
proxy_read_timeout 300s;
proxy_send_timeout 300s;
location / {
proxy_pass http://127.0.0.1:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# Enable streaming responses
proxy_buffering off;
proxy_cache off;
chunked_transfer_encoding on;
}
}
sudo ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
sudo certbot --nginx -d ai.yourdomain.com
sudo nginx -t && sudo systemctl reload nginx
Step 6: Build a Simple Chat API with Python
pip install flask requests
nano /var/ai/chat_api.py
from flask import Flask, request, jsonify, Response
import requests
import json
app = Flask(__name__)
OLLAMA_URL = "http://localhost:11434"
DEFAULT_MODEL = "llama3.2:3b"
@app.route('/chat', methods=['POST'])
def chat():
data = request.json
messages = data.get('messages', [])
model = data.get('model', DEFAULT_MODEL)
stream = data.get('stream', False)
payload = {
"model": model,
"messages": messages,
"stream": stream
}
if stream:
def generate():
with requests.post(f"{OLLAMA_URL}/api/chat",
json=payload, stream=True) as r:
for line in r.iter_lines():
if line:
yield f"data: {line.decode()}\n\n"
return Response(generate(), mimetype='text/event-stream')
else:
response = requests.post(f"{OLLAMA_URL}/api/chat", json=payload)
result = response.json()
return jsonify({
"content": result.get("message", {}).get("content", ""),
"model": model,
"done": True
})
@app.route('/models', methods=['GET'])
def list_models():
response = requests.get(f"{OLLAMA_URL}/api/tags")
return jsonify(response.json())
if __name__ == '__main__':
app.run(host='127.0.0.1', port=5000)
Step 7: Create Custom Model Personalities (Modelfiles)
Ollama’s Modelfile lets you create custom versions of any base model with specific system prompts and parameters:
nano /var/ai/Modelfile.support
FROM llama3.2:3b
# Set a specific system prompt
SYSTEM """You are a helpful technical support agent for VPS hosting.
You specialize in Linux server administration, Nginx, Docker, and web hosting.
Always give specific, actionable answers with code examples when relevant.
Keep responses concise and practical."""
# Model parameters
PARAMETER temperature 0.3 # Lower = more focused, less creative
PARAMETER top_p 0.9
PARAMETER num_ctx 4096 # Context window size
# Build and test the custom model
ollama create vps-support -f /var/ai/Modelfile.support
ollama run vps-support "How do I check which process is using the most CPU?"
Step 8: Performance Optimization
Set the number of CPU threads
sudo nano /etc/systemd/system/ollama.service
[Service]
Environment="OLLAMA_NUM_THREADS=4" # Match your vCPU count
Environment="OLLAMA_HOST=127.0.0.1:11434"
Use quantized models for faster inference
Quantized models use less RAM and run faster at a small quality cost:
# Q4_K_M quantization — best balance of speed and quality
ollama pull llama3.1:8b-instruct-q4_K_M
# Q8 — higher quality, more RAM
ollama pull mistral:7b-instruct-q8_0
Pre-load models to eliminate cold start
# Keep model loaded in memory
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.2:3b", "keep_alive": -1, "prompt": ""}'
Create a swap file for model loading
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
Use Cases: What Developers Build with Self-Hosted LLMs
| Application | Model recommendation |
|---|---|
| Customer support chatbot | Llama 3.1 8B with custom system prompt |
| Code review automation | DeepSeek-R1 8B or CodeLlama 7B |
| Document summarization | Mistral 7B or Gemma 2 9B |
| Internal knowledge base Q&A | Llama 3.1 8B with RAG pipeline |
| Chinese language tasks | Qwen2.5 7B or 14B |
| Data extraction/classification | Mistral 7B (precise instruction following) |
Final Thoughts
Self-hosted LLMs on a VPS democratize AI capabilities for developers and businesses that can’t or won’t send data to commercial APIs. Ollama makes this remarkably accessible — install in minutes, pull a model, and start making API calls. The performance is slower than GPU-accelerated inference, but for many real-world use cases, 5–10 tokens/second is entirely adequate.
For demanding AI workloads, VPS.DO’s higher-tier plans with more RAM and CPU cores deliver noticeably faster inference. The 8 GB RAM USA VPS handles Llama 3.1 8B comfortably; the dedicated server options support larger 13B+ models with better throughput.