How to Host Your Own AI Model on a VPS: Running Ollama and Open-Source LLMs

How to Host Your Own AI Model on a VPS: Running Ollama and Open-Source LLMs

Running AI language models no longer requires paying per-token to OpenAI or Anthropic. With Ollama, you can run powerful open-source models — Llama 3, Mistral, Gemma, Qwen, DeepSeek — directly on your VPS, with a local API that your applications can call as if it were any other service. Complete privacy, zero API costs, and full control over the model and its configuration.

This guide covers installing Ollama on an Ubuntu VPS, running your first models, building a simple chat API, and optimizing performance for CPU-only inference (the realistic scenario for most VPS setups).

Why Self-Host AI Models on a VPS?

  • Privacy — Your prompts and data never leave your server. Critical for legal, medical, financial, or confidential business use cases.
  • Cost control — No per-token billing. A $50/month VPS running Ollama serves unlimited inference.
  • No rate limits — Commercial APIs throttle requests. Your own server handles as many requests as the hardware allows.
  • Customization — Fine-tune models on your data, create custom system prompts, modify model parameters freely.
  • Offline capability — Your AI runs even without internet connectivity.
  • Integration freedom — Embed AI into internal tools, APIs, chatbots, and workflows without vendor restrictions.

Realistic Expectations: VPS vs GPU Server

Consumer GPUs like an RTX 4090 generate tokens at 80–150 tokens/second. A CPU-only VPS generates 3–15 tokens/second depending on the model size and CPU cores. For many use cases — background processing, internal tools, low-concurrency APIs — this is entirely acceptable.

Model Size RAM needed Tokens/sec (4 vCPU) Use case
Llama 3.2 3B 2 GB 4 GB 8–15 t/s Fast responses, lightweight tasks
Llama 3.1 8B 5 GB 8 GB 4–8 t/s General purpose, balanced quality
Mistral 7B 4 GB 8 GB 5–9 t/s Instruction following, coding
Gemma 2 9B 6 GB 10 GB 3–6 t/s Strong reasoning, longer context
DeepSeek-R1 8B 5 GB 8 GB 4–7 t/s Reasoning, math, code
Qwen2.5 14B 9 GB 16 GB 2–4 t/s Multilingual, Chinese language

💡 VPS Recommendation: For Llama 3.1 8B or Mistral 7B, use a 4 vCPU / 8 GB RAM VPS. For 13B+ models, 16 GB RAM is recommended. VPS.DO’s USA VPS plans provide the dedicated RAM and SSD needed for model storage and inference. View Plans →


Step 1: Update System and Check Resources

sudo apt update && sudo apt upgrade -y

# Check available RAM
free -h

# Check CPU cores
nproc

# Check disk space (models need 2–10 GB each)
df -h

Step 2: Install Ollama

# Official one-line installer
curl -fsSL https://ollama.com/install.sh | sh

This installs the Ollama binary, creates a systemd service, and starts it automatically.

# Verify installation
ollama --version

# Check service status
sudo systemctl status ollama

Step 3: Pull and Run Your First Model

# Pull Llama 3.2 3B (good starting point — only 2 GB)
ollama pull llama3.2:3b

# Start an interactive chat session
ollama run llama3.2:3b

Type a message and press Enter. You’ll see tokens streaming in real time. Press /bye to exit the chat.

# Pull other popular models
ollama pull mistral:7b          # Excellent general-purpose model
ollama pull gemma2:9b           # Strong reasoning
ollama pull qwen2.5:7b          # Great for Chinese language tasks
ollama pull deepseek-r1:8b      # Advanced reasoning and math
ollama pull codellama:7b        # Specialized for code generation

# List downloaded models
ollama list

# Remove a model to free disk space
ollama rm codellama:7b

Step 4: Use the Ollama REST API

Ollama exposes a REST API on port 11434. By default it only listens on localhost.

Generate a completion (non-streaming)

curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2:3b",
    "prompt": "What is a VPS and why should I use one?",
    "stream": false
  }'

Chat API (OpenAI-compatible format)

curl http://localhost:11434/api/chat \
  -d '{
    "model": "mistral:7b",
    "messages": [
      {"role": "system", "content": "You are a helpful server administrator."},
      {"role": "user", "content": "How do I check my VPS memory usage?"}
    ],
    "stream": false
  }'

OpenAI-compatible endpoint

Ollama supports the OpenAI API format — drop-in replacement for apps using OpenAI SDK:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Step 5: Expose Ollama via Nginx with Authentication

To access Ollama from external applications, expose it through Nginx with basic authentication — never expose the raw Ollama port publicly.

Configure Ollama to listen on all interfaces

sudo nano /etc/systemd/system/ollama.service

Add the environment variable to the [Service] section:

[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
# Keep this as localhost — Nginx will proxy it
sudo systemctl daemon-reload
sudo systemctl restart ollama

Set up Nginx reverse proxy with authentication

sudo apt install apache2-utils -y
sudo htpasswd -c /etc/nginx/.ollama-htpasswd apiuser
sudo nano /etc/nginx/sites-available/ollama
server {
    listen 443 ssl;
    server_name ai.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/ai.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ai.yourdomain.com/privkey.pem;

    # Require authentication
    auth_basic "Ollama API";
    auth_basic_user_file /etc/nginx/.ollama-htpasswd;

    # Increase timeout for long AI responses
    proxy_read_timeout 300s;
    proxy_send_timeout 300s;

    location / {
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        # Enable streaming responses
        proxy_buffering off;
        proxy_cache off;
        chunked_transfer_encoding on;
    }
}
sudo ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
sudo certbot --nginx -d ai.yourdomain.com
sudo nginx -t && sudo systemctl reload nginx

Step 6: Build a Simple Chat API with Python

pip install flask requests
nano /var/ai/chat_api.py
from flask import Flask, request, jsonify, Response
import requests
import json

app = Flask(__name__)
OLLAMA_URL = "http://localhost:11434"
DEFAULT_MODEL = "llama3.2:3b"

@app.route('/chat', methods=['POST'])
def chat():
    data = request.json
    messages = data.get('messages', [])
    model = data.get('model', DEFAULT_MODEL)
    stream = data.get('stream', False)

    payload = {
        "model": model,
        "messages": messages,
        "stream": stream
    }

    if stream:
        def generate():
            with requests.post(f"{OLLAMA_URL}/api/chat",
                             json=payload, stream=True) as r:
                for line in r.iter_lines():
                    if line:
                        yield f"data: {line.decode()}\n\n"

        return Response(generate(), mimetype='text/event-stream')
    else:
        response = requests.post(f"{OLLAMA_URL}/api/chat", json=payload)
        result = response.json()
        return jsonify({
            "content": result.get("message", {}).get("content", ""),
            "model": model,
            "done": True
        })

@app.route('/models', methods=['GET'])
def list_models():
    response = requests.get(f"{OLLAMA_URL}/api/tags")
    return jsonify(response.json())

if __name__ == '__main__':
    app.run(host='127.0.0.1', port=5000)

Step 7: Create Custom Model Personalities (Modelfiles)

Ollama’s Modelfile lets you create custom versions of any base model with specific system prompts and parameters:

nano /var/ai/Modelfile.support
FROM llama3.2:3b

# Set a specific system prompt
SYSTEM """You are a helpful technical support agent for VPS hosting.
You specialize in Linux server administration, Nginx, Docker, and web hosting.
Always give specific, actionable answers with code examples when relevant.
Keep responses concise and practical."""

# Model parameters
PARAMETER temperature 0.3    # Lower = more focused, less creative
PARAMETER top_p 0.9
PARAMETER num_ctx 4096       # Context window size
# Build and test the custom model
ollama create vps-support -f /var/ai/Modelfile.support
ollama run vps-support "How do I check which process is using the most CPU?"

Step 8: Performance Optimization

Set the number of CPU threads

sudo nano /etc/systemd/system/ollama.service
[Service]
Environment="OLLAMA_NUM_THREADS=4"    # Match your vCPU count
Environment="OLLAMA_HOST=127.0.0.1:11434"

Use quantized models for faster inference

Quantized models use less RAM and run faster at a small quality cost:

# Q4_K_M quantization — best balance of speed and quality
ollama pull llama3.1:8b-instruct-q4_K_M

# Q8 — higher quality, more RAM
ollama pull mistral:7b-instruct-q8_0

Pre-load models to eliminate cold start

# Keep model loaded in memory
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.2:3b", "keep_alive": -1, "prompt": ""}'

Create a swap file for model loading

sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

Use Cases: What Developers Build with Self-Hosted LLMs

Application Model recommendation
Customer support chatbot Llama 3.1 8B with custom system prompt
Code review automation DeepSeek-R1 8B or CodeLlama 7B
Document summarization Mistral 7B or Gemma 2 9B
Internal knowledge base Q&A Llama 3.1 8B with RAG pipeline
Chinese language tasks Qwen2.5 7B or 14B
Data extraction/classification Mistral 7B (precise instruction following)

Final Thoughts

Self-hosted LLMs on a VPS democratize AI capabilities for developers and businesses that can’t or won’t send data to commercial APIs. Ollama makes this remarkably accessible — install in minutes, pull a model, and start making API calls. The performance is slower than GPU-accelerated inference, but for many real-world use cases, 5–10 tokens/second is entirely adequate.

For demanding AI workloads, VPS.DO’s higher-tier plans with more RAM and CPU cores deliver noticeably faster inference. The 8 GB RAM USA VPS handles Llama 3.1 8B comfortably; the dedicated server options support larger 13B+ models with better throughput.

Fast • Reliable • Affordable VPS - DO It Now!

Get top VPS hosting with VPS.DO’s fast, low-cost plans. Try risk-free with our 7-day no-questions-asked refund and start today!