How to Run a Local LLM on a VPS: Deploy Ollama and Open-Source AI Models

How to Run a Local LLM on a VPS: Deploy Ollama and Open-Source AI Models

Running large language models on your own VPS infrastructure gives you a private, customizable AI environment that sends no data to third-party APIs. Ollama has emerged as the leading tool for running open-source LLMs locally — it handles model downloading, quantization management, and provides a clean API compatible with OpenAI’s format. This guide covers deploying Ollama on a VPS, connecting it to Open WebUI for a ChatGPT-like interface, and using it via API for application integration.

Why Run an LLM on Your Own VPS?

  • Privacy: Sensitive business documents, customer data, and proprietary code never leave your infrastructure. No data is sent to OpenAI, Anthropic, or any third party.
  • Cost control: For high-volume API usage, running your own model eliminates per-token costs. A VPS running Llama 3.1 8B can process millions of tokens per month at fixed infrastructure cost.
  • Customization: Fine-tune models on your specific domain, adjust system prompts, and integrate with proprietary data without API restrictions.
  • Availability: No rate limits, no service outages from third-party providers, no usage policies that restrict your use case.
  • Latency: For applications where every millisecond matters, a local model can respond faster than a round-trip to an external API — especially with smaller quantized models.

Understanding VPS Requirements for LLMs

LLMs are RAM-intensive. The amount of RAM required depends on the model size and quantization level:

Model Parameters Quantization RAM Required VPS Minimum
Llama 3.2 3B Q4_K_M ~2.5 GB 4 GB RAM VPS
Llama 3.1 8B Q4_K_M ~5 GB 8 GB RAM VPS
Mistral 7B Q4_K_M ~4.5 GB 8 GB RAM VPS
Qwen2.5 14B Q4_K_M ~9 GB 16 GB RAM VPS
Llama 3.1 70B Q4_K_M ~45 GB Dedicated server

CPU-only inference is entirely viable for models up to 14B parameters — you do not need a GPU. Generation speed on CPU is slower than GPU (3–15 tokens/second vs 40–100+ tokens/second on modern GPUs), but perfectly acceptable for non-interactive use cases like document processing, batch analysis, and API calls where a few seconds of latency is acceptable.

Step 1: Install Ollama

Ollama provides a single installation script that handles all dependencies:

curl -fsSL https://ollama.com/install.sh | sh

Verify the installation:

ollama --version
systemctl status ollama

Ollama installs as a systemd service that starts automatically on boot.

Step 2: Download and Run Your First Model

# Download and run Llama 3.1 8B (recommended starting point)
ollama run llama3.1:8b

# Or start with the smaller 3B model for limited RAM VPS
ollama run llama3.2:3b

# For Chinese language tasks, Qwen is excellent
ollama run qwen2.5:7b

# For coding assistance
ollama run codellama:7b

The first run downloads the model (several GB). Subsequent runs load from local storage. Type your prompt and press Enter to generate a response, or type /bye to exit the interactive session.

Step 3: Use the Ollama API

Ollama exposes an HTTP API on port 11434 (localhost only by default). This API is compatible with OpenAI’s chat completions format:

Basic API Call

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Explain the difference between TCP and UDP in simple terms.",
  "stream": false
}'

OpenAI-Compatible Chat Completions Endpoint

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is a VPS?"}
    ]
  }'

This OpenAI-compatible endpoint means any application built for the OpenAI API can switch to your self-hosted Ollama instance by changing the base URL and model name — no other code changes required.

Step 4: Expose the API Securely (Optional)

By default, Ollama only listens on localhost. To expose it for remote access (for example, to call from another server or a local development machine), configure it with authentication via Nginx:

sudo nano /etc/nginx/sites-available/ollama
server {
    listen 443 ssl;
    server_name ollama.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/ollama.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ollama.yourdomain.com/privkey.pem;

    # Basic authentication to prevent unauthorized access
    auth_basic "Ollama API";
    auth_basic_user_file /etc/nginx/.htpasswd;

    location / {
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_buffering off;  # Required for streaming responses
        proxy_read_timeout 300s;
    }
}
# Create authentication credentials
sudo apt install apache2-utils -y
sudo htpasswd -c /etc/nginx/.htpasswd apiuser

Step 5: Deploy Open WebUI for a ChatGPT-Like Interface

Open WebUI provides a full-featured chat interface for Ollama — model selection, conversation history, file uploads, and more:

docker run -d \
  -p 127.0.0.1:3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Configure Nginx to serve Open WebUI at a subdomain with SSL:

sudo certbot --nginx -d chat.yourdomain.com

Access your private ChatGPT-like interface at https://chat.yourdomain.com. Create an admin account on first login.

Step 6: Create Custom Models with Modelfiles

Ollama Modelfiles allow you to customize a base model’s behavior, system prompt, and parameters:

nano Modelfile
FROM llama3.1:8b

SYSTEM """
You are a helpful customer support assistant for VPS.DO, a VPS hosting provider. 
You help users with server setup, troubleshooting, and hosting questions.
Always be concise and technical when appropriate.
"""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
ollama create vpsdo-assistant -f Modelfile
ollama run vpsdo-assistant

Practical Use Cases for VPS-Hosted LLMs

Document Processing and Summarization

Process large volumes of PDFs, reports, or emails through the API without sending sensitive content to external services. A Python script can iterate through a folder of documents and generate summaries using the local Ollama API.

Code Review and Generation

Use CodeLlama or Qwen Coder for automated code review in your CI/CD pipeline. Send pull request diffs to the local API and receive feedback without exposing proprietary code to external AI providers.

Customer Support Automation

Build a custom support chatbot using a fine-tuned or system-prompted model hosted on your VPS. Integrate with your ticketing system via API to auto-respond to common queries.

Content Generation Pipeline

Automate content creation — product descriptions, social media posts, email drafts — by integrating the Ollama API into your publishing workflow. Process in batches overnight when the server is otherwise idle.

Performance Optimization for CPU Inference

# Set number of threads for Ollama (defaults to all cores)
sudo systemctl edit ollama

Add under [Service]:

Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_MAX_LOADED_MODELS=1"

For multi-core VPS instances, setting OLLAMA_NUM_PARALLEL=2 allows two concurrent inference requests without significantly reducing per-request speed. Allowing more than one loaded model simultaneously increases RAM usage — keep to one loaded model on VPS instances with limited RAM.

Getting Started

For CPU-only LLM inference on smaller models (3B–8B parameters), an 8 GB RAM VPS is the practical minimum. For 14B models, 16 GB RAM is required. KVM VPS plans at VPS.DO provide the full Linux kernel access Ollama requires, with NVMe storage for fast model loading and root access for complete configuration. The USA and Hong Kong data center options allow you to place your private AI infrastructure in the region closest to your users.

Conclusion

Ollama makes deploying open-source LLMs on a VPS straightforward — a single installation command, a model download, and you have a private AI API running in minutes. For privacy-sensitive workloads, high-volume inference use cases, or teams that want to experiment with AI capabilities without API costs, a self-hosted LLM on a VPS is a compelling and increasingly accessible option in 2025.

Fast • Reliable • Affordable VPS - DO It Now!

Get top VPS hosting with VPS.DO’s fast, low-cost plans. Try risk-free with our 7-day no-questions-asked refund and start today!