11 min read

How to Self-Host Ollama and Access It Remotely

Run Llama, DeepSeek, Mistral and 100+ AI models on your own hardware with Ollama. This guide covers installation, the local API, and remote access via Localtonet tunnel.

🤖 Ollama · Local LLM · Remote Access · Self-Host · 2025

How to Self-Host Ollama and Access It Remotely

Ollama lets you run powerful open-source AI models Llama, DeepSeek, Mistral, Gemma, Qwen and more entirely on your own hardware. No API keys, no usage fees, no data leaving your machine. This guide covers installing Ollama on Windows, macOS, and Linux, then making it securely accessible from anywhere using Localtonet.

🧠 Runs 100+ open models 🔒 Your data stays local 💸 No API costs 🌍 Accessible from anywhere

What Is Ollama?

Ollama is an open-source tool that lets you download and run large language models locally on your own computer. Think of it as a package manager for AI models you pull a model with one command and start chatting with it immediately, with no cloud subscription, no internet dependency after the download, and no data sent to third parties.

Ollama exposes a REST API on port 11434 that is fully compatible with the OpenAI API format. That means any application built for ChatGPT's API Cline, Continue, AnythingLLM, Open WebUI can be pointed at your local Ollama instance and work without modification.

🔒 100% private Your prompts, your data, and your model outputs never leave your machine.
💸 No API costs Run unlimited requests at zero marginal cost. No tokens, no billing, no rate limits.
🧠 100+ models Llama 3, DeepSeek R1, Mistral, Gemma, Qwen, Phi, and many more all available with a single pull command.
🔗 OpenAI-compatible API Drop-in replacement for the OpenAI API. Just change the base URL existing integrations work immediately.
GPU-accelerated Automatically uses your NVIDIA, AMD, or Apple Silicon GPU for fast inference. Falls back to CPU gracefully.
🐳 Docker support Official Docker image on Docker Hub for container-based deployments and CI/CD pipelines.
Self-Hosted Ollama
  • Complete data privacy
  • No per-token costs
  • No internet required after model download
  • Full control over model versions
  • Works offline
Cloud AI APIs
  • Data processed on third-party servers
  • Costs scale with usage
  • Requires internet connection
  • Subject to rate limits and outages
  • Model versions can change without notice

Step 1 — Install Ollama

🐧 Linux (Recommended for servers)

The official one-liner detects your architecture and installs Ollama as a systemd service automatically.

curl -fsSL https://ollama.com/install.sh | sh

Verify the installation and check the service status:

ollama --version
systemctl status ollama

🍎 macOS

Download the Ollama app from ollama.com/download, open the ZIP, and drag Ollama.app to your Applications folder. Launch it and Ollama runs in the menu bar. Requires macOS 14 Sonoma or later.

🪟 Windows

Download the OllamaSetup.exe installer from ollama.com/download. Run it no Administrator rights required. Ollama installs to your user directory and adds itself to the PATH.

GPU acceleration

Ollama automatically detects and uses your GPU. For NVIDIA cards, ensure you have up-to-date drivers installed. For Apple Silicon Macs (M1/M2/M3/M4), GPU acceleration is built in and works out of the box. CPU-only mode works on any machine but is significantly slower for larger models.

Step 2 — Pull and Run Your First Model

Browse the full model library at ollama.com/library. Below are some recommended starting points depending on your hardware and use case.

Model Size Best for Pull command
Llama 3.2 ~2 GB General purpose, fast on most hardware ollama pull llama3.2
DeepSeek R1 ~5 GB Reasoning and complex problem solving ollama pull deepseek-r1
Mistral ~4 GB Instruction following, summarization ollama pull mistral
Gemma 3 ~3 GB Lightweight, fast, multimodal ollama pull gemma3
Phi-4 Mini ~2 GB Fastest option for low-spec hardware ollama pull phi4-mini
Qwen 2.5 Coder ~4 GB Code generation and completion ollama pull qwen2.5-coder
Llama 3.2 (~2 GB)
Best forGeneral purposePullollama pull llama3.2
DeepSeek R1 (~5 GB)
Best forReasoning tasksPullollama pull deepseek-r1
Qwen 2.5 Coder (~4 GB)
Best forCode generationPullollama pull qwen2.5-coder

Pull a model and start an interactive chat session:

# Pull the model (downloads once, runs locally forever)
ollama pull llama3.2

# Start an interactive chat
ollama run llama3.2

# Exit the chat
/bye

List all models you have downloaded:

ollama list

Step 3 — Test the Local API

Ollama exposes a REST API on http://localhost:11434 by default. The API is fully compatible with the OpenAI API format, so any tool or library that works with OpenAI can point at Ollama instead just change the base URL.

# Verify Ollama is running
curl http://localhost:11434

# Send a chat request
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{ "role": "user", "content": "Hello! What can you do?" }],
  "stream": false
}'

Using the OpenAI-compatible endpoint (works with any OpenAI SDK):

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{ "role": "user", "content": "Hello!" }]
  }'
OpenAI SDK compatibility

Set base_url="http://localhost:11434/v1" in any OpenAI SDK client and api_key="ollama" (any non-empty string works). Your existing code runs against your local model without any other changes.

Step 4 — Access Ollama Remotely with Localtonet

By default, Ollama only listens on localhost it's not reachable from other machines. There are two approaches to make it remotely accessible. The simplest and most secure is using Localtonet to create an encrypted tunnel without changing any Ollama settings or opening firewall ports.

A Recommended: Localtonet TCP Tunnel

Most secure

Ollama stays bound to localhost only. Localtonet creates an encrypted tunnel to a public relay address. No firewall rules, no OLLAMA_HOST changes, no authentication complexity.

1

Install Localtonet

Download and authenticate Localtonet on the machine running Ollama. Get your AuthToken from Dashboard → My Tokens.

2

Create a TCP tunnel for port 11434

In the Localtonet dashboard, go to Tunnels → New Tunnel. Select TCP, set the local IP to 127.0.0.1 and port to 11434. Click Create.

3

Authenticate and start the client

Run the Localtonet client on the same machine as Ollama. Your tunnel activates and the dashboard shows the public relay address — for example example.localto.net:3327.

4

Connect from any remote machine

Use the relay address as the Ollama host in any client, SDK, or application. The connection is encrypted end-to-end by Localtonet.

Test the remote connection from any machine:

# Replace with your actual Localtonet relay address
curl http://example.localto.net:3327/api/chat -d '{
  "model": "llama3.2",
  "messages": [{ "role": "user", "content": "Hello from remote!" }],
  "stream": false
}'

Use with the OpenAI Python SDK from a remote machine:

from openai import OpenAI

client = OpenAI(
    base_url="http://example.localto.net:3327/v1",
    api_key="ollama"  # Required but not validated
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain quantum computing simply."}]
)

print(response.choices[0].message.content)

B Alternative: OLLAMA_HOST=0.0.0.0

You can also configure Ollama to listen on all network interfaces directly. This is simpler but exposes the API to your local network — use only on trusted networks and never without additional authentication on the public internet.

# Linux — set via systemd override
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama
# macOS — set via launchctl
launchctl setenv OLLAMA_HOST "0.0.0.0"
# Then restart the Ollama app from the menu bar
Do not expose port 11434 directly to the internet

Ollama has no built-in authentication. Exposing port 11434 publicly means anyone can use your compute and access your models for free. Always use Localtonet's encrypted tunnel, or place a reverse proxy with authentication in front of Ollama if you need direct port exposure.

Security Considerations

✅ Use Localtonet tunnel (recommended)

Ollama remains bound to localhost. Localtonet provides an encrypted tunnel to the relay address. Only people with the relay address can connect, and you can stop sharing at any time by stopping the tunnel. No firewall rules, no public IP exposure, no Ollama configuration changes needed.

🔐 Add Localtonet SSO for authentication

In the Localtonet dashboard, enable Single Sign-On (SSO) on your tunnel. This adds a login layer in front of the tunnel endpoint only authenticated users can reach your Ollama API. Supports Google, GitHub, Microsoft, and GitLab login.

⚠️ If you expose OLLAMA_HOST=0.0.0.0 directly

This should only be done on isolated local networks. On the public internet, place Nginx in front of Ollama with Basic Auth or token-based authentication configured at the proxy level. Ollama itself has no built-in authentication mechanism.

What You Can Do with a Remote Ollama Instance

💻 Private AI coding assistant

Connect VS Code extensions like Continue or Cline to your remote Ollama instance. Get AI code completion and chat that runs on your own hardware your code never leaves your machine.

👥 Shared team LLM endpoint

Run Ollama on a powerful workstation and share the Localtonet relay address with your team. Everyone gets access to the same models without each person needing to download gigabytes of weights.

📱 Access from your phone or tablet

Apps like AnythingLLM and Open WebUI support remote Ollama connections. Point them at your Localtonet relay address and chat with your local models from any device, anywhere.

🤖 AI agent and automation backends

Use your self-hosted Ollama as the LLM backend for n8n workflows, LangChain agents, or custom Python scripts running on remote machines. The OpenAI-compatible API means no code changes just update the base URL.

🖥️ Offload to a GPU server

Running Ollama on a dedicated GPU machine (desktop, VPS, or NAS) and accessing it from a lightweight laptop? The Localtonet tunnel makes the remote GPU-powered Ollama instance behave as if it were running locally.

Frequently Asked Questions

Do I need a GPU to run Ollama?

No. Ollama runs on CPU too, but it will be significantly slower for larger models. For comfortable use, a GPU with at least 8GB VRAM is recommended for 7B parameter models. Smaller models like Phi-4 Mini or Gemma 3 run well on CPU-only machines.

How much disk space do I need?

The Ollama binary itself is small. Models range from about 2 GB (small 3B parameter models) to 40 GB or more (70B parameter models). A good starting point is 10–20 GB of free space for a couple of mid-size models.

Is the Ollama API compatible with tools built for OpenAI?

Yes. Ollama exposes an OpenAI-compatible endpoint at /v1/chat/completions. Set base_url to your Ollama address and api_key to any non-empty string. Most OpenAI integrations work without any other changes.

Can I run multiple models at the same time?

Yes. Ollama can load multiple models concurrently, switching between them based on incoming requests. How many can run simultaneously depends on your available VRAM or RAM. Models not recently used are automatically unloaded from memory.

Will the Localtonet tunnel URL change every time?

By default, TCP tunnel ports are assigned dynamically. To get a fixed, permanent address, reserve a port via Add-ons → Reserved Ports in the Localtonet dashboard. You can also attach a custom domain for a stable HTTPS URL.

Can I keep Ollama running permanently without a terminal open?

Yes. On Linux, the official installer sets Ollama up as a systemd service that starts automatically on boot. On macOS, the Ollama app launches at login. For Localtonet to also stay running persistently, use Localtonet service mode.

Run Your Own AI | Private, Free, and Accessible from Anywhere

Install Ollama, pull a model, create a Localtonet tunnel. Your own private AI API is ready in under ten minutes.

Create Free Localtonet Account →

Localtonet is a secure multi-protocol tunneling and proxy platform designed to expose localhost, devices, private services, and AI agents to the public internet supporting HTTP/HTTPS tunnels, TCP/UDP forwarding, mobile proxy infrastructure, file server publishing, latency-optimized game connectivity, and developer-ready AI agent endpoint exposure from a single unified control plane.

support