How to Self-Host Ollama and Access It Remotely
Ollama lets you run powerful open-source AI models Llama, DeepSeek, Mistral, Gemma, Qwen and more entirely on your own hardware. No API keys, no usage fees, no data leaving your machine. This guide covers installing Ollama on Windows, macOS, and Linux, then making it securely accessible from anywhere using Localtonet.
📋 What's in this guide
What Is Ollama?
Ollama is an open-source tool that lets you download and run large language models locally on your own computer. Think of it as a package manager for AI models you pull a model with one command and start chatting with it immediately, with no cloud subscription, no internet dependency after the download, and no data sent to third parties.
Ollama exposes a REST API on port 11434 that is fully compatible with the OpenAI API format.
That means any application built for ChatGPT's API Cline, Continue, AnythingLLM, Open WebUI can be pointed
at your local Ollama instance and work without modification.
Self-Hosted Ollama
- Complete data privacy
- No per-token costs
- No internet required after model download
- Full control over model versions
- Works offline
Cloud AI APIs
- Data processed on third-party servers
- Costs scale with usage
- Requires internet connection
- Subject to rate limits and outages
- Model versions can change without notice
Step 1 — Install Ollama
🐧 Linux (Recommended for servers)
The official one-liner detects your architecture and installs Ollama as a systemd service automatically.
curl -fsSL https://ollama.com/install.sh | sh
Verify the installation and check the service status:
ollama --version
systemctl status ollama
🍎 macOS
Download the Ollama app from ollama.com/download, open the ZIP, and drag Ollama.app to your Applications folder. Launch it and Ollama runs in the menu bar. Requires macOS 14 Sonoma or later.
🪟 Windows
Download the OllamaSetup.exe installer from ollama.com/download. Run it no Administrator rights required. Ollama installs to your user directory and adds itself to the PATH.
Ollama automatically detects and uses your GPU. For NVIDIA cards, ensure you have up-to-date drivers installed. For Apple Silicon Macs (M1/M2/M3/M4), GPU acceleration is built in and works out of the box. CPU-only mode works on any machine but is significantly slower for larger models.
Step 2 — Pull and Run Your First Model
Browse the full model library at ollama.com/library. Below are some recommended starting points depending on your hardware and use case.
| Model | Size | Best for | Pull command |
|---|---|---|---|
| Llama 3.2 | ~2 GB | General purpose, fast on most hardware | ollama pull llama3.2 |
| DeepSeek R1 | ~5 GB | Reasoning and complex problem solving | ollama pull deepseek-r1 |
| Mistral | ~4 GB | Instruction following, summarization | ollama pull mistral |
| Gemma 3 | ~3 GB | Lightweight, fast, multimodal | ollama pull gemma3 |
| Phi-4 Mini | ~2 GB | Fastest option for low-spec hardware | ollama pull phi4-mini |
| Qwen 2.5 Coder | ~4 GB | Code generation and completion | ollama pull qwen2.5-coder |
ollama pull llama3.2ollama pull deepseek-r1ollama pull qwen2.5-coderPull a model and start an interactive chat session:
# Pull the model (downloads once, runs locally forever)
ollama pull llama3.2
# Start an interactive chat
ollama run llama3.2
# Exit the chat
/bye
List all models you have downloaded:
ollama list
Step 3 — Test the Local API
Ollama exposes a REST API on http://localhost:11434 by default.
The API is fully compatible with the OpenAI API format, so any tool or library that works with OpenAI
can point at Ollama instead just change the base URL.
# Verify Ollama is running
curl http://localhost:11434
# Send a chat request
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [{ "role": "user", "content": "Hello! What can you do?" }],
"stream": false
}'
Using the OpenAI-compatible endpoint (works with any OpenAI SDK):
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{ "role": "user", "content": "Hello!" }]
}'
Set base_url="http://localhost:11434/v1" in any OpenAI SDK client and
api_key="ollama" (any non-empty string works). Your existing code runs against
your local model without any other changes.
Step 4 — Access Ollama Remotely with Localtonet
By default, Ollama only listens on localhost it's not reachable from other machines.
There are two approaches to make it remotely accessible. The simplest and most secure is using Localtonet
to create an encrypted tunnel without changing any Ollama settings or opening firewall ports.
A Recommended: Localtonet TCP Tunnel
Most secure
Ollama stays bound to localhost only. Localtonet creates an encrypted tunnel to a public relay address.
No firewall rules, no OLLAMA_HOST changes, no authentication complexity.
Install Localtonet
Download and authenticate Localtonet on the machine running Ollama. Get your AuthToken from Dashboard → My Tokens.
Create a TCP tunnel for port 11434
In the Localtonet dashboard, go to Tunnels → New Tunnel.
Select TCP, set the local IP to 127.0.0.1 and port to 11434.
Click Create.
Authenticate and start the client
Run the Localtonet client on the same machine as Ollama. Your tunnel activates
and the dashboard shows the public relay address — for example example.localto.net:3327.
Connect from any remote machine
Use the relay address as the Ollama host in any client, SDK, or application. The connection is encrypted end-to-end by Localtonet.
Test the remote connection from any machine:
# Replace with your actual Localtonet relay address
curl http://example.localto.net:3327/api/chat -d '{
"model": "llama3.2",
"messages": [{ "role": "user", "content": "Hello from remote!" }],
"stream": false
}'
Use with the OpenAI Python SDK from a remote machine:
from openai import OpenAI
client = OpenAI(
base_url="http://example.localto.net:3327/v1",
api_key="ollama" # Required but not validated
)
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Explain quantum computing simply."}]
)
print(response.choices[0].message.content)
B Alternative: OLLAMA_HOST=0.0.0.0
You can also configure Ollama to listen on all network interfaces directly. This is simpler but exposes the API to your local network — use only on trusted networks and never without additional authentication on the public internet.
# Linux — set via systemd override
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama
# macOS — set via launchctl
launchctl setenv OLLAMA_HOST "0.0.0.0"
# Then restart the Ollama app from the menu bar
Ollama has no built-in authentication. Exposing port 11434 publicly means anyone can use your
compute and access your models for free. Always use Localtonet's encrypted tunnel,
or place a reverse proxy with authentication in front of Ollama if you need direct port exposure.
Security Considerations
✅ Use Localtonet tunnel (recommended)
Ollama remains bound to localhost. Localtonet provides an encrypted tunnel to the relay address.
Only people with the relay address can connect, and you can stop sharing at any time by stopping the tunnel.
No firewall rules, no public IP exposure, no Ollama configuration changes needed.
🔐 Add Localtonet SSO for authentication
In the Localtonet dashboard, enable Single Sign-On (SSO) on your tunnel. This adds a login layer in front of the tunnel endpoint only authenticated users can reach your Ollama API. Supports Google, GitHub, Microsoft, and GitLab login.
⚠️ If you expose OLLAMA_HOST=0.0.0.0 directly
This should only be done on isolated local networks. On the public internet, place Nginx in front of Ollama with Basic Auth or token-based authentication configured at the proxy level. Ollama itself has no built-in authentication mechanism.
What You Can Do with a Remote Ollama Instance
💻 Private AI coding assistant
Connect VS Code extensions like Continue or Cline to your remote Ollama instance. Get AI code completion and chat that runs on your own hardware your code never leaves your machine.
👥 Shared team LLM endpoint
Run Ollama on a powerful workstation and share the Localtonet relay address with your team. Everyone gets access to the same models without each person needing to download gigabytes of weights.
📱 Access from your phone or tablet
Apps like AnythingLLM and Open WebUI support remote Ollama connections. Point them at your Localtonet relay address and chat with your local models from any device, anywhere.
🤖 AI agent and automation backends
Use your self-hosted Ollama as the LLM backend for n8n workflows, LangChain agents, or custom Python scripts running on remote machines. The OpenAI-compatible API means no code changes just update the base URL.
🖥️ Offload to a GPU server
Running Ollama on a dedicated GPU machine (desktop, VPS, or NAS) and accessing it from a lightweight laptop? The Localtonet tunnel makes the remote GPU-powered Ollama instance behave as if it were running locally.
Frequently Asked Questions
Do I need a GPU to run Ollama?
No. Ollama runs on CPU too, but it will be significantly slower for larger models. For comfortable use, a GPU with at least 8GB VRAM is recommended for 7B parameter models. Smaller models like Phi-4 Mini or Gemma 3 run well on CPU-only machines.
How much disk space do I need?
The Ollama binary itself is small. Models range from about 2 GB (small 3B parameter models) to 40 GB or more (70B parameter models). A good starting point is 10–20 GB of free space for a couple of mid-size models.
Is the Ollama API compatible with tools built for OpenAI?
Yes. Ollama exposes an OpenAI-compatible endpoint at /v1/chat/completions. Set base_url to your Ollama address and api_key to any non-empty string. Most OpenAI integrations work without any other changes.
Can I run multiple models at the same time?
Yes. Ollama can load multiple models concurrently, switching between them based on incoming requests. How many can run simultaneously depends on your available VRAM or RAM. Models not recently used are automatically unloaded from memory.
Will the Localtonet tunnel URL change every time?
By default, TCP tunnel ports are assigned dynamically. To get a fixed, permanent address, reserve a port via Add-ons → Reserved Ports in the Localtonet dashboard. You can also attach a custom domain for a stable HTTPS URL.
Can I keep Ollama running permanently without a terminal open?
Yes. On Linux, the official installer sets Ollama up as a systemd service that starts automatically on boot. On macOS, the Ollama app launches at login. For Localtonet to also stay running persistently, use Localtonet service mode.
Run Your Own AI | Private, Free, and Accessible from Anywhere
Install Ollama, pull a model, create a Localtonet tunnel. Your own private AI API is ready in under ten minutes.
Create Free Localtonet Account →