19 min read

How to Expose Local LLMs to the Internet Securely

Learn how to expose local LLMs to the internet securely for remote access, testing, and collaboration without complex network setup.

AI · Local LLM · Ollama · LM Studio · 2026

How to Expose Your Local LLM to the Internet (Complete 2026 Guide)

Running a large language model on your own hardware gives you complete privacy, zero per-token costs, and full control. But when you need remote access for team collaboration, webhook integration, AI agent endpoints, or MCP-compatible services you need a secure public URL. Localtonet creates an encrypted HTTPS tunnel from your local model server to the internet in minutes, without port forwarding, a static IP, or cloud infrastructure. This guide covers every major local LLM tool (Ollama, LM Studio, LocalAI, GPT4All, Jan, llama.cpp, text-generation-webui) and the top open-weight models available today.

🤖 7 LLM tools covered 🔒 Security guide included 🌐 Works on Windows · Linux · macOS · Docker

expose llm to the internet

Why Run an LLM Locally?

Cloud AI APIs are convenient, but local LLMs have become a practical alternative for many workloads. The quality gap between open-weight models and proprietary APIs has narrowed significantly. Models like Llama 4, Qwen3, and DeepSeek-R1 are genuinely competitive with commercial offerings on coding, reasoning, and text generation tasks.

🔒 Complete Data Privacy Prompts, documents, and responses never leave your hardware. Critical for healthcare, legal, finance, and any sensitive data workload.
💰 Zero Per-Token Cost No API bills. Run queries continuously without usage limits. The hardware you already own becomes your AI backend.
Lower Latency No round trip to a cloud datacenter. Responses come directly from your GPU or CPU, often faster than remote APIs for short queries.
🔧 Full Customization Fine-tune models on your own data, adjust system prompts, change context lengths, and switch models without API restrictions.
🌐 Offline Operation Works without internet access. Important for air-gapped environments, travel, or networks with restricted connectivity.
📋 Compliance and Data Residency Some industries require data to stay within specific jurisdictions or infrastructure. Local LLMs satisfy these requirements by design.

Why Expose a Local LLM to the Internet?

A local LLM running only on localhost is only useful to the machine running it. Many real-world workflows require a public endpoint:

👥 Team Collaboration

Share a single high-performance local model with multiple teammates. One developer runs a machine with a powerful GPU, and the rest of the team sends API requests to it via the Localtonet tunnel URL without needing their own hardware.

🔗 Webhook and Automation Integration

Connect your local LLM to external services like Zapier, Make, n8n, or GitHub Actions. These services need to send requests to a public URL. Localtonet provides that URL pointing directly at your local model server.

🤖 AI Agent Endpoints

AI agents built with LangChain, AutoGPT, CrewAI, or similar frameworks often need to call external tools or be called by external orchestrators. A Localtonet tunnel gives your agent a public API endpoint.

🔌 MCP-Compatible Services

The Model Context Protocol (MCP) enables AI assistants to connect to tools over HTTP. A Localtonet HTTP tunnel makes your local LLM or MCP tool server accessible to Claude, Cursor, and other MCP-compatible clients from anywhere.

📱 Mobile App Testing

Test your mobile app's AI integration against a real local model during development. Your phone accesses the model via the public tunnel URL while the model runs on your development machine.

🖥 Cross-Device Access

Run your LLM on a powerful desktop and access it from a lightweight laptop or tablet. The Localtonet tunnel is the bridge between your inference machine and your working machine.

Top Open-Weight LLMs for Local Deployment in 2026

The open-weight model ecosystem has matured dramatically. These are the models worth running locally today, organized by use case.

Model Creator Best For Min RAM (quantized) Ollama Command
Llama 4 (8B) Meta General purpose, coding, chat 6 GB ollama run llama4:8b
Qwen3 (0.6B-235B) Alibaba Multilingual, reasoning, coding 1 GB (0.6B) ollama run qwen3:0.6b
DeepSeek-R1 / V3 DeepSeek Advanced reasoning, math, science 5 GB (7B distill) ollama run deepseek-r1:7b
Mistral Small 3 (24B) Mistral AI Balanced quality and speed, long context 16 GB ollama run mistral-small3
Gemma 3 (1B-27B) Google Efficient, multimodal (vision), low hardware 1 GB (1B) ollama run gemma3:1b
Phi-4 (14B) Microsoft Reasoning, coding, small footprint 9 GB ollama run phi4
GPT-4o (open weights) OpenAI Strong reasoning, tool use, structured output 8 GB ollama run gpt-4o-mini
Kimi K2 Thinking Moonshot AI Agentic workflows, multi-step reasoning 5 GB (distill) ollama run kimi-k2
GLM-4.7 Zhipu AI Coding, tool use, agentic execution 8 GB ollama run glm4:9b
Hardware tip: quantization makes models run on consumer hardware

Most local LLM tools use GGUF quantization (Q4, Q5, Q8) to reduce model size. A 7B parameter model at Q4 quantization needs roughly 4-5 GB of RAM or VRAM. A 13B model needs about 8-10 GB. Apple Silicon (M1/M2/M3/M4) unified memory is particularly efficient for local inference because the GPU and CPU share the same memory pool, making a 32 GB M3 Mac capable of running 30B+ models comfortably.

Local LLM Tools: Which One Runs on Which Port?

Every local LLM tool starts an HTTP API server on a specific port. When you create a Localtonet tunnel, you point it at that port. Here is a quick reference for the most popular tools:

Tool Default Port Interface OpenAI-Compatible API Best For
Ollama 11434 CLI + API Developers, homelab, automation
LM Studio 1234 GUI + API Non-technical users, GUI workflow
LocalAI 8080 API + Web UI Full stack: text, image, audio, vision
text-generation-webui 7860 (UI) / 5000 (API) Web UI + API Advanced users, fine-tuning
GPT4All 4891 Desktop + API Windows users, no setup
Jan 1337 Desktop + API Privacy-first assistant, offline use
llama.cpp server 8080 API Minimal footprint, embedded systems

How Localtonet Works with a Local LLM

Your LLM tool starts an HTTP API server on a local port (e.g., Ollama on port 11434). Localtonet creates an encrypted outbound connection from that port to a public HTTPS URL on Localtonet's servers. Any request to the public URL is forwarded through the tunnel to your local model, processed, and returned to the caller. Your machine never needs an open inbound port, a static IP, or any router configuration.

1

Start your local LLM server

Make sure your LLM tool is running and listening on its default port. Test locally: curl http://localhost:11434/api/tags (Ollama example).

2

Install and authenticate Localtonet

Install Localtonet for your OS (see platform guides below). Authenticate with your AuthToken from Dashboard → My Tokens.

3

Create an HTTP tunnel in the dashboard

Go to localtonet.com/tunnel/http. Select your process type (Random Subdomain, Custom Subdomain, or Custom Domain), select your AuthToken, choose a server, enter IP 127.0.0.1 and the port your LLM uses. Click Create.

4

Start the tunnel

Press Start in the tunnel list. Your public HTTPS URL appears immediately (e.g., https://myai.localto.net).

5

Use the public URL as your LLM API endpoint

Replace http://localhost:PORT with your tunnel URL in any client, integration, or API call.

Method 1: Exposing Ollama via Localtonet

Ollama + Localtonet

Port 11434 Most Popular OpenAI-Compatible

Ollama is the most widely used tool for running local LLMs. It installs in one command, supports 100+ models, exposes an OpenAI-compatible REST API on port 11434, and runs on Windows, macOS, and Linux with GPU acceleration (NVIDIA CUDA, Apple Metal, AMD ROCm).

Step 1: Install and Run Ollama

1

Install Ollama

# macOS and Linux:
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from https://ollama.com/download
2

Pull a model and start the server

# Pull and run a model (downloads automatically on first run):
ollama run llama4:8b

# For smaller hardware:
ollama run gemma3:1b

# For reasoning tasks:
ollama run deepseek-r1:7b

# For multilingual or coding:
ollama run qwen3:0.6b
3

Verify Ollama API is running

curl http://localhost:11434/api/tags

You should see a JSON list of your downloaded models.

Step 2: Create the Localtonet Tunnel

Go to localtonet.com/tunnel/http. Enter IP 127.0.0.1 and port 11434. Click Create, then Start.

Step 3: Call Your Model via the Public URL

# Test the Ollama API through the tunnel:
curl https://YOUR-TUNNEL.localto.net/api/tags

# Send a chat completion (OpenAI-compatible format):
curl https://YOUR-TUNNEL.localto.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4:8b",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in simple terms"}
    ]
  }'

# Use Ollama native API format:
curl https://YOUR-TUNNEL.localto.net/api/chat \
  -d '{
    "model": "llama4:8b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
Important: Ollama binds to localhost by default

By default Ollama only listens on 127.0.0.1, which is correct for a Localtonet tunnel (the tunnel connects to localhost). If you ever need Ollama to accept connections from other local machines directly (without a tunnel), set the environment variable OLLAMA_HOST=0.0.0.0. For the Localtonet setup, the default localhost binding is what you want.

Method 2: Exposing LM Studio via Localtonet

LM Studio + Localtonet

Port 1234 GUI OpenAI + Anthropic Compatible

LM Studio is the best graphical interface for running local LLMs. It includes a model browser, chat interface, and a developer API server that supports both OpenAI-compatible and Anthropic-compatible endpoints. Default port is 1234.

1

Start the API server in LM Studio

Download LM Studio from lmstudio.ai. Open the app, go to the Developer tab, load a model, and toggle Start Server. The server starts on http://localhost:1234.

2

Verify the server is running

curl http://localhost:1234/v1/models
3

Create the Localtonet tunnel for port 1234

Go to localtonet.com/tunnel/http. Enter IP 127.0.0.1 and port 1234. Click Create, then Start.

4

Call via public URL

curl https://YOUR-TUNNEL.localto.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "loaded-model-name",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Method 3: Exposing LocalAI via Localtonet

LocalAI + Localtonet

Port 8080 Docker Text + Image + Audio

LocalAI is a full-stack open-source AI platform. It runs as a Docker container and provides a complete OpenAI drop-in replacement API covering text generation, image generation (Stable Diffusion), audio transcription (Whisper), and text-to-speech. Default port is 8080.

1

Start LocalAI with Docker

# CPU only:
docker run -ti --name local-ai -p 8080:8080 \
  localai/localai:latest-cpu

# With Nvidia GPU:
docker run -ti --name local-ai -p 8080:8080 --gpus all \
  localai/localai:latest-gpu-nvidia-cuda-12

Browse and install models at http://localhost:8080/browse/

2

Create the Localtonet tunnel for port 8080

Go to localtonet.com/tunnel/http. Enter IP 127.0.0.1 and port 8080. Click Create, then Start.

3

Call via public URL

# Chat completion:
curl https://YOUR-TUNNEL.localto.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4", "messages": [{"role":"user","content":"Hi"}]}'

# Image generation:
curl https://YOUR-TUNNEL.localto.net/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{"prompt": "a sunset over mountains", "size": "256x256"}'

Method 4: Exposing text-generation-webui via Localtonet

text-generation-webui (oobabooga) + Localtonet

Port 7860 (UI) Port 5000 (API) Advanced

text-generation-webui (commonly known as oobabooga) is a feature-rich web interface and API server for running LLMs. It supports GGUF, GPTQ, AWQ, and EXL2 model formats, multi-user sessions, and model fine-tuning. It exposes a web UI on port 7860 and an API on port 5000.

1

Start text-generation-webui with the API enabled

python server.py --api --listen

The web UI starts on port 7860. The API starts on port 5000.

2

Create tunnels for both ports (or just the one you need)

Create one HTTP tunnel for port 7860 to share the web UI, and another for port 5000 to expose the API. Create and start each tunnel in the HTTP Tunnel dashboard.

Method 5: Exposing GPT4All via Localtonet

GPT4All + Localtonet

Port 4891 Windows No Setup

GPT4All is a polished desktop application that requires minimal setup. It includes a local API server on port 4891 that can be enabled in settings. Ideal for Windows users who want a full desktop experience with no command-line setup.

1

Enable the local API server in GPT4All

Open GPT4All → Settings → Application → Enable API Server. The server starts on port 4891.

2

Create the Localtonet tunnel for port 4891

Go to localtonet.com/tunnel/http. Enter IP 127.0.0.1 and port 4891. Click Create, then Start.

Method 6: Exposing Jan via Localtonet

Jan + Localtonet

Port 1337 Privacy-First Desktop

Jan is an offline-first assistant platform with a clean desktop UI similar to ChatGPT. It includes an OpenAI-compatible API server on port 1337. Best for users who want a fully local ChatGPT replacement with complete privacy control.

1

Enable the API server in Jan

Download Jan from jan.ai. Go to Settings → Local API Server and toggle it on. The server starts on http://localhost:1337.

2

Create the Localtonet tunnel for port 1337

Go to localtonet.com/tunnel/http. Enter IP 127.0.0.1 and port 1337. Click Create, then Start.

Method 7: Exposing llama.cpp Server via Localtonet

llama.cpp server + Localtonet

Port 8080 Minimal Any Hardware

llama.cpp is the low-level inference engine that powers Ollama, LM Studio, and many other tools. Running its built-in HTTP server directly gives you the lowest overhead and maximum hardware flexibility. It starts an OpenAI-compatible server on port 8080 by default.

1

Start the llama.cpp server

./llama-server \
  --model /path/to/model.gguf \
  --port 8080 \
  --ctx-size 4096 \
  --n-gpu-layers 35

Verify: curl http://localhost:8080/health

2

Create the Localtonet tunnel for port 8080

Go to localtonet.com/tunnel/http. Enter IP 127.0.0.1 and port 8080. Click Create, then Start.

Security: Protecting Your Public LLM Endpoint

Exposing a local LLM publicly without authentication means anyone who finds the URL can use your GPU for free, generate unlimited content, and potentially extract system prompts or sensitive data loaded in context. Always secure your tunnel before sharing the URL.

🔑 Option A: Localtonet Built-in HTTP Authentication (Easiest)

Enable username and password authentication directly on the Localtonet tunnel. Go to your tunnel settings in the dashboard and add credentials under Authentication. Every request to the tunnel URL must include the correct credentials. This works for any LLM tool without modifying the tool itself.

🏢 Option B: Localtonet SSO (Google, GitHub, Microsoft, GitLab)

Add SSO authentication to the tunnel so only users with accounts from a specific identity provider can access the endpoint. This is ideal for team scenarios where you want to restrict access to users with a company email domain.

🌐 Option C: Restrict to Specific IP Addresses

In your Localtonet tunnel settings, enable IP whitelisting and add the IP addresses of your team members or specific servers. All requests from other IPs will be rejected before they reach your LLM.

⏱ Option D: Time-Limited Tunnels

For demos and temporary access, start the tunnel only when needed and stop it immediately after. Press Stop in the Localtonet dashboard when the session ends. This minimizes exposure time entirely.

Ollama has no built-in authentication

Ollama does not currently implement API key authentication on its HTTP server. If you expose Ollama via a tunnel without Localtonet's HTTP auth or SSO layer, anyone with the URL can use your model. Always enable at minimum username and password authentication in Localtonet tunnel settings when sharing an Ollama endpoint.

Keeping the Tunnel Running Automatically

For a production AI endpoint or a team-shared model server, the tunnel needs to start automatically after reboots and restart if it crashes. Localtonet supports a built-in service mode on all platforms.

Service Mode (Windows, Linux, macOS)

systemd launchd Windows Service
1

Install and start the service

sudo localtonet --install-service --authtoken YOUR_TOKEN_HERE
sudo localtonet --start-service --authtoken YOUR_TOKEN_HERE
2

Verify

# Linux:
systemctl status localtonet

# macOS:
launchctl list | grep localtonet

# Windows: open services.msc, find Localtonet
3

Stop or remove if needed

sudo localtonet --stop-service --authtoken YOUR_TOKEN_HERE
sudo localtonet --remove-service --authtoken YOUR_TOKEN_HERE

Real-World Use Case: Sharing a Local LLM with a Team

A common scenario: one developer in a team has a machine with a powerful GPU. The rest of the team wants to use that model for code review, document summarization, and internal chatbot tasks, without running models on their own laptops.

1

Set up the model server on the GPU machine

ollama run mistral-small3
2

Create a Custom Subdomain tunnel

Go to localtonet.com/tunnel/http. Select Custom Subdomain, enter a name like team-ai, AuthToken, port 11434. Create and Start. Your team endpoint is https://team-ai.localto.net.

3

Enable HTTP authentication in tunnel settings

Add username and password in the Localtonet tunnel settings. Share credentials with the team securely.

4

Team members connect from any tool

Any OpenAI-compatible client can now point at https://team-ai.localto.net as the base URL. Works with LangChain, n8n, Cursor, Continue.dev, OpenWebUI, and any app that supports a custom OpenAI endpoint.

curl https://team-ai.localto.net/v1/chat/completions \
  -u username:password \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral-small3", "messages": [{"role":"user","content":"Review this code..."}]}'
5

Install Localtonet as a service on the GPU machine

sudo localtonet --install-service --authtoken YOUR_TOKEN
sudo localtonet --start-service --authtoken YOUR_TOKEN

The tunnel now restarts automatically on every boot. The team always has access as long as the machine is on.

Frequently Asked Questions

How do I expose Ollama to the internet?

Install and authenticate Localtonet, then go to localtonet.com/tunnel/http and create an HTTP tunnel for IP 127.0.0.1 and port 11434 (Ollama's default port). Press Start. You will get a public HTTPS URL that forwards all requests to your Ollama server. Always add HTTP authentication in the Localtonet tunnel settings since Ollama has no built-in API key support.

Can I use a local LLM as an OpenAI API replacement?

Yes. Ollama, LM Studio, LocalAI, GPT4All, Jan, and llama.cpp server all expose OpenAI-compatible API endpoints at /v1/chat/completions and /v1/models. Point your application's OpenAI base URL at your Localtonet tunnel URL, set any string as the API key (most tools accept anything), and your existing OpenAI-compatible code will work against your local model with no other changes.

What is the best local LLM for low-spec hardware?

For very low hardware (4 GB RAM or less), Gemma 3 1B (ollama run gemma3:1b) and Qwen3 0.6B (ollama run qwen3:0.6b) are excellent choices. Both run on CPU without a GPU. Phi-4 Mini is another strong option for reasoning tasks on limited hardware. For 8 GB RAM with a GPU, Llama 4 8B, Mistral 7B, and DeepSeek-R1 7B offer much better quality.

Is it safe to expose a local LLM via a tunnel?

A Localtonet tunnel is safer than opening a port directly on your router because it uses an outbound connection and does not expose your home IP. However, the LLM endpoint itself will be publicly reachable via the tunnel URL. Always enable HTTP authentication or SSO in Localtonet tunnel settings to prevent unauthorized use of your compute. Also close the tunnel when it is not needed.

Can I use my local LLM with AI coding tools like Cursor or Continue.dev?

Yes. Cursor, Continue.dev, Cline, Roo Code, and most AI coding assistants support a custom base URL for the OpenAI-compatible API. Set the base URL to your Localtonet tunnel URL and select your local model. This gives you a capable coding assistant running on your own hardware with zero API costs and complete data privacy.

What hardware do I need to run a local LLM?

The minimum is 8 GB of RAM for a useful 7B parameter model at Q4 quantization. A dedicated GPU (NVIDIA with CUDA, or AMD with ROCm) dramatically improves token generation speed. Apple Silicon Macs (M1 and later) with unified memory are particularly well suited for local inference. For larger models (13B-30B), 16-32 GB RAM or VRAM is recommended. Sub-1B models like Gemma 3 1B can run on almost any device including older laptops.

Can I use a custom domain for my local LLM API endpoint?

Yes. Add your domain to the Localtonet DNS Manager, update your nameservers to ns1.localtonet.com and ns2.localtonet.com at your registrar, and select Custom Domain when creating the HTTP tunnel. Your LLM API will be accessible at a URL like https://ai.yourdomain.com with automatic HTTPS. This is useful for production AI agent endpoints and team-shared model APIs.

Quick Reference: All LLM Tools and Tunnel Ports

Tool Port(s) Install Command Tunnel IP
Ollama 11434 curl -fsSL https://ollama.com/install.sh | sh 127.0.0.1
LM Studio 1234 Download from lmstudio.ai 127.0.0.1
LocalAI 8080 docker run -p 8080:8080 localai/localai:latest-cpu 127.0.0.1
text-generation-webui 7860 / 5000 python server.py --api --listen 127.0.0.1
GPT4All 4891 Download from gpt4all.io 127.0.0.1
Jan 1337 Download from jan.ai 127.0.0.1
llama.cpp server 8080 ./llama-server --model model.gguf --port 8080 127.0.0.1

Your Local AI, Accessible Everywhere

Secure HTTPS tunnel for Ollama, LM Studio, LocalAI, and any local LLM. No port forwarding. No static IP. Free to start.

Get Started Free →

Localtonet is a secure multi-protocol tunneling and proxy platform designed to expose localhost, devices, private services, and AI agents to the public internet supporting HTTP/HTTPS tunnels, TCP/UDP forwarding, mobile proxy infrastructure, file server publishing, latency-optimized game connectivity, and developer-ready AI agent endpoint exposure from a single unified control plane.

support