12 min read

How to Run a Local AI Coding Assistant That Never Sends Your Code to the Cloud

With Ollama, set up a completely custom AI coding assistant and continue working in VS Code. Tab autocompletion, chat, and codebase search. Everything on your own hardware; no data is sent to the cloud.

🤖 Local AI · VS Code · Ollama · Continue · Private Coding Assistant

How to Run a Local AI Coding Assistant That Never Sends Your Code to the Cloud

GitHub Copilot, Cursor, and similar tools send every keystroke, every file you open, and every completion request to external servers. For personal projects that is a trade-off many developers accept. For client work, proprietary code, or anything under an NDA, it is a problem. This guide shows you how to build a fully private coding assistant using Ollama and the Continue extension for VS Code, all running on your own machine, with no data leaving your hardware.

🔒 Zero data sent to the cloud 💸 No subscription or API costs ⚡ Tab autocomplete and chat 🌍 Works offline

Why Run Your Coding Assistant Locally?

Cloud AI coding tools are convenient but they come with a trade-off that is easy to overlook. Every completion request sends context to a remote server: the file you are editing, surrounding code, imports, variable names, and function signatures. For many tools, that context window includes other files open in your editor. The provider's privacy policy determines what happens to that data.

A local setup eliminates the trade-off entirely. The model runs on your CPU or GPU. Completions happen in memory. Nothing leaves your machine. Once the model is downloaded, the assistant works offline with no internet connection at all.

Local AI assistant (this guide)
  • Code never leaves your machine
  • Works fully offline
  • No subscription or per-token fees
  • No rate limits or outages
  • Full control over model version
  • Works with proprietary and NDA-covered code
Cloud AI assistants (Copilot, Cursor)
  • Code sent to external servers on every request
  • Requires internet connection
  • Monthly subscription or per-token billing
  • Subject to rate limits and service outages
  • Model updates happen without your control

The Stack: Ollama and Continue

Two tools do all the work here.

🦙 Ollama — local model runtime

Ollama downloads and runs open-source language models on your machine. It exposes a REST API on http://localhost:11434 that is compatible with the OpenAI API format. Any tool that knows how to talk to OpenAI can be pointed at Ollama instead. Ollama automatically uses your GPU if one is available and falls back to CPU otherwise.

🔗 Continue — VS Code extension

Continue is an open-source VS Code extension that adds AI chat and inline tab autocomplete to your editor. It is model-agnostic: you configure which backend it talks to, and it works identically whether that backend is OpenAI, Anthropic, or a local Ollama instance. It supports both JetBrains IDEs and VS Code.

0Cloud requests made
0Monthly fees
100+Models available
~30 minSetup time

Step 1: Install Ollama

🐧 Linux

curl -fsSL https://ollama.com/install.sh | sh

# Verify it is running
ollama --version
systemctl status ollama

🍎 macOS

Download the Ollama app from ollama.com/download, unzip it, drag Ollama.app to Applications, and launch it. Ollama runs in the menu bar and starts the API server automatically. No need to run ollama serve manually when using the macOS app.

🪟 Windows

Download OllamaSetup.exe from ollama.com/download and run the installer. Ollama adds itself to the PATH and starts as a background service automatically.

Verify Ollama is responding on any platform:

curl http://localhost:11434
# Expected response: Ollama is running

Step 2: Pull the Right Models

You need two types of models: one for chat (answering questions, explaining code, generating functions) and one for tab autocomplete (predicting what you are about to type inline). These are separate models with different characteristics. Chat models are larger and more capable. Autocomplete models are smaller, faster, and optimized for completing partial code tokens.

Model Role Size Good for Pull command
qwen2.5-coder:7b Chat ~4.7 GB Best chat model for most hardware ollama pull qwen2.5-coder:7b
deepseek-r1:7b Chat ~4.7 GB Strong reasoning, good for complex logic ollama pull deepseek-r1:7b
llama3.2:3b Chat ~2 GB Low-spec hardware, fast responses ollama pull llama3.2:3b
qwen2.5-coder:1.5b Autocomplete ~1 GB Fast inline suggestions on any hardware ollama pull qwen2.5-coder:1.5b
nomic-embed-text Embeddings ~274 MB Codebase indexing for context search ollama pull nomic-embed-text
qwen2.5-coder:7b (Chat)
Size~4.7 GBPullollama pull qwen2.5-coder:7b
qwen2.5-coder:1.5b (Autocomplete)
Size~1 GBPullollama pull qwen2.5-coder:1.5b
nomic-embed-text (Embeddings)
Size~274 MBPullollama pull nomic-embed-text

Pull the recommended starter set:

ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:1.5b
ollama pull nomic-embed-text

# Verify they downloaded correctly
ollama list
How much RAM do you need?

The model must fit in memory (RAM or VRAM) to run. A 7B model at 4-bit quantization needs about 5 GB. A 1.5B autocomplete model needs about 1 GB. If you have 8 GB of RAM, running both at the same time leaves enough headroom for your OS and editor. With 16 GB or more, you can comfortably run larger chat models.

Step 3: Install the Continue Extension in VS Code

1

Install from the VS Code marketplace

Press Ctrl+Shift+X (or Cmd+Shift+X on macOS) to open the Extensions panel. Search for Continue and install the extension published by Continue. A Continue icon appears in the sidebar when installation is complete.

2

Open the Continue sidebar

Click the Continue icon in the VS Code sidebar. On first launch, Continue asks you to select a provider. Skip the account setup by choosing Use local model or close the wizard, you will configure it manually in the next step.

3

Open the configuration file

Press Ctrl+Shift+P and run Continue: Open config.yaml. This opens the main configuration file where you specify which models Continue uses.

Step 4: Configure Chat, Autocomplete, and Embeddings

Replace the contents of config.yaml with the following. This sets up the chat model, the autocomplete model, and the embedding model for codebase search; all pointing at your local Ollama instance.

name: Local Ollama Config
version: 0.0.1
schema: v1

models:
  - name: Qwen2.5 Coder 7B
    provider: ollama
    model: qwen2.5-coder:7b
    apiBase: http://localhost:11434
    roles:
      - chat
      - edit

tabAutocompleteModel:
  name: Qwen2.5 Coder 1.5B
  provider: ollama
  model: qwen2.5-coder:1.5b
  apiBase: http://localhost:11434

embeddingsProvider:
  provider: ollama
  model: nomic-embed-text
  apiBase: http://localhost:11434

Save the file. Continue reloads automatically and connects to Ollama.

Test the chat

Click the Continue icon in the sidebar. Type a question in the chat panel:

Write a Python function that reads a CSV file and returns the rows as a list of dicts.

Continue sends the request to Ollama on your machine. The first response after a cold start may take a few seconds while the model loads into memory. Subsequent responses are faster.

Test tab autocomplete

Open any code file and start typing a function. After a short pause, Continue shows a grey ghost-text suggestion. Press Tab to accept it. Press Escape to dismiss. The suggestion comes from the 1.5B model running locally, no network request is made.

Keyboard shortcuts worth knowing

Ctrl+I (or Cmd+I) opens an inline edit prompt for the selected code. Ctrl+L (or Cmd+L) adds the current file to the chat context. Ctrl+Shift+L adds the selected code to the chat context.

Index your codebase for context search

Continue can index your entire codebase so you can ask questions like "where is the user authentication handled?" and get accurate answers. Open the Command Palette and run Continue: Index Codebase. Indexing uses the nomic-embed-text model locally and stores the index on disk. Once indexed, use the @codebase context provider in chat:

@codebase How does the payment processing flow work?

Use Your Local AI From a Different Machine

If you have a powerful desktop at home and want to use its GPU for AI assistance from a lighter laptop, you can run Ollama on the desktop and point Continue on the laptop at it. The setup uses a Localtonet TCP tunnel to make the Ollama API reachable from outside your local network.

1

Install and start Localtonet on the machine running Ollama

localtonet --authtoken <YOUR_TOKEN>
2

Create a TCP tunnel for port 11434

Go to the TCP/UDP tunnel page, select TCP, set local IP to 127.0.0.1 and port to 11434. Click Create and start the tunnel. Note the relay address shown in the dashboard, for example example.localto.net:33771.

3

Update the apiBase in Continue's config on the remote laptop

On the laptop you want to code from, open Continue's config.yaml and replace http://localhost:11434 with the relay address:

models:
  - name: Qwen2.5 Coder 7B (Remote)
    provider: ollama
    model: qwen2.5-coder:7b
    apiBase: http://example.localto.net:33771
    roles:
      - chat
      - edit

tabAutocompleteModel:
  name: Qwen2.5 Coder 1.5B (Remote)
  provider: ollama
  model: qwen2.5-coder:1.5b
  apiBase: http://example.localto.net:33771

embeddingsProvider:
  provider: ollama
  model: nomic-embed-text
  apiBase: http://example.localto.net:33771

Continue on your laptop now sends all requests to the Ollama instance running on your desktop at home. The desktop GPU does all the inference work. Your laptop just sends text and receives completions. Your code still never reaches any cloud server.

Keep Localtonet and Ollama running permanently on the host machine

On Linux, Ollama is already registered as a systemd service by the install script. Register Localtonet as a service too so the tunnel comes back after every reboot: sudo localtonet --install-service --authtoken <YOUR_TOKEN> followed by sudo localtonet --start-service --authtoken <YOUR_TOKEN>.

Frequently Asked Questions

How does the quality compare to GitHub Copilot?

For most everyday tasks completing functions, generating boilerplate, explaining code, writing tests a 7B coding model like Qwen2.5-Coder performs very well. It will not match the largest cloud models on complex multi-step reasoning tasks, but for the kind of completions you use most often the quality gap is smaller than you might expect. The inline autocomplete from a 1.5B model is noticeably faster than cloud-based alternatives because there is no network round-trip.

Does Continue send any telemetry or usage data?

Continue collects anonymized usage data by default to help improve the extension. This does not include your code or prompts only behavioral data like which features you use. You can disable it entirely by adding allowAnonymousTelemetry: false to your config.yaml file.

Can I use this with JetBrains IDEs instead of VS Code?

Yes. Continue has a JetBrains plugin available in the JetBrains Marketplace. The configuration format and Ollama integration are identical. Install the Continue plugin, open its settings, and point it at your Ollama instance the same way.

The autocomplete suggestions are slow. How do I speed them up?

Slow autocomplete almost always means the model is running on CPU rather than GPU. Check that your GPU drivers are installed correctly and that Ollama detected the GPU by running ollama run qwen2.5-coder:1.5b and watching the terminal output for GPU layer mentions. If you are on an NVIDIA card, run nvidia-smi to confirm the driver is active. For CPU-only machines, use the 1.5B autocomplete model rather than anything larger.

Can I switch between multiple models in Continue?

Yes. Add as many models as you like to the models section of config.yaml. A model picker appears at the bottom of the Continue chat panel letting you switch between them per conversation. You might use a fast 3B model for quick questions and a larger 14B model for complex refactoring tasks.

What is the nomic-embed-text model used for?

It is an embedding model used by Continue's codebase indexing feature. When you run Index Codebase, Continue reads all your files, converts them into vector embeddings using this model, and stores the index locally. When you use @codebase in chat, Continue searches that index to find the most relevant files and code snippets to include as context in your prompt. The entire process is local, no code is sent anywhere.

Run AI on Your Hardware, Access It from Anywhere

Ollama and Continue give you a fully private coding assistant at zero cost. Add a Localtonet tunnel and your GPU-powered assistant is reachable from any of your devices.

Create Free Localtonet Account →

Localtonet is a secure multi-protocol tunneling and proxy platform designed to expose localhost, devices, private services, and AI agents to the public internet supporting HTTP/HTTPS tunnels, TCP/UDP forwarding, mobile proxy infrastructure, file server publishing, latency-optimized game connectivity, and developer-ready AI agent endpoint exposure from a single unified control plane.

support