Using Cursor With Local LLM Models: The Complete Setup Guide

Introduction

Cursor has become one of the most popular AI-first code editors on the market, loved for its deep integration of AI assistance directly into the editing experience. But there’s a catch most users don’t think about: by default, every prompt you send, and often the code surrounding it, travels to a third-party API.

What if you could keep all of that on your own machine? With local LLMs and Cursor’s support for custom model endpoints, you can. In this guide, we’ll walk you through setting up Cursor to use a local LLM via Ollama — so you get the Cursor experience you love with full data privacy and no per-token costs.

Why Run a Local LLM with Cursor?

Before we get into the setup, it’s worth understanding what you gain from going local:

Privacy: Your code, prompts, and completions never leave your machine. Ideal for proprietary codebases, client work under NDAs, or regulated industries.
No usage costs: After the initial download, inference is free. Heavy users can save significantly on API bills.
Offline capability: Work on planes, in secure environments, or anywhere without reliable internet.
Customisation: Fine-tune or prompt-engineer your local model however you like, without being constrained by a hosted provider’s guardrails.
Speed on good hardware: With a modern GPU, local inference can be competitive with — or faster than — hosted API response times for common tasks.

Prerequisites

Here’s what you’ll need before starting:

Cursor installed (latest version from cursor.sh)
Ollama installed on your machine
At least 16GB RAM (8GB is workable but limiting)
A GPU with 8GB+ VRAM recommended (CPU-only works but is slower)
A terminal you’re comfortable with

Step 1: Install Ollama

Ollama is the simplest way to download and serve open-source LLMs locally. It handles model management, provides an OpenAI-compatible API, and works across macOS, Linux, and Windows.

Download and install Ollama from ollama.com. Once installed, verify it’s working:

ollama --version

Then start the Ollama server:

ollama serve

Ollama will listen on http://localhost:11434 by default. Keep this running in the background throughout your session.

Step 2: Choose and Pull Your Local Model

Not all local models are created equal for coding tasks. Here are the top contenders worth considering:

Top Coding Models for Cursor

Model	Size	VRAM Needed	Strengths
Qwen2.5-Coder 32B	32B	20GB+	Best overall coding quality
DeepSeek-Coder V2 Lite	16B	12GB+	Excellent code reasoning
GLM 4	9B	8GB+	Strong instruction following
CodeLlama 34B	34B	24GB+	Mature, well-tested
Mistral Nemo	12B	8GB+	Fast, great for autocompletion
Phi-3.5 Mini	3.8B	4GB+	Lightweight, surprisingly capable

For most developers with a modern GPU, we recommend starting with Qwen2.5-Coder or DeepSeek-Coder V2 Lite. If you’re on CPU only or have limited VRAM, Phi-3.5 Mini or Mistral Nemo are solid lighter options.

Pull your chosen model:

# Example: Qwen2.5-Coder 7B (smaller, faster)
ollama pull qwen2.5-coder:7b

# Or DeepSeek Coder
ollama pull deepseek-coder-v2

# Or GLM
ollama pull glm4

Verify the model downloaded:

ollama list

Step 3: Configure Cursor to Use Your Local Model

This is where the magic happens. Cursor supports custom model endpoints that are compatible with the OpenAI API format — and Ollama provides exactly that.

Opening Cursor Settings

Open Cursor
Go to Settings (Cmd/Ctrl + ,) or via the menu Cursor → Settings → Cursor Settings
Navigate to the Models section

Adding a Custom Model

In the Models section, you’ll see an option to add custom models. Click “Add Model” and enter the following:

Model Name: The name of your Ollama model (e.g., qwen2.5-coder:7b)
Base URL: http://localhost:11434/v1
API Key: Enter any string (e.g., ollama) — Ollama doesn’t validate this, but Cursor requires a value

Setting as Default (Optional)

Once added, you can select your local model as the default for chat, Composer, and autocomplete. This means every interaction in Cursor routes to your local machine rather than any cloud API.

Step 4: Testing Your Setup

Open a code file in Cursor and try out a few interactions to confirm everything is working:

Chat (Cmd/Ctrl + L)

Open the chat panel and ask something like “Explain what this function does” with some code selected. You should see the response generated locally — you can monitor your GPU usage or check Ollama’s terminal output to confirm the local model is being called.

Composer (Cmd/Ctrl + I)

Open Composer and ask it to write a new function or refactor an existing one. Composer’s multi-file editing capability works well even with local models, though very large projects may benefit from a model with a longer context window.

Autocomplete (Tab)

Cursor’s autocomplete (the “ghost text” suggestions as you type) can also be routed to your local model. For this, latency matters more than for chat — aim for a model and quantisation level that responds in under a second for a smooth typing experience.

Optimising Your Local Model for Cursor

Creating a Custom Modelfile

Ollama lets you create custom model variants via a Modelfile. For Cursor coding tasks, you’ll want low temperature for consistent output and a system prompt oriented toward coding assistance:

FROM qwen2.5-coder:7b

PARAMETER temperature 0.15
PARAMETER num_ctx 32768
PARAMETER repeat_penalty 1.1

SYSTEM """
You are an expert software engineer and coding assistant. You write clean, efficient, well-commented code. When asked to explain code, be concise and accurate. When writing code, follow the existing style and conventions of the project. Never include unnecessary filler text.
"""

Save this as Modelfile and build your custom variant:

ollama create cursor-coder -f Modelfile

Then point Cursor at cursor-coder in your model settings.

Context Window Tuning

The num_ctx parameter controls how much text the model considers at once. Larger is better for Cursor’s multi-file operations, but each doubling roughly doubles memory usage. Start at 16384 and increase if you have the VRAM headroom.

Running Multiple Models

A clever setup is to run different models for different tasks:

Fast model (e.g., Phi-3.5 Mini or Mistral Nemo) for autocomplete — prioritises low latency
Larger model (e.g., Qwen2.5-Coder 32B or DeepSeek V2) for Chat and Composer — prioritises quality

Ollama can serve multiple models simultaneously; you just configure each in Cursor’s model settings and select the appropriate one for each feature.

Cursor Features and Local LLM Compatibility

Not every Cursor feature works equally well with local models. Here’s a breakdown:

Works Excellently

✅ Chat (Cmd+L): Question and answer, code explanation, debugging help
✅ Inline edits (Cmd+K): Targeted code modifications within a selection
✅ Composer: Multi-file generation and editing (with a model that has a long context window)
✅ Code explanation: Select code and ask what it does

Works With Limitations

⚠️ Autocomplete: Works but requires a fast model; larger models may introduce noticeable lag
⚠️ Large codebase indexing: Cursor’s codebase indexing uses embeddings; local embedding models can be used but require additional setup

Requires Hosted Model (or Not Applicable Locally)

❌ Cursor’s built-in web search: Tied to Cursor’s hosted infrastructure
❌ Some agentic features: Depending on the Cursor version, some orchestration may be tied to specific hosted models

Setting Up Local Embeddings (Optional but Recommended)

Cursor uses embeddings to understand your codebase — allowing it to retrieve relevant context from across your project. By default, these embeddings are computed by a hosted model. You can replace this with a local embedding model for full privacy.

Ollama supports embedding models:

ollama pull nomic-embed-text

The embedding endpoint is available at:

POST http://localhost:11434/v1/embeddings

Configuring Cursor to use local embeddings depends on your Cursor version — check the latest Cursor documentation or community resources for the current method, as this feature evolves frequently.

Performance Benchmarks: What to Expect

Here’s a rough guide to what you can expect in terms of performance on common consumer hardware:

Hardware	Recommended Model	Chat Response	Autocomplete Lag
MacBook Pro M3 Pro (18GB)	Qwen2.5-Coder 7B Q8	5–10 seconds	1–2 seconds
Mac Studio M2 Ultra (192GB)	Qwen2.5-Coder 32B	3–6 seconds	<1 second
RTX 4090 (24GB VRAM)	Qwen2.5-Coder 32B Q4	3–5 seconds	<1 second
RTX 3070 (8GB VRAM)	Mistral Nemo / GLM 4	6–12 seconds	2–3 seconds
CPU Only (32GB RAM)	Phi-3.5 Mini	15–30 seconds	5–10 seconds

Chat and Composer are far more forgiving of latency than autocomplete. If your hardware makes autocomplete feel sluggish, consider keeping autocomplete on a hosted model while routing chat and Composer to your local model.

Troubleshooting

“Connection refused” or API errors in Cursor

The most common issue. Check that ollama serve is running in a terminal. Also confirm your Base URL in Cursor settings is exactly http://localhost:11434/v1 — the /v1 is required for OpenAI compatibility.

Responses feel low quality or off-topic

Try a larger or more capable model, reduce the temperature via a Modelfile, and ensure the num_ctx is large enough that the model can see the full context being sent by Cursor.

Cursor freezing or slow to respond

Your machine may be memory-constrained. Close other applications to free RAM/VRAM, try a smaller quantisation, or reduce num_ctx.

Model not appearing in Cursor’s model list

After adding a custom model in Cursor settings, restart Cursor completely. If the model still doesn’t appear, double-check the model name matches exactly what ollama list shows.

Privacy Considerations and Best Practices

Going local is a strong privacy choice, but it’s worth understanding exactly what Cursor does and doesn’t send to the cloud:

When using a local model via a custom endpoint, prompts and completions are handled entirely by your local Ollama server.
Cursor itself (the application) may still collect telemetry and usage data depending on your settings. Review Cursor’s privacy settings and disable telemetry if this is a concern.
Codebase indexing may still use Cursor’s cloud infrastructure unless you’ve configured local embeddings.
For the highest privacy assurance, check Cursor’s privacy mode options, which may disable all codebase upload features.

For most developers, routing the AI model interactions to a local LLM is the most impactful privacy step. The code that matters most — what you type into prompts and what comes back — stays on your machine.

Comparison: Local LLMs vs Cursor’s Hosted Models

Factor	Local LLM	Hosted (Claude/GPT-4)
Privacy	✅ Complete (on-device)	❌ Sent to third party
Cost	✅ Free after setup	❌ Ongoing subscription/usage
Offline use	✅ Yes	❌ No
Raw capability	⚠️ Very good, not frontier	✅ Best available
Speed (cold start)	⚠️ Depends on hardware	✅ Consistent
Customisability	✅ Full control	❌ Limited

The sweet spot for many developers is a hybrid approach: use local models for most day-to-day work, and switch to a hosted model when you hit a particularly complex problem that needs frontier-level reasoning.

Conclusion

Cursor is already a fantastic coding tool. Pairing it with a local LLM makes it even better for developers who care about privacy, cost, or simply the satisfaction of running everything on their own hardware.

The setup is straightforward once you understand the moving parts: Ollama serves the model, Cursor talks to it via the OpenAI-compatible endpoint, and you get a seamless AI coding experience that never touches the internet.

Start with a model that fits your hardware, tune it with a Modelfile, and enjoy the privacy-first Cursor experience. Once you’ve gone local, it’s hard to go back.

Have questions or found a better model combination that works for you? Drop a comment below — the local LLM space moves fast and community knowledge is invaluable.

Found this guide helpful? Check out our other posts on local AI tooling and developer productivity.

Using Cursor with Local LLM Models: The Complete Setup Guide