Introduction
Cursor has become one of the most popular AI-first code editors on the market, loved for its deep integration of AI assistance directly into the editing experience. But there’s a catch most users don’t think about: by default, every prompt you send, and often the code surrounding it, travels to a third-party API.
What if you could keep all of that on your own machine? With local LLMs and Cursor’s support for custom model endpoints, you can. In this guide, we’ll walk you through setting up Cursor to use a local LLM via Ollama — so you get the Cursor experience you love with full data privacy and no per-token costs.
Why Run a Local LLM with Cursor?
Before we get into the setup, it’s worth understanding what you gain from going local:
- Privacy: Your code, prompts, and completions never leave your machine. Ideal for proprietary codebases, client work under NDAs, or regulated industries.
- No usage costs: After the initial download, inference is free. Heavy users can save significantly on API bills.
- Offline capability: Work on planes, in secure environments, or anywhere without reliable internet.
- Customisation: Fine-tune or prompt-engineer your local model however you like, without being constrained by a hosted provider’s guardrails.
- Speed on good hardware: With a modern GPU, local inference can be competitive with — or faster than — hosted API response times for common tasks.
Prerequisites
Here’s what you’ll need before starting:
- Cursor installed (latest version from cursor.sh)
- Ollama installed on your machine
- At least 16GB RAM (8GB is workable but limiting)
- A GPU with 8GB+ VRAM recommended (CPU-only works but is slower)
- A terminal you’re comfortable with
Step 1: Install Ollama
Ollama is the simplest way to download and serve open-source LLMs locally. It handles model management, provides an OpenAI-compatible API, and works across macOS, Linux, and Windows.
Download and install Ollama from ollama.com. Once installed, verify it’s working:
ollama --version
Then start the Ollama server:
ollama serve
Ollama will listen on http://localhost:11434 by default. Keep this running in the background throughout your session.
Step 2: Choose and Pull Your Local Model
Not all local models are created equal for coding tasks. Here are the top contenders worth considering:
Top Coding Models for Cursor
| Model | Size | VRAM Needed | Strengths |
|---|---|---|---|
| Qwen2.5-Coder 32B | 32B | 20GB+ | Best overall coding quality |
| DeepSeek-Coder V2 Lite | 16B | 12GB+ | Excellent code reasoning |
| GLM 4 | 9B | 8GB+ | Strong instruction following |
| CodeLlama 34B | 34B | 24GB+ | Mature, well-tested |
| Mistral Nemo | 12B | 8GB+ | Fast, great for autocompletion |
| Phi-3.5 Mini | 3.8B | 4GB+ | Lightweight, surprisingly capable |
For most developers with a modern GPU, we recommend starting with Qwen2.5-Coder or DeepSeek-Coder V2 Lite. If you’re on CPU only or have limited VRAM, Phi-3.5 Mini or Mistral Nemo are solid lighter options.
Pull your chosen model:
# Example: Qwen2.5-Coder 7B (smaller, faster)
ollama pull qwen2.5-coder:7b
# Or DeepSeek Coder
ollama pull deepseek-coder-v2
# Or GLM
ollama pull glm4
Verify the model downloaded:
ollama list
Step 3: Configure Cursor to Use Your Local Model
This is where the magic happens. Cursor supports custom model endpoints that are compatible with the OpenAI API format — and Ollama provides exactly that.
Opening Cursor Settings
- Open Cursor
- Go to Settings (Cmd/Ctrl + ,) or via the menu Cursor → Settings → Cursor Settings
- Navigate to the Models section
Adding a Custom Model
In the Models section, you’ll see an option to add custom models. Click “Add Model” and enter the following:
- Model Name: The name of your Ollama model (e.g.,
qwen2.5-coder:7b) - Base URL:
http://localhost:11434/v1 - API Key: Enter any string (e.g.,
ollama) — Ollama doesn’t validate this, but Cursor requires a value
Setting as Default (Optional)
Once added, you can select your local model as the default for chat, Composer, and autocomplete. This means every interaction in Cursor routes to your local machine rather than any cloud API.
Step 4: Testing Your Setup
Open a code file in Cursor and try out a few interactions to confirm everything is working:
Chat (Cmd/Ctrl + L)
Open the chat panel and ask something like “Explain what this function does” with some code selected. You should see the response generated locally — you can monitor your GPU usage or check Ollama’s terminal output to confirm the local model is being called.
Composer (Cmd/Ctrl + I)
Open Composer and ask it to write a new function or refactor an existing one. Composer’s multi-file editing capability works well even with local models, though very large projects may benefit from a model with a longer context window.
Autocomplete (Tab)
Cursor’s autocomplete (the “ghost text” suggestions as you type) can also be routed to your local model. For this, latency matters more than for chat — aim for a model and quantisation level that responds in under a second for a smooth typing experience.
Optimising Your Local Model for Cursor
Creating a Custom Modelfile
Ollama lets you create custom model variants via a Modelfile. For Cursor coding tasks, you’ll want low temperature for consistent output and a system prompt oriented toward coding assistance:
FROM qwen2.5-coder:7b
PARAMETER temperature 0.15
PARAMETER num_ctx 32768
PARAMETER repeat_penalty 1.1
SYSTEM """
You are an expert software engineer and coding assistant. You write clean, efficient, well-commented code. When asked to explain code, be concise and accurate. When writing code, follow the existing style and conventions of the project. Never include unnecessary filler text.
"""
Save this as Modelfile and build your custom variant:
ollama create cursor-coder -f Modelfile
Then point Cursor at cursor-coder in your model settings.
Context Window Tuning
The num_ctx parameter controls how much text the model considers at once. Larger is better for Cursor’s multi-file operations, but each doubling roughly doubles memory usage. Start at 16384 and increase if you have the VRAM headroom.
Running Multiple Models
A clever setup is to run different models for different tasks:
- Fast model (e.g., Phi-3.5 Mini or Mistral Nemo) for autocomplete — prioritises low latency
- Larger model (e.g., Qwen2.5-Coder 32B or DeepSeek V2) for Chat and Composer — prioritises quality
Ollama can serve multiple models simultaneously; you just configure each in Cursor’s model settings and select the appropriate one for each feature.
Cursor Features and Local LLM Compatibility
Not every Cursor feature works equally well with local models. Here’s a breakdown:
Works Excellently
- ✅ Chat (Cmd+L): Question and answer, code explanation, debugging help
- ✅ Inline edits (Cmd+K): Targeted code modifications within a selection
- ✅ Composer: Multi-file generation and editing (with a model that has a long context window)
- ✅ Code explanation: Select code and ask what it does
Works With Limitations
- ⚠️ Autocomplete: Works but requires a fast model; larger models may introduce noticeable lag
- ⚠️ Large codebase indexing: Cursor’s codebase indexing uses embeddings; local embedding models can be used but require additional setup
Requires Hosted Model (or Not Applicable Locally)
- ❌ Cursor’s built-in web search: Tied to Cursor’s hosted infrastructure
- ❌ Some agentic features: Depending on the Cursor version, some orchestration may be tied to specific hosted models
Setting Up Local Embeddings (Optional but Recommended)
Cursor uses embeddings to understand your codebase — allowing it to retrieve relevant context from across your project. By default, these embeddings are computed by a hosted model. You can replace this with a local embedding model for full privacy.
Ollama supports embedding models:
ollama pull nomic-embed-text
The embedding endpoint is available at:
POST http://localhost:11434/v1/embeddings
Configuring Cursor to use local embeddings depends on your Cursor version — check the latest Cursor documentation or community resources for the current method, as this feature evolves frequently.
Performance Benchmarks: What to Expect
Here’s a rough guide to what you can expect in terms of performance on common consumer hardware:
| Hardware | Recommended Model | Chat Response | Autocomplete Lag |
|---|---|---|---|
| MacBook Pro M3 Pro (18GB) | Qwen2.5-Coder 7B Q8 | 5–10 seconds | 1–2 seconds |
| Mac Studio M2 Ultra (192GB) | Qwen2.5-Coder 32B | 3–6 seconds | <1 second |
| RTX 4090 (24GB VRAM) | Qwen2.5-Coder 32B Q4 | 3–5 seconds | <1 second |
| RTX 3070 (8GB VRAM) | Mistral Nemo / GLM 4 | 6–12 seconds | 2–3 seconds |
| CPU Only (32GB RAM) | Phi-3.5 Mini | 15–30 seconds | 5–10 seconds |
Chat and Composer are far more forgiving of latency than autocomplete. If your hardware makes autocomplete feel sluggish, consider keeping autocomplete on a hosted model while routing chat and Composer to your local model.
Troubleshooting
“Connection refused” or API errors in Cursor
The most common issue. Check that ollama serve is running in a terminal. Also confirm your Base URL in Cursor settings is exactly http://localhost:11434/v1 — the /v1 is required for OpenAI compatibility.
Responses feel low quality or off-topic
Try a larger or more capable model, reduce the temperature via a Modelfile, and ensure the num_ctx is large enough that the model can see the full context being sent by Cursor.
Cursor freezing or slow to respond
Your machine may be memory-constrained. Close other applications to free RAM/VRAM, try a smaller quantisation, or reduce num_ctx.
Model not appearing in Cursor’s model list
After adding a custom model in Cursor settings, restart Cursor completely. If the model still doesn’t appear, double-check the model name matches exactly what ollama list shows.
Privacy Considerations and Best Practices
Going local is a strong privacy choice, but it’s worth understanding exactly what Cursor does and doesn’t send to the cloud:
- When using a local model via a custom endpoint, prompts and completions are handled entirely by your local Ollama server.
- Cursor itself (the application) may still collect telemetry and usage data depending on your settings. Review Cursor’s privacy settings and disable telemetry if this is a concern.
- Codebase indexing may still use Cursor’s cloud infrastructure unless you’ve configured local embeddings.
- For the highest privacy assurance, check Cursor’s privacy mode options, which may disable all codebase upload features.
For most developers, routing the AI model interactions to a local LLM is the most impactful privacy step. The code that matters most — what you type into prompts and what comes back — stays on your machine.
Comparison: Local LLMs vs Cursor’s Hosted Models
| Factor | Local LLM | Hosted (Claude/GPT-4) |
|---|---|---|
| Privacy | ✅ Complete (on-device) | ❌ Sent to third party |
| Cost | ✅ Free after setup | ❌ Ongoing subscription/usage |
| Offline use | ✅ Yes | ❌ No |
| Raw capability | ⚠️ Very good, not frontier | ✅ Best available |
| Speed (cold start) | ⚠️ Depends on hardware | ✅ Consistent |
| Customisability | ✅ Full control | ❌ Limited |
The sweet spot for many developers is a hybrid approach: use local models for most day-to-day work, and switch to a hosted model when you hit a particularly complex problem that needs frontier-level reasoning.
Conclusion
Cursor is already a fantastic coding tool. Pairing it with a local LLM makes it even better for developers who care about privacy, cost, or simply the satisfaction of running everything on their own hardware.
The setup is straightforward once you understand the moving parts: Ollama serves the model, Cursor talks to it via the OpenAI-compatible endpoint, and you get a seamless AI coding experience that never touches the internet.
Start with a model that fits your hardware, tune it with a Modelfile, and enjoy the privacy-first Cursor experience. Once you’ve gone local, it’s hard to go back.
Have questions or found a better model combination that works for you? Drop a comment below — the local LLM space moves fast and community knowledge is invaluable.
Found this guide helpful? Check out our other posts on local AI tooling and developer productivity.

Leave a Reply
You must be logged in to post a comment.