GPT-4.1: SWE-bench Performance
GPT-4.1 jumped to 54.6% on SWE-bench Verified — up from 33.2% for GPT-4o and 38% for GPT-4.5. Windsurf and Cursor adopted it immediately.
SWE-bench Results
- 54.6% on SWE-bench Verified, a 21.4-point gain over GPT-4o (OpenAI).
- SWE-bench tests real software engineering tasks in open-source Python repos: bug fixes and feature additions.
- GPT-4.1 explores codebases better, generates code that runs, and passes tests — critical for agent-based development.
Aider Polyglot Diffs
- Aider’s polyglot diff benchmark measures code changes across multiple languages, outputting only necessary diffs.
- GPT-4.1 doubles GPT-4o’s score and beats GPT-4.5 by 8 points (OpenAI).
- The model follows diff formats more reliably, reducing merge conflicts for devs who need precise, minimal changes.
- Output token limit is now 32,768 — double GPT-4o’s — for those who prefer full file rewrites.
Windsurf Integration
- GPT-4.1 in Windsurf, free for all users April 14–21 (Windsurf Changelog, Reddit).
- 60% higher coding accuracy vs. GPT-4o
- 70% fewer incorrect file modifications
- 50% fewer repeated edits and 50% less verbosity
- 40% fewer unnecessary file reads, 30% more efficient tool calling (details)
Cursor Integration
- GPT-4.1 selectable in Cursor via Settings → Models (Cursor Forum, Reddit).
- It is free to try, and developers get GPT-4.1’s multi-language diff and instruction-following in a VS Code-like environment (Dr. Lee’s Blog).
After a Week of Use
I’ve used GPT-4.1 in both Windsurf and Cursor. The diff generation is more reliable, with fewer back-and-forth iterations and more code that works on the first try.
Watch the video overview: