KahWee - Web Development, AI Tools & Tech Trends

Expert takes on AI tools like Claude and Sora, modern web development with React and Vite, and tech trends. By KahWee.

GPT-4.1: SWE-bench Performance

GPT-4.1 jumped to 54.6% on SWE-bench Verified — up from 33.2% for GPT-4o and 38% for GPT-4.5. Windsurf and Cursor adopted it immediately.

SWE-bench Results

  • 54.6% on SWE-bench Verified, a 21.4-point gain over GPT-4o (OpenAI).
  • SWE-bench tests real software engineering tasks in open-source Python repos: bug fixes and feature additions.
  • GPT-4.1 explores codebases better, generates code that runs, and passes tests — critical for agent-based development.

Aider Polyglot Diffs

  • Aider’s polyglot diff benchmark measures code changes across multiple languages, outputting only necessary diffs.
  • GPT-4.1 doubles GPT-4o’s score and beats GPT-4.5 by 8 points (OpenAI).
  • The model follows diff formats more reliably, reducing merge conflicts for devs who need precise, minimal changes.
  • Output token limit is now 32,768 — double GPT-4o’s — for those who prefer full file rewrites.

Windsurf Integration

  • GPT-4.1 in Windsurf, free for all users April 14–21 (Windsurf Changelog, Reddit).
  • 60% higher coding accuracy vs. GPT-4o
  • 70% fewer incorrect file modifications
  • 50% fewer repeated edits and 50% less verbosity
  • 40% fewer unnecessary file reads, 30% more efficient tool calling (details)

Cursor Integration

  • GPT-4.1 selectable in Cursor via Settings → Models (Cursor Forum, Reddit).
  • It is free to try, and developers get GPT-4.1’s multi-language diff and instruction-following in a VS Code-like environment (Dr. Lee’s Blog).

After a Week of Use

I’ve used GPT-4.1 in both Windsurf and Cursor. The diff generation is more reliable, with fewer back-and-forth iterations and more code that works on the first try.

Watch the video overview: