OpenAI's New GPT-4.1 Models Enhance Coding and Instruction Following

OpenAI Launches GPT-4.1: A New Era for AI Coding and Instruction Following

OpenAI has unveiled its latest family of AI models, GPT-4.1, signaling a significant advancement in AI capabilities, particularly in coding and instruction following. This new series includes GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano, each designed to excel in specific areas while offering enhanced performance and efficiency.

Key Features and Capabilities:

Enhanced Coding and Instruction Following: OpenAI claims that all GPT-4.1 models demonstrate superior performance in coding tasks and adhering to instructions, making them valuable tools for developers and researchers.
Massive Context Window: The multimodal models boast an impressive 1-million-token context window, allowing them to process approximately 750,000 words in a single input. This capability surpasses previous models and enables more complex and nuanced interactions.
API Availability: Unlike some previous models, the GPT-4.1 family is available through OpenAI's API, providing broader access for developers to integrate these advanced capabilities into their applications.
Competitive Landscape: The release of GPT-4.1 positions OpenAI competitively against other major players in the AI space, such as Google with its Gemini 2.5 Pro and Anthropic with its Claude 3.7 Sonnet, both of which also feature large context windows and strong coding performance.

Performance and Benchmarks:

OpenAI's internal testing indicates that the full GPT-4.1 model outperforms its predecessors, GPT-4o and GPT-4o mini, on coding benchmarks like SWE-bench Verified. While scoring slightly below Google's Gemini 2.5 Pro and Anthropic's Claude 3.7 Sonnet on this specific benchmark, GPT-4.1 demonstrates strong capabilities.

SWE-bench Verified: GPT-4.1 achieved scores between 52% and 54.6%, with OpenAI noting that some benchmark solutions could not be run on their infrastructure.
Video Understanding: In a separate evaluation using Video-MME, GPT-4.1 reached 72% accuracy in understanding content in videos without subtitles, showcasing its multimodal prowess.

Model Variants and Efficiency:

GPT-4.1: The flagship model, offering the highest accuracy and performance.
GPT-4.1 mini: A more efficient and faster version, balancing performance with resource utilization.
GPT-4.1 nano: The speediest and most cost-effective model, designed for maximum efficiency.

Pricing:

GPT-4.1: $2 per million input tokens, $8 per million output tokens.
GPT-4.1 mini: $0.40 per million input tokens, $1.60 per million output tokens.
GPT-4.1 nano: $0.10 per million input tokens, $0.40 per million output tokens.

Challenges and Future Outlook:

Despite its advancements, GPT-4.1, like other large language models, still faces challenges:

Reliability with Large Inputs: OpenAI acknowledges that the model's reliability decreases with an increased number of input tokens. Accuracy dropped from 84% with 8,000 tokens to 50% with 1 million tokens in internal tests.
Literal Interpretation: GPT-4.1 can be more literal than GPT-4o, sometimes requiring more specific prompts for optimal results.
Security Vulnerabilities: Code-generating models, including GPT-4.1, can still struggle with fixing or introducing security vulnerabilities and bugs, an area that requires continuous improvement.

OpenAI's long-term vision includes developing an "agentic software engineer" capable of handling complex software engineering tasks end-to-end, from coding to quality assurance and documentation. GPT-4.1 represents a significant step towards this ambitious goal, promising to empower developers with more sophisticated AI tools.

About the Author:

Kyle Wiggers, formerly TechCrunch's AI Editor, has a background in technology journalism, with his work appearing in various publications. He resides in Manhattan.