Why OpenAI built its own chip: inference bills and the custom-ASIC race
OpenAI ranks among the world's largest GPU consumers. Every ChatGPT query and API call triggers inference — the forward pass that turns tokens into answers. As GPT-4 and GPT-5 families grow more capable, inference spend has become the heaviest line item on the path to profitability.
Until now, OpenAI relied almost entirely on Nvidia hardware. H100, H200, and Blackwell are powerful general-purpose accelerators — not purpose-built for homogeneous LLM serving. In a workload where every request looks structurally similar, a lot of GPU capacity is effectively wasted. Think of it this way: Nvidia GPUs are a Swiss Army knife; Jalapeño is a scalpel.
Scale drives the bill: Hundreds of millions of daily users mean every API call burns cluster time.
General GPU mismatch: Chips designed for training, gaming, and simulation run pure inference at lower efficiency than a dedicated ASIC.
Peers already ship custom silicon: Google TPU, Amazon Trainium/Inferentia, Microsoft Maia, and Meta MTIA are all in production.
Late entry, fastest cycle: Nine months from design to tape-out — OpenAI claims the fastest advanced ASIC turnaround in high-performance semiconductors.
Nvidia ties remain deep: In February 2026 Nvidia invested $30B in OpenAI; training still runs on Nvidia.
| Company | Custom chip | Primary use |
|---|---|---|
| TPU (Tensor Processing Unit) | Training + inference | |
| Amazon | Trainium / Inferentia | Training + inference |
| Microsoft | Maia 100 | Inference |
| Meta | MTIA | Inference |
| OpenAI | Jalapeño (2026) | Inference only |
What Jalapeño is: ASIC architecture, 3nm process, and lab validation
An ASIC (Application-Specific Integrated Circuit) does one job — LLM inference. No gaming, no training, no general compute. That narrow focus is the efficiency trade: inside its lane, Jalapeño is designed to run far leaner than a general GPU.
“Jalapeño was designed from a blank slate for LLM inference, incorporating our deep insights into frontier models — kernel execution, data movement, network communication, and serving patterns.” — Richard Ho, OpenAI hardware lead
Blank-slate design: Every architectural choice starts from modern LLM inference, not legacy GPU layouts.
Minimize data movement: Memory bandwidth is the usual bottleneck; Jalapeño cuts unnecessary traffic between memory and compute.
Balanced compute, memory, and network: Tuned for real LLM serving loads so utilization stays closer to theoretical peaks.
Broadcom Tomahawk networking: Strong inter-node links for large clusters — critical when multi-chip serving drives very large models.
Celestica board integration: EMS partner handles motherboards and rack systems for volume manufacturing.
| Manufacturing element | Detail |
|---|---|
| Foundry | TSMC |
| Process node | 3nm (same generation as Apple M4 and Nvidia Blackwell) |
| Lab validation model | GPT-5.3-Codex-Spark (flagship coding inference model) |
| Engineering status | Silicon samples running ML workloads at target frequency and power |
Performance and cost: 50% savings, Blackwell parity, and data credibility
The figures below come from Broadcom CEO Hock Tan and OpenAI's public statements. All are early lab results. A full technical report is months away and independent benchmarks have not run — treat them as vendor-reported numbers until production data arrives.
| Metric | Jalapeño (early tests) | Benchmark |
|---|---|---|
| Inference cost savings | ~50% | vs mainstream AI GPUs |
| Performance per watt | Significantly above current state of the art | OpenAI official statement |
| Absolute throughput | On par with Nvidia Blackwell and Google TPU | Hock Tan (Reuters) |
| Thermal behavior | Better than expected | OpenAI internal testing |
“So far, Jalapeño is showing about 50% cost savings compared to typical AI GPUs.” — Hock Tan, Broadcom CEO (Bloomberg)
OpenAI president Greg Brockman added that Jalapeño moved from initial design to tape-out in just nine months, with parts of the design flow accelerated by OpenAI's own AI models (VentureBeat cited sources saying prior-generation OpenAI models assisted; exact versions were not disclosed).
Caveat: The 50% figure is Broadcom's early lab data. Production proof waits on: ① OpenAI's full technical report; ② Microsoft and partner datacenter rollouts; ③ third-party benchmarks.
~50% cost savings: Hock Tan to Bloomberg, measured against “typical AI GPUs.”
Blackwell-class performance: Reuters interview, grouped with Google TPU.
9-month tape-out: Greg Brockman, described as the fastest high-performance ASIC cycle on record.
Nine-month development, supply chain, deployment roadmap, and six-step guide
How did OpenAI tape out in nine months?
Hardware–software co-design: Model and silicon teams worked in lockstep, avoiding the guesswork that slows traditional ASIC programs.
AI-assisted chip design: OpenAI models helped accelerate parts of the design and optimization workflow.
Broadcom IP reuse: Mature blocks for implementation and networking shortened the path from logic to physical design.
| Role | Company | Responsibility |
|---|---|---|
| Architecture | OpenAI | LLM inference optimization, full-stack design |
| Silicon & networking | Broadcom | Chip implementation, Tomahawk interconnect, production support |
| Foundry | TSMC | 3nm fabrication |
| System integration | Celestica | Boards, racks, server integration, volume build |
| First deploy customer | Microsoft Azure | Datacenter rollout (starting late 2026) |
| Phase | Timing | Milestone |
|---|---|---|
| Near term | Late 2026 | First commercial deploys on Azure and partner datacenters; priority for ChatGPT, Codex, and internal API inference |
| Mid term | 2027 | Volume production; deployment exceeds 1.3 GW; possible external AI company access |
| Long term | Through 2029 | Custom silicon supporting 10 GW (~10 nuclear-plant equivalents); next generation in 2028, then annual iterations; training chips possible later |
Six-step decision guide (developers and tech leads):
Separate training from inference: Jalapeño covers inference only; frontier training still needs Nvidia GPUs — do not conflate the two in capacity planning.
Treat 50% as vendor data: Wait for OpenAI's technical report and Azure production numbers before rebuilding TCO models.
Watch late-2026 Azure signals: Microsoft's first production racks are the real test of whether savings hold.
Understand supply diversification: Even if Jalapeño carries 20–30% of inference load, that alone improves Nvidia negotiation leverage.
Track Broadcom's ASIC empire: Google TPU, Meta MTIA, and OpenAI Jalapeño all run through Broadcom — custom ASIC is now hyperscaler standard.
Decouple local agents from cloud inference: Chip savings hit API pricing; Cursor and Codex agent pipelines still need stable macOS hosts online 24/7.
Competitive landscape, industry impact, timeline, and key people
Can Jalapeño replace Nvidia soon? No. Three reasons: ① inference-only, no training; ② the CUDA software ecosystem (millions of developers, deep libraries) is the hardest moat to cross; ③ ASICs are rigid — a fundamental LLM architecture shift would be expensive to absorb.
The strategic play is supply diversification, not divorce. Quilter Cheviot global tech research head Ben Barringer put it plainly: “Nobody wants to be beholden to Nvidia.” OpenAI's own blog frames the move as building a full-stack AI company — chips, kernels, memory, networking, scheduling, and deployment — not just better models.
| Dimension | Nvidia | Jalapeño / OpenAI |
|---|---|---|
| Training | Dominant; CUDA ecosystem | Still Nvidia-dependent ($30B investment tie) |
| Inference | General GPU; share may erode | Purpose ASIC; ~50% cost target |
| Next generation | Vera Rubin platform | Second-gen Jalapeño 2028, annual cadence after |
| Broadcom role | — | Custom ASIC kingmaker: Google TPU + Meta MTIA + OpenAI |
Broader industry impact:
Inference economics reshape pricing: If 50% savings hold in production, ChatGPT and API floors drop — the AI price war gets a lower baseline.
Full-stack AI becomes the bar: Competition shifts from “whose model is best” to “whose stack is most efficient end to end.”
Semiconductor winners and losers: Broadcom, TSMC, and HBM suppliers gain; Nvidia's inference share and AMD's relevance face pressure.
| Name | Role | Contribution |
|---|---|---|
| Greg Brockman | OpenAI co-founder & president | Public launch; framed as full-stack infrastructure strategy |
| Richard Ho | OpenAI hardware lead | Architecture and technical direction |
| Hock Tan | Broadcom CEO | Blackwell-class performance and 50% cost claims |
| Sam Altman | OpenAI CEO | Overall strategy; long stated goal of owning compute destiny |
2025-10 OpenAI and Broadcom announce custom chip partnership 2026-02 Nvidia invests $30B in OpenAI (includes Vera Rubin compute agreement) 2026-06-24 Jalapeño publicly unveiled; engineering samples running in lab Late 2026 First commercial deploy (Microsoft Azure and partner datacenters) 2027 Volume production; deployment exceeds 1.3 GW 2028 (est.) Second-generation chip launch 2029 (goal) Custom silicon supporting 10 GW compute scale
Broadcom stock is up roughly 18% YTD through the first five months of 2026 and nearly 7× since late 2022 — a direct beneficiary of the custom ASIC wave.
Reality check for your stack: Cursor + Codex agents on a MacBook stop when the lid closes; a Linux-only VPS lacks Xcode and macOS Keychain; gateway plus local inference on weak hardware triggers swap jitter. Jalapeño savings live in the cloud inference layer — for 24/7 agent orchestration, stable MCP toolchains, and fast coding-stack pivots as OpenAI silicon lands, a KVMNODE dedicated Mac Mini M4 / M4 Pro is usually the better production choice. See pricing, order, and the help center.