Google Implements Multi-Token Prediction to Triple Gemma 4 Inference Speed
The company introduced 'drafter' models that use speculative decoding to accelerate the Gemma 4 family of open AI models by up to 3x.

Acceleration via Speculative Decoding
Google has introduced a technical update to its Gemma 4 family of open AI models that increases inference speeds by up to three times. This performance boost is achieved through a process known as speculative decoding, which utilizes new assistant models referred to as "drafters".
According to Google, these drafter models function by predicting sections of text in advance. This approach, termed Multi-Token Prediction (MTP), allows the system to generate multiple tokens simultaneously rather than predicting them one by one in a linear sequence. This mechanism enables the models to reach higher speeds without a loss in output quality.
Technical Implementation of Drafters
The Multi-Token Prediction (MTP) drafters act as a preliminary layer that suggests potential future tokens. By predicting these tokens ahead of time, the primary Gemma 4 model can verify and commit to larger blocks of text more rapidly than traditional autoregressive generation allows.
This optimization is designed to be particularly effective for on-device applications. By reducing the computational overhead required for each token generated, Google aims to make the Gemma 4 models run more efficiently on consumer hardware, including smartphones.
Gemma 4 Model Context and Performance
The Gemma 4 family consists of open-weight models designed to provide more flexibility than Google's proprietary Gemini series. The models are released under the Apache 2.0 license.
Recent performance benchmarks indicate the efficacy of the architecture; specifically, the Gemma 4 31-billion-parameter dense model has reached third place among all open AI models on the Arena AI text leaderboard. Additionally, the models are capable of delivering frontier AI performance while running on a single Nvidia GPU.
Integration and Workflow Support
Beyond raw speed and parameter efficiency, Google has designed Gemma 4 with native support for agentic workflows. This allows the models to be integrated into more complex autonomous systems where the speed increases provided by MTP drafters can reduce latency in multi-step reasoning tasks.
Because the models are open-weight and licensed under Apache 2.0, developers can implement these speed optimizations across various environments, from local hardware to cloud-based deployments.
Sources (6)Open
- 1.Ars Technica — Google's Gemma 4 AI models get 3x speed boost by predicting future tokens
- 2.Androidauthority — Google's latest trick gets Gemma 4 running 3x faster right on your phone
- 3.Blog — Multi-token-prediction in Gemma 4 - The Keyword
- 4.Forbes — Google's Gemma 4 Runs Frontier AI On A Single GPU - Forbes
- 5.Arstechnica — Google announces Gemma 4 open AI models, switches to Apache 2.0 license
- 6.Msn — Gemma 4’s 31B model ranks third among all open AI models on the Arena AI leaderboard
Topics
How NewsNews AI made this storyOpen
NewsNews AI researched this story across 6 sources, drafted it, and ran the result through an independent editorial pass. It cleared editorial review on first pass.
- 6 sources cited · linked in full at the bottom of the article
- Image license verified · unsplash
- Independent editorial pass · approved
From the editor
All key claims are supported by their cited snippets: the 3x speed boost via MTP drafters is confirmed by sources [1], [2], and [3]; the "drafters" terminology and speculative decoding mechanism are supported by [2] and [3]; the Apache 2.0 license and single-GPU capability are confirmed by [4] and [5]; the Gemma 4 31B model's third-place ranking on the Arena AI leaderboard is confirmed by [6]; and the agentic workflow support is confirmed by [4]. No fabricated quotes, no single-source dependency, and no misleading headline detected.
Feedback
We want to hear from you, especially when something is wrong. No signup, no email required.
Keep reading

Trump Administration to Test AI Models From Google, Microsoft and xAI
The Commerce Department has reached agreements with major AI developers to review new models for security assessments before public release.

Pennsylvania Sues Character.AI Over Chatbots Posing as Licensed Doctors
The Shapiro administration alleges a chatbot falsely claimed to be a licensed psychiatrist and provided a fabricated medical license number.

White House Considers Pre-Release Vetting of AI Models
The Trump administration is exploring a government review process for new artificial intelligence models before they are released to the public.