Google Implements Multi-Token Prediction to Triple Gemma 4 Inference Speed

Acceleration via Speculative Decoding

Google has introduced a technical update to its Gemma 4 family of open AI models that increases inference speeds by up to three times. This performance boost is achieved through a process known as speculative decoding, which utilizes new assistant models referred to as "drafters".

According to Google, these drafter models function by predicting sections of text in advance. This approach, termed Multi-Token Prediction (MTP), allows the system to generate multiple tokens simultaneously rather than predicting them one by one in a linear sequence. This mechanism enables the models to reach higher speeds without a loss in output quality.

Technical Implementation of Drafters

The Multi-Token Prediction (MTP) drafters act as a preliminary layer that suggests potential future tokens. By predicting these tokens ahead of time, the primary Gemma 4 model can verify and commit to larger blocks of text more rapidly than traditional autoregressive generation allows.

This optimization is designed to be particularly effective for on-device applications. By reducing the computational overhead required for each token generated, Google aims to make the Gemma 4 models run more efficiently on consumer hardware, including smartphones.

Gemma 4 Model Context and Performance

The Gemma 4 family consists of open-weight models designed to provide more flexibility than Google's proprietary Gemini series. The models are released under the Apache 2.0 license.

Recent performance benchmarks indicate the efficacy of the architecture; specifically, the Gemma 4 31-billion-parameter dense model has reached third place among all open AI models on the Arena AI text leaderboard. Additionally, the models are capable of delivering frontier AI performance while running on a single Nvidia GPU.

Integration and Workflow Support

Beyond raw speed and parameter efficiency, Google has designed Gemma 4 with native support for agentic workflows. This allows the models to be integrated into more complex autonomous systems where the speed increases provided by MTP drafters can reduce latency in multi-step reasoning tasks.

Because the models are open-weight and licensed under Apache 2.0, developers can implement these speed optimizations across various environments, from local hardware to cloud-based deployments.

Google Implements Multi-Token Prediction to Triple Gemma 4 Inference Speed

Acceleration via Speculative Decoding

Technical Implementation of Drafters

Gemma 4 Model Context and Performance

Integration and Workflow Support

Topics

Feedback

Keep reading

Trump Administration to Test AI Models From Google, Microsoft and xAI

Pennsylvania Sues Character.AI Over Chatbots Posing as Licensed Doctors

White House Considers Pre-Release Vetting of AI Models