newsnews.ai

Google Implements Multi-Token Prediction to Triple Gemma 4 Inference Speed

The company introduced 'drafter' models that use speculative decoding to accelerate the Gemma 4 family of open AI models by up to 3x.

By NewsNews AI
a colorful google logo on a black background
a colorful google logo on a black background·Photo: BoliviaInteligente on Unsplashunsplash

Acceleration via Speculative Decoding

Google has introduced a technical update to its Gemma 4 family of open AI models that increases inference speeds by up to three times. This performance boost is achieved through a process known as speculative decoding, which utilizes new assistant models referred to as "drafters".

According to Google, these drafter models function by predicting sections of text in advance. This approach, termed Multi-Token Prediction (MTP), allows the system to generate multiple tokens simultaneously rather than predicting them one by one in a linear sequence. This mechanism enables the models to reach higher speeds without a loss in output quality.

Technical Implementation of Drafters

The Multi-Token Prediction (MTP) drafters act as a preliminary layer that suggests potential future tokens. By predicting these tokens ahead of time, the primary Gemma 4 model can verify and commit to larger blocks of text more rapidly than traditional autoregressive generation allows.

This optimization is designed to be particularly effective for on-device applications. By reducing the computational overhead required for each token generated, Google aims to make the Gemma 4 models run more efficiently on consumer hardware, including smartphones.

Gemma 4 Model Context and Performance

The Gemma 4 family consists of open-weight models designed to provide more flexibility than Google's proprietary Gemini series. The models are released under the Apache 2.0 license.

Recent performance benchmarks indicate the efficacy of the architecture; specifically, the Gemma 4 31-billion-parameter dense model has reached third place among all open AI models on the Arena AI text leaderboard. Additionally, the models are capable of delivering frontier AI performance while running on a single Nvidia GPU.

Integration and Workflow Support

Beyond raw speed and parameter efficiency, Google has designed Gemma 4 with native support for agentic workflows. This allows the models to be integrated into more complex autonomous systems where the speed increases provided by MTP drafters can reduce latency in multi-step reasoning tasks.

Because the models are open-weight and licensed under Apache 2.0, developers can implement these speed optimizations across various environments, from local hardware to cloud-based deployments.

Sources (6)Open

Topics

How NewsNews AI made this storyOpen

NewsNews AI researched this story across 6 sources, drafted it, and ran the result through an independent editorial pass. It cleared editorial review on first pass.

  • 6 sources cited · linked in full at the bottom of the article
  • Image license verified · unsplash
  • Independent editorial pass · approved

From the editor

All key claims are supported by their cited snippets: the 3x speed boost via MTP drafters is confirmed by sources [1], [2], and [3]; the "drafters" terminology and speculative decoding mechanism are supported by [2] and [3]; the Apache 2.0 license and single-GPU capability are confirmed by [4] and [5]; the Gemma 4 31B model's third-place ranking on the Arena AI leaderboard is confirmed by [6]; and the agentic workflow support is confirmed by [4]. No fabricated quotes, no single-source dependency, and no misleading headline detected.

More about our editorial process

Feedback

We want to hear from you, especially when something is wrong. No signup, no email required.

Keep reading