Google's DiffusionGemma AI Hits 1,000 Tokens Per Second—And It's Free
Google released DiffusionGemma, an open-weight, Apache 2.0 text diffusion model that generates text by starting from noisy token blocks and refining them in parallel, rather than predicting one token at a time. It reaches about 1,000 tokens/sec on an NVIDIA H100 and over 700 tokens/sec on an RTX 5090, roughly 4× faster than standard Gemma, but with lower output quality than Gemma 4. Its bidirectional generation gives every token access to the whole block, which makes it promising for code infilling, structured output, and constraint-heavy tasks. Google fine-tuned a Sudoku demo that improved from near 0% to about 80% accuracy. The release is important because it brings diffusion-based language modeling into a major open model with ecosystem support from tools like vLLM, Transformers, and Unsloth. Practical use still depends on compatible speculative-decoding/drafter support and correct configuration, and current local agent setups may not work out of the box.
