NVIDIA Tests DFlash To Cut LLM Inference Bottlenecks
DFlash replaces sequential speculative drafting with block-diffusion token prediction on NVIDIA GPUs, aiming to raise throughput for latency-sensitive coding, reasoning and agent workflows without changing the target model output path.

DFlash Moves Token Drafting Into Parallel Compute
DFlash is being tested as a way to accelerate autoregressive large language model inference on NVIDIA hardware by replacing the usual sequential speculative drafter with a lightweight block-diffusion model.
The method predicts a block of masked future tokens in a single forward pass, then leaves the target model to verify the candidates.
The problem is specific to latency-sensitive LLM serving.
Autoregressive models generate tokens one after another, which can leave GPU compute underused when developers need fast interactive responses.
Speculative decoding already tries to ease that by asking a smaller model to draft future tokens, but the normal draft model still generates those tokens sequentially.
DFlash changes the draft path, not the final verification path.
The target model still performs the validation pass, while DFlash exposes more parallel work to the GPU.
Coding assistants, reasoning systems and agentic workflows are the clearest fit because per-user token latency and concurrency are both hard limits.
Blackwell Tests Put Throughput Claims On The Table
The strongest numbers come from a DGX B300 test setup with eight NVIDIA systems, the gpt-oss-120b model and TensorRT-LLM.
On the SPEED-Bench coding dataset, DFlash produced higher throughput across latency targets described as production-relevant.
In tests targeting 500-600 tokens per second for each user, DFlash handled more than 15x as much Blackwell throughput as the autoregressive baseline.
The same output rate was 1.5x higher than EAGLE-3 speculative decoding.
At the lowest concurrency point with a batch size of one, DFlash more than doubled interactivity on Blackwell hardware.
The hardware details explain why the claim is framed as an inference-systems story rather than only a model release.
The Blackwell Ultra GPU description lists two large dies, a 10tbps chip-to-chip connection, 160 streaming multiprocessors and 640 fifth-generation Tensor Cores.
DFlash is meant to feed that hardware with parallel draft work instead of waiting on one token after another.
vLLM And SGLang Support Shape Adoption Work
The release also includes integration paths for engineering teams already running open inference stacks.
The research team released 20 DFlash model checkpoints on Hugging Face, with recipes for NVIDIA Blackwell and Hopper GPUs and support for model families including Qwen, Kimi K2.6, Llama, Gemma and gpt-oss.
For vLLM environments, engineers can replace EAGLE-3 with a DFlash checkpoint through a configuration update using the open-source Speculators library.
A Gemma 4 31B test on a single Blackwell Ultra GPU showed up to 5.8x higher throughput at matched concurrency over standard autoregressive decoding, including 5.8x on Math500, 5.6x on HumanEval and 5.3x on GSM8K.
SGLang deployments require changing the speculative decoding algorithm to DFlash and supplying the matching draft checkpoint.
A Qwen3 8-B evaluation on a single NVIDIA B200 GPU showed up to 5.1x throughput improvement at matched concurrency over autoregressive decoding, with 5.1x on Math500 and 4.2x on HumanEval.
The operational burden now sits with inference teams.
DFlash offers open checkpoints and framework paths, but production adoption still depends on whether teams can maintain acceptance rates, latency targets, model compatibility and serving reliability across their own workloads.
















