NVIDIA Tests DFlash To Cut LLM Inference Bottlenecks

BySendTech Times Infrastructure DeskNewsroom-edited, source-reviewed coverage|Source: Developer Tech

Newsroom brief

DFlash replaces sequential speculative drafting with block-diffusion token prediction on NVIDIA GPUs, aiming to raise throughput for latency-sensitive coding, reasoning and agent workflows without changing the target model output path.

Verified against source materialEdited by SendTech Times Infrastructure Desk

NVIDIA Tests DFlash To Cut LLM Inference Bottlenecks

DFlash Moves Token Drafting Into Parallel Compute

DFlash is being tested as a way to accelerate autoregressive large language model inference on NVIDIA hardware by replacing the usual sequential speculative drafter with a lightweight block-diffusion model.

The method predicts a block of masked future tokens in a single forward pass, then leaves the target model to verify the candidates.

The problem is specific to latency-sensitive LLM serving.

Autoregressive models generate tokens one after another, which can leave GPU compute underused when developers need fast interactive responses.

Speculative decoding already tries to ease that by asking a smaller model to draft future tokens, but the normal draft model still generates those tokens sequentially.

DFlash changes the draft path, not the final verification path.

The target model still performs the validation pass, while DFlash exposes more parallel work to the GPU.

Coding assistants, reasoning systems and agentic workflows are the clearest fit because per-user token latency and concurrency are both hard limits.

Blackwell Tests Put Throughput Claims On The Table

The strongest numbers come from a DGX B300 test setup with eight NVIDIA systems, the gpt-oss-120b model and TensorRT-LLM.

On the SPEED-Bench coding dataset, DFlash produced higher throughput across latency targets described as production-relevant.

In tests targeting 500-600 tokens per second for each user, DFlash handled more than 15x as much Blackwell throughput as the autoregressive baseline.

The same output rate was 1.5x higher than EAGLE-3 speculative decoding.

At the lowest concurrency point with a batch size of one, DFlash more than doubled interactivity on Blackwell hardware.

The hardware details explain why the claim is framed as an inference-systems story rather than only a model release.

The Blackwell Ultra GPU description lists two large dies, a 10tbps chip-to-chip connection, 160 streaming multiprocessors and 640 fifth-generation Tensor Cores.

DFlash is meant to feed that hardware with parallel draft work instead of waiting on one token after another.

vLLM And SGLang Support Shape Adoption Work

The release also includes integration paths for engineering teams already running open inference stacks.

The research team released 20 DFlash model checkpoints on Hugging Face, with recipes for NVIDIA Blackwell and Hopper GPUs and support for model families including Qwen, Kimi K2.6, Llama, Gemma and gpt-oss.

For vLLM environments, engineers can replace EAGLE-3 with a DFlash checkpoint through a configuration update using the open-source Speculators library.

A Gemma 4 31B test on a single Blackwell Ultra GPU showed up to 5.8x higher throughput at matched concurrency over standard autoregressive decoding, including 5.8x on Math500, 5.6x on HumanEval and 5.3x on GSM8K.

SGLang deployments require changing the speculative decoding algorithm to DFlash and supplying the matching draft checkpoint.

A Qwen3 8-B evaluation on a single NVIDIA B200 GPU showed up to 5.1x throughput improvement at matched concurrency over autoregressive decoding, with 5.1x on Math500 and 4.2x on HumanEval.

The operational burden now sits with inference teams.

DFlash offers open checkpoints and framework paths, but production adoption still depends on whether teams can maintain acceptance rates, latency targets, model compatibility and serving reliability across their own workloads.

#AI inference #block diffusion #Nvidia #DFlash

Nvidia and Foxconn Push Agentic AI Into Taiwan Hospitals

Nvidia and Foxconn are working with Taiwanese medical centers on agentic AI systems for clinical and hospital operations. The effort is tied to Healthy Taiwan and a USD 1.5 billion sovereign AI healthcare investment. CoDoctor, CoDoClaw, Scrub Bot and Nurabot show healthcare AI moving toward multi-agent and physical AI workflows.

NVIDIA Gives AI Agents A Life Sciences Tool Stack

NVIDIA says BioNeMo Agent Toolkit gives AI agents domain-specific tools for biology, chemistry, genomics and drug discovery, with more than 50 companies already using the system.

NVIDIA AI Science Tools Move Research Data Into GPU Pipelines

NVIDIA introduced DAQIRI, ALCHEMI NIM microservices and cuPhoton reference code for scientific AI workloads, targeting chemistry, materials discovery, dark matter research and large observational datasets.

Chips & Semiconductors

Arm and Supermicro Put Agentic AI Servers to a CPU Test

Supermicro has introduced new server platforms built around Arm’s AGI CPU for inference-heavy and agentic AI workloads across cloud, enterprise and edge deployments. Arm says the AGI CPU includes up to 136 Arm Neoverse V3 cores, 12 DDR5 memory channels running at up to 8800 MT/s and PCIe Gen6 connectivity within a 300W power envelope. The key test is whether operators can use these CPU-heavy designs to add inference capacity without creating new pressure on power and cooling.