Analysis
AI SHIFT:

NVIDIA Tests DFlash To Cut LLM Inference Bottlenecks

Newsroom brief

DFlash replaces sequential speculative drafting with block-diffusion token prediction on NVIDIA GPUs, aiming to raise throughput for latency-sensitive coding, reasoning and agent workflows without changing the target model output path.

Verified against source materialEdited by SendTech Times Infrastructure Desk
NVIDIA Tests DFlash To Cut LLM Inference Bottlenecks

DFlash Moves Token Drafting Into Parallel Compute

DFlash is being tested as a way to accelerate autoregressive large language model inference on NVIDIA hardware by replacing the usual sequential speculative drafter with a lightweight block-diffusion model.

The method predicts a block of masked future tokens in a single forward pass, then leaves the target model to verify the candidates.

The problem is specific to latency-sensitive LLM serving.

Autoregressive models generate tokens one after another, which can leave GPU compute underused when developers need fast interactive responses.

Speculative decoding already tries to ease that by asking a smaller model to draft future tokens, but the normal draft model still generates those tokens sequentially.

DFlash changes the draft path, not the final verification path.

The target model still performs the validation pass, while DFlash exposes more parallel work to the GPU.

Coding assistants, reasoning systems and agentic workflows are the clearest fit because per-user token latency and concurrency are both hard limits.

Blackwell Tests Put Throughput Claims On The Table

The strongest numbers come from a DGX B300 test setup with eight NVIDIA systems, the gpt-oss-120b model and TensorRT-LLM.

On the SPEED-Bench coding dataset, DFlash produced higher throughput across latency targets described as production-relevant.

In tests targeting 500-600 tokens per second for each user, DFlash handled more than 15x as much Blackwell throughput as the autoregressive baseline.

The same output rate was 1.5x higher than EAGLE-3 speculative decoding.

At the lowest concurrency point with a batch size of one, DFlash more than doubled interactivity on Blackwell hardware.

The hardware details explain why the claim is framed as an inference-systems story rather than only a model release.

The Blackwell Ultra GPU description lists two large dies, a 10tbps chip-to-chip connection, 160 streaming multiprocessors and 640 fifth-generation Tensor Cores.

DFlash is meant to feed that hardware with parallel draft work instead of waiting on one token after another.

vLLM And SGLang Support Shape Adoption Work

The release also includes integration paths for engineering teams already running open inference stacks.

The research team released 20 DFlash model checkpoints on Hugging Face, with recipes for NVIDIA Blackwell and Hopper GPUs and support for model families including Qwen, Kimi K2.6, Llama, Gemma and gpt-oss.

For vLLM environments, engineers can replace EAGLE-3 with a DFlash checkpoint through a configuration update using the open-source Speculators library.

A Gemma 4 31B test on a single Blackwell Ultra GPU showed up to 5.8x higher throughput at matched concurrency over standard autoregressive decoding, including 5.8x on Math500, 5.6x on HumanEval and 5.3x on GSM8K.

SGLang deployments require changing the speculative decoding algorithm to DFlash and supplying the matching draft checkpoint.

A Qwen3 8-B evaluation on a single NVIDIA B200 GPU showed up to 5.1x throughput improvement at matched concurrency over autoregressive decoding, with 5.1x on Math500 and 4.2x on HumanEval.

The operational burden now sits with inference teams.

DFlash offers open checkpoints and framework paths, but production adoption still depends on whether teams can maintain acceptance rates, latency targets, model compatibility and serving reliability across their own workloads.

Share this article
inXf

Related articles

More
Nvidia and Foxconn Push Agentic AI Into Taiwan Hospitals
AI

Nvidia and Foxconn Push Agentic AI Into Taiwan Hospitals

Nvidia and Foxconn are working with Taiwanese medical centers on agentic AI systems for clinical and hospital operations. The effort is tied to Healthy Taiwan and a USD 1.5 billion sovereign AI healthcare investment. CoDoctor, CoDoClaw, Scrub Bot and Nurabot show healthcare AI moving toward multi-agent and physical AI workflows.

NVIDIA Gives AI Agents A Life Sciences Tool Stack
AI

NVIDIA Gives AI Agents A Life Sciences Tool Stack

NVIDIA says BioNeMo Agent Toolkit gives AI agents domain-specific tools for biology, chemistry, genomics and drug discovery, with more than 50 companies already using the system.

NVIDIA AI Science Tools Move Research Data Into GPU Pipelines
AI

NVIDIA AI Science Tools Move Research Data Into GPU Pipelines

NVIDIA introduced DAQIRI, ALCHEMI NIM microservices and cuPhoton reference code for scientific AI workloads, targeting chemistry, materials discovery, dark matter research and large observational datasets.

Arm and Supermicro Put Agentic AI Servers to a CPU Test
Chips & Semiconductors

Arm and Supermicro Put Agentic AI Servers to a CPU Test

Supermicro has introduced new server platforms built around Arm’s AGI CPU for inference-heavy and agentic AI workloads across cloud, enterprise and edge deployments. Arm says the AGI CPU includes up to 136 Arm Neoverse V3 cores, 12 DDR5 memory channels running at up to 8800 MT/s and PCIe Gen6 connectivity within a 300W power envelope. The key test is whether operators can use these CPU-heavy designs to add inference capacity without creating new pressure on power and cooling.

Keep Reading

More Stories

Latest
Dubai Approves AI Park Design Challenge In Municipality Project PackagePoliticsJun 25, 2026Dubai Approves AI Park Design Challenge In Municipality Project PackageDubai approved a municipality project package that includes an AI-powered park design challenge, an AED50 million falcon market and an 8-kilometre Dubai Creek lighting project due in the first quarter of 2027.Taktile Raises $110 Million For AI Decision Tools In FinanceFintech & Digital PaymentsJun 25, 2026Taktile Raises $110 Million For AI Decision Tools In FinanceTaktile raised a $110 million Series C led by Growth Equity at Goldman Sachs Alternatives as it pushes AI agents into underwriting, claims, fraud and compliance decisions at financial institutions.Circle And Nomura Set 2027 USDC Settlement Plan In JapanFintech & Digital PaymentsJun 25, 2026Circle And Nomura Set 2027 USDC Settlement Plan In JapanCircle and Nomura plan a USDC-based corporate payment and digital asset settlement service in Japan as early as 2027, targeting foreign-exchange and cross-border supplier payments.Amazon Commits $13 Billion For India AI And Cloud CapacityCloud & Data CentersJun 25, 2026Amazon Commits $13 Billion For India AI And Cloud CapacityAmazon said it will invest an additional $13 billion to expand AI and cloud infrastructure in India by 2030, including AWS data center capacity in Mumbai and Hyderabad.Salesforce Makes Slackbot An Enterprise AI AgentAIJun 25, 2026Salesforce Makes Slackbot An Enterprise AI AgentSalesforce rebuilt Slackbot as an AI agent for Business+ and Enterprise+ customers, adding access to Salesforce records, Google Drive, calendars and Slack conversations.Railway Raises $100 Million For AI-Native Cloud BuildoutCloud & Data CentersJun 25, 2026Railway Raises $100 Million For AI-Native Cloud BuildoutRailway raised $100 million in Series B funding led by TQ Ventures as the developer cloud platform expands its own data-center footprint and pitches faster deployment for AI-generated software.Mistral OCR 4 Adds Audit Trail For Enterprise DocumentsAIJun 25, 2026Mistral OCR 4 Adds Audit Trail For Enterprise DocumentsMistral AI released OCR 4 with bounding boxes, block classification and confidence scores, pricing the document model from $4 per 1,000 pages for enterprise workflows.Japan Clears Ripple RLUSD For Regulated Stablecoin UseFintech & Digital PaymentsJun 25, 2026Japan Clears Ripple RLUSD For Regulated Stablecoin UseJapan’s Financial Services Agency approved Ripple’s RLUSD as an electronic payment instrument, allowing SBI VC Trade to offer the dollar-backed stablecoin to retail and institutional users.OpenAI And Broadcom Name Jalapeño AI AcceleratorChips & SemiconductorsJun 25, 2026OpenAI And Broadcom Name Jalapeño AI AcceleratorOpenAI and Broadcom unveiled Jalapeño, their first custom AI accelerator, with initial deployment targeted by the end of 2026 and a ramp expected in the following years.AD Ports Lifts GFS Stake To 81% In $300 Million DealEconomyJun 25, 2026AD Ports Lifts GFS Stake To 81% In $300 Million DealAD Ports raised its ownership of Global Feeder Shipping to 81% through a Dh1.1 billion, or $300 million, transaction as Gulf and Red Sea trade routes remain under pressure.Fed Stress Test Keeps Large Bank Capital Rules Unchanged Until 2027Real EstateJun 25, 2026Fed Stress Test Keeps Large Bank Capital Rules Unchanged Until 2027The Federal Reserve said all 32 banks in its annual stress test stayed above minimum common equity tier 1 requirements, even after projected losses of more than $708 billion.AMD Ramps Venice EPYC CPUs On TSMC 2nm ProcessChips & SemiconductorsJun 25, 2026AMD Ramps Venice EPYC CPUs On TSMC 2nm ProcessAMD says its 6th Gen EPYC processor, codenamed Venice, has entered production ramp on TSMC 2nm technology, with future plans for TSMC Arizona production.