Zyphra’s Zamba2-VL Tests Hybrid AI For Faster Vision-Language Models
Zyphra released Zamba2-VL, an open-source vision-language model family that uses a Mamba2-transformer hybrid architecture to target lower-latency multimodal inference for documents, OCR, counting and edge AI tasks.

Zyphra Pushes Hybrid Models Into Vision-Language AI
Zyphra has released Zamba2-VL, an open-source vision-language model family built around a hybrid Mamba2 and transformer architecture.
The launch puts the startup’s Zamba2 backbone into multimodal AI, where models must read images and text together rather than handle language alone.
The release covers three model sizes: 1.2B, 2.7B and 7B parameters.
Zyphra made the models available on Hugging Face under the Apache 2.0 license, giving developers a route to test the architecture without waiting for a closed commercial deployment.
Why The Architecture Is Different
Zamba2-VL keeps the familiar LLaVA-style pipeline for multimodal work.
A pretrained vision encoder extracts image features, a lightweight MLP adapter maps those features into the language model’s embedding space, and the language model processes image and text tokens together.
The model supports single-image analysis, multi-image understanding and object grounding.
The change sits inside the language-model backbone.
Zamba2 uses Mamba2 state-space layers for most computation and inserts a shared transformer attention layer after every six Mamba2 layers.
The shared-weight design is meant to reduce memory-bandwidth pressure while preserving some transformer strengths.
That design targets a specific bottleneck in vision-language AI.
High-resolution images, documents and video-style inputs can create thousands of vision tokens, which makes transformer-only inference expensive as sequence length grows.
Zyphra’s claim is that the Mamba2-heavy structure gives Zamba2-VL near-linear prefill behavior and a fixed-size recurrent state.
Benchmarks Put Efficiency Beside Accuracy
Zyphra trained the model family on 100 billion vision-text and general-text tokens from public web datasets.
Its evaluation suite used 14 benchmarks, spanning document and chart tasks as well as visual reasoning, OCR, grounding and counting.
The strongest published figures are in counting and document tasks.
The 1.2B model scored 62.5 on PixMoCount, ahead of InternVL3.5 at 32.8 and PerceptionLM-1B.
On CountBenchQA, the 2.7B and 7B models scored 87.5 and 90.6.
The 2.7B model also reached 90.9 on DocVQA.
The efficiency claim is the more strategic part of the release.
Under a 32,000-token input setting, Zyphra said Zamba2-VL achieved at least 10 times lower TTFT than comparable transformer-based models while maintaining similar accuracy.
That does not prove broad production readiness, but it gives developers a concrete benchmark to test against long-context visual workloads.
Edge Deployment Is The Practical Test
The smaller Zamba2-VL models are aimed at deployments where latency and memory matter.
Zyphra named smartphones, industrial edge equipment, PDF analysis, automated receipt and invoice handling, and inventory or product-counting workflows as target use cases.
Those applications explain why a 1.2B or 2.7B model matters more than headline scale.
If the architecture can keep useful OCR, counting and document performance while cutting first-token delay, it could fit devices and edge systems that cannot afford heavy transformer inference.
The next checkpoint is external validation.
The models are open under Apache 2.0, so the evidence to watch is whether independent developers can reproduce the 32,000-token TTFT advantage and the DocVQA, PixMoCount and CountBenchQA results in real multimodal applications.
















