31B-parameter (3B active) Mamba2-Transformer hybrid MoE multimodal model that unifies video, audio, image, and text understanding. Supports enterprise-grade Q&A, summarization, transcription, OCR, document intelligence, GUI automation, and agentic workflows. Reasoning is on by default with toggle via enable_thinking. Trained on 354M+ samples (~717B tokens) across 1,395 datasets. Available in BF16, FP8, and NVFP4 precisions. Commercial use permitted under NVIDIA Open Model Agreement.
Modalities
Context window
256,000
Pricing
input / output per 1M
Reasoning
Streaming
Real-time token-by-token response streaming
Function calling
Connect the model to external tools and systems
Structured outputs
Return responses in JSON schema format
Fine-tuning
Custom model training on your data
Reasoning
Extended thinking before responding
Computer use
Control and interact with computer interfaces
| CVBENCH 2D | 83.95 |
| OCRBENCH V2 EN | 67.04 |
| OSWORLD | 47.4 |
| CHARXIV REASONING | 63.6 |
| MMLONGBENCH DOC | 57.5 |
| MATHVISTA MINI | 82.8 |
| OCR REASONING | 54.14 |
| VIDEO MME | 72.2 |
| WORLD SENSE | 55.4 |
| DAILY OMNI | 74.52 |
| VOICE INTERACTION | 89.39 |
| Release date | 2026-04-28 |
| Model ID | nemotron-3-nano-omni |
| Provider | NVIDIA |