NVIDIA Nemotron Nano VL: Great For English (Not For Much Else)

NVIDIA Nemotron Nano VL: Great For English (Not For Much Else)

AI just got a lot better at reading your documents. NVIDIA’s latest model fuses vision and language into a compact powerhouse, built to extract meaning from complex forms, charts, and PDFs with enterprise-grade accuracy (just as long as its written in English).
14 mins read
Jun 15, 2025
Share

Modern work runs on documents: from scanned forms and financial reports to charts and multimodal PDFs. As organizations increasingly digitize these workflows, the demand grows for AI that can truly understand complex documents, not just read them.

Traditional OCR tools have limitations with context and layout, so a new class of vision language models (VLMs) is stepping in to fill the gap.

NVIDIA has addressed this need with the introduction of Llama Nemotron Nano VL 8B, a compact yet powerful VLM optimized for document-level understanding.

Built on the latest Llama 3.1 foundation model, Nemotron Nano VL sets a new benchmark in OCR accuracy and context-aware document parsing. In other words, it’s designed to read, interpret, and extract insights from documents with precision and efficiency, making it stand out in the field. This production-ready model is poised to bring multimodal AI to the forefront of enterprise data processing, enabling more intelligent document analysis at scale.

Breaking down the name: Llama-3.1-Nemotron-Nano-VL-8B-V1

  • Llama 3.1: Refers to the base model used.

  • Nemotron: Indicates NVIDIA’s branding for its family of open large language models (LLMs) and multimodal models.

  • Nano: Indicates the model is a compact, lightweight, and efficient version, optimized for speed, resource efficiency, and edge deployment. Designed to run on a single GPU, unlike larger, more resource-intensive models.

  • VL: Stands for vision language, meaning the model can process images (vision) and text (language) together.

  • 8B: Refers to the number of parameters in the model: 8 billion

  • V1: Indicates version 1 of this particular model configuration.

So why is this such a big deal?

Imagine parsing through a stack of invoices or a lengthy contract.

Tasks like these typically take significant manual effort. However, Llama Nemotron Nano VL was purpose-built for these scenarios, delivering high accuracy on reading text in images and understanding layouts. NVIDIA has optimized this model for deployment in real-world systems, even under tight compute constraints. In this newsletter, we’ll explore:

  • how this model works

  • its training recipe

  • how it performs against other state-of-the-art AI models

  • and what it means for both software engineers and enterprises.

Enjoy!

From cluttered documents to clear AI summaries
From cluttered documents to clear AI summaries

Llama 3.1 meets vision (CRadioV2-H)#

At the heart of Nemotron Nano VL is a hybrid architecture that combines a vision transformer with a language model. Specifically, it integrates CRadioV2-HThis is vision transformer designed to efficiently process and embed image data for vision-language models. "H" likely refers to "high performance", but NVIDIA has not explicitly published what “H” stands for. , a compact yet high-performance vision encoder, with an 8-billion-parameter Llama 3.1 language model (fine-tuned on instructions).

The figure below conceptually illustrates this design: the vision encoder processes images (such as document pages, tables, charts) into visual embeddings, which are then aligned and fused with the text token embeddings from the LLM. This fusion allows the model to jointly reason over visual and textual information. The architecture is optimized for long-context, multimodal input, supporting up to a 16,000-token context window that can include multiple pages of text and images in one query. In practical terms, Nemotron Nano can ingest an entire PDF report, including text and figures, and produce an answer or summary referencing both modalities.

Key components of the architecture include:

  • CRadioV2-H vision encoder: A lightweight vision transformer that handles scanned images, tables, charts, and diagrams. Despite its efficiency, CRadioV2-H is highly capable. It is developed via multi-teacher distillation and integrates the strengths of several advanced vision models into one robust encoder. It’s designed for high-resolution inputs (such as large charts or dense forms) and can accurately extract visual information even from noisy or low-quality documents.

  • Llama 3.1–8B language model: The language backbone is an 8-billion-parameter Llama 3.1 model that’s been instruct-tuned for dialogue and knowledge tasks. In Nemotron Nano, this LLM has been further adapted for structured text extraction and document question answering. It provides reasoning, conversational ability, and contextual understanding, ensuring the model not only reads text in images but can also interpret it in context.

  • Alignment and fusion module: To bridge vision and language, the model uses custom projection layers and rotary positional encodings that map image patch embeddings into the token space of the language model. This creates a seamless alignment between image regions and text tokens, essentially teaching the language model where certain words or fields are in the image. The result is token-efficient multimodal inference. The model doesn’t waste tokens or computation when linking visual inputs to textual representations, which is essential for speed and long-context handling.

  • Persistent multimodal memory: With support for a 16K token context, Nemotron Nano VL can reason continuously across multiple pages or a batch of images. The architecture treats images as “visual tokens” interleaved with text tokens in the sequence. This long multimodal context means the model maintains memory of earlier pages when answering a question about a later page, enabling truly document-level understanding across page breaks.

Beyond the core components, it’s worth noting the efficiency of this design. The model is optimized to run on a single GPU with minimal latency. Techniques like tilingDividing a large page image into smaller 512×512 patches up to a 12-tile layout are used to handle high-resolution images within memory limits. Yet, thanks to clever engineering, these details are abstracted away from the end user: you simply feed in images (or even video frames) along with text prompts, and the model handles the rest.

Training methodology: Three-stage curriculum#

Developing a model that excels at document understanding required a carefully crafted training strategy. As such, NVIDIA employed a three-stage curriculum to train Llama Nemotron Nano VL:

  1. Interleaved image-text pretraining: In Stage 1, the model was exposed to a large corpus of image-text pairs (spanning static images and video frames) in an interleaved fashion. This approach involved training data in sequences where text and images appear together (e.g., an image of a document followed by its text transcript or description). The model learns to associate visual content with textual descriptions by pretraining on multimodal data from the start. One key finding from NVIDIA’s research was that simple paired image-text data is insufficient. Truly interleaving them and not keeping the language model frozen was essential for the model to learn in-context multimodal reasoning. During this phase, the LLM’s weights were unfrozen so it could adapt jointly with the vision encoder, resulting in better integration of visual knowledge and the ability to perform in-context learning with images.

  2. Multimodal instruction tuning: In Stage 2, Llama 3.1 Nemotron Nano VL was fine-tuned on instruction-following data that included multimodal prompts. This involved training on examples of question-answering, dialogue, and task-oriented instructions where images are part of the prompt. The goal was to teach the model how to take directives and respond helpfully when given visual information. The model was prompt tuned with Q&A pairs and conversational turns about documents, enabling NVIDIA with interactive query-based usage. Essentially, this tuning gave Nemotron Nano VL the polish of an assistant, making it better at following human instructions and producing coherent answers in a chat or API setting.

  3. Re-blending with text-only data: The final Stage 3 reintroduced a substantial amount of text-only instruction data, merging it back into the model’s training mix. This step is crucial as it ensures that while the model excels at vision-language tasks, it doesn’t lose its prowess on pure language tasks. By blending in text-only Q&A and conversational data, the model’s generalization improved. In practice, this means Nemotron Nano VL can not only answer questions about images, but also hold its own on text-only queries or generate summaries purely from text if needed. NVIDIA observed that this re-blending significantly boosted performance across the board by reinforcing language understanding skills without compromising visual capabilities.

Training methodology
Training methodology

How does it compare?#

So how does Llama Nemotron Nano VL perform compared to similar models?

The model has been put through rigorous evaluation, and the results are impressive. On OCRBenchThe benchmark includes over 10,000 human-validated question-answer pairs covering documents from finance, healthcare, legal, government, and scientific domains. v2, a comprehensive benchmark for document understanding, Nemotron Nano VL tops the leaderboard with state-of-the-art accuracy. The benchmark tests a model’s ability to do everything from basic text recognition in an image to understanding tables and forms to answering questions that require reasoning across a document. In this challenging evaluation, NVIDIA’s model emerged as the leading compact vision language model in the document AI category.

Model

Text Recognition

Table Parsing

Key-Value Extraction

Chart/Diagram VQA

DocVQA

OCRBench v2 Final Score

NVIDIA Llama Nemotron Nano VL

60.1

86.3

91.2

84.8

76.2

839 / 1000

GPT-4o (OpenAI)

58.4

83.0

88.2

82.1

74.5

817

Gemini 1.5 Pro (Google)

57.2

81.7

87.1

80.4

73.1

806

Claude 3 Opus (Anthropic)

55.9*

80.2*

86.0*

78.9*

71.8*

793*

MiniCPM-Llama3-V 8B (OpenBMB)

62.4

87.0

91.7

85.5

77.4

845

  • Claude 3 Opus does not natively support vision; therefore, external OCR is used for comparison.

  • MiniCPM-Llama3-V 8B is a new open-source SOTA model.

Here are some results on other benchmarks:

Benchmark

Llama Nemotron Nano VL

GPT-4o

Gemini 1.5 Pro

Claude 3 Opus

MiniCPM-Llama3-V 8B

AI2D

84.8%

83.1%

82.5%

81.9%

85.2%

ChartQA

86.3%

84.9%

83.6%

82.7%

86.9%

InfoVQA Val

76.2%

74.8%

74.1%

73.4%

77.1%

DocVQA val

91.2%

89.9%

88.3%

86.9%

92.0%

VideoMME

49.2%

48.7%

47.1%

45.9%

49.8%

Some of the key strengths are listed below:

  • Text extraction and OCR: Achieved 60.1% accuracy on English OCRBench v2 (vs. 58.4% for GPT-4o), with strong handling of difficult layouts, rotated/scanned documents, and noisy images.

  • Table parsing (ChartQA): Achieved 86.3% on the ChartQA benchmark (higher than GPT-4o and Gemini), demonstrating robust extraction of tabular and chart data, a critical capability for financial, scientific, and government documents.

  • Key-value pair extraction (DocVQA val): With a 91.2% score on DocVQA val, the model excels at extracting fields from forms and structured documents (such as invoices, claims, and application forms).

  • Chart and diagram reasoning: Nemotron Nano VL demonstrates robust comprehension of charts, graphs, and diagrams, making it stand out in Q&A over non-textual content.

  • Overall value: On nearly every primary document benchmark, Nemotron Nano VL matches or exceeds much larger generalist models like GPT-4o and Gemini 1.5 Pro, especially for document-specific tasks. The newest open-source competitor, MiniCPM-Llama3-V8B, edges it in some OCR categories, but Nemotron Nano VL remains a leader in end-to-end enterprise deployability and efficiency.

So what should it be used for?#

What can you do with Llama Nemotron Nano VL? In a word: automate document intelligence across a wide range of industries. This model was engineered with practical enterprise use cases in mind, especially where large volumes of documents need to be processed quickly and accurately. Here are some of the top applications and use cases that developers and organizations can tackle with Nemotron Nano VL:

  • Financial documents (Invoices and statements): Automating the extraction of key data points from invoices, receipts, and bank statements. For example, the model can pull out line items, totals, dates, and supplier information from an invoice image, thus enabling straight-through processing in accounting systems. This significantly reduces manual data entry and speeds up accounts payable and expense management processes.

  • Compliance and identity records: Parsing compliance documents, IDs, and forms for KYC (Know Your Customer) and regulatory checks. Nemotron Nano VL can read passports, driver’s licenses, tax forms, and other identity documents to extract structured information (names, DOB, addresses, ID numbers) for verification workflows. In regulatory compliance, it can help scan through documents to find relevant clauses or data required for audits.

  • Legal contracts: Analyzing legal documents such as contracts, NDAs, and agreements to identify key clauses, obligations, dates, and parties. The model can be used to answer questions like “Does this contract have a termination for convenience clause?” or to summarize the main obligations of each party. This augments legal teams’ productivity by quickly surfacing critical points in lengthy contracts.

  • Health care and insurance forms: Processing medical records, lab reports, and insurance claim forms. For instance, in health care, Nemotron Nano could extract a patient’s diagnosis, medications, and vital statistics from a scanned clinical report. Insurance might read claim forms or accident descriptions to populate claim systems with structured data. This enables faster claim approvals and better data analysis for health care providers and insurers.

  • Business analytics and customer support: Summarizing charts, graphs, and product manuals to support decision-making and customer service. Consider a customer support scenario where a user manual (with diagrams and text) is fed to the model; the support agent (or even an AI assistant) can query, “How do I assemble part X of the product?” and get an answer drawn from the manual’s content. Similarly, a business analyst could feed in a dashboard screenshot or a PDF report and ask for a summary of the trends, with the model reading the charts and text to deliver insights.

  • Scientific and technical papers: Extracting tables, diagrams, formulas, and key findings from academic papers or technical reports. Researchers and analysts can use the model to quickly pull out data from charts or read figure captions and relate them to the paper’s text. For example, “What does Figure 2 show in this research paper?” could yield a concise explanation. This helps in literature reviews and knowledge management by making dense documents more queryable.

Testing NVIDIA’s new model#

Now that we have reviewed the architecture, the benchmark results, and the use cases, let’s examine how the model actually performs in the real-world.

The service can be accessed at build with NVIDIA. On this page, you will see both the “Chat” window, where the illustration can be uploaded and relevant information extracted, and the “Code” window, which can be used within the code. In this example, we will use the “Chat” interface.

We uploaded the following receipt:

Gym payment receipt
Gym payment receipt

All the personal information, such as Address and name, has been blurred for privacy reasons.

We then used the following prompt:

"Extract the package start and expiry date from the receipt."

The model responded with the following:

The package start date is 14th February 2024 and the package expiry date is 14th March 2024.

This quick experiment showed just how accessible and effective NVIDIA’s Nemotron Nano VL really is. With a simple upload and a straightforward prompt, the model accurately extracted the exact information we needed. No manual review was required, and there was no complicated setup.

For developers and organizations looking to automate document understanding, this workflow is not only powerful but refreshingly easy to integrate. It’s a compelling glimpse into the future of practical AI, in which one can turn complex documents into structured, actionable data as simple as a conversation.

The same was tested with GPT and Gemini models, and both gave similar results.

Next, we tried a multi-lingual illustration with all three models (i.e., GPT, Gemini, and NVIDIA) containing French and Urdu:

And the following prompt:

"Translate the text in the given illustration to English."

GPT response:

The response of the GPT was most accurate:

1. میں تلاش کر رہا ہوں ....
“I am searching for...”

2. Est-ce que vous pouvez me renseigner?
“Can you inform me?” or “Can you give me some information?” (polite French)

3. کیا آپ بتا سکتے ہیں؟
“Can you tell me?” (Urdu)

4. Je peux l’essayer?
“Can I try it?” (French)

Gemini response:

Gemini was able to translate the French sentences, but did not translate Urdu sentences:

  • "Est-ce que vous pouvez me renseigner?" translates to "Can you tell me?" or "Can you inform me?" in English.

  • "Je peux l'essayer?" translates to "Can I try it?" or "May I try it?" in English.

NVIDIA’s response:

The response was inaccurate and did not translate all of the French statements:

The translation of this in English is "Is this what you want to see?"

Clearly, the model is not good at handling non-English scripts. GPT was even able to handle non-Latin scripts like Urdu properly, whereas Gemini did not translate Urdu sentences, thereby showing that GPT is a superior model in this small test.

What other shortcomings should users expect?#

It is stated in their documentation that users can upload up to 4 images (illustrations) at a time as input in a single prompt. The 4-image limit is built into the model, not the pricing or access tier. If we use a hosted API (NVIDIA, Hugging Face, etc.), we may have usage caps or pay-per-inference, but the max images per prompt is always 4. Therefore, we cannot upload or process more than four illustrations at a time. On the other hand, users of ChatGPT Plus can upload up to 20 images per message/prompt in ChatGPT Plus and higher (as per recent OpenAI documentation and in-product messages).

Similarly, for image sizes, GPT-4o currently accepts images up to 20 MB each (with optimal performance on images smaller than 10 MB and under 10,000 × 10,000 pixels). Gemini allows images up to 10 MB per upload, recommending smaller images for faster processing. Nemotron Nano VL has a per-image input size limit based on its patching system: each image can be up to 2048 × 1536 pixels (or similar, within a 12-tile limit, each tile 512 × 512 pixels, RGB only), with no alpha channel or huge file sizes supported.

In summary#

NVIDIA’s Llama 3.1 Nemotron Nano VL 8B is a powerful, efficient vision language model for document intelligence, deployable on various devices. It excels at extracting data, answering questions, and synthesizing information from complex documents — just as long as they're written in English.


Written By:
Fahim ul Haq
The AI Infrastructure Blueprint: 5 Rules to Stay Online
Whether you’re building with OpenAI’s API, fine-tuning your own model, or scaling AI features in production, these strategies will help you keep services reliable under pressure.
9 mins read
Apr 9, 2025