NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models

14/06/2024

NVIDIA today announced Nemotron-4 340B, a family of open models that developers can use to generate synthetic data for training large language models (LLMs) for commercial applications across healthcare, finance, manufacturing, retail and every other industry.

High-quality training data plays a critical role in the performance, accuracy and quality of responses from a custom LLM - but robust datasets can be prohibitively expensive and difficult to access.

Through a uniquely permissive open model license, Nemotron-4 340B gives developers a free, scalable way to generate synthetic data that can help build powerful LLMs.

The Nemotron-4 340B family includes base, instruct and reward models that form a pipeline to generate synthetic data used for training and refining LLMs. The models are optimized to work with NVIDIA NeMo, an open-source framework for end-to-end model training, including data curation, customization and evaluation. They're also optimized for inference with the open-source NVIDIA TensorRT-LLM library.

Nemotron-4 340B can be downloaded now from the NVIDIA NGC catalog and from Hugging Face, where developers can also use the Train on DGX Cloud service to easily fine-tune open AI models. Developers will soon be able to access the models at ai.nvidia.com, where they'll be packaged as an NVIDIA NIM microservice with a standard application programming interface that can be deployed anywhere.

Navigating Nemotron to Generate Synthetic Data LLMs can help developers generate synthetic training data in scenarios where access to large, diverse labeled datasets is limited.

The Nemotron-4 340B Instruct model creates diverse synthetic data that mimics the characteristics of real-world data, helping improve data quality to increase the performance and robustness of custom LLMs across various domains.

Then, to boost the quality of the AI-generated data, developers can use the Nemotron-4 340B Reward model to filter for high-quality responses. Nemotron-4 340B Reward grades responses on five attributes: helpfulness, correctness, coherence, complexity and verbosity. It's currently first place on the Hugging Face RewardBench leaderboard, created by AI2, for evaluating the capabilities, safety and pitfalls of reward models.

In this synthetic data generation pipeline, (1) the Nemotron-4 340B Instruct model is first used to produce synthetic text-based output. An evaluator model, (2) Nemotron-4 340B Reward, then assesses this generated text - providing feedback that guides iterative improvements and ensures the synthetic data is accurate, relevant and aligned with specific requirements. Researchers can also create their own instruct or reward models by customizing the Nemotron-4 340B Base model using their proprietary data, combined with the included HelpSteer2 dataset.

Fine-Tuning With NeMo, Optimizing for Inference With TensorRT-LLM Using open-source NVIDIA NeMo and NVIDIA TensorRT-LLM, developers can optimize the efficiency of their instruct and reward models to generate synthetic data and to score responses.

All Nemotron-4 340B models are optimized with TensorRT-LLM to take advantage of tensor parallelism, a type of model parallelism in which individual weight matrices are split across multiple GPUs and servers, enabling efficient inference at scale.

Nemotron-4 340B Base, trained on 9 trillion tokens, can be customized using the NeMo framework to adapt to specific use cases or domains. This fine-tuning process benefits from extensive pretraining data and yields more accurate outputs for specific downstream tasks.

A variety of customization methods are available through the NeMo framework, including supervised fine-tuning and parameter-efficient fine-tuning methods such as low-rank adaptation, or LoRA.

To boost model quality, developers can align their models with NeMo Aligner and datasets annotated by Nemotron-4 340B Reward. Alignment is a key step in training LLMs, where a model's behavior is fine-tuned using algorithms like reinforcement learning from human feedback (RLHF) to ensure its outputs are safe, accurate, contextually appropriate and consistent with its intended goals.

Businesses seeking enterprise-grade support and security for production environments can also access NeMo and TensorRT-LLM through the cloud-native NVIDIA AI Enterprise software platform, which provides accelerated and efficient runtimes for generative AI foundation models.

Evaluating Model Security and Getting Started The Nemotron-4 340B Instruct model underwent extensive safety evaluation, including adversarial tests, and performed well across a wide range of risk indicators. Users should still perform careful evaluation of the model's outputs to ensure the synthetically generated data is suitable, safe and accurate for their use case.

For more information on model security and safety evaluation, read the model card.

Download Nemotron-4 340B models via NVIDIA NGC and Hugging Face. For more details, read the research papers on the model and dataset.

See notice regarding software product information.

LINK:	https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm...
	See more stories from nvidia

North America Stories

06/08/2026

Hisense Adds Dolby Vision 2 to Select Models

Share Copy link Facebook X Linkedin Bluesky Email...

06/08/2026

MediaKind to showcase one unified portfolio at IBC2026

At IBC2026, MediaKind will make its first major appearance as a unified global powerhouse in video, showcasing one of the world's most comprehensive video i...

06/08/2026

Big Blue Marble to unveil AI-assisted piracy detection an...

Big Blue Marble (#5.A63) will demonstrate how its integrated technology and operational expertise help media companies scale premium services with less complexi...

06/08/2026

NIH expected to award Scripps Research nearly $4.2 million over 5 years to advance tools for vaccine design

LA JOLLA, CA-Scripps Research has received more than $500,000 in first-year fund...

06/08/2026

Improving vaccine design for Ebola, HIV and more

LA JOLLA, CA-Viruses are masters at invading our cells thanks to specialized proteins that coat their surfaces. When scientists design vaccines, they often crea...

06/08/2026

How a chemical reaction triggers brain inflammation in Alzheimer's disease

LA JOLLA, CA-The brain has its own immune system, which detects threats and mounts a defense. A growing body of evidence has shown that in Alzheimer's disea...

06/08/2026

Jin-Quan Yu elected to the National Academy of Sciences

LA JOLLA, CA-Scripps Research chemist Jin-Quan Yu has been elected to the National Academy of Sciences (NAS), one of the highest honors a scientist can achieve....

06/08/2026

Scripps Research ranks third in 2026 Cure Innovation Index

LA JOLLA, CA-Scripps Research ranked third in the inaugural 2026 Cure Innovation Index recognizing the top-performing institutes and centers across the United S...

06/08/2026

Scripps Research chemist Benjamin Cravatt elected to American Philosophical Society

Benjamin Cravatt, the Gilula Chair of Chemical Biology and a professor of chemis...

06/08/2026

Scripps Research immunologist Dennis Burton elected to American Academy of Arts and Sciences

LA JOLLA, CA-Dennis Burton, professor and the James & Jessie Minor Chair in Immu...

06/08/2026

How changes to proteins can alter drug interactions for new precision therapies

LA JOLLA, CA-Inside every human cell, proteins are constantly being tagged with small chemical modifications after they're produced. Known as post-translati...

06/08/2026

Scripps Research establishes endowed chair honoring renowned structural biologist Ian Wilson

LA JOLLA-Scripps Research has established the Ian Wilson Endowed Chair, a new fa...

06/08/2026

Scripps Research's Skaggs Graduate School awards doctoral degrees to 34th graduating class

Scripps Research's Skaggs Graduate School of Chemical and Biological Science...

06/08/2026

Scripps Research chemist Jin-Quan Yu is named a Fellow of the Royal Society

LA JOLLA, CA-Professor Jin-Quan Yu of Scripps Research has been elected to the Fellowship of the Royal Society, the U.K.'s national academy of sciences and ...

06/08/2026

Experimental HIV vaccine achieves a long-sought goal

LA JOLLA, CA-For years, researchers have been hoping for vaccines that protect people against not just one strain of HIV, but every strain of the quickly mutati...

06/08/2026

Calibr-Skaggs advances CLF065, a regenerative GLP-2 therapy, into two Phase 2 IBD studies

LA JOLLA, CA-The Calibr-Skaggs Institute for Innovative Medicines, the nonprofit...

06/08/2026

Chemists snap together complex 3D molecules from highly reactive radicals'-without losing their shape

LA JOLLA, CA-Building the complex 3D molecules needed for new medicines has alwa...

06/08/2026

A fentanyl countermeasure that adapts to combat future black-market drugs

LA JOLLA, CA-Fentanyl and related variants of the synthetic opioid kill more Americans each year than car accidents and gun violence combined. In too-high doses...

06/08/2026

Two Scripps Research assistant professors named 2026 Baxter Young Investigators

LA JOLLA, CA-What do decoding communication between organs and reimagining the future of genome editing have in common? They're among the scientific questio...

06/08/2026

Calibr-Skaggs awarded $5.1M by NIH to develop long-acting hepatitis B virus therapy

LA JOLLA, CA-Of the 1.2 million people living with HIV in the United States, app...

06/08/2026

Lab studies explain how new cancer drug works as it enters patient testing

LA JOLLA, CA-For some people, cancer immunotherapies are life-changing. These treatments can turn the body's own immune system against a tumor, either elimi...

06/08/2026

Newly identified molecule strengthens the eye's response to damage in retinal disease

LA JOLLA, CA-Many conditions that cause vision loss share a common feature: the ...

06/08/2026

Molecular scissors caught in action: A structural blueprint for RNA therapeutics

LA JOLLA, CA-RNA interference is a natural mechanism for living cells to control whether specific genes are being used or not. Crowned with the 2006 Nobel Priz...

06/08/2026

Immune molecule may drive excessive drinking in alcohol use disorder

LA JOLLA, CA-The drugs that keep rheumatoid arthritis in check may one day help people stop drinking. A new Scripps Research study shows that an anti-inflammato...

06/08/2026

Back in action: Researchers make drug-resistant bacteria vulnerable again

LA JOLLA, CA-Antibiotic resistance is one of the most urgent threats to global health, linked to an estimated 4.7 million deaths worldwide in 2019 alone. As mor...

06/08/2026

Scripps Research scientists demonstrate a faster, cheaper route to making critical drugs using common table sugar

LA JOLLA, CA-Some of the world's best-selling diabetes drugs depend on a che...

06/08/2026

Scripps Research scientists awarded $2M to advance global disease surveillance

LA JOLLA, CA-Detecting infectious disease threats early and responding quickly can dramatically alter the course of an infectious outbreak. Technologies such as...

06/08/2026

Joan Pulupa joins Scripps Research faculty to study the organization of DNA in brain cells and its links to neurodegeneration

LA JOLLA, CA-Molecular biophysicist Joan Pulupa will join Scripps Research in Ja...

06/08/2026

Scripps Research scientists train the immune system to make antibodies against numerous HIV strains

LA JOLLA, CA-HIV is globally so diverse, consisting of hundreds of thousands of ...

06/08/2026

ASG Ups Michele Ferreira to Chief Business Officer

Share Copy link Facebook X Linkedin Bluesky Email...

06/08/2026

WNBC and WNJU Expand New York Giants Deal

Share Copy link Facebook X Linkedin Bluesky Email...

06/08/2026

Utah Scientific to Highlight NBOSS at IBC 2026

Share Copy link Facebook X Linkedin Bluesky Email...

06/08/2026

Disney, TikTok Ink Global Short-Form Content-Sharing Deal

Share Copy link Facebook X Linkedin Bluesky Email...

06/08/2026

Kane Peterson Joins QuickLink's North America Team

Share Copy link Facebook X Linkedin Bluesky Email...

06/08/2026

FCC Returns $881 Million in Unused TV Broadcaster Relocation Funds

Share Copy link Facebook X Linkedin Bluesky Email...

06/08/2026

PlayBox Neo to highlight secure and scalable workflow del...

At this year's SET EXPO, PlayBox Neo will present recent innovations across its PlayBox Neo Suite and integrated range of broadcast media solutions. By show...

06/08/2026

Modern Streaming Solutions Private Limited Partners with...

Live demo at IBC2026 VisualOn Booth, Hall 5, Stand A55 Amsterdam, Netherlands Modern Streaming Solutions Private Limited, a rising force in India's dig...

06/08/2026

How Karukera Studio Built a Media Production Hub with SNS EVO

How Karukera Studio Built a Media Production Hub with SNS EVO Melanie Ciotti August 5, 2026 0 Comments Hero images displays Karukera Studio in Sainte-...

06/08/2026

NHK drama series Rosanjin no Kamado shot on PYXIS 6K

NHK drama series Rosanjin no Kamado shot on PYXIS 6K Brie Clayton August 5, 2026 0 Comments Blackmagic PYXIS 6K and DaVinci Resolve Studio capture the...

06/08/2026

Camera Match: AutoSetup 3d Plugin / script for Photomontages- Cinema 4D

Camera Match: AutoSetup 3d Plugin / script for Photomontages- Cinema 4D Jamie Cardoso August 5, 2026 0 Comments If you do any kind of Architectural Vi...

06/08/2026

Into the Omniverse: How Open World Models Push the Frontier of Physical AI

Editor's note: This post is part of Into the Omniverse, a series focused on how developers, 3D practitioners and enterprises can transform their workflows u...

06/08/2026

GeForce NOW Shakes Up August With 26 New Games

August is here, bringing 26 new games for GeForce NOW members. Command the seas in World of Warships: Legends and discover what's next in the GeForce NOW ...

05/08/2026

SVG Regional Sports Production Summit 2026: All Sessions Now Available to Watch on SVG PLAY

RSNs, teams, leagues, and streamers explore the present and future of local spor...

05/08/2026

SVG New Sponsor Spotlight: Hitachi Vantara's Lenny Khaitov on Building Resilient Data Infrastructure for Sports Production

As sports-production workflows generate larger volumes of unstructured data and ...

05/08/2026

NESN's Women's Celebration Night Returns With Expanded Focus on the Next Generation

Three aspiring broadcasters will contribute on-camera and inside the production ...

05/08/2026

Union County Paints an Authentic Picture of Recovery and Community

Adam Meeks attends the premiere of his film Union County, an official selection of the 2026 Sundance Film Festival. (Photo by Jemal Countess/Sundance Institut...

05/08/2026

Australia's retirement sector advertising rises more than 10% as millions plan for life after work according to Nielsen

Advertisers invested $53.6 million in the retirement sector in the last 12 month...

05/08/2026

New Nielsen data shows IKEA's New Zealand launch attracted more than 243,000 shoppers in the last month

Nielsen CMI and Ad Intel data reveals who's visiting IKEA, where else they s...

05/08/2026

Nielsen expands Four-Screen Ad Deduplication across CTV in Japan

Media buyers and sellers can now compare deduplicated campaign reach across computer, mobile, connected TV and linear TV across all measurable CTV publishers in...

05/08/2026

Stranger Things,' Bluey,' The Pitt' Top Nielsen's Streaming Charts Through First Half of 2026

Stranger Things' Leads All Streaming Titles with 23.3 Billion Minutes in Fir...

View most recent headlines