Sony Pixel Power calrec Sony

NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models

14/06/2024

NVIDIA today announced Nemotron-4 340B, a family of open models that developers can use to generate synthetic data for training large language models (LLMs) for commercial applications across healthcare, finance, manufacturing, retail and every other industry.

High-quality training data plays a critical role in the performance, accuracy and quality of responses from a custom LLM - but robust datasets can be prohibitively expensive and difficult to access.

Through a uniquely permissive open model license, Nemotron-4 340B gives developers a free, scalable way to generate synthetic data that can help build powerful LLMs.

The Nemotron-4 340B family includes base, instruct and reward models that form a pipeline to generate synthetic data used for training and refining LLMs. The models are optimized to work with NVIDIA NeMo, an open-source framework for end-to-end model training, including data curation, customization and evaluation. They're also optimized for inference with the open-source NVIDIA TensorRT-LLM library.

Nemotron-4 340B can be downloaded now from the NVIDIA NGC catalog and from Hugging Face, where developers can also use the Train on DGX Cloud service to easily fine-tune open AI models. Developers will soon be able to access the models at ai.nvidia.com, where they'll be packaged as an NVIDIA NIM microservice with a standard application programming interface that can be deployed anywhere.

Navigating Nemotron to Generate Synthetic Data LLMs can help developers generate synthetic training data in scenarios where access to large, diverse labeled datasets is limited.

The Nemotron-4 340B Instruct model creates diverse synthetic data that mimics the characteristics of real-world data, helping improve data quality to increase the performance and robustness of custom LLMs across various domains.

Then, to boost the quality of the AI-generated data, developers can use the Nemotron-4 340B Reward model to filter for high-quality responses. Nemotron-4 340B Reward grades responses on five attributes: helpfulness, correctness, coherence, complexity and verbosity. It's currently first place on the Hugging Face RewardBench leaderboard, created by AI2, for evaluating the capabilities, safety and pitfalls of reward models.

In this synthetic data generation pipeline, (1) the Nemotron-4 340B Instruct model is first used to produce synthetic text-based output. An evaluator model, (2) Nemotron-4 340B Reward, then assesses this generated text - providing feedback that guides iterative improvements and ensures the synthetic data is accurate, relevant and aligned with specific requirements. Researchers can also create their own instruct or reward models by customizing the Nemotron-4 340B Base model using their proprietary data, combined with the included HelpSteer2 dataset.

Fine-Tuning With NeMo, Optimizing for Inference With TensorRT-LLM Using open-source NVIDIA NeMo and NVIDIA TensorRT-LLM, developers can optimize the efficiency of their instruct and reward models to generate synthetic data and to score responses.

All Nemotron-4 340B models are optimized with TensorRT-LLM to take advantage of tensor parallelism, a type of model parallelism in which individual weight matrices are split across multiple GPUs and servers, enabling efficient inference at scale.

Nemotron-4 340B Base, trained on 9 trillion tokens, can be customized using the NeMo framework to adapt to specific use cases or domains. This fine-tuning process benefits from extensive pretraining data and yields more accurate outputs for specific downstream tasks.

A variety of customization methods are available through the NeMo framework, including supervised fine-tuning and parameter-efficient fine-tuning methods such as low-rank adaptation, or LoRA.

To boost model quality, developers can align their models with NeMo Aligner and datasets annotated by Nemotron-4 340B Reward. Alignment is a key step in training LLMs, where a model's behavior is fine-tuned using algorithms like reinforcement learning from human feedback (RLHF) to ensure its outputs are safe, accurate, contextually appropriate and consistent with its intended goals.

Businesses seeking enterprise-grade support and security for production environments can also access NeMo and TensorRT-LLM through the cloud-native NVIDIA AI Enterprise software platform, which provides accelerated and efficient runtimes for generative AI foundation models.

Evaluating Model Security and Getting Started The Nemotron-4 340B Instruct model underwent extensive safety evaluation, including adversarial tests, and performed well across a wide range of risk indicators. Users should still perform careful evaluation of the model's outputs to ensure the synthetically generated data is suitable, safe and accurate for their use case.

For more information on model security and safety evaluation, read the model card.

Download Nemotron-4 340B models via NVIDIA NGC and Hugging Face. For more details, read the research papers on the model and dataset.

See notice regarding software product information.
LINK: https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm...
See more stories from nvidia

North America Stories

19/02/2026

Submissions open for Best of Show Awards at NAB 2026

Share Copy link Facebook X Linkedin Bluesky Email...

19/02/2026

SMPTE Issues Call for Summit Papers

Share Copy link Facebook X Linkedin Bluesky Email...

19/02/2026

Broadband Usage Jumps by 9.9% in Q4

Share Copy link Facebook X Linkedin Bluesky Email...

19/02/2026

NBC Sports Details NEPs Extensive Work for Winter Olympic Coverage

Share Copy link Facebook X Linkedin Bluesky Email...

19/02/2026

Fuse Media and Complex to Launch Complex TV

Share Copy link Facebook X Linkedin Bluesky Email...

19/02/2026

Spincast Gets U.S. Patent for AI-Powered Shoppable TV

Share Copy link Facebook X Linkedin Bluesky Email...

19/02/2026

IBCAP Announces $21 Million Lawsuit Against DMTN IPTV

Share Copy link Facebook X Linkedin Bluesky Email...

19/02/2026

NRB Backs ATSC 3.0 Tuner and Must-Carry Requirements

Share Copy link Facebook X Linkedin Bluesky Email...

19/02/2026

Foundry Acquires Griptape to Accelerate AI Integration Ac...

Foundry, the leading developer of creative software for the Media and Entertainment industry, today announced the completion of its acquisition of Griptape, a p...

19/02/2026

Scott D Smith Captures the Chaos and Intimacy of The Bear...

Capturing the raw energy and emotional intensity of FX's hit series The Bear is no small feat, especially when the set itself is as hectic and unpredictab...

19/02/2026

Cobalt Digital Products Receive IPMX Product Certificatio...

Cobalt Digital Inc., a leading designer and manufacturer of award-winning video and audio conversion, processing, and distribution solutions, and a founder of o...

19/02/2026

DHD to Demonstrate Latest-Generation Audio Production Sys...

DHD s complete range of digital audio mixers plus the latest-generation RM1 Pro broadcast-from-anywhere unit will be promoted at the upcoming Lokalradioforbunds...

19/02/2026

Media Companies Gain Operations Observability With Operat...

Operative, the preferred advertising management solution provider for the world's leading media brands, today announced a partnership with GraySwan to bring...

19/02/2026

Witbe Launches New Agentic SDK to Power Agentic Automatio...

Witbe today announced the launch of Agentic SDK, a new test automation framework designed to help video service providers build, operate, and scale agentic auto...

19/02/2026

Network18 Media and Investments Selects Grass Valleys Pla...

Grass Valley today announced that Network18 Media & Investments Ltd., one of India's largest and most influential media conglomerates, is deploying Grass Va...

19/02/2026

Clear-Com to Sponsor and Present USITT Stage Management A...

Clear-Com is sponsoring and presenting the USITT Stage Management Award during the United States Institute for Theatre Technology (USITT) Annual Conference & S...

19/02/2026

Paramores Hayley Williams Praises Berklee Ensemble: Make Noise

Paramores Hayley Williams Praises Berklee Ensemble: Make Noise Williams acknowledged the Paramore Ensemble after their show at the Berklee Performance Center ...

19/02/2026

February 18, 2026

Mapping protein production in brain cells yields new insights for brain disease Scripps Research and UC San Diego scientists used a novel method to show that so...

18/02/2026

Milano Cortina 2026: How OBS is Utilizing Audio QC at the First Winter Games to Use Immersive Sound

Audio quality control (QC) is becoming ever more crucial for Olympic Broadcastin...

18/02/2026

Milano Cortina 2026: OBS Demonstrates its Commitment to an Inclusive Sports Media Landscape

The Olympic Games are not only a showcase of athletic excellence, they are also ...

18/02/2026

Ronda Rousey vs. Gina Carano Headlines Netflix's First Live MMA Event

Netflix is entering the MMA game with a matchup between two of the biggest names ever to compete on the women's side. Most Valuable Promotions and Netflix ...

18/02/2026

Network18 Media & Investments Selects Grass Valley's Playout X to Power Unified News Operations

Grass Valley announces that Network18 Media & Investments Ltd., one of India'...

18/02/2026

Cobalt Digital Products Receive IPMX Product Certification Following Inaugural Testing Event

Cobalt Digital Inc., a designer and manufacturer of video and audio conversion, ...

18/02/2026

Channel Nine's Wide World of Sports Hits Livigno; XR From North Sydney Drives Coverage

Production is divided between a studio in the mountains and a brand-new studio i...

18/02/2026

Milano Cortina 2026: Focusing in on the Parabolic' Sound of Skis on Snow at NBC Sports

At the Winter Games IBC in Milan, NBC Sports and Olympics' director of audio...

18/02/2026

2026 Milano Cortina Women's Alpine Photo Gallery

The Women's Alpine events at the Tofane Alpine Skiing Center, about a 10-minute drive from Cortina d'Ampezzo, featured some of the most exciting Olympic...

18/02/2026

Generated in the Italian Alps, NBC's Olympics Audio Is Mixed in Stamford

At the Broadcast Center, 14 audio-control rooms handle the sound in a complex routing and processing regimen We are exactly where we want to be, Karl Malone,...

18/02/2026

Cortina Curling Olympic Stadium Photo Gallery

With 300 hours of curling competition in two weeks, it's a safe bet that even the most curling-hungry fan will be satiated. It also requires a production te...

18/02/2026

Cortina Sliding Center Two-Man Bobsleigh Photo Gallery

Always seen as one of the more crazy Olympic events, Bobsleigh is a sport in which athletes must have nerves of steel and pilots must navigate high-tech sleds...

18/02/2026

Milano Cortina 2026: Inside the Milan-Based Warner Bros. Discovery Casa Italia Studio with Eurosport

A typical Winter Olympics day - or should I say evening - inside the Warner Bros...

18/02/2026

Milano Cortina 2026: How NOS is Delivering a Speed Skating-First Feed for Dutch Audiences

The Netherlands has dominated Milan's ice rinks, scooping six speed skating ...

18/02/2026

NBC Sports Sets Its Own Record With Olympics Comms

At the Broadcast Center in Stamford, three discrete intercom systems combine into a centralized infrastructure Into the second week of Milan Cortina 2026, eigh...

18/02/2026

Deloitte Releases 2026 Global Sports Industry Outlook

AI is reshaping operations, capital is scaling ownership, sports are converging with media and entertainment, and venues are evolving into year-round platforms...

18/02/2026

SVG New Sponsor Spotlight: DNA Inc.'s Thomas Engel on Evolving Digital Experiences and the Power of In-House Collaboration

DNA Inc. specializes in crafting next level digital experiences in the Media, St...

18/02/2026

Copyright Issues Are a Crack in the Ice for Olympic Figure Skating

Conflicts have increased, but so have solutions, driven chiefly by pragmatism and the threat of AI music The only persons on the Figure Skating ice at the 2026...

18/02/2026

Canada's CBC Navigates Multi-Zone Winter Olympics With Bi-Lingual Production Model, Remote Studios, and Custom Content Hubs

Canadian rightsholder deploys its most complex setup at an Olympics ever with ...

18/02/2026

Live From Stamford: NBC Sports Broadcast Center Is Anchor Point of the Entire Olympic Broadcast'

A crew of 1,685 people and 13 control rooms produce nearly every on-air minute f...

18/02/2026

L3Harris Secures Full-Rate Production Contract for US Navy Submarine Communication Systems

The Virginia-class attack submarine USS Texas (SSN 775) underway....

18/02/2026

TV Viewing Hits 12-Month High

Share Copy link Facebook X Linkedin Bluesky Email...

18/02/2026

SinclairLaunches New True Crime Daily Video Podcast

Share Copy link Facebook X Linkedin Bluesky Email...

18/02/2026

E! Co-Founder Alan Mruvka Launches Filmology Labs Studios

Share Copy link Facebook X Linkedin Bluesky Email...

18/02/2026

Stephen Colbert, FCC Commissioner Gomez Blast FCC 'Censorship'

Share Copy link Facebook X Linkedin Bluesky Email...

18/02/2026

NBA All-Star Game Averages 8.8 Million Viewers

Share Copy link Facebook X Linkedin Bluesky Email...

18/02/2026

Speed - Innovation - LEDs - Whitaker ASC and Verbinski Di...

Good Luck, Have Fun, Don't Die in theaters February 2026 Photo courtesy of Graham Bartholomew, SMPSP James Whitaker, ASC reunited with visionary director G...

18/02/2026

TAG Video Systems Expands MCS With New Lens Visualization...

TAG Video Systems has expanded its Media Control System (MCS) version 1.7.0 release with Lens, a visual service health analysis interface that organizes monitor...

18/02/2026

Teradek Introduces RF-X Revolutionizing Mission-Critical...

Teradek, a global leader in advanced video transmission technology, today announced the launch of RF-X Auto Switcher, a revolutionary appliance designed to deli...

18/02/2026

DP Adolpho Veloso Captures Train Dreams with ZEISS Super...

Train Dreams. (L-R) Director of Photography Adolpho Veloso and Joel Edgerton as Robert Grainier on the set of Train Dreams. Cr. Daniel Schaefer/BBP Train Dreams...

18/02/2026

Cerberus Tech Expands Livelinks Multi-Cloud Capabilities...

Cerberus Tech, a leader in cloud-native IP video contribution and distribution, today announced the addition of Akamai Cloud as a supported infrastructure optio...

18/02/2026

ORTC deploys Synamedia Quortex Link for IP channel distri...

Leading video software provider, Synamedia, today announced that Office de Radio et T l vision des Comores (ORTC), the national public broadcaster for the Comor...

18/02/2026

Netflix Opens New Office in Mexico City

Back to All News Netflix Opens New Office in Mexico CityFrom left to right: Francisco Ramos, Vice President of Content for Latin America; Manola Zabalza, Secre...