Speaking the Language of the Genome: Gordon Bell Finalist Applies Large Language Models to Predict New COVID Variants
14/11/2022
Published in October, the groundbreaking work is a collaboration by more than two dozen academic and commercial researchers from Argonne National Laboratory, NVIDIA, the University of Chicago and others.
The research team trained an LLM to track genetic mutations and predict variants of concern in SARS-CoV-2, the virus behind COVID-19. While most LLMs applied to biology to date have been trained on datasets of small molecules or proteins, this project is one of the first models trained on raw nucleotide sequences - the smallest units of DNA and RNA.
We hypothesized that moving from protein-level to gene-level data might help us build better models to understand COVID variants, said Arvind Ramanathan, computational biologist at Argonne, who led the project. By training our model to track the entire genome and all the changes that appear in its evolution, we can make better predictions about not just COVID, but any disease with enough genomic data.
The Gordon Bell awards, regarded as the Nobel Prize of high performance computing, will be presented at this week's SC22 conference by the Association for Computing Machinery, which represents around 100,000 computing experts worldwide. Since 2020, the group has awarded a special prize for outstanding research that advances the understanding of COVID with HPC.
Training LLMs on a Four-Letter Language LLMs have long been trained on human languages, which usually comprise a couple dozen letters that can be arranged into tens of thousands of words, and joined together into longer sentences and paragraphs. The language of biology, on the other hand, has only four letters representing nucleotides - A, T, G and C in DNA, or A, U, G and C in RNA - arranged into different sequences as genes.
While fewer letters may seem like a simpler challenge for AI, language models for biology are actually far more complicated. That's because the genome - made up of over 3 billion nucleotides in humans, and about 30,000 nucleotides in coronaviruses - is difficult to break down into distinct, meaningful units.
When it comes to understanding the code of life, a major challenge is that the sequencing information in the genome is quite vast, Ramanathan said. The meaning of a nucleotide sequence can be affected by another sequence that's much further away than the next sentence or paragraph would be in human text. It could reach over the equivalent of chapters in a book.
NVIDIA collaborators on the project designed a hierarchical diffusion method that enabled the LLM to treat long strings of around 1,500 nucleotides as if they were sentences.
Standard language models have trouble generating coherent long sequences and learning the underlying distribution of different variants, said paper co-author Anima Anandkumar, senior director of AI research at NVIDIA and Bren professor in the computing + mathematical sciences department at Caltech. We developed a diffusion model that operates at a higher level of detail that allows us to generate realistic variants and capture better statistics.
Predicting COVID Variants of Concern Using open-source data from the Bacterial and Viral Bioinformatics Resource Center, the team first pretrained its LLM on more than 110 million gene sequences from prokaryotes, which are single-celled organisms like bacteria. It then fine-tuned the model using 1.5 million high-quality genome sequences for the COVID virus.
By pretraining on a broader dataset, the researchers also ensured their model could generalize to other prediction tasks in future projects - making it one of the first whole-genome-scale models with this capability.
Once fine-tuned on COVID data, the LLM was able to distinguish between genome sequences of the virus' variants. It was also able to generate its own nucleotide sequences, predicting potential mutations of the COVID genome that could help scientists anticipate future variants of concern.
Trained on a year's worth of SARS-CoV-2 genome data, the model can infer the distinction between various viral strains. Each dot on the left corresponds to a sequenced SARS-CoV-2 viral strain, color-coded by variant. The figure on the right zooms into one particular strain of the virus, which captures evolutionary couplings across the viral proteins specific to this strain. Image courtesy of Argonne National Laboratory's Bharat Kale, Max Zvyagin and Michael E. Papka. Most researchers have been tracking mutations in the spike protein of the COVID virus, specifically the domain that binds with human cells, Ramanathan said. But there are other proteins in the viral genome that go through frequent mutations and are important to understand.
The model could also integrate with popular protein-structure-prediction models like AlphaFold and OpenFold, the paper stated, helping researchers simulate viral structure and study how genetic mutations impact a virus' ability to infect its host. OpenFold is one of the pretrained language models included in the NVIDIA BioNeMo LLM service for developers applying LLMs to digital biology and chemistry applications.
Supercharging AI Training With GPU-Accelerated Supercomputers The team developed its AI models on supercomputers powered by NVIDIA A100 Tensor Core GPUs - including Argonne's Polaris, the U.S. Department of Energy's Perlmutter, and NVIDIA's in-house Selene system. By scaling up to these powerful systems, they achieved performance of more than 1,500 exaflops in training runs, creating the largest biological language models to date.
We're working with models today that have up to 25 billion
LINK: | https://blogs.nvidia.com/blog/2022/11/14/genomic-large-language-model-... |
See more stories from nvidia |
Most recent headlines
04/08/2024
Dalet Appoints Santiago Solanas as CEO to Lead Next Era of Growth and Innovation
Dalet, a leading technology and service provider for media-rich organizations, is excited to announce Santiago Solanas as its new Chief Executive Officer (CEO)....
03/06/2024
Dalet and Veritone Reach Agreement to Distribute, Transact and Monetize Media Archives
Dalet, a leading technology and service provider for media-rich organizations, a...
28/04/2024
Mediahaus delivers the first SRT live-streaming sports production over 5G with URSA Broadcast G2
Mediahaus delivers the first SRT live-streaming sports production over 5G with U...
27/04/2024
L3Harris Chair and CEO Christopher E. Kubasik Discusses 1Q24 On CNBC's "Closing Bell: Overtime"
On April 26, L3Harris Chair and CEO Christopher E. Kubasik joined CNBC's Mor...
27/04/2024
Audinate Adds Major New Features to Dante Connect
PORTLAND, Oregon Audinate Group Limited, the developer of the Dante AV-over-IP solution, announced significant new additions to Dante Connect, its cloud-based D...
27/04/2024
Bell Media Launches New Portfolio of FAST Channels
TORONTO Bell Media has launched 10 English and French-language FAST channels featuring entertainment, factual, news, and sports programming. The new free stream...
27/04/2024
Study: Broadcast TV Evening News Avoids Serious Economic Issues
An extensive new analysis of the news segments in the broadcast evening news programs of ABC, CBS, NBC and PBS has found that broadcasters devoted most of their...
27/04/2024
Hughes Opens Manufacturing Facility and Private 5G Incubation Center in Maryland
GERMANTOWN, Md. EchoStar's Hughes Network Systems has opened a new manufacturing facility and private 5G incubation center in Germantown, Maryland....
27/04/2024
Broadcasting Legend Harry Pappas Dead At 78
Harry Pappas, one of three brothers who founded Pappas Telecasting Companies in 1971, died April 24. He was 78 years old....
27/04/2024
Televisa Selects Synamedia For Broadcast Distribution Overhaul
ATLANTA and LONDON Mexican telecommunications and broadcast company Televisa has selected Synamedia for an overhaul of its broadcast distribution....
27/04/2024
Participate in the Survey - The Impact of AI on Media and the Creative Industry
Participate in the Survey - The Impact of AI on Media and the Creative Industry Pascal Wagner April 26, 2024 0 Comments By participating in this surve...
27/04/2024
SDVI Rally Access Workstation Earns Two Top Awards at 2024 NAB Show
SDVI Rally Access Workstation Earns Two Top Awards at 2024 NAB Show Brie Clayton April 26, 2024 0 Comments SDVI, the leading platform provider for clo...
27/04/2024
Berklee's Music and Health Institute Launches Community Health Musician Certificate
Berklee's Music and Health Institute Launches Community Health Musician Cert...
27/04/2024
Charter Reports Higher Q1 Profits Despite Broadband, Video Losses
Charter Communications reported higher first-quarter profits despite continued cord-cutting and competition for broadband customers....
27/04/2024
Environmental Groups Aim To Make Unscripted TV More Sustainable
Two environmentally-focused groups are partnering to engage the unscripted TV world in finding better ways to address climate change. Reality of Change is an ec...
27/04/2024
Sarah Garcia Named Weekend Anchor at Telemundo 40 in Texas
Sarah Garcia has been promoted to weekend anchor at KTLM McAllen, Texas, known as Telemundo 40. Starting April 27, she will anchor Noticias Telemundo 40 weekend...
27/04/2024
CBS Sports Kicks Off FAST Channel for UEFA Champions League on Pluto TV
CBS Sports said it launched a new 24-hour free, ad supported streaming television (FAST) channel devoted to the UEFA Champions League....
27/04/2024
Brian Roberts's Pay Rose To $35 Million at Comcast
Comcast chairman and CEO Brian Roberts received $35.4 million in compensation in 2023, up 11% from the previous year, according to a proxy statement filed by th...
27/04/2024
John Lithgow Goes Back to School in Art Happens Here'
Art Happens Here With John Lithgow, which sees the actor study dance, ceramics, silk-screen printing and vocal jazz with students in Los Angeles, debuts on PBS ...
27/04/2024
FETV Wants Upfront Buyers Seeking Cable Viewers To Join Its Family
Remember Leave It to Beaver? Bewitched? Dragnet? When cable ratings were rising?...
27/04/2024
Catchy Comedy Features Gomer Pyle, USMC' Weekend Marathon
Next up for the weekend binge at Catchy Comedy is Gomer Pyle, U.S.M.C. Every weekend, Catchy Comedy features The Catchy Binge, a marathon of a classic sitcom....
26/04/2024
Sundance Film Festival CDMX 2024 kicks off today at Cinpolis
Sundance Film Festival CDMX 2024 kicks-off today with screenings in 5 theaters in Mexico City and the opening-night film, FRIDA, directed by Carla Guti rrez...
26/04/2024
Interview: Lourdes Portillo, Director of Las madres de la Plaza de Mayo, La Ofrenda
[Editor's Note: This interview is part of a larger feature about the women d...
26/04/2024
Career insights instead of everyday school life
Once again this year, SGL Carbon opened its doors to interested children and young people. On the occasion of the German Girls and Boys Day, which took place on...
26/04/2024
L3Harris Technologies Reports Strong First Quarter 2024 Results, Increases 2024 Profitability Guidance
Orders1 of $5.5 billion; book-to-bill of 1.06x Revenue of $5.2 billion, up 17%,...
26/04/2024
What Makes A Network Resilient?
Five Considerations For Communications Modernization In The 21st Century In the digital-enabled battlespace, the Joint Force needs to shoot, move and communica...
26/04/2024
CBS Sports Launches New Free Streaming Channel
CBS Sports has launched Champions League as a new, 24-hour streaming channel that will serve as the year-round destination for nonstop highlights of the UEFA ...
26/04/2024
Roku Streaming Homes Hit 81.6M
Despite tough competition in the streaming space, Roku reported solid results in Q1 2024, beating revenue expectations, with total net revenue up 19% YoY to $88...
26/04/2024
Sarah Farrell Named General Manager Of Pinewood Toronto Studios
LONDON AND TORONTO Pinewood Toronto Studios has appointed Sarah Farrell as general manager of the Studios in downtown Toronto....
26/04/2024
Quantum to Offer Advanced Filesharing Technology and Performance in StorNext and Myriad Solutions
Quantum to Offer Advanced Filesharing Technology and Performance in StorNext and...
26/04/2024
FilmLight Colour Awards welcomes 2024 entries and introduces new Emerging Talent' award
FilmLight Colour Awards welcomes 2024 entries and introduces new Emerging Talen...
26/04/2024
Picture Shop Announces Chris Evans as Head of Unscripted
Picture Shop Announces Chris Evans as Head of Unscripted Brie Clayton April 26, 2024 0 Comments Picture Shop announced Chris Evans will lead Unscripte...
26/04/2024
Participate in a Survey - The Impact of AI on Media and the Creative Industry
Participate in a Survey - The Impact of AI on Media and the Creative Industry Pascal Wagner April 26, 2024 0 Comments By participating in this survey,...
26/04/2024
Hi Barbie! Mattel Launching First FAST Channels on Samsung TV Plus
Toy maker Mattel said it is working with Samsung to launch its first free ad-supported streaming television (FAST) channels later this year....
26/04/2024
Marty Moe Named President Of Trusted Media Brands
Trusted Media Brands (TMB) said it named Marty Moe as president....
26/04/2024
Ron Howard Directs Jim Henson Documentary for Disney Plus
Ron Howard is the director on Jim Henson Idea Man, a documentary that premieres on Disney Plus May 31. Henson of course created Kermit the Frog, Miss Piggy, Big...
26/04/2024
Kraken Skate Away From RSN Root Sports for Deals With Tegna, Amazon
The ice continues to melt under the regional sports network business as the Seattle Kraken of the National Hockey League have made a long-term deal to broadcast...
26/04/2024
Warner Bros. Discovery Launches Olli First-Party Data Platform
Heading into the upfronts, Warner Bros. Discovery said it launched Olli, a first-party data platform advertiser can use for converged, targeted advertising camp...
26/04/2024
The Equalizer' Gets Season 5 on CBS
CBS has renewed the drama The Equalizer, which will see season five on in 2024-2025. Queen Latifah stars....
26/04/2024
The CW Inks New Deal for Miss USA, Miss Teen USA
The CW has entered into an exclusive multiyear broadcast partnership for the Miss USA Pageant and the Miss Teen USA Pageant. The 73rd Miss USA Pageant will air ...
26/04/2024
Fuse Urging Young Viewers To Vote With Blunt Campaign
Fuse Media isn't mincing words in a campaign urging its young viewers to register and participate in the 2024 elections....
26/04/2024
Neil Gaiman's Sandman' Universe Expands With Dead Boy Detectives'
Dead Boy Detectives, a series from Neil Gaiman about a detective agency staffed by ghosts, debuts on Netflix April 25. George Rexstrew and Jayden Revri are in t...
26/04/2024
The Story Collective opens largest film and TV studio in the heart of London
The Story Collective has gradually repurposed the former Mortlake Brewery to include production offices, workshops and sound stages By Matthew Corrigan Publi...
26/04/2024
Richard Perkett joins Amagi as chief product officer
Perkett joins the company following a 25 year career in product management, product marketing, engineering and user experience (UX) across multiple industries ...
26/04/2024
Teradek Announces Smaller More Robust Built-in Antennas f...
Teradek, the industry leader in wireless video transmitters and receivers, announced today the launch of new Bolt 6 LT 750 and Bolt 6 Monitor Module 750 RX with...
26/04/2024
Amagi Names Richard Perkett Chief Product Officer
NEW YORK Amagi has appointed Richard Perkett chief product officer (CPO)....
26/04/2024
NAB Board Election Results Announced
WASHINGTON, D.C. The National Association of Broadcasters (NAB) has announced the results of the 2024 NAB Radio and Television Board of Directors elections. The...
26/04/2024
Mattel to Launch First FAST Channels on Samsung TV Plus
EL SEGUNDO, Calif. & NEW YORK Mattel has announced a deal to launch its first three 24/7 free ad supported streaming (FAST) channels on Samsung TV Plus, Samsung...
26/04/2024
NextGen TV Launches In Portland, Maine
PORTLAND, Maine Viewers here can now receive the NextGen TV signals of five local stations with the launch of ATSC 3.0 service from host station WPFO, which is ...