Nucleotide Transformers
Discover single-nucleotide resolution insights
State-of-the-art generative DNA models
Download PaperIntro:
Here we present a family of transformer-based genomic language models trained on extensive datasets of DNA sequences across multiple reference genomes. These models integrate contextual information from nucleotide sequences to enable highly accurate predictions of molecular phenotypes and precise segmentation of genomic elements at single-nucleotide resolution.
Nucleotide Transformer:
Closing the gap between measurable genetic information and observable traits is a longstanding challenge in genomics. Yet, the prediction of molecular phenotypes from DNA sequences alone remains limited and inaccurate, often driven by the scarcity of annotated data and the inability to transfer learnings between prediction tasks. Here, we present an extensive study of foundation models pre-trained on DNA sequences, named the Nucleotide Transformer, integrating information from 3,202 diverse human genomes, as well as 850 genomes from a wide range of species, including model and non-model organisms. These transformer models yield transferable, context-specific representations of nucleotide sequences, which allow for accurate molecular phenotype prediction even in low-data settings. We show that the representations alone match or outperform specialised methods on 11 of 18 prediction tasks, and up to 15 after fine-tuning. Despite no supervision, the transformer models learnt to focus attention on key genomic elements, including those that regulate gene expression, such as enhancers. Lastly, we demonstrate that utilising model representations alone can improve the prioritisation of functional genetic variants. The training and application of foundational models in genomics explored in this study provide a widely applicable stepping stone to bridge the gap of accurate molecular phenotype prediction from DNA sequence alone.
SegmentNT:
Foundation models have achieved remarkable success in several fields such as natural language processing, computer vision and more recently biology. DNA foundation models in particular are emerging as a promising approach for genomics. However, so far no model has delivered granular, nucleotide-level predictions across a wide range of genomic and regulatory elements, limiting its practical usefulness. In this paper, we build on our previous work on the Nucleotide Transformer (NT) to develop a segmentation model, SegmentNT, that processes input DNA sequences up to 30kb length to predict 14 different classes of genomics elements at single nucleotide resolution. By utilising pre-trained weights from NT, SegmentNT surpasses the performance of several ablation models, including convolution networks with one-hot encoded nucleotide sequences and models trained from scratch. SegmentNT can process multiple sequence lengths with zero-shot generalisation for sequences of up to 50kbp. We show improved performance on the detection of splice sites throughout the genome and demonstrate strong nucleotide-level precision. Because it evaluates all gene elements simultaneously, SegmentNT can predict the impact of sequence variants not only on splice site changes but also on exon and intron rearrangements in transcript isoforms. Finally, we show that a SegmentNT model trained on human genomics elements can generalise to elements of different species and that a trained multispecies SegmentNT model achieves stronger generalisation for all genic elements on unseen species. In summary, SegmentNT demonstrates that DNA foundation models can tackle complex, granular tasks in genomics at a single-nucleotide resolution. SegmentNT can be easily extended to additional genomics elements and species, thus representing a new paradigm on how we analyse and interpret DNA.
ChatNT:
Language models are thriving, powering conversational agents that assist and empower humans to solve a number of tasks. Recently, these models were extended to support additional modalities including vision, audio and video, demonstrating impressive capabilities across multiple domains including healthcare. Still, conversational agents remain limited in biology as they cannot yet fully comprehend biological sequences. On the other hand, high-performance foundation models for biological sequences have been built through self-supervision over sequencing data, but these need to be fine-tuned for each specific application, preventing transfer and generalisation between tasks. In addition, these models are not conversational which limits their utility to users with coding capabilities. In this paper, we propose to bridge the gap between biology foundation models and conversational agents by introducing ChatNT, the first multimodal conversational agent with an advanced understanding of biological sequences. ChatNT achieves new state-of-the-art results on the Nucleotide Transformer benchmark while being able to solve all tasks at once, in English, and to generalise to unseen questions. In addition, we have curated a new set of more biologically relevant instructions tasks from DNA, RNA and proteins, spanning multiple species, tissues and biological processes. ChatNT reaches performance on par with state-of-the-art specialised methods on those tasks. We also present a novel perplexity-based technique to help calibrate the confidence of our model predictions. Our framework for genomics instruction-tuning can be easily extended to more tasks and biological data modalities (e.g. structure, imaging), making it a widely applicable tool for biology. ChatNT is the first model of its kind and constitutes an initial step towards building generally capable agents that understand biology from first principles while being accessible to users with no coding background
AgroNT:
Significant progress has been made in the field of plant genomics, as demonstrated by the increased use of high-throughput methodologies that enable the characterisation of multiple genome-wide molecular phenotypes. These findings have provided valuable insights into plant traits and their underlying genetic mechanisms, particularly in model plant species. Nonetheless, effectively leveraging them to make accurate predictions represents a critical step in crop genomic improvement. We present AgroNT, a foundational large language model trained on genomes from 48 plant species with a predominant focus on crop species. We show that AgroNT can obtain state-of-the-art predictions for regulatory annotations, promoter/terminator strength, tissue-specific gene expression, and prioritise functional variants. We conduct a large-scale in silico saturation mutagenesis analysis on cassava to evaluate the regulatory impact of over 10 million mutations and provide their predicted effects as a resource for variant characterisation. Finally, we propose the use of the diverse datasets compiled here as the Plants Genomic Benchmark (PGB), providing a comprehensive benchmark for deep learning-based methods in plant genomic research.
Explore other Models
Choose the right Model for you.

Nucleotide Transformers
Discover single-nucleotide resolution insights

ProtBFN
Explore the proteome with Bayesian Flow Networks

AbBFN2
Design antibodies with integrated, modeling
