Digitalisation

Researchers move closer to preventing pandemics

Researchers have developed a new artificial intelligence tool, PathogenFinder2, that can help determine whether an unfamiliar bacterium carries genetic features linked to the ability to cause disease. This may transform pandemic preparedness by enabling the detection of harmful bacteria before they have even infected humans.

Foto af computer på DTU
PathogenFinder2 required substantial computing power to develop and was trained on a dataset of more than 21,000 bacterial genomes from international databases. Model photo: Bax Lindhardt
Alfred Ferrer Florensa in his office at the DTU National Food Institute. Displayed on the screen is the first map to show how thousands of bacteria are related to one another in terms of their pathogenic properties.
Alfred Ferrer Florensa in his office at the DTU National Food Institute. Displayed on the screen is the first map to show how thousands of bacteria are related to one another in terms of their pathogenic properties. Photo: Lene Hundborg Koss

What PathogenFinder2 does differently

PathogenFinder2 introduces a fundamentally new strategy. Instead of relying on similarity to known species, the model uses protein language models, advanced AI systems trained on millions of protein sequences. Much as text prediction tools learn patterns in human language, these models learn the language of proteins, allowing them to detect biochemical signals that traditional approaches miss.

“PathogenFinder2 is one of the first models to interpret whole bacterial genomes by leveraging the massive potential of language models. It performs significantly better than all previous models, particularly when it encounters bacterial species we have never seen before. In addition, it provides explanations for its predictions,” says PhD Alfred Ferrer Florensa.

The researchers emphasise that the model can indicate interesting patterns and possible risks, but the results must be examined further before any final conclusions can be drawn.

Understanding why a bacterium looks risky

PathogenFinder2 does more than produce a prediction. It highlights the specific proteins that most strongly influence its assessment. 

These may include known virulence factors, such as toxins or attachment structures (features that help bacteria attach to human cells), as well as completely uncharacterised proteins that could play a role in disease. 

This interpretability provides new avenues for research into diagnostics, vaccine targets and mechanisms of infection, including proteins not previously linked to disease.

A map of bacterial disease potential

Using protein language models to represent full genomes also enabled the researchers to build the first Bacterial Pathogenic Capacity Landscape, a map showing how thousands of bacteria relate to one another in terms of their disease-linked features. 

The landscape reveals clusters of bacteria that infect similar tissues or share metabolic strategies, offering a new way to explore microbial evolution and interactions.

“The Bacterial Pathogenic Capacity Landscape provides the first overview of all the disease‑causing bacteria that humans can be infected by. It reveals patterns and can, for example, show which bacteria tend to infect the same body sites or potentially rely on similar nutrients. This gives us new opportunities to investigate how bacteria evolve and interact,” says Alfred Ferrer Florensa.

Trained on 21,000 bacterial genomes

The researchers assembled the largest dataset to date of bacterial genomes with known disease-causing potential or known non-pathogenic behavior.
The dataset consisted of more than 21,000 bacterial genomes from international databases, including bacteria isolated from human infections as well as bacteria from the healthy human microbiome, probiotic cultures, food production, and extreme environments, such as organisms capable of surviving in very hot or very cold conditions.

This gave the model a unique foundation for learning to distinguish between harmful and harmless bacteria, even when encountering previously undescribed species.

Read more

The study, entitled Whole-genome prediction of bacterial pathogenic capacity on novel bacteria using protein language models with PathogenFinder2, has been published in Bioinformatics.

The project is funded by the EU Horizon 2020 programme, the US National Institute of Allergy and Infectious Diseases, and the Novo Nordisk Foundation. It is also supported by the HPC RIVR Consortium and EuroHPC JU through access to computing resources.

Read more about the Researh Group for Genomic Epidemiology