Scientists have just organized an enormous collection of predicted protein shapes into families, using artificial intelligence and a powerful new comparison method.
Working from AlphaFold models spanning organisms across the tree of life, they uncovered roughly 700,000 previously undescribed shape groups and 13 that seem unique to humans.
Those structures live in the AlphaFold Protein Structure Database, a public catalog of AI-predicted protein structures.
By sorting this vast structural library into related groups, researchers gain a clearer view of how proteins function and evolve across the tree of life.
Mapping the protein shape universe
The work was led by Martin Steinegger, assistant professor at the School of Bio…
Scientists have just organized an enormous collection of predicted protein shapes into families, using artificial intelligence and a powerful new comparison method.
Working from AlphaFold models spanning organisms across the tree of life, they uncovered roughly 700,000 previously undescribed shape groups and 13 that seem unique to humans.
Those structures live in the AlphaFold Protein Structure Database, a public catalog of AI-predicted protein structures.
By sorting this vast structural library into related groups, researchers gain a clearer view of how proteins function and evolve across the tree of life.
Mapping the protein shape universe
The work was led by Martin Steinegger, assistant professor at the School of Biological Sciences, Seoul National University.
His research focuses on building ultra-fast methods that compare enormous collections of protein sequences and structures using creative computational shortcuts.
Before this project, most protein shape comparisons happened in small batches, where scientists studied one protein family or a narrow group at a time.
Clustering every available model together creates a panoramic view of protein space, revealing shape patterns that only appear when the entire dataset is considered.
“We’ve entered a new era in structural biology where computational methods unlock unprecedented access to explore the protein universe,” said Martin Steinegger, who leads the project.
He and his colleagues estimate that using earlier techniques, clustering the same protein shape data might have taken about a decade of nonstop computing.
Taming protein structure overload
The AlphaFold artificial intelligence method uses deep neural networks to turn amino acid sequences into structures with an accuracy that approaches many experimental measurements.
Combined with the database, this approach now provides structural models for more than 200 million proteins, covering most cataloged sequences across biology.
In this study, the team used Foldseek Cluster, a structural alignment algorithm that encodes protein shapes into a compact alphabet, to organize the AlphaFold models efficiently.
Those families capture nearly the full variety of protein structures in the database, distilling a vast sea of models into about 2.3 million representative groups.
Many of the smallest families lacked functional labels from existing resources, marking them as candidate homes for previously unrecognized folds and activities.
Because the method operates in roughly linear time with respect to database size, it scales comfortably to hundreds of millions of entries rather than stalling.
Dark clusters and newcomers
Among the vast protein shape families, the team focused special attention on dark clusters, groups of proteins with no match to known folds.
From these, they selected tens of thousands with especially high confidence predictions and searched them for pockets that might bind small molecules or catalyze reactions.
Some small clusters appeared limited to a single species, which fits expectations for de novo gene birth, the evolution of new protein coding genes from noncoding DNA.
Looking specifically at humans, the team saw almost no clusters containing only human proteins, suggesting that truly human specific folds are uncommon.
Instead, most human protein structures fall into clusters that stretch deep across the tree of life, underscoring how evolution reuses ancient molecular parts in new combinations.
For experimentalists, dark clusters highlight places where no one has yet measured a structure in the lab, making them attractive targets for follow-up work.
Roots of the human immune system
Some of the most eye-catching clusters bridge human immune proteins and bacterial proteins, pointing to shared structural solutions that predate complex animals.
One striking case centers on gasdermins, a protein family that forms pores in cell membranes during certain immune responses. That action underlies pyroptosis, a process where infected cells burst and release inflammatory signals.
Structural comparisons in the new clusters place human gasdermins alongside bacterial counterparts, showing that the core pore-forming domain is shared across very distant branches of life.
Another example involves the human bactericidal permeability-increasing protein, an innate immune protein that binds bacterial endotoxin.
Within the structural clusters, BPI sits beside bacterial proteins that share its overall architecture, hinting that microbes may repurpose related designs for their own membrane biology.
The clustering links human DNA-sensing proteins such as AIM2 to proteins in gut bacteria, suggesting that parts of our immune machinery may descend from ancient microbial detectors.
Because sequence similarity between these distant relatives is often extremely low, their shared folds would have been almost impossible to spot with sequence search alone.
What this structural atlas unlocks
These long-distance structural links matter because protein shape tends to change more slowly than sequence, so protein structure often preserves evolutionary relationships after sequence similarity has faded away.
For researchers chasing a particular function, the AlphaFold Clusters database now acts as an atlas, letting them find structural neighbors and generate new hypotheses about uncharacterized proteins.
For drug discovery teams, dark clusters that contain predicted binding pockets look especially enticing, because they could hide enzymes or receptors that no existing medicines touch.
The study is published in Nature.
—–
Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
Check us out on EarthSnap, a free app brought to you by Eric Ralls and Earth.com.
—–