Structure-derived synthetic sequences guide a protein language model toward metalloproteins (opens in new tab)
Motivation Protein language models (pLMs) capture evolutionary sequence constraints but are limited in modeling underrepresented functional classes due to training data imbalance. Metalloproteins constitute a fundamental but sparsely represented class in sequence databases. We therefore assess whether structure-conditioned synthetic sequences can be used to specialize pLMs toward metal-binding functionality. Results We fine-tuned the generalist model ProtGPT2 on synthetic sequences generated ...
Read the original article