Krajobraz biologii obliczeniowej został w ostatnich latach radykalnie przekształcony dzięki pojawieniu się modeli języka białek. Zapożyczone z dużych modeli językowych (LLM), te solidne narzędzia wykazały talent do przewidywania struktury i funkcji białek z imponującą precyzją. Oferują one szeroki zakres zastosowań, od wykrywania potencjalnych celów leków po pionierskie przyszłe przeciwciała terapeutyczne.
Było to jednak słodko-gorzkie zwycięstwo. Pomimo ich transformacyjnego wkładu, modele te tradycyjnie cierpiały z powodu braku przejrzystości. Do tej pory naukowcy starali się zrozumieć, w jaki sposób modele te generują swoje prognozy lub jakie konkretne cechy białka biorą pod uwagę w tym procesie. Jednak ta era niepewności dobiega końca dzięki niedawnym wysiłkom naukowców z MIT.
A research team at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), lead by Bonnie Berger, the Simons Professor of Mathematics and head of the Computation and Biology group has unveiled a method for deciphering the inner mechanisms of these powerful models. According to a study published in the Proceedings of the National Academy of SciencesTo nowo odkryte zrozumienie może pomóc naukowcom skuteczniej wybierać i dostosowywać modele do konkretnych zadań, zwiększając tym samym tempo odkrywania leków i opracowywania szczepionek.
So how do these protein language models work? Think of them like LLMs like ChatGPT, but in place of processing human language, they’re analysing sequences of amino acids. They’ve been used to predict the way proteins fold, interact, and function. In 2018, Berger and her former student Tristan Bepler introduced one of the first such models, trailblazing the path for later ground-breaking models such as AlphaFold, ESM2, and OmegaFold.
One of the standout applications unfolded in 2021 when Berger’s team harnessed a protein model to pinpoint sections of viral proteins that were unlikely to mutate. This critical information helped to spotlight potential vaccine targets for formidable viruses like HIV, influenza, and SARS-CoV-2. However, the models remained somewhat of a black box—scientists could observe the outcome but had no insight into the process leading to it.
To shed light on the decision-making process within protein models, the MIT team utilised a method known as a sparse autoencoder, a technique now used to interpret LLMs that has previous not been applied to protein models. Eerily, a protein model usually represents data with a limited number of nodes, say 480. With these nodes densely crammed with data, it’s virtually impossible to determine what each one represents. Sparse autoencoders ease this by expanding the representation to a much larger set of nodes, like 20,000. This growth, along with a sparsity constraint, makes data dispersion possible, simplifying the process to isolate and interpret individual features.
This use of sparse representations has uncovered new insights. After generating sparse representations of various proteins, the researchers employed an AI assistant named Claude, developed by Anthropic, to aid in interpreting the data. It comes to light that specific nodes correspond with specific biological features. Besides predicting the outcomes, understanding why the model makes a certain prediction is now possible. Intriguingly, the researchers found some biological features more commonly encoded than others. “Even without being trained for interpretability, it emerges naturally when sparsity is encouraged”, admits Onkar Gujral, the lead author of the study and a graduate student at MIT.
This advance not only holds significant implications for the field of biology but reaches beyond too. With clarity on which features a protein model encodes, scientists can better match models to specific research tasks or refine the input data to improve predictions, thus potentially leading to new biological insights based solely on model behaviour. “Once the models become more powerful, there’s potential to discover more biology than what is currently known, simply from analysing these models,” Gujral observes.
Ten kamień milowy badania został poparty przez National Institutes of Health. Zapewnia to znaczący i kluczowy krok w kierunku przejrzystości i efektywnego wykorzystania sztucznej inteligencji w biologii.
Więcej szczegółów można znaleźć w oryginalnym artykule opublikowanym na MIT News: https://news.mit.edu/2025/researchers-glimpse-inner-workings-protein-language-models-0818
This website uses cookies.