The new AI tool for bioengineers can be both predictive and predictable

Researchers at the National Institute of Standards and Technology (NIST) have developed a new statistical tool that they use to predict protein function. Not only does this help in the difficult task of modifying proteins in a practical way, but it also works in a completely explanatory way – an advantage over conventional artificial intelligence (AI) that has helped protein engineering in the past.

The new tool, called LANTERN, could be effective in everything from biofuel production to crop improvement and the development of new treatments for diseases. Proteins as building blocks of biology are a key component of all this work. But while it is relatively easy to switch to a DNA strand that acts as a template for a given protein, it remains difficult to determine which specific base pair – the edges of the DNA ladder – is the key to creating a desired effect. . Finding these keys is the work of AI made from deep neural networks (DNNs), which, while efficient, are notoriously opaque to human comprehension.

Described in a new article published Proceedings of the National Academy of Sciences, LANTERN demonstrates the ability to predict the genetic mutations needed to make useful differences between three different proteins. One is the spike-shaped protein from the surface of the SARS-CoV-2 virus that causes COVID-19; Understanding how DNA mutations can alter these spike proteins can help epidemiologists predict the future of an epidemic. The other two are the well-known workhorse in the laboratory: bacteria used as markers in biological tests. E. coli lasagna protein and green fluorescent protein (GFP). Selecting these three topics helps the NIST team not only to make their tool work, but also to explain the results – an important feature for the industry, which requires predictive approaches that help to understand the underlying system.

Peter Toner, NIST’s statistician and computational biologist, and LANTERN’s lead developer, said: “We have a method that is completely explanatory and has no loss of predictability.” “The general idea is that if you want one of these things, you can’t have the other. We’ve shown that sometimes you can have both. A

The problem the NIST team is tackling can be imagined as an interaction with a complex machine that plays a huge control panel filled with thousands of unlabelled switches: the device is a gene, a strand of DNA that encodes a protein; The switches are attached to the base of the strand. All switches somehow affect the output of the device. Your work If the machine is made to work differently in a certain way, which switches do you need to flip?

Since the reaction may require changes to several base pairs, scientists must reverse a combination of these, measure the result, then choose a new combination, and measure again. Continuous numbers are impressive.

“The number of possible combinations could be greater than the number of atoms in the universe,” Toner said. “You can never measure all the possibilities. This is a ridiculously large number. A

Due to the amount of data involved, DNNs were tasked with sorting through data samples and guessing which base pairs should be flipped. In that, they have succeeded – unless you ask for an explanation of how to get their answers. These are often described as “black boxes” because their internal functions are obscure.

“It’s really hard to understand how DNN predicts them,” said NIST physicist David Ross, one of the paper’s co-authors. “And if you want to use those predictions to design something new, that’s a big deal. A

LANTERN, on the other hand, is designed to be clearly understandable. Part of its interpretability stems from the use of explanatory parameters to represent the analyzed data. Instead of allowing the number of these parameters to be extraordinarily large and often impenetrable, as in the case of DNN, each parameter of the LANTERN calculation serves a purpose intended to be intuitive, helping users understand what these parameters mean and how they are LANTERN. Who influences. Counting predictions

While LANTERN models represent protein mutations using vectors, widely used mathematical instruments are often visually represented as arrows. Each arrow has two characteristics: its aspect refers to the effect of mutation, while its length represents the force of that effect. When two protein vectors point in the same direction, LANTERN indicates that the proteins have the same function.

The directions of these vectors often coincide with biological processes. For example, three datasets studied by the LANTERN team learned an aspect related to protein folding. (Folding plays an important role in how proteins work, so identifying this factor in datasets was an indication that the model is working as expected.) When making predictions, LANTERN simply combines these vectors together – a method that users follow when reviewing. Can predict them

Other labs had previously used DNN to predict switch changes that would lead to useful changes in the three proteins in question, so the NIST team decided to contrast LANTERN with DNN results. The new method just wasn’t enough; According to the team, in predictive accuracy for such problems it has reached a new stage in the industry.

“Lanter has matched or surpassed almost all alternative methods in the accuracy of predictions,” Toner said. “It surpasses all other methods of predicting changes in LacI, and it has comparative predictive accuracy for GFP for everyone except one. For SARS-CoV-2, its predictive accuracy is higher than all other options except the DNN type. Which matches the LANTERN. Accuracy but I don’t beat it.

LANTERN determines which sets of switches will have the greatest impact on a given protein property – its fold stability, for example – and summarizes how the user can change that property to achieve the desired effect. In a way, LANTERN transfers many of our machine panel switches to a few simple dials.

“It reduces thousands of switches to five small dials that you can turn on,” Ross said. “It tells you that the first dial will have a bigger effect, the second will have a different but smaller effect, the third will be smaller and more. So as an engineer, it tells me that I can focus on the first and second dials to get the results I need. LANTERN explains this to me, and it’s incredibly helpful.

Rajmonda Caceres, a scientist at MIT’s Lincoln Laboratory who is familiar with the method behind LANTERN, said he liked the explanations of the tool.

“There aren’t many AI methods applied to biology applications where they are designed for explicit explanations,” said Casares, who is not involved with NIST research. “When biologists look at the results, they can see which mutations contribute to protein mutations. This level of interpretation allows for more interdisciplinary research because biologists understand how algorithms can be learned and they can create other information about biological systems. Under study.”

Toner said that while he is happy with the results, Lantern AI is not a panacea for the problem of interpretability. Exploring alternatives to DNN in more detail would benefit the overall effort to create an explanatory and reliable AI, he said.

“In the context of predicting genetic effects on protein function, LANTERN is the first example of something that competes with DNN in predictive power and is still fully explainable,” Toner said. “He came up with a definite solution to a particular problem. We hope it can be applied to others, and this work inspires the development of new interpretive approaches. We don’t want to go from a black box to predictive AI. A

Leave a Comment