Connect with us

Fitness

µFormer: A Deep Learning Framework for Efficient Protein Fitness Prediction and Optimization

Published

on

µFormer: A Deep Learning Framework for Efficient Protein Fitness Prediction and Optimization

Protein engineering is essential for designing proteins with specific functions, but navigating the complex fitness landscape of protein mutations poses a significant challenge, making it hard to find optimal sequences. Zero-shot approaches, which predict mutational effects without relying on homologs or multiple sequence alignments (MSAs), reduce some dependencies but fall short in predicting diverse protein properties. Learning-based models trained on deep mutational scanning (DMS) or MAVE data have been used to predict fitness landscapes alone or with MSAs or language models. Still, these data-driven models often struggle when experimental data is sparse.

Microsoft Research AI for Science researchers introduced µFormer, a deep learning framework that integrates a pre-trained protein language model with specialized scoring modules to predict protein mutational effects. µFormer predicts high-order mutants, models epistatic interactions, and handles insertions. With reinforcement learning, µFormer efficiently explores vast mutant spaces to design enhanced protein variants. The model predicted mutants with a 2000-fold increase in bacterial growth rate, driven by improved enzymatic activity. µFormer’s success extends to challenging scenarios, including multi-point mutations and its predictions were validated through wet-lab experiments, highlighting its potential for optimizing protein design.

The µFormer model is a deep learning approach designed to predict the fitness of mutated protein sequences. It operates in two stages: first, by pre-training a masked protein language model (PLM) on a large dataset of unlabeled protein sequences, and second, by predicting fitness scores using three scoring modules integrated into the pre-trained model. These modules—residual-level, motif-level, and sequence-level—capture different aspects of the protein sequence and combine their outputs to generate the final fitness score. The model is trained using known fitness data, minimizing errors between predicted and actual scores.

Additionally, the µFormer is combined with a reinforcement learning (RL) strategy to explore the vast space of possible mutations efficiently. The protein engineering problem in this framework is modeled as a Markov Decision Process (MDP), with Proximal Policy Optimization (PPO) used to optimize mutation policies. Dirichlet noise is added during the mutation search process to ensure effective exploration and avoid local optima. Baseline comparisons were made using models like ESM-1v and ECNet, and they were evaluated on datasets such as FLIP and ProteinGym.

µFormer, a hybrid model combining a self-supervised protein language model with supervised scoring modules, predicts protein fitness scores efficiently. Pre-trained on 30 million protein sequences from UniRef50 and fine-tuned with three scoring modules, µFormer outperformed ten methods in the ProteinGym benchmark, achieving a mean Spearman correlation of 0.703. It predicts high-order mutations and epistasis, with strong correlations for multi-site mutations. In protein optimization, µFormer, paired with reinforcement learning, designed TEM-1 variants that significantly improved growth, with one double mutant outperforming a known quadruple mutant.

In conclusion, Previous studies have shown the potential of sequence-based protein language models in tasks like enzyme function prediction and antibody design. µFormer, a sequence-based model with three scoring modules, was developed to generalize across diverse protein properties. It achieved state-of-the-art performance in fitness prediction tasks, including complex mutations and epistasis. µFormer also demonstrated its ability to optimize enzyme activity, particularly in predicting TEM-1 variants against cefotaxime. Despite its success, improvements can be made by incorporating structural data, developing phenotype-aware models, and creating models capable of handling longer protein sequences for better accuracy.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel.

If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

Continue Reading