ProtENN, a deep learning approach for protein function prediction, increases coverage of the protein family database, Pfam, by 9.5%, comparable to the coverage achieved over a decade by alignment-based methods.
Protein domains with similar amino acid sequences tend to have similar functions– this is the very backbone of existing computational tools that leverage sequence homology to predict protein function. Despite the success of these algorithms in providing functional annotations for a large number of proteins, they struggle in predicting the function of proteins with low sequence homology to known proteins. However, recent work by Google Research presents a deep learning solution called ProtENN, which has effectively produced functional annotations for 6.8 million previously unannotated protein domains1.
State-of-the-art methods for protein function prediction, such as Protein Basic Local Alignment Tool (BLASTp), primarily rely on pairwise alignment-based techniques2. In these methods, a protein sequence is aligned to sequences with known function. If there is at least 30% homology between the sequences, they are inferred to share function. To further refine these techniques, probability-based methods have been introduced, in which the degree of conservation of a multiple sequence alignment is determined. For instance, profile hidden Markov models (HMM), such as HMMER, compare the protein sequence to a profile HMM that serves as a representation of a known protein domain or protein family3. If the sequence is matched to a profile HMM, its function can be inferred.
Although these methods have progressed protein function prediction, the well-known database for protein annotation, Pfam (now hosted by InterPro4) has seen a mere 5% coverage expansion over the past 5 years5. Dependence on sequence alignment limits the ability of such approaches to annotate proteins that diverge in sequence to known protein families and families that contain relatively few sequences. Additionally, proteins are not simply linearly arranged – the secondary and tertiary structure of proteins can influence function, which alignment-based methods fail to consider1.
To overcome the limitations of alignment-based approaches, Bileschi and colleagues1 propose a deep learning model that predicts protein function without reliance on sequence alignment (fig. 1). They use a one-dimensional Convolutional Neural Network (CNN) that classifies proteins into one of 17,929 possible functional classes found in the Pfam database. Their model, ProtENN, considers both local and global protein sequence information to recognize sequence characteristics that are indicative of specific functions. Within ProtENN, a filter moves along the inputted amino acid sequence to identify features and patterns in the sequence. These patterns are then processed through multiple layers of the model, where higher layers identify increasingly complicated patterns. The function of novel protein domains can then be predicted, offering a quick, autonomous approach for annotation with minimal human intervention.
The greatest challenge in developing an accurate model for protein function prediction is not in building the model itself, but in designing train and test datasets that can apply to diverse sequences andprevent model bias1. To account for this, sequences in Pfam obtained from UniProtKB reference proteomes were split into train and test sets (1) randomly or (2) by grouping sequence families together and placing the entire group in either the training set or the testing set. The latter ensures that sequence homology between the datasets is low, allowing for accurate classification of proteins with low sequence similarity.
To benchmark model performance, the team at Google Research compared ProtENN against the well-established alignment-based methods, BLASTp and HMMER. Remarkably, ProtENN outperformed the two methods, achieving the lowest error rate and highest accuracy in both the random and grouped split datasets. This showcases ProtENN’s ability to make accurate predictions for diverse sequences.
Strikingly, the authors found that merging ProtENN with alignment-based methods improves prediction accuracy more than either method can individually. Not only did combining ProtENN with HMMER further reduce error rates by 38.6%, but the ensemble increased protein coverage in Pfam by 9.5%, or 6.8 million sequence regions. This added annotations for 1.8 million full-length proteins with no previous annotations, including 360 human proteins. These annotations have publicly been released as Pfam-N, available on the European Bioinformatics Institute website. Since this work, Pfam-N now has 5.2 million protein sequences, expanding UniProtKB reference proteome coverage by 8% (fig. 2)4.
As an emerging space in proteomics, deep learning still faces many challenges. The information ProtENN uses to make predictions is largely unknown. Uncovering this information is crucial in understanding the relationship between protein sequence and function, however, this remains a difficult task6. Additionally, deep learning models heavily rely on a high volume of sequence data to learn meaningful patterns. To overcome this, a machine learning technique called transfer learning has recently been tested in conjunction with ProtENN to show further increases in protein prediction accuracy7. This suggests that despite its limitations, deep learning will likely become a core component of future tools for protein function prediction.
Alongside these advancements, integrative models will likely be developed that combine deep learning with approaches that consider protein information beyond sequence, such as structure and phylogenetic relationships. This will be useful for developments in biomedicine and therapeutics, such as de novo protein design, which requires precise protein sequence evaluation and functional prediction8. To facilitate the usefulness and buildability of ProtENN for various applications, the authors have made the information used to build ProtENN publicly available.
As public protein databases continue to grow, the need for accurate protein function predictions becomes increasingly important. To meet this challenge, ProtENN has paved the way for the use of deep learning in protein classification. Although in its infancy, ProtENN’s full capabilities are only beginning to be explored.
1. Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Nat Biotechnol 40, 932–937 (2022).
2. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J Mol Biol 215, 403–410 (1990).
3. Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res 39, W29 (2011).
4. Paysan-Lafosse, T. et al. InterPro in 2022. Nucleic Acids Res 51, (2023).
5. Mistry, J. et al. Pfam: The protein families database in 2021. Nucleic Acids Res 49, D412–D419 (2021).
6. de Crécy-Lagard, V. et al. A roadmap for the functional annotation of protein families: a community perspective. Database (Oxford) 2022, (2022).
7. Bugnon, L. A. et al. Transfer learning: The key to functionally annotate the protein universe. Patterns 4, 100691 (2023).
8. Unsal, S. et al. Learning functional properties of proteins with language models. Nat Mach Intell 4, 227–245 (2022).