Skip to main content

AlphaEnzyme: predicting enzyme functions directly from AlphaFold encoding


EMSL Project ID
60535

Abstract

Predicting the function of enzyme (represented as an enzyme commission number, or EC number) is one of the central tasks in annotating genomes and metagenomes. It is crucial for understanding the metabolic potentials of microbes and microbiome. The conventional approach for enzyme function prediction relies mostly on multiple sequence alignments and uses the annotated functions of homologous sequences as predicted functions. Other approaches include utilizing features extracted from sequences such as k-mers and contact maps. When sequences with known function are similar to sequences of unknown function, these methods typically work well, but when the sequence identity falls below 30-40%, the accuracy of these methods falls off quickly. One reason for this failure to identify function for proteins in this "twilight zone" is that they do not directly incorporate 3D structures. However, until recently, large training sets of protein structure were not available. Now in the era of AlphaFold, DeepMind has published hundreds of thousands of protein structures with angstrom-level accuracy, and this represents a new resource from which to predict function. New methods are needed that can directly exploit these protein structures in a functionally meaningful way. Here, we propose utilizing the intermediate representation (IR) of proteins in AlphaFold as encoding of protein structure for predicting enzyme functions. Our hypothesis is that the intermediate embedding space of AlphaFold preserves the core invariants of structural information such that proteins that are close to each other in structure space will also be close to each other in function space. If this hypothesis is true, we will be able to more accurately predict EC numbers in the twilight zone where sequence-based methods typically fail. We propose to build up a machine learning pipeline that are capable of fast generating IRs and enzyme function prediction. We propose to build and deploy this platform on EMSL's Tahoma HPC systems.

Project Details

Project type
Exploratory Research
Start Date
2022-12-01
End Date
2023-12-31
Status
Closed

Team

Principal Investigator

Qiang Guan
Institution
Kent State University

Team Members

Safa Shubbar
Institution
Kent State University

Yuxin Yang
Institution
Kent State University

Song Feng
Institution
Pacific Northwest National Laboratory

Jeremy Zucker
Institution
Pacific Northwest National Laboratory