DeepChainTM lets you create your own machine learning tools, or what we call Apps, to solve particular biology questions. We’re democratising AI protein exploration and design by allowing students and independent researchers to tap into our datasets, pre-trained models and supercomputing power. In this series we shout out to developers who are helping our community further its research by building easy-to-use Apps with DeepChainTM.
‘My App helps predict protein function from sequence alone’
– St John Grimbly
Can you tell us a bit about yourself?
The applications of powerful, pre-trained language models have really excited and stimulated my imagination in terms of what machine learning can do to spur on scientific development. The low barrier of entry that apps like DeepChain are affording researchers is – simply put – amazing. My passion has always been for cross-field research and generalist applications of cutting-edge technology to hard problems. When DeepChain hosted a hackathon, I jumped at the opportunity to build an app which tackled the protein prediction problem head-on. Another plus is the thrill of getting to contribute to the early success of the state-of-the-art DeepChain platform.
What’s the challenge that inspired your app?
While exploring papers I came across a fundamental problem that requires immense compute for, often, mediocre results – predicting protein function. This is a hot topic in artificial intelligence and research in general as many of the present and pressing problems we face in science are protein related. Usually these are problems related to prediction that require immense data and computation. These are exactly the type of problems that AI has been knocking down one by one. We have seen this with DeepMind’s protein folding AI – AlphaFold.
Proteins are produced as sequences of amino acids, which are then folded into the 3D structure needed for their function.
The first thing we should understand is why protein folding is so important. Protein folding has traditionally been seen as a grand challenge of science. Recently, however, this problem has come close to being practically solved with computational techniques. The protein structural information we obtain by ‘folding’ the protein sequence tells us about how a protein will interact with other proteins and biological processes. This is crucial for designing drugs, understanding viruses, or even understanding the purpose of DNA and RNA sequences.
It was thus very surprising – and exciting – for me when I came across the Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function paper. This claimed “Deep, pretrained embeddings outperform hand-crafted sequence and structure representations.” In other words, using pre-trained language models for protein prediction is as good as using these models with structural information. We don’t need structural (folding) information to make good predictions about protein function. This alternative leverages the immense power of pre-computation for a task which could be seen as a grand-challenge of biology. A task naturally suited for DeepChain’s platform.
For an example of how powerful these models can be, Villegas-Morcillo et al. show that MLP with embeddings outperformed DeepFold. Now with DeepChain and the App Hub, one can quickly predict protein function using only a protein sequence input.
What does your DeepchainTM App do?
The OntologyPredict app applies the immense power of pre-trained, unsupervised language models – like ProtBert – to ‘skip’ the protein folding ‘step,’ and go directly from a protein sequence to its predicted protein function. More specifically, it tries to predict a Gene Ontology class given a protein sequence, which gives us information about what the protein is associated with. Since multiple proteins can be associated with multiple different tasks, this is a very hard classification problem.
This work is based on the paper Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. A full blog post of how I built this app is available here.
|Model||NLP (ProtBERT) embeddings with a NN (MLP) layer above it for performing multi-class classification|
|Output||A multidimensional array giving scores corresponding with the likelihood of the protein being in the possible (Molecular) Gene Ontology classes.|
|Training Dataset||Subset of the Protein Data Bank (PDB) dataset and classes from the Gene Ontology project. Sourced from Villegas-Morcillo et al.|
|Accuracy||Undetermined. Original paper reports maximum protein-centric F-measure (Fmax) of ~0.53 using different embeddings (ELMo vs ProtBERT) for a similar (MLP) model.|
How can I access DeepChainTM Apps?
To use this App, you can download it from GitHub, or access it through our user-friendly DeepChainTM platform. No coding is needed to leverage the power of this and other apps using DeepChainTM. You simply select the app to be used from the menu, input your amino acid sequence of interest, and DeepChainTM will provide you with the predicted Gene Ontology classification scores.
Join our community!
Visit DeepChain’s™ GitHub to learn more about the technical aspects of building machine learning models and engage with our community of experts. Producing an App helps showcase your skills and network with like-minded experts. You will also be supporting the worldwide scientific community by building open access Apps to help accelerate protein research.
To learn more about DeepChain™, feel free to send us an email at firstname.lastname@example.org!
If you are a computational biologist passionate about AI, join our team! You can find our job vacancies here.