Biomolecules 14 00339
Biomolecules 14 00339
Review
Advances in AI for Protein Structure Prediction: Implications for
Cancer Drug Discovery and Development
Xinru Qiu 1 , Han Li 2 , Greg Ver Steeg 2 and Adam Godzik 1, *
Further complicating this process is the inadequacy of preclinical models that accurately rep-
Further
resent thecomplicating
disease and thethisconstraints
process isofthe inadequacy
overly ofdisease
simplistic preclinical models
models, that
which accurately
together am-
represent
plify the disease
the difficulties inand the constraints
grasping of overly
the complexity simplistic
of human disease
systems. models,
Lack which
of high together
quality struc-
amplify
tural the difficulties
models in grasping
of drug targets, the complexity
a main problem of human
addressed systems.is Lack
by AlphaFold, of high
only one quality
of the chal-
structural models of drug targets, a main problem addressed by AlphaFold, is
lenges in drug discovery. However, as we show in this review, AI is also making rapid pro- only one of
the challenges in drug discovery. However, as we
gress in addressing other bottlenecks in drug development. show in this review, AI is also making
rapid progress in addressing other bottlenecks in drug development.
Figure 1.
Figure Stages of
1. Stages of Drug
Drug Discovery
Discovery Process:
Process: TheThe drug
drug discovery
discovery process
process comprises
comprises several
several critical
critical
stages. It begins with the “Discovery and Development” phase, where the focus
stages. It begins with the “Discovery and Development” phase, where the focus is on target identi-is on target iden-
tificationand
fication andvalidation.
validation.This
Thisstage
stageinvolves
involvesscreening
screeningpotential
potentialcompounds
compounds and and further
further refining
promising candidates
candidates through hit-to-lead development
development and lead optimization. Following
Following this,
this, the
process moves to
process moves to“Preclinical
“PreclinicalDevelopment”,
Development”, which
which includes
includes a range
a range of tests
of lab lab tests
suchsuchas inas in vitro
vitro stud-
studies, animal
ies, animal model
model testing,
testing, and and
ADMETADMET (Absorption,
(Absorption, Distribution,
Distribution, Metabolism,
Metabolism, Excretion,
Excretion, Tox-
Toxicity)
icity) studies. Based on these results, a decision is made on whether to proceed to the next phase.
studies. Based on these results, a decision is made on whether to proceed to the next phase. “Clinical
“Clinical Trials” ensue, which are categorized into four phases: Phase I assesses safety and dosage; Phase
Trials” ensue, which are categorized into four phases: Phase I assesses safety and dosage; Phase II
II examines efficacy and side effects; Phase III involves larger studies to confirm efficacy and monitor
examines
adverse efficacyand
reactions; andthe
side effects;
final stagePhase III involves
is “Review larger studies
and Approval”, whichto confirm
consists of aefficacy and monitor
comprehensive reg-
adverse reactions; and the final stage is “Review and Approval”, which consists of a
ulatory review, culminating in market authorization and followed by post-marketing monitoring to en- comprehensive
regulatory
sure review,
long-term culminating
safety in market
and effectiveness, authorization
constituting anddefined
a newly followed by IV.
Phase post-marketing monitoring
to ensure long-term safety and effectiveness, constituting a newly defined Phase IV.
Traditionally, the three-dimensional structures of proteins are deciphered using la-
Traditionally,
bor-intensive the three-dimensional
and costly structures
experimental methods likeofX-ray
proteins are deciphered
crystallography, using labor-
nuclear mag-
intensive and costly experimental methods like X-ray crystallography, nuclear
netic resonance (NMR), and cryogenic electron microscopy (cryo-EM). While invaluable, magnetic
resonance
these (NMR),
techniques and
are cryogenic
limited electron
by speed, cost,microscopy (cryo-EM).
and applicability While
to only invaluable,
certain these
protein struc-
techniques
tures. are limited
In contrast, by advancements
recent speed, cost, andinapplicability to onlyprediction,
protein structure certain protein structures.
culminating in
In contrast, recent advancements in protein structure prediction, culminating
AF2, have dramatically expanded our capabilities, complementing and occasionally in AF2,
sur-
have dramatically expanded our capabilities, complementing and occasionally surpassing
passing experimental approaches.
experimental approaches.
The AF2 breakthrough has been quickly followed by other AI tools such as Ro-
The AF2 breakthrough has been quickly followed by other AI tools such as RoseTTAfold [7],
seTTAfold [7], ESMFold [8], and OpenFold [9]. ProGen [10], ProteinMPNN [11], EvoDiff
ESMFold [8], and OpenFold [9]. ProGen [10], ProteinMPNN [11], EvoDiff [12], and RFdiffu-
[12], and RFdiffusion [13] extend the AI capabilities to novel protein design, as does Dif-
sion [13] extend the AI capabilities to novel protein design, as does DiffDock [14] to molecular
fDock [14] to molecular docking. These and many other rapidly developing tools apply
docking. These and many other rapidly developing tools apply novel algorithms and AI ar-
novel algorithms and AI architectures, each with unique strengths and weaknesses. Here
chitectures, each with unique strengths and weaknesses. Here we focus not so much on the
we focus not so much on the comparison of their predictions, but on the differences in
comparison of their predictions, but on the differences in their algorithms and approaches and
their algorithms and approaches and the resulting optimal applications.
the resulting optimal applications.
2.
2. Protein
Protein Structure
Structure Prediction
Prediction in
In Silico
Silico before
before AlphaFold
AlphaFold
In
In the
the period
period preceding
preceding the
the advent
advent ofof AlphaFold,
AlphaFold, the
the process
process of
of protein
protein structure
structure
prediction generally encompassed several distinct stages, as outlined in the following
prediction generally encompassed several distinct stages, as outlined in the following dis-
cussion (Figure 2).
discussion (Figure 2).
Biomolecules 2024, 14, x FOR PEER REVIEW 3 of 16
Figure 2. Stages of Protein Structure Prediction: The foundational stage involves determining the
DNA sequence that encodes the protein of interest. The next step is to infer the protein sequence from
Figure 2. Stages of Protein Structure Prediction: The foundational stage involves determining the
the DNA sequence. Homology modeling uses known protein structures as templates to predict the
DNA sequence that encodes the protein of interest. The next step is to infer the protein sequence
structure of a protein with an unknown structure but similar sequence. Lastly, validation of structure
from the DNA sequence. Homology modeling uses known protein structures as templates to predict
ensures the predicted structure’s biological plausibility. This involves checks on stereochemical
the structure of a protein with an unknown structure but similar sequence. Lastly, validation of
quality, energy evaluation, and comparison to known structural data.
structure ensures the predicted structure’s biological plausibility. This involves checks on stereo-
chemical quality, energy
2.1. Homology evaluation,
and Comparative and comparison to known structural data.
Modeling
Homology modeling predicts a protein’s 3D structure using the structure of a homol-
2.1.ogous
Homology and ItComparative
protein. Modeling
involves four steps: identifying a homologous protein with a known
structure
Homology (target identification),
modeling aligning
predicts the target
a protein’s 3Dwith the template
structure sequence
using the (alignment),
structure of a homol-
ogous protein. It involves four steps: identifying a homologous protein with aand
constructing a model of the target protein from aligned regions (model building), known
enhancing
structure the model’s
(target accuracy aligning
identification), and stability
the(model
target refinement). Improvements
with the template sequencein dis-
(align-
tant homology recognition and alignment between distant homologies are exemplified by
ment), constructing a model of the target protein from aligned regions (model building),
the HHpred algorithm and the accompanying suite of programs [15,16]. Predictions of
andprotein
enhancing the model’s accuracy and stability (model refinement). Improvements in
contact maps from coevolution patterns approached this problem from another
distant
anglehomology recognition
[17], enhanced andapplications
by the first alignment between distant neural
of deep learning homologies are exemplified
networks [18]. In
by the HHpred algorithm and the accompanying suite of programs [15,16].
the late 2010s tools such as Rosetta [19] and I-Tassser [20] crossed the line from homologyPredictions of
protein contact maps
to comparative from[21].
modeling coevolution patterns
Rosetta achieved approached
this this problem
by using smaller elements offrom another
known
angle [17], enhanced
structures by the first
and a combination ofapplications of deep
energy-like scoring learning
function andneural networks
empirical folding[18]. In the
rules.
I-TASSER (Iterative Threading ASSEmbly Refinement)’s similarity uses a
late 2010s tools such as Rosetta [19] and I-Tassser [20] crossed the line from homology to combination
of template-based
comparative modeling modeling and fragment
[21]. Rosetta assembly.
achieved this byOther tools’
using similarity
smaller has started
elements of known
approaching the level of ab-initio protein structure prediction [21]. These advances di-
structures and a combination of energy-like scoring function and empirical folding rules.
rectly led to the development of AlphaFold and the following AI approaches to protein
I-TASSER (Iterative Threading ASSEmbly Refinement)’s similarity uses a combination of tem-
structure prediction.
plate-based modeling and fragment assembly. Other tools’ similarity has started approaching
the 2.2.
level of ab-initio
Structure protein structure prediction [21]. These advances directly led to the de-
Validation
velopment of AlphaFold
Structure validationandensures
the following
that theAI approaches
predictions areto proteinand
accurate structure prediction.
plausible. Tools
like [22] analyze the geometry of structural features and verify the dihedral angles in the
2.2.Ramachandran plot. Energy based evaluations, such as ANOLEA [23], assess potential
Structure Validation
energy to evaluate the correctness of folding. Finally, predicted structures can be compared
Structure validation ensures that the predictions are accurate and plausible. Tools like
to the experimental data, if such are available. Such comparisons can be used to benchmark
[22]analyze the geometry of structural features and verify the dihedral angles in the Rama-
the prediction methods and establish expected accuracy, but cannot be used to evaluate
chandran plot. Energy
predictions for based
proteins evaluations,
with such as ANOLEA
no known experimental [23], assess
structures. potential
However, energy to
functional
evaluate the correctness
predictions based on theofpredicted
folding.3DFinally, predicted
structures, such asstructures
identity ofcan besite
active compared to the ex-
or interaction
perimental
interface data, if such
residues, aretested
can be available. Such
in vitro, thuscomparisons can be used
indirectly confirming to benchmark
the structure the pre-
prediction.
diction methods and establish expected accuracy, but cannot be used to evaluate predictions
for 3. Existing Protein Structure Data Sets and Their Applications
proteins with no known experimental structures. However, functional predictions based
on the predictedprotein
Existing structure data
3D structures, suchsets play a pivotal
as identity role in
of active protein
site bioinformatics
or interaction (Table
interface 1).
residues,
Protein structures elucidated through experimental methods by various structural biology
can be tested in vitro, thus indirectly confirming the structure prediction.
research groups are submitted to the Research Collaboratory for Structural Bioinformatics
(RCSB) Protein Data Bank (PDB) [24]. The practical applications of AI-based structure predic-
3. Existing Protein
tions have Structure
been made Data by
much easier Sets
theand their Applications
development of the AlphaFold Protein Structure
Existing protein structure data sets play a pivotal role in protein bioinformatics (Table 1).
Protein structures elucidated through experimental methods by various structural biology
research groups are submitted to the Research Collaboratory for Structural Bioinformatics
(RCSB) Protein Data Bank (PDB) [24]. The practical applications of AI-based structure
Biomolecules 2024, 14, 339 4 of 16
Database, which offers precalculated predictions for over 200 million protein structures. Inte-
gration of the AlphaFold2 and the UniProt databases extended access to protein structural
information to a broad community of biologists [25,26]. The ESM Metagenomic Atlas con-
tains predictions for over 700 million protein structures from various microorganisms found
in environments such as soil, seawater, and the human gut. This comprehensive collection of
predicted structures provides valuable insights into the metagenomic landscape [8]. These
data sets collectively support a broad range of research studies and applications, includ-
ing developing and evaluating machine learning models, advancing our understanding of
protein biology, and facilitating drug discovery efforts.
Table 1. Publicly available protein structure data sets and their applications in different phases of
drug discovery.
study [31], the researchers used AF2 to predict the three-dimensional structures of all the
human DGK paralogs and conducted structural alignment of the predictions to reveal the
conserved domains and their spatial arrangement relative to each other. The study also
used docking studies to corroborate the existence of a conserved ATP-binding site between
the catalytic and accessory domains and to investigate the spatial arrangement of DGK
with respect to the membrane.
AF2 can aid drug discovery by accurately predicting protein 3D structures and identi-
fying potential allosteric binding sites. Allosteric drugs, which bind the allosteric rather
than the active sites, can induce conformational changes in proteins, affecting their activ-
ities. This enables the design of more effective drugs that can synergize with traditional
orthosteric drugs to enhance efficacy. A study from Nussinov, R., et al. [32] illustrated how
allosteric drugs can alter the conformation of an active site that a drug-resistant mutation
has created, permitting a blocked orthosteric drug to bind. This suggests that a combination
of allosteric and orthosteric drugs can be more effective than either drug type alone. In
another study from Weng, Y., et al. [33], AF2 was used to predict the protein structure of
WSB1. The predicted structure was then optimized using molecular dynamics simulations
and validated using software. After that, virtual screening was performed using AutoDock-
GPU and Glide to filter compounds using ligand- or structure-based methods. Finally,
four compounds with different compound scaffolds were selected as potential inhibitors
of WSB1.
In a recent development, AlphaMissense, a computational tool devised by Google
DeepMind, was shown to correctly assess the pathogenic potential of missense variants [34].
By utilizing the structural insights from AlphaFold, AlphaMissense evaluates the effects of
mutations on the functionality of proteins. In the realm of cancer drug discovery, this tool
holds significant promise in aiding researchers to efficiently select genetic mutations for in-
depth study. This could expedite the process of identifying novel drug targets. Furthermore,
AlphaMissense has the potential to enhance our comprehension of less-explored segments
of the genetic code, especially genes that play crucial roles in human health but whose
functions are yet to be fully understood.
5. Target Identification
The next step after understanding the molecular mechanism of disease is identifying
targets for therapeutic intervention. Again, knowledge of the structure of proteins involved
in pathways or networks mutated or modified in cancer is an important step in identifying
best drug targets. Understanding the molecular mechanisms of disease at the molecular
level, including the functional, interactive, and mechanistic implications of gene product
alterations, is essential for developing targeted therapeutic strategies for cancer. By model-
ing these aspects, researchers can evaluate and compare different strategies to correct the
adverse outcomes caused by gene mutations. Such molecular models are instrumental in
the design of effective cancer therapies [35].
tential protein–protein interactions (PPIs) related to cancer driver proteins. These proteins
play roles in various cellular functions, including transcription regulation, signal transduc-
tion, DNA repair, and cell cycle processes. For the predicted binary protein complexes, they
constructed spatial models, revealing that 1087 of these complexes had not been previously
characterized in terms of their 3D structures. In addition, the top AF2 contact probability
between residues of a protein pair can be used to distinguish true PPIs from false ones
in yeast.
Vasoactive intestinal peptide receptor 2 (VIPR2), a class B G-protein-coupled receptor,
plays a role in numerous physiological processes through its interaction with vasoactive
intestinal peptide (VIP) and pituitary adenylate cyclase-activating polypeptide (PACAP).
VIPR2 has garnered interest as a potential therapeutic target in the fields of psychiatry,
oncology, and immunology. In a study by Sakamoto, K., et al. [37], the researchers combined
AF2 with molecular dynamics (MD) simulation techniques to construct models of the
VIPR2/KS-133 and VIPR2/vasoactive intestinal peptide (VIP) complex and to understand
their binding modes. The VIPR2/KS-133 and VIPR2/VIP complex models were constructed
using AF2 and molecular dynamic simulations.
Figure 3. Model Architecture of AlphaFold2. The architecture of the AlphaFold2 model can be
Figure 3. Model Architecture of AlphaFold2. The architecture of the AlphaFold2 model can be
broadly divided into three parts: (1) Model Input (2) Evoformer (3) Structure module.
broadly divided into three parts: (1) Model Input (2) Evoformer (3) Structure module.
6.2. Overview of the ESMFold Algorithm
The model’s attention maps, derived from sequence embeddings, are used to predict
The ESMFold
the contact map. Thismodel [8] isis built
capability basedupon
solelya on
BERT-like architecture,
the amino which
acid sequence is aprotein,
of the type of
large language model that utilizes stacked Transformer encoder layers. It is
making ESMFold a valuable tool for studying proteins that are difficult to analyze usingtrained using
a technique known as masked residue prediction, where certain
traditional methods that depend on evolutionary comparisons (Figure 4). amino acids in the protein
sequence are hidden from the model during training, forcing it to predict these residues
based on the surrounding context. This training process enables ESMFold to develop in-
tricate internal representations of protein sequences. A notable feature of the ESM lan-
guage model is its ability to infer structural information from protein sequences without
relying on MSAs or known protein homologies.
The model’s attention maps, derived from sequence embeddings, are used to predict
the contact map. This capability is based solely on the amino acid sequence of the protein,
Biomolecules 2024, 14, 339 making ESMFold a valuable tool for studying proteins that are difficult to analyze using8 of 16
traditional methods that depend on evolutionary comparisons (Figure 4).
Figure
Figure 4.
4. Model
Model Architecture
Architecture of
of ESMFold.
ESMFold. The
The ESMFold
ESMFold model
model can
can be
be divided
divided into
into four parts: data
data
parsing,
parsing, encoder (Folding Trunk), decoder (Structure
(Structure Module),
Module), and
and the
the recycling
recycling phase.
phase.
6.3. Overview
6.3. Overview of
of the
the RoseTTAFold
RoseTTAFold Algorithm
Algorithm
Developed by
Developed by David
David Baker’s
Baker’s group
groupatatthe theInstitute
Institutefor
forProtein
ProteinDesign
Designat at
thethe
University
Univer-
of Washington,
sity of Washington,RoseTTAFold
RoseTTAFold [7] is
[7]anisextension
an extensionof the older
of the Rosetta
older family
Rosetta of tools,
family en-
of tools,
hanced by the deep learning technology. It employs a unique ‘three-track’
enhanced by the deep learning technology. It employs a unique ‘three-track’ neural net- neural network
and integrates
work three types
and integrates threeoftypes
information: the sequential
of information: patterns patterns
the sequential in proteins, the interplay
in proteins, the
between amino acids, and the probable three-dimensional configurations.
interplay between amino acids, and the probable three-dimensional configurations. Ro- RoseTTAFold
has recentlyhas
seTTAFold been updated
recently toupdated
been model complete
to modelbiological
complete assemblies, including aincluding
biological assemblies, range of
biomolecules such as proteins, DNA, and RNA. This enhancement
a range of biomolecules such as proteins, DNA, and RNA. This enhancement broadensbroadens the potential
usespotential
the of protein structure
uses prediction
of protein structure algorithms
prediction [40].
algorithms [40].
6.4. Overview of the OpenFold Algorithm
6.4. Overview of the OpenFold Algorithm
The OpenFold Consortium introduced OpenFold, an open-source, trainable version of
The OpenFold Consortium introduced OpenFold, an open-source, trainable version of
AF2, alongside OpenProteinSet, a database of 5 million diverse MSAs. This eliminates the
AF2, alongside
massive OpenProteinSet,
computational a databaseofofCPU
barrier—millions 5 million diverse MSAs.
hours—required This eliminates
for large-scale the
training.
massive computational barrier—millions of CPU hours—required for
When trained from scratch using OpenProteinSet, OpenFold matches AF2’s prediction large-scale training.
When
quality trained from scratch
but offers usinglike
advantages OpenProteinSet, OpenFold
faster processing, lowermatches
memory AF2’s prediction
usage quality
for handling
but offers
longer advantages
proteins on alike faster
single processing,
GPU, lower memory
and compatibility usage
with thefor handling
widely used longer
PyTorchproteins
ma-
on a single GPU, and compatibility with the widely used PyTorch machine
chine learning framework. This makes OpenFold easily accessible to a broad developer learning frame-
work. This makes
community [9]. OpenFold easily accessible to a broad developer community [9].
Using
Using OpenFold, researchers
OpenFold, researchers explored
explored thethe model’s
model’s protein-folding
protein-folding learning
learningprocess,
process,
identifying distinct behavioral phases during intermediate training
identifying distinct behavioral phases during intermediate training stages. They stages. They discov-
discovered
ered that OpenFold
that OpenFold learnslearns
spatialspatial dimensions
dimensions and structural
and structural elementselements in an interleaved
in an interleaved fashion.
fashion. With OpenFold
With OpenFold achievingachieving 90% accuracy
90% accuracy in just 3%inofjust
the 3% of thetime
training training
as AF2,timeitsas AF2, its
retraining
retraining
on prunedon datapruned data sets showcased
sets showcased robustness robustness and varied generalization
and varied generalization capabilities. capabili-
Training
ties. Training on smaller, diverse data sets further enhanced OpenFold’s
on smaller, diverse data sets further enhanced OpenFold’s performance. These performance.
findings
These
provide findings
valuableprovide valuable
insights insights into
into AF2-type AF2-type
models and pavemodels and pave
the way the way for ad-
for advancements in
vancements
biomolecularinmodeling
biomolecular modeling algorithms.
algorithms.
6.5. Comparing
6.5. Comparing AlphaFold2
AlphaFold2 vs.
vs. ESMFold
ESMFold vs.
vs. RoseTTAFold
RoseTTAFoldvs.vs. OpenFold
OpenFold
In protein structure prediction, utilizing individual sequences
In protein structure prediction, utilizing individual sequences withoutwithout
relyingrelying on
on co-evo-
co-evolutionary
lutionary data likedata
MSA like MSA emerges
emerges as a promising
as a promising strategy.
strategy. This This
method method potentially
potentially eliminates
eliminates
the the time
time needed for needed
homologyfor searches
homology andsearches and MSAand
MSA building building and may
may enhance enhance
prediction
prediction accuracy for orphan proteins. Although explored in earlier research by Chowd-
hury et al. and Wang et al. [41,42], the results were initially less than ideal. However, recent
ESMFold results indicate that larger pre-trained models alongside techniques inspired
by AF2’s distillation method can enhance prediction accuracies. This improvement is
attributed to two primary factors. First, the size of the sequence pre-trained models has
Biomolecules 2024, 14, 339 9 of 16
been significantly increased, with ESMFold now using a 15B model that encapsulates more
co-evolutionary information. Second, instead of employing self-distillation, a technique
known as AF2 distillation has been adopted. In this approach, AF2 is utilized to perform
structure predictions on a large sequence database, and the predicted structures are then
used as training data for ESMFold. This innovative method of utilizing AlphaFold2’s
predictive power to enrich the training data has contributed to the enhanced performance
of ESMFold in protein structure prediction. For instance, ESMFold, with fewer parameters,
predicts a protein with 384 residues in just 14.2 s on a single NVIDIA V100 GPU, about 6
times faster than AF2.
The strategies employed by AF2, ESMFold, RoseTTAFold and OpenFold in protein
structure prediction offer distinct advantages and limitations. ESMFold’s approach of using
individual sequences for predictions is time-efficient and particularly beneficial for orphan
proteins, which lack homologs in current databases. ESMFold, demonstrates a significant
speed advantage over AF2, enabling the rapid construction of predicted structures, a crucial
factor given the vast amount of available sequence data.
On the other hand, AF2’s methodology, as summarized in the overview, leverages
MSA and structural databases to interpret coevolutionary correlations between mutations
for its predictions. However, this approach may pose challenges in handling novel single-
point mutations or orphan proteins and concerns regarding data leakage in evaluation
data sets.
RoseTTAFold can predict protein–nucleic acid complexes, though its precision in this
area is not as high as when dealing with protein structures alone. To enhance this capability,
the RoseTTAFoldNA extension has been developed, specifically focusing on improving the
predictions of protein-nucleic acid complexes [43].
The contrasting approaches among AF2, ESMFold, RoseTTAFold and OpenFold
highlight the trade-offs between prediction speed and accuracy and need for additional
input data (Table 2). We compared the algorithms used by AF2, ESM2 and OpenFold
focusing on the input and frameworks in Supplementary Table S1.
Table 2. Capabilities of and differences between these four protein structure prediction models.
Figure 5. Model
Figure structure
5. Model ofof
structure RFDiffusion.
RFDiffusion.Use
Useofofthe
thediffusion
diffusionmodel
model approach for training
approach for trainingand
and fine-
tuning the protein
fine-tuning structure
the protein prediction
structure model,
prediction enabling
model, enablinga amore
morerefined depictionofofthe
refined depiction the hidden re-
hidden
lationship between
relationship protein
between sequences
protein sequencesand
andstructures.
structures.
7.7.Generative
GenerativeAI: AI:
AA Catalyst
Catalyst in in Cancer
Cancer Drug
Drug Development
Development
Generative
Generative AIAI has
has emerged
emerged as aastransformative
a transformative
forceforce inlife
in the thesciences
life sciences
sectorsector
(Figure(Figure
6), 6),
poweringinnovative
powering innovative research,
research, optimizing
optimizing workflows,
workflows, and providing
and providing newIts
new insights. insights.
appli- Its
applications
cations are extensive
are extensive and varied:andWevaried: We discussed
previously previously de discussed
novo designde of novo design
proteins of pro-
[48,49],
teins
the [48,49],
creation of the
novelcreation of novel
antibodies [50], antibodies [50], and
and the building the building ofmodels
of comprehensive comprehensive
for single- mod-
cell
els multi-omics
for single-cell[51], which can[51],
multi-omics provide
whicha deeper understanding
can provide of the cellular heteroge-
a deeper understanding of the cellular
neity in tumors in
heterogeneity and informand
tumors the inform
development of personalized
the development cancer treatments.
of personalized cancer treatments.
Generative
Figure6.6.Generative
Figure AI AI in Life
in Life Sciences:
Sciences: A Comprehensive
A Comprehensive OverviewOverview of Applications
of Applications and Inno-
and Innova-
vations.
tions. Generative
Generative AI isAIrevolutionizing
is revolutionizing various
various aspects
aspects of lifeofsciences.
life sciences. It is accelerating
It is accelerating drug drug
discovery, aiding in antibody development, and enhancing single-cell multi-omics models for disease
understanding. The technology also plays a role in personalized medicine, population genetics,
and viral evolution. Beyond biology, it is pivotal in data science for generating synthetic data and
in scientific visualization through text-to-image technologies. Overall, generative AI’s impact is
expansive and transformative across life sciences.
Biomolecules 2024, 14, 339 12 of 16
Moreover, generative AI also plays a role in genomic variant effect prediction [52] and
identifying statistical patterns in DNA sequences [53], which can help in understanding
the genetic basis of cancer. It is instrumental in predicting and reconstructing the evolution
of viruses [54], thus offering valuable insights for epidemiology and vaccine development.
Additionally, this technology can generate synthetic data to augment existing data sets [55],
providing a valuable resource for researchers and scientists. Even in the realm of data
visualization, generative AI can be used for text-to-image generation [56,57], translating
complex textual descriptions into accurate, understandable biological images. Overall,
by enabling a deeper understanding of biological systems and accelerating the discovery
process, it holds great promise in advancing the fight against cancer.
10. Conclusions
AI is poised to significantly transform the landscape of drug development, offering
ways to streamline the process, reduce costs, and enhance success rates at various stages.
The process started with AF2, which has achieved remarkable success in predicting protein
structures, marking a milestone for AI applications in structural biology. By providing
accurate predictions of protein structures, AF2 can accelerate the development of new
cancer drugs and therapies, and more effectively identify and validate novel drug targets,
particularly for those lacking substantial structural information. Perhaps more importantly,
AF2 has inspired a wave of AI-driven tools in protein structure prediction, engineering,
docking, and generating novel proteins with desired structures and functions. These tools
exemplify the role of AI in advancing drug development by enabling the generation of
novel protein sequences and structures, predicting the effects of genomic variants, and
providing new insights into the mechanisms of cancer.
References
1. Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko,
A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [CrossRef]
2. Takkouche, A.; Qiu, X.; Sedova, M.; Jaroszewski, L.; Godzik, A. Unusual structural and functional features of TpLRR/BspA-like
LRR proteins. J. Struct. Biol. 2023, 215, 108011. [CrossRef]
3. Pak, M.A.; Markhieva, K.A.; Novikova, M.S.; Petrov, D.S.; Vorobyev, I.S.; Maksimova, E.S.; Kondrashov, F.A.; Ivankov, D.N. Using
AlphaFold to predict the impact of single mutations on protein stability and function. PLoS ONE 2023, 18, e0282689. [CrossRef]
[PubMed]
4. Yamaguchi, S.; Kaneko, M.; Narukawa, M. Approval success rates of drug candidates based on target, action, modality, application,
and their combinations. Clin. Transl. Sci. 2021, 14, 1113–1122. [CrossRef]
5. Schlander, M.; Hernandez-Villafuerte, K.; Cheng, C.Y.; Mestre-Ferrandiz, J.; Baumann, M. How Much Does It Cost to Research
and Develop a New Drug? A Systematic Review and Assessment. Pharmacoeconomics 2021, 39, 1243–1269. [CrossRef]
6. Mansoori, B.; Mohammadi, A.; Davudian, S.; Shirjang, S.; Baradaran, B. The Different Mechanisms of Cancer Drug Resistance: A
Brief Review. Adv. Pharm. Bull. 2017, 7, 339–348. [CrossRef]
7. Baek, M.; DiMaio, F.; Anishchenko, I.; Dauparas, J.; Ovchinnikov, S.; Lee, G.R.; Wang, J.; Cong, Q.; Kinch, L.N.; Schaeffer, R.D.; et al.
Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021, 373, 871–876. [CrossRef]
8. Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; et al. Evolutionary-scale
prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123–1130. [CrossRef]
9. Ahdritz, G.; Bouatta, N.; Kadyan, S.; Xia, Q.; Gerecke, W.; O’Donnell, T.J.; Berenberg, D.; Fisk, I.; Zanichelli, N.; Zhang, B.; et al.
OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalizati. bioRxiv 2022.
[CrossRef]
10. Madani, A.; Krause, B.; Greene, E.R.; Subramanian, S.; Mohr, B.P.; Holton, J.M.; Olmos, J.L.; Xiong, C.; Sun, Z.Z.; Socher, R.; et al.
Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 2023, 41, 1099–1106.
[CrossRef]
11. Dauparas, J.; Anishchenko, I.; Bennett, N.; Bai, H.; Ragotte, R.J.; Milles, L.F.; Wicky, B.I.M.; Courbet, A.; de Haas, R.J.;
Bethel, N.; et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science 2022, 378, 49–56. [CrossRef]
12. Alamdari, S.; Thakkar, N.; Berg, R.; Lu, A.X.; Fusi, N.; Amini, A.P.; Yang, K.K. Protein generation with evolutionary diffusion:
Sequence is all you need. bioRxiv 2023. [CrossRef]
13. Watson, J.L.; Juergens, D.; Bennett, N.R.; Trippe, B.L.; Yim, J.; Eisenach, H.E.; Ahern, W.; Borst, A.J.; Ragotte, R.J.; Milles, L.F.; et al.
De novo design of protein structure and function with RFdiffusion. Nature 2023, 620, 1089–1100. [CrossRef]
14. Corso, G.; Stärk, H.; Jing, B.; Barzilay, R.; Jaakkola, T. Diffdock: Diffusion steps, twists, and turns for molecular docking.
arXiv 2022. [CrossRef]
15. Söding, J.; Biegert, A.; Lupas, A.N. The HHpred interactive server for protein homology detection and structure prediction.
Nucleic Acids Res. 2005, 33, W244–W248. [CrossRef]
16. Zimmermann, L.; Stephens, A.; Nam, S.Z.; Rau, D.; Kübler, J.; Lozajic, M.; Gabler, F.; Söding, J.; Lupas, A.N.; Alva, V.A. A
Completely Reimplemented MPI Bioinformatics Toolkit with a New HHpred Server at its Core. J. Mol. Biol. 2018, 430, 2237–2243.
[CrossRef] [PubMed]
17. Marks, D.S.; Colwell, L.J.; Sheridan, R.; Hopf, T.A.; Pagnani, A.; Zecchina, R.; Sander, C. Protein 3D structure computed from
evolutionary sequence variation. PLoS ONE 2011, 6, e28766. [CrossRef] [PubMed]
18. Wang, S.; Sun, S.; Li, Z.; Zhang, R.; Xu, J. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model.
PLoS Comput. Biol. 2017, 13, e1005324. [CrossRef]
19. Du, Z.; Su, H.; Wang, W.; Ye, L.; Wei, H.; Peng, Z.; Anishchenko, I.; Baker, D.; Yang, J. The trRosetta server for fast and accurate
protein structure prediction. Nat. Protoc. 2021, 16, 5634–5651. [CrossRef]
20. Yang, J.; Zhang, Y. Protein Structure and Function Prediction Using I-TASSER. Curr. Protoc. Bioinform. 2015, 52, 5.8.1–5.8.15.
[CrossRef]
21. Moult, J.; Fidelis, K.; Kryshtafovych, A.; Schwede, T.; Tramontano, A. Critical assessment of methods of protein structure
prediction (CASP)-Round XII. Proteins 2018, 86, 7–15. [CrossRef]
22. Laskowski, R.A.; Rullmannn, J.A.; MacArthur, M.W.; Kaptein, R.; Thornton, J.M. AQUA and PROCHECK-NMR: Programs for
checking the quality of protein structures solved by NMR. J. Biomol. NMR 1996, 8, 477–486. [CrossRef] [PubMed]
23. Melo, F.; Devos, D.; Depiereux, E.; Feytmans, E. ANOLEA: A www server to assess protein structures. Proc. Int. Conf. Intell. Syst.
Mol. Biol. 1997, 5, 187–190. [PubMed]
24. Berman, H.M.; Kleywegt, G.J.; Nakamura, H.; Markley, J.L. The Protein Data Bank archive as an open data resource. J. Comput.
Aided Mol. Des. 2014, 28, 1009–1014. [CrossRef]
25. Varadi, M.; Anyango, S.; Deshpande, M.; Nair, S.; Natassia, C.; Yordanova, G.; Yuan, D.; Stroe, O.; Wood, G.; Laydon, A.; et al.
AlphaFold Protein Structure Database: Massively expanding the structural coverage of protein-sequence space with high-accuracy
models. Nucleic Acids Res. 2022, 50, D439–D444. [CrossRef]
26. Varadi, M.; Anyango, S.; Deshpande, M.; Paramval, U.; Pidruchna, I.; Radhakrishnan, M.; Tsenkov, M.; Nair, S.; Mirdita, M.; Yeo,
J.; et al. AlphaFold Protein Structure Database in 2024: Providing structure coverage for over 214 million protein sequences.
Nucleic Acids Res 2024, 52, D368–D375. [CrossRef]
Biomolecules 2024, 14, 339 15 of 16
27. Berman, H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, I.N.; Bourne, P.E. The Protein Data Bank.
Nucleic Acids Res. 2000, 28, 235–242. [CrossRef]
28. Kryshtafovych, A.; Schwede, T.; Topf, M.; Fidelis, K.; Moult, J. Critical assessment of methods of protein structure prediction
(CASP)-Round XIV. Proteins 2021, 89, 1607–1617. [CrossRef]
29. Bienert, S.; Waterhouse, A.; de Beer, T.A.; Tauriello, G.; Studer, G.; Bordoli, L.; Schwede, T. The SWISS-MODEL Repository-new
features and functionality. Nucleic Acids Res. 2017, 45, D313–D319. [CrossRef]
30. Keskin Karakoyun, H.; Yüksel, Ş.K.; Amanoglu, I.; Naserikhojasteh, L.; Yeşilyurt, A.; Yakıcıer, C.; Timuçin, E.; Akyerli, C.B.
Evaluation of AlphaFold structure-based protein stability prediction on missense variations in cancer. Front. Genet. 2023,
14, 1052383. [CrossRef]
31. Aulakh, S.S.; Bozelli, J.C.; Epand, R.M. Exploring the AlphaFold Predicted Conformational Properties of Human Diacylglycerol
Kinases. J. Phys. Chem. B 2022, 126, 7172–7183. [CrossRef] [PubMed]
32. Nussinov, R.; Zhang, M.; Liu, Y.; Jang, H. AlphaFold, allosteric, and orthosteric drug discovery: Ways forward. Drug Discov. Today
2023, 28, 103551. [CrossRef] [PubMed]
33. Weng, Y.; Pan, C.; Shen, Z.; Chen, S.; Xu, L.; Dong, X.; Chen, J. Identification of Potential WSB1 Inhibitors by AlphaFold Modeling,
Virtual Screening, and Molecular Dynamics Simulation Studies. Evid. Based Complement. Alternat Med. 2022, 2022, 4629392.
[CrossRef] [PubMed]
34. Cheng, J.; Novati, G.; Pan, J.; Bycroft, C.; Žemgulytė, A.; Applebaum, T.; Pritzel, A.; Wong, L.H.; Zielinski, M.; Sargeant, T.; et al.
Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 2023, 381, eadg7492. [CrossRef]
[PubMed]
35. Borkakoti, N.; Thornton, J.M. AlphaFold2 protein structure prediction: Implications for drug discovery. Curr. Opin. Struct. Biol.
2023, 78, 102526. [CrossRef] [PubMed]
36. Zhang, J.; Pei, J.; Durham, J.; Bos, T.; Cong, Q. Computed cancer interactome explains the effects of somatic mutations in cancers.
Protein Sci. 2022, 31, e4479. [CrossRef]
37. Sakamoto, K.; Asano, S.; Ago, Y.; Hirokawa, T. AlphaFold version 2.0 elucidates the binding mechanism between VIPR2 and
KS-133, and reveals an S-S bond (Cys(25)-Cys(192)) formation of functional significance for VIPR2. Biochem. Biophys. Res. Commun.
2022, 636, 10–16. [CrossRef]
38. Ren, F.; Ding, X.; Zheng, M.; Korzinkin, M.; Cai, X.; Zhu, W.; Mantsyzov, A.; Aliper, A.; Aladinskiy, V.; Cao, Z.; et al. AlphaFold
accelerates artificial intelligence powered drug discovery: Efficient discovery of a novel CDK20 small molecule inhibitor. Chemical
Science 2023, 14, 1443–1452. [CrossRef]
39. Richardson, L.; Allen, B.; Baldi, G.; Beracochea, M.; Bileschi, M.L.; Burdett, T.; Burgin, J.; Caballero-Pérez, J.; Cochrane, G.; Colwell,
L.J.; et al. MGnify: The microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 2023, 51, D753–D759. [CrossRef]
40. Krishna, R.; Wang, J.; Ahern, W.; Sturmfels, P.; Venkatesh, P.; Kalvet, I.; Lee, G.R.; Morey-Burrows, F.S.; Anishchenko, I.;
Humphreys, I.R.; et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science 2024, 2528.
[CrossRef]
41. Chowdhury, R.; Bouatta, N.; Biswas, S.; Floristean, C.; Kharkar, A.; Roy, K.; Rochereau, C.; Ahdritz, G.; Zhang, J.;
Church, G.M.; et al. Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol.
2022, 40, 1617–1623. [CrossRef]
42. Wang, W.; Peng, Z.; Yang, J. Single-sequence protein structure prediction using supervised transformer protein language models.
Nat. Comput. Sci. 2022, 2, 804–814. [CrossRef]
43. Baek, M.; McHugh, R.; Anishchenko, I.; Jiang, H.; Baker, D.; DiMaio, F. Accurate prediction of protein-nucleic acid complexes
using RoseTTAFoldNA. Nat. Methods 2024, 21, 117–121. [CrossRef]
44. Moussad, B.; Roche, R.; Bhattacharya, D. The transformative power of transformers in protein structure prediction. Proc. Natl.
Acad. Sci. USA 2023, 120, e2303499120. [CrossRef]
45. Wang, G.; Fang, X.; Wu, Z.; Liu, Y.; Xue, Y.; Xiang, Y.; Yu, D.; Wang, F.; Ma, Y. Helixfold: An efficient implementation of alphafold2
using paddlepaddle. arXiv 2022. [CrossRef]
46. Wang, J.; Lisanza, S.; Juergens, D.; Tischer, D.; Watson, J.L.; Castro, K.M.; Ragotte, R.; Saragovi, A.; Milles, L.F.; Baek, M.; et al.
Scaffolding protein functional sites using deep learning. Science 2022, 377, 387–394. [CrossRef] [PubMed]
47. Gentile, F.; Yaacoub, J.C.; Gleave, J.; Fernandez, M.; Ton, A.T.; Ban, F.; Stern, A.; Cherkasov, A. Artificial intelligence-enabled
virtual screening of ultra-large chemical libraries with deep docking. Nat. Protoc. 2022, 17, 672–697. [CrossRef]
48. Anishchenko, I.; Pellock, S.J.; Chidyausiku, T.M.; Ramelot, T.A.; Ovchinnikov, S.; Hao, J.; Bafna, K.; Norn, C.; Kang, A.;
Bera, A.K.; et al. De novo protein design by deep network hallucination. Nature 2021, 600, 547–552. [CrossRef] [PubMed]
49. Yim, J.; Trippe, B.L.; Bortoli, V.D.; Mathieu, E.; Doucet, A.; Barzilay, R.; Jaakkola, T. SE (3) diffusion model with application to
protein backbone generation. arXiv 2023, arXiv:2302.02277.
50. Callaway, E. How generative AI is building better antibodies. Nature 2023. Available online: https://www.nature.com/articles/
d41586-023-01516-w (accessed on 16 January 2024). [CrossRef]
51. Cui, H.; Wang, C.; Maan, H.; Pang, K.; Luo, F.; Duan, N.; Wang, B. scGPT: Towards Building a Foundation Model for Single-Cell
Multi-omics Using Generative AI. Nat. Methods 2024, 1–11. [CrossRef]
52. Benegas, G.; Batra, S.S.; Song, Y.S. DNA language models are powerful zero-shot predictors of genome-wide variant effects. Proc.
Natl. Acad. Sci. USA 2022, 120, e2311219120. [CrossRef]
Biomolecules 2024, 14, 339 16 of 16
53. Yamada, K.; Hamada, M. Prediction of RNA-protein interactions using a nucleotide language model. Bioinform. Adv. 2022,
2, vbac023. [CrossRef] [PubMed]
54. Zvyagin, M.; Brace, A.; Hippe, K.; Deng, Y.; Zhang, B.; Bohorquez, C.O.; Clyde, A.; Kale, B.; Perez-Rivera, D.; Ma, H.; et al.
GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. bioRxiv 2022. [CrossRef]
55. Chen, R.J.; Lu, M.Y.; Chen, T.Y.; Williamson, D.F.K.; Mahmood, F. Synthetic data in machine learning for medicine and healthcare.
Nat. Biomed. Eng. 2021, 5, 493–497. [CrossRef]
56. Kather, J.N.; Ghaffari, L.N.; Foersch, S.; Truhn, D. Medical domain knowledge in domain-agnostic generative AI. NPJ Digit. Med.
2022, 5, 90. [CrossRef]
57. Khader, F.; Muller-Franzes, G.; Tayebi-Arasteh, S.; Han, T.; Haarburger, C.; Schulze-Hagen, M.; Schad, P.; Engelhardt, S.; Baeßler,
B.; Foersch, S.; et al. Denoising diffusion probabilistic models for 3D medical image generation. Sci. Rep. 2023, 13, 7303. [CrossRef]
[PubMed]
58. Stokes, J.M.; Yang, K.; Swanson, K.; Jin, W.; Cubillos-Ruiz, A.; Donghia, N.M.; MacNair, C.R.; French, S.; Carfrae, L.A.; Bloom-
Ackermann, Z.; et al. A Deep Learning Approach to Antibiotic Discovery. Cell 2020, 180, 688–702 e613. [CrossRef]
59. Burki, T. A new paradigm for drug development. Lancet Digit. Health 2020, 2, e226–e227. [CrossRef]
60. InSilico Medicine Hong Kong Limited ((1 October 2023—28 February 2026)). Evaluating INS018_055 Administered Orally to
Subjects with Idiopathic Pulmonary Fibrosis. NCT05975983. Available online: https://clinicaltrials.gov/study/NCT05975983
(accessed on 16 January 2024).
61. InSilicoMedicineHongKongLimited ((19 June 2023—11 June 2024)) Study Evaluating INS018_055 Administered Orally to Subjects
With Idiopathic Pulmonary Fibrosis (IPF). NCT05938920. Available online: https://clinicaltrials.gov/study/NCT05938920
(accessed on 26 February 2023).
62. Bung, N.; Krishnan, S.R.; Bulusu, G.; Roy, A. De novo design of new chemical entities for SARS-CoV-2 using artificial intelligence.
Future Med. Chem. 2021, 13, 575–585. [CrossRef]
63. Blanco-Gonzalez, A.; Cabezon, A.; Seco-Gonzalez, A.; Conde-Torres, D.; Antelo-Riveiro, P.; Piñeiro, Á.; Garcia-Fandino, R. The
Role of AI in Drug Discovery: Challenges, Opportunities, and Strategies. Pharmaceuticals 2023, 16, 891. [CrossRef] [PubMed]
64. Khan, B.; Fatima, H.; Qureshi, A.; Kumar, S.; Hanan, A.; Hussain, J.; Abdullah, S. Drawbacks of Artificial Intelligence and Their
Potential Solutions in the Healthcare Sector. Biomed. Mater. Devices 2023, 1, 731–738. [CrossRef] [PubMed]
65. Fernandez, A. Artificial Intelligence Teaches Drugs to Target Proteins by Tackling the Induced Folding Problem. Mol. Pharm.
2020, 17, 2761–2767. [CrossRef] [PubMed]
66. Gershenson, A.; Gosavi, S.; Faccioli, P.; Wintrode, P.L. Successes and challenges in simulating the folding of large proteins. J. Biol.
Chem. 2020, 295, 15–33. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.