AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (8.5 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Exploring Fragment Adding Strategies to Enhance Molecule Pretraining in AI-Driven Drug Discovery

School of Life Sciences, Shandong University, Qingdao 266237, China
School of Computer Science and Technology, Shandong University, Qingdao 266237, China
State Key Laboratory of Microbiology Technology, Shandong University, Qingdao 266237, China
Show Author Information

Abstract

The effectiveness of AI-driven drug discovery can be enhanced by pretraining on small molecules. However, the conventional masked language model pretraining techniques are not suitable for molecule pretraining due to the limited vocabulary size and the non-sequential structure of molecules. To overcome these challenges, we propose FragAdd, a strategy that involves adding a chemically implausible molecular fragment to the input molecule. This approach allows for the incorporation of rich local information and the generation of a high-quality graph representation, which is advantageous for tasks like virtual screening. Consequently, we have developed a virtual screening protocol that focuses on identifying estrogen receptor alpha binders on a nucleus receptor. Our results demonstrate a significant improvement in the binding capacity of the retrieved molecules. Additionally, we demonstrate that the FragAdd strategy can be combined with other self-supervised methods to further expedite the drug discovery process.

References

[1]

H. F. Lynch and C. T. Robertson, Challenges in confirming drug effectiveness after early approval, Science, vol. 374, no. 6572, pp. 1205–1207, 2021.

[2]

M. Schlander, K. Hernandez-Villafuerte, C. Y. Cheng, J. Mestre-Ferrandiz, and M. Baumann, How much does it cost to research and develop a new drug? A systematic review and assessment, PharmacoEconomics, vol. 39, no. 11, pp. 1243–1269, 2021.

[3]

S. Simoens and I. Huys, R&D costs of new medicines: A landscape analysis, Front. Med., vol. 8, p. 760762, 2021.

[4]
Q. Jiao, Z. Qiu, Y. Wang, C. Chen, Z. Yang, and X. Cui, Edge-gated graph neural network for predicting protein-ligand binding affinities, in Proc. IEEE Int. Conf. Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 2021, pp. 334–339.
[5]

H. Beck, M. Härter, B. Haß, C. Schmeck, and L. Baerfacker, Small molecules and their impact in drug discovery: A perspective on the occasion of the 125th anniversary of the Bayer Chemical Research Laboratory, Drug Discov. Today, vol. 27, no. 6, pp. 1560–1574, 2022.

[6]

Y. Ye, Unleashing the power of big data to guide precision medicine in China, Nature, vol. 606, no. 7916, pp. 49–51, 2022.

[7]
Y. Wang, Z. Qiu, Q. Jiao, C. Chen, Z. Meng, and X. Cui, Structure-based protein-drug affinity prediction with spatial attention mechanisms, in Proc. IEEE Int. Conf. Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 2021, pp. 92–97.
[8]

L. Ericsson, H. Gouk, C. C. Loy, and T. M. Hospedales, Self-Supervised Representation Learning: Introduction, advances, and challenges, IEEE Signal Process. Mag., vol. 39, no. 3, pp. 42–62, 2022.

[9]

Y. LeCun and I. Misra, Self-supervised learning: The dark matter of intelligence, https://ai.meta.com/blog/self-supervised-learning-the-dark-matter-of-intelligence/, 2021.

[10]

C. Cai, S. Wang, Y. Xu, W. Zhang, K. Tang, Q. Ouyang, L. Lai, and J. Pei, Transfer learning for drug discovery, J. Med. Chem., vol. 63, no. 16, pp. 8683–8694, 2020.

[11]
Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang, Self-supervised graph transformer on large-scale molecular data, in Proc. 34th Int. Conf. Neural Information Processing Systems, Virtual Event, 2020, pp. 12559–12571.
[12]
W. H. Hu, B. W. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec, Strategies for pre-training graph neural networks, presented at Int. Conf. Learning Representations (ICLR), Virtual Event, 2020.
[13]

Y. Wang, J. Wang, Z. Cao, and A. Barati Farimani, Molecular contrastive learning of representations via graph neural networks, Nat. Mach. Intell., vol. 4, no. 3, pp. 279–287, 2022.

[14]
P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, Deep graph Infomax, presented at Int. Conf. Learning Representations (ICLR), Vancouver, Canada, 2018.
[15]

J. Milton and J. Treffers-Daller, Vocabulary size revisited: The link between vocabulary size and academic achievement, Appl. Linguist. Rev., vol. 4, no. 1, pp. 151–172, 2013.

[16]
X. Zhang, C. Chen, Z. Meng, Z. Yang, H. Jiang, and X. Cui, CoAtGIN: Marrying convolution and attention for graph-based molecule property prediction, in Proc. IEEE Int. Conf. Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 2022, pp. 374–379.
[17]
G. Landrum, RDKit: Open-source cheminformatics, https://www.rdkit.org, 2023.
[18]

J. Degen, C. Wegscheid-Gerlach, A. Zaliani, and M. Rarey, On the art of compiling and using ‘drug-like’ chemical fragment spaces, ChemMedChem, vol. 3, no. 10, pp. 1503–1507, 2008.

[19]
Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow, Gated graph sequence neural networks, presented at Int. Conf. Learning Representations (ICLR), San Juan, Puerto Rico, 2016.
[20]
K. Xu, W. Hu, J. Leskovec, and S. Jegelka, How powerful are graph neural networks? presented at Int. Conf. Learning Representations (ICLR), Vancouver, Canada, 2018.
[21]

Z. Wu, B. Ramsundar, E. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande, MoleculeNet: A benchmark for molecular machine learning, Chem. Sci., vol. 9, no. 2, pp. 513–530, 2018.

[22]

C. Valsecchi, F. Grisoni, S. Motta, L. Bonati, and D. Ballabio, NURA: A curated dataset of nuclear receptor modulators, Toxicol. Appl. Pharmacol., vol. 407, p. 115244, 2020.

[23]

J. Johnson, M. Douze, and H. Jégou, Billion-scale similarity search with GPUs, IEEE Trans. Big Data, vol. 7, no. 3, pp. 535–547, 2021.

[24]

O. Trott and A. J. Olson, AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function. efficient optimization, and multithreading, J. Comput. Chem., vol. 31, no. 2, pp. 455–461, 2010.

[25]

N. M. O’Boyle, M. Banck, C. A. James, C. Morley, T. Vandermeersch, and G. R. Hutchison, Open Babel: An open chemical toolbox, J. Cheminf., vol. 3, no. 1, p. 33, 2011.

[26]

W. L. DeLano, PyMOL: An open-source molecular graphics tool, CCP4 Newsletter On Protein Crystallography, vol. 40, no. 1, pp. 82–92, 2002.

[27]
Dassault Systèmes, BIOVIA discovery studio visualizer, https://www.3ds.com, 2023.
[28]
W. Hamilton, Z. T. Ying, and J. Leskovec, Inductive representation learning on large graphs, in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 1025–1035.
[29]

G. Subramanian, B. Ramsundar, V. Pande, and R. A. Denny, Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches, J. Chem. Inf. Model., vol. 56, no. 10, pp. 1936–1949, 2016.

[30]

K. M. Gayvert, N. S. Madhukar, and O. Elemento, A data-driven approach to predicting successes and failures of clinical trials, Cell Chem. Biol., vol. 23, no. 10, pp. 1294–1301, 2016.

[31]
G. Hinton and S. Roweis, Stochastic neighbor embedding, in Proc. 15th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2002, pp. 857–864.
[32]

A. A. Sadybekov, A. V. Sadybekov, Y. Liu, C. Iliopoulos-Tsoutsouvas, X. P. Huang, J. Pickett, B. Houser, N. Patel, N. K. Tran, F. Tong, et al., Synthon-based ligand discovery in virtual libraries of over 11 billion compounds, Nature, vol. 601, no. 7893, pp. 452–459, 2022.

[33]

F. Gentile, J. C. Yaacoub, J. Gleave, M. Fernandez, A. T. Ton, F. Ban, A. Stern, and A. Cherkasov, Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking, Nat. Protoc., vol. 17, no. 3, pp. 672–697, 2022.

[34]
J. Wang, Z. Qiu, X. Zhang, Z. Yang, W. Zhao, and X. Cui, Boosting deep learning-based docking with cross-attention and centrality embedding, in Proc. IEEE Int. Conf. Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 2022, pp. 360–365.
[35]

K. Atz, F. Grisoni, and G. Schneider, Geometric deep learning on molecular representations, Nat. Mach. Intell., vol. 3, no. 12, pp. 1023–1032, 2021.

[36]

D. Bafna, F. Ban, P. S. Rennie, K. Singh, and A. Cherkasov, Computer-aided ligand discovery for estrogen receptor alpha, Int. J. Mol. Sci., vol. 21, no. 12, p. 4193, 2020.

[37]

M. Kriegel, H. J. Wiederanders, S. Alkhashrom, J. Eichler, and Y. A. Muller, A PROSS-designed extensively mutated estrogen receptor α variant displays enhanced thermal stability while retaining native allosteric regulation and structure, Sci. Rep., vol. 11, no. 1, p. 10509, 2021.

[38]

D. Probst and J. L. Reymond, Visualization of very large high-dimensional data sets as minimum spanning trees, J. Cheminf., vol. 12, no. 1, p. 12, 2020.

Big Data Mining and Analytics
Pages 565-576
Cite this article:
Meng Z, Chen C, Zhang X, et al. Exploring Fragment Adding Strategies to Enhance Molecule Pretraining in AI-Driven Drug Discovery. Big Data Mining and Analytics, 2024, 7(3): 565-576. https://doi.org/10.26599/BDMA.2024.9020003

355

Views

60

Downloads

3

Crossref

2

Web of Science

2

Scopus

0

CSCD

Altmetrics

Received: 03 November 2023
Revised: 17 December 2023
Accepted: 08 January 2024
Published: 27 February 2024
© The author(s) 2024.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return