Exploring Fragment Adding Strategies to Enhance Molecule Pretraining in AI-Driven Drug Discovery

Zhaoxu Meng; Cheng Chen; Xuan Zhang; Wei Zhao; Xuefeng Cui

doi:10.26599/BDMA.2024.9020003

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

PDF (8.5 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

Exploring Fragment Adding Strategies to Enhance Molecule Pretraining in AI-Driven Drug Discovery

Zhaoxu Meng^¹, Cheng Chen^², Xuan Zhang^², Wei Zhao^³(

), Xuefeng Cui^²(

)

1School of Life Sciences, Shandong University, Qingdao 266237, China

2School of Computer Science and Technology, Shandong University, Qingdao 266237, China

3State Key Laboratory of Microbiology Technology, Shandong University, Qingdao 266237, China

Show Author Information

Abstract

The effectiveness of AI-driven drug discovery can be enhanced by pretraining on small molecules. However, the conventional masked language model pretraining techniques are not suitable for molecule pretraining due to the limited vocabulary size and the non-sequential structure of molecules. To overcome these challenges, we propose FragAdd, a strategy that involves adding a chemically implausible molecular fragment to the input molecule. This approach allows for the incorporation of rich local information and the generation of a high-quality graph representation, which is advantageous for tasks like virtual screening. Consequently, we have developed a virtual screening protocol that focuses on identifying estrogen receptor alpha binders on a nucleus receptor. Our results demonstrate a significant improvement in the binding capacity of the retrieved molecules. Additionally, we demonstrate that the FragAdd strategy can be combined with other self-supervised methods to further expedite the drug discovery process.

Keywords

pretraining information retrieval drug discovery virtual screening molecule property prediction

References

[1]

H. F. Lynch and C. T. Robertson, Challenges in confirming drug effectiveness after early approval, Science, vol. 374, no. 6572, pp. 1205–1207, 2021.

Crossref Google Scholar

[2]

M. Schlander, K. Hernandez-Villafuerte, C. Y. Cheng, J. Mestre-Ferrandiz, and M. Baumann, How much does it cost to research and develop a new drug? A systematic review and assessment, PharmacoEconomics, vol. 39, no. 11, pp. 1243–1269, 2021.

Crossref Google Scholar

[3]

S. Simoens and I. Huys, R&D costs of new medicines: A landscape analysis, Front. Med., vol. 8, p. 760762, 2021.

Crossref Google Scholar

[4]

Q. Jiao, Z. Qiu, Y. Wang, C. Chen, Z. Yang, and X. Cui, Edge-gated graph neural network for predicting protein-ligand binding affinities, in Proc. IEEE Int. Conf. Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 2021, pp. 334–339.

Crossref

[5]

H. Beck, M. Härter, B. Haß, C. Schmeck, and L. Baerfacker, Small molecules and their impact in drug discovery: A perspective on the occasion of the 125th anniversary of the Bayer Chemical Research Laboratory, Drug Discov. Today, vol. 27, no. 6, pp. 1560–1574, 2022.

Crossref Google Scholar

[6]

Y. Ye, Unleashing the power of big data to guide precision medicine in China, Nature, vol. 606, no. 7916, pp. 49–51, 2022.

Crossref Google Scholar

[7]

Y. Wang, Z. Qiu, Q. Jiao, C. Chen, Z. Meng, and X. Cui, Structure-based protein-drug affinity prediction with spatial attention mechanisms, in Proc. IEEE Int. Conf. Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 2021, pp. 92–97.

Crossref

[8]

L. Ericsson, H. Gouk, C. C. Loy, and T. M. Hospedales, Self-Supervised Representation Learning: Introduction, advances, and challenges, IEEE Signal Process. Mag., vol. 39, no. 3, pp. 42–62, 2022.

Crossref Google Scholar

[9]

Y. LeCun and I. Misra, Self-supervised learning: The dark matter of intelligence, https://ai.meta.com/blog/self-supervised-learning-the-dark-matter-of-intelligence/, 2021.

[10]

C. Cai, S. Wang, Y. Xu, W. Zhang, K. Tang, Q. Ouyang, L. Lai, and J. Pei, Transfer learning for drug discovery, J. Med. Chem., vol. 63, no. 16, pp. 8683–8694, 2020.

Crossref Google Scholar

[11]

Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang, Self-supervised graph transformer on large-scale molecular data, in Proc. 34th Int. Conf. Neural Information Processing Systems, Virtual Event, 2020, pp. 12559–12571.

[12]

W. H. Hu, B. W. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec, Strategies for pre-training graph neural networks, presented at Int. Conf. Learning Representations (ICLR), Virtual Event, 2020.

[13]

Y. Wang, J. Wang, Z. Cao, and A. Barati Farimani, Molecular contrastive learning of representations via graph neural networks, Nat. Mach. Intell., vol. 4, no. 3, pp. 279–287, 2022.

Crossref Google Scholar

[14]

P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, Deep graph Infomax, presented at Int. Conf. Learning Representations (ICLR), Vancouver, Canada, 2018.

[15]

J. Milton and J. Treffers-Daller, Vocabulary size revisited: The link between vocabulary size and academic achievement, Appl. Linguist. Rev., vol. 4, no. 1, pp. 151–172, 2013.

Crossref Google Scholar

[16]

X. Zhang, C. Chen, Z. Meng, Z. Yang, H. Jiang, and X. Cui, CoAtGIN: Marrying convolution and attention for graph-based molecule property prediction, in Proc. IEEE Int. Conf. Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 2022, pp. 374–379.

Crossref

[17]

G. Landrum, RDKit: Open-source cheminformatics, https://www.rdkit.org, 2023.

[18]

J. Degen, C. Wegscheid-Gerlach, A. Zaliani, and M. Rarey, On the art of compiling and using ‘drug-like’ chemical fragment spaces, ChemMedChem, vol. 3, no. 10, pp. 1503–1507, 2008.

Crossref Google Scholar

[19]

Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow, Gated graph sequence neural networks, presented at Int. Conf. Learning Representations (ICLR), San Juan, Puerto Rico, 2016.

[20]

K. Xu, W. Hu, J. Leskovec, and S. Jegelka, How powerful are graph neural networks? presented at Int. Conf. Learning Representations (ICLR), Vancouver, Canada, 2018.

[21]

Z. Wu, B. Ramsundar, E. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande, MoleculeNet: A benchmark for molecular machine learning, Chem. Sci., vol. 9, no. 2, pp. 513–530, 2018.

Crossref Google Scholar

[22]

C. Valsecchi, F. Grisoni, S. Motta, L. Bonati, and D. Ballabio, NURA: A curated dataset of nuclear receptor modulators, Toxicol. Appl. Pharmacol., vol. 407, p. 115244, 2020.

Crossref Google Scholar

[23]

J. Johnson, M. Douze, and H. Jégou, Billion-scale similarity search with GPUs, IEEE Trans. Big Data, vol. 7, no. 3, pp. 535–547, 2021.

Crossref Google Scholar

[24]

O. Trott and A. J. Olson, AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function. efficient optimization, and multithreading, J. Comput. Chem., vol. 31, no. 2, pp. 455–461, 2010.

Crossref Google Scholar

[25]

N. M. O’Boyle, M. Banck, C. A. James, C. Morley, T. Vandermeersch, and G. R. Hutchison, Open Babel: An open chemical toolbox, J. Cheminf., vol. 3, no. 1, p. 33, 2011.

Crossref Google Scholar

[26]

W. L. DeLano, PyMOL: An open-source molecular graphics tool, CCP4 Newsletter On Protein Crystallography, vol. 40, no. 1, pp. 82–92, 2002.

Google Scholar

[27]

Dassault Systèmes, BIOVIA discovery studio visualizer, https://www.3ds.com, 2023.

[28]

W. Hamilton, Z. T. Ying, and J. Leskovec, Inductive representation learning on large graphs, in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 1025–1035.

[29]

G. Subramanian, B. Ramsundar, V. Pande, and R. A. Denny, Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches, J. Chem. Inf. Model., vol. 56, no. 10, pp. 1936–1949, 2016.

Crossref Google Scholar

[30]

K. M. Gayvert, N. S. Madhukar, and O. Elemento, A data-driven approach to predicting successes and failures of clinical trials, Cell Chem. Biol., vol. 23, no. 10, pp. 1294–1301, 2016.

Crossref Google Scholar

[31]

G. Hinton and S. Roweis, Stochastic neighbor embedding, in Proc. 15th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2002, pp. 857–864.

[32]

A. A. Sadybekov, A. V. Sadybekov, Y. Liu, C. Iliopoulos-Tsoutsouvas, X. P. Huang, J. Pickett, B. Houser, N. Patel, N. K. Tran, F. Tong, et al., Synthon-based ligand discovery in virtual libraries of over 11 billion compounds, Nature, vol. 601, no. 7893, pp. 452–459, 2022.

Crossref Google Scholar

[33]

F. Gentile, J. C. Yaacoub, J. Gleave, M. Fernandez, A. T. Ton, F. Ban, A. Stern, and A. Cherkasov, Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking, Nat. Protoc., vol. 17, no. 3, pp. 672–697, 2022.

Crossref Google Scholar

[34]

J. Wang, Z. Qiu, X. Zhang, Z. Yang, W. Zhao, and X. Cui, Boosting deep learning-based docking with cross-attention and centrality embedding, in Proc. IEEE Int. Conf. Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 2022, pp. 360–365.

Crossref

[35]

K. Atz, F. Grisoni, and G. Schneider, Geometric deep learning on molecular representations, Nat. Mach. Intell., vol. 3, no. 12, pp. 1023–1032, 2021.

Crossref Google Scholar

[36]

D. Bafna, F. Ban, P. S. Rennie, K. Singh, and A. Cherkasov, Computer-aided ligand discovery for estrogen receptor alpha, Int. J. Mol. Sci., vol. 21, no. 12, p. 4193, 2020.

Crossref Google Scholar

[37]

M. Kriegel, H. J. Wiederanders, S. Alkhashrom, J. Eichler, and Y. A. Muller, A PROSS-designed extensively mutated estrogen receptor α variant displays enhanced thermal stability while retaining native allosteric regulation and structure, Sci. Rep., vol. 11, no. 1, p. 10509, 2021.

Crossref Google Scholar

[38]

D. Probst and J. L. Reymond, Visualization of very large high-dimensional data sets as minimum spanning trees, J. Cheminf., vol. 12, no. 1, p. 12, 2020.

Crossref Google Scholar

Big Data Mining and Analytics

Volume 7 Issue 3,
September 2024

Pages 565-576

DOI: 10.26599/BDMA.2024.9020003

Cite this article:

Meng Z, Chen C, Zhang X, et al. Exploring Fragment Adding Strategies to Enhance Molecule Pretraining in AI-Driven Drug Discovery. Big Data Mining and Analytics, 2024, 7(3): 565-576. https://doi.org/10.26599/BDMA.2024.9020003

355

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 03 November 2023

Revised: 17 December 2023

Accepted: 08 January 2024

Published: 27 February 2024

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).