| Sign up

PDF (8.5 MB)

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Open Access

Exploring Fragment Adding Strategies to Enhance Molecule Pretraining in AI-Driven Drug Discovery

Zhaoxu Meng^¹, Cheng Chen^², Xuan Zhang^², Wei Zhao^³(), Xuefeng Cui^²()

1School of Life Sciences, Shandong University, Qingdao 266237, China

2School of Computer Science and Technology, Shandong University, Qingdao 266237, China

3State Key Laboratory of Microbiology Technology, Shandong University, Qingdao 266237, China

Show Author Information

Abstract

The effectiveness of AI-driven drug discovery can be enhanced by pretraining on small molecules. However, the conventional masked language model pretraining techniques are not suitable for molecule pretraining due to the limited vocabulary size and the non-sequential structure of molecules. To overcome these challenges, we propose FragAdd, a strategy that involves adding a chemically implausible molecular fragment to the input molecule. This approach allows for the incorporation of rich local information and the generation of a high-quality graph representation, which is advantageous for tasks like virtual screening. Consequently, we have developed a virtual screening protocol that focuses on identifying estrogen receptor alpha binders on a nucleus receptor. Our results demonstrate a significant improvement in the binding capacity of the retrieved molecules. Additionally, we demonstrate that the FragAdd strategy can be combined with other self-supervised methods to further expedite the drug discovery process.

Keywords

pretraining information retrieval drug discovery virtual screening molecule property prediction

References

[1]

H. F. Lynch and C. T. Robertson, Challenges in confirming drug effectiveness after early approval, Science, vol. 374, no. 6572, pp. 1205–1207, 2021.

Crossref Google Scholar

[2]

M. Schlander, K. Hernandez-Villafuerte, C. Y. Cheng, J. Mestre-Ferrandiz, and M. Baumann, How much does it cost to research and develop a new drug? A systematic review and assessment, PharmacoEconomics, vol. 39, no. 11, pp. 1243–1269, 2021.

Crossref Google Scholar

[3]

S. Simoens and I. Huys, R&D costs of new medicines: A landscape analysis, Front. Med., vol. 8, p. 760762, 2021.

Crossref Google Scholar

[4]

Q. Jiao, Z. Qiu, Y. Wang, C. Chen, Z. Yang, and X. Cui, Edge-gated graph neural network for predicting protein-ligand binding affinities, in Proc. IEEE Int. Conf. Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 2021, pp. 334–339.

[5]

H. Beck, M. Härter, B. Haß, C. Schmeck, and L. Baerfacker, Small molecules and their impact in drug discovery: A perspective on the occasion of the 125th anniversary of the Bayer Chemical Research Laboratory, Drug Discov. Today, vol. 27, no. 6, pp. 1560–1574, 2022.

Crossref Google Scholar

[6]

Y. Ye, Unleashing the power of big data to guide precision medicine in China, Nature, vol. 606, no. 7916, pp. 49–51, 2022.

Crossref Google Scholar

[7]

Y. Wang, Z. Qiu, Q. Jiao, C. Chen, Z. Meng, and X. Cui, Structure-based protein-drug affinity prediction with spatial attention mechanisms, in Proc. IEEE Int. Conf. Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 2021, pp. 92–97.

[8]

L. Ericsson, H. Gouk, C. C. Loy, and T. M. Hospedales, Self-Supervised Representation Learning: Introduction, advances, and challenges, IEEE Signal Process. Mag., vol. 39, no. 3, pp. 42–62, 2022.

Crossref Google Scholar

[9]

Y. LeCun and I. Misra, Self-supervised learning: The dark matter of intelligence, https://ai.meta.com/blog/self-supervised-learning-the-dark-matter-of-intelligence/, 2021.

[10]

C. Cai, S. Wang, Y. Xu, W. Zhang, K. Tang, Q. Ouyang, L. Lai, and J. Pei, Transfer learning for drug discovery, J. Med. Chem., vol. 63, no. 16, pp. 8683–8694, 2020.

Crossref Google Scholar

[11]

Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang, Self-supervised graph transformer on large-scale molecular data, in Proc. 34th Int. Conf. Neural Information Processing Systems, Virtual Event, 2020, pp. 12559–12571.

[12]

W. H. Hu, B. W. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec, Strategies for pre-training graph neural networks, presented at Int. Conf. Learning Representations (ICLR), Virtual Event, 2020.

[13]

Y. Wang, J. Wang, Z. Cao, and A. Barati Farimani, Molecular contrastive learning of representations via graph neural networks, Nat. Mach. Intell., vol. 4, no. 3, pp. 279–287, 2022.

Crossref Google Scholar

[14]

P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, Deep graph Infomax, presented at Int. Conf. Learning Representations (ICLR), Vancouver, Canada, 2018.

[15]

J. Milton and J. Treffers-Daller, Vocabulary size revisited: The link between vocabulary size and academic achievement, Appl. Linguist. Rev., vol. 4, no. 1, pp. 151–172, 2013.

Crossref Google Scholar

[16]

X. Zhang, C. Chen, Z. Meng, Z. Yang, H. Jiang, and X. Cui, CoAtGIN: Marrying convolution and attention for graph-based molecule property prediction, in Proc. IEEE Int. Conf. Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 2022, pp. 374–379.

[17]

G. Landrum, RDKit: Open-source cheminformatics, https://www.rdkit.org, 2023.

[18]

J. Degen, C. Wegscheid-Gerlach, A. Zaliani, and M. Rarey, On the art of compiling and using ‘drug-like’ chemical fragment spaces, ChemMedChem, vol. 3, no. 10, pp. 1503–1507, 2008.

Crossref Google Scholar

[19]

Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow, Gated graph sequence neural networks, presented at Int. Conf. Learning Representations (ICLR), San Juan, Puerto Rico, 2016.

[20]

K. Xu, W. Hu, J. Leskovec, and S. Jegelka, How powerful are graph neural networks? presented at Int. Conf. Learning Representations (ICLR), Vancouver, Canada, 2018.

[21]

Z. Wu, B. Ramsundar, E. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande, MoleculeNet: A benchmark for molecular machine learning, Chem. Sci., vol. 9, no. 2, pp. 513–530, 2018.

Crossref Google Scholar

[22]

C. Valsecchi, F. Grisoni, S. Motta, L. Bonati, and D. Ballabio, NURA: A curated dataset of nuclear receptor modulators, Toxicol. Appl. Pharmacol., vol. 407, p. 115244, 2020.

Crossref Google Scholar

[23]

J. Johnson, M. Douze, and H. Jégou, Billion-scale similarity search with GPUs, IEEE Trans. Big Data, vol. 7, no. 3, pp. 535–547, 2021.

Crossref Google Scholar

[24]

O. Trott and A. J. Olson, AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function. efficient optimization, and multithreading, J. Comput. Chem., vol. 31, no. 2, pp. 455–461, 2010.

Crossref Google Scholar

[25]

N. M. O’Boyle, M. Banck, C. A. James, C. Morley, T. Vandermeersch, and G. R. Hutchison, Open Babel: An open chemical toolbox, J. Cheminf., vol. 3, no. 1, p. 33, 2011.

Crossref Google Scholar

[26]

W. L. DeLano, PyMOL: An open-source molecular graphics tool, CCP4 Newsletter On Protein Crystallography, vol. 40, no. 1, pp. 82–92, 2002.

[27]

Dassault Systèmes, BIOVIA discovery studio visualizer, https://www.3ds.com, 2023.

[28]

W. Hamilton, Z. T. Ying, and J. Leskovec, Inductive representation learning on large graphs, in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 1025–1035.

[29]

G. Subramanian, B. Ramsundar, V. Pande, and R. A. Denny, Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches, J. Chem. Inf. Model., vol. 56, no. 10, pp. 1936–1949, 2016.

Crossref Google Scholar

[30]

K. M. Gayvert, N. S. Madhukar, and O. Elemento, A data-driven approach to predicting successes and failures of clinical trials, Cell Chem. Biol., vol. 23, no. 10, pp. 1294–1301, 2016.

Crossref Google Scholar

[31]

G. Hinton and S. Roweis, Stochastic neighbor embedding, in Proc. 15th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2002, pp. 857–864.

[32]

A. A. Sadybekov, A. V. Sadybekov, Y. Liu, C. Iliopoulos-Tsoutsouvas, X. P. Huang, J. Pickett, B. Houser, N. Patel, N. K. Tran, F. Tong, et al., Synthon-based ligand discovery in virtual libraries of over 11 billion compounds, Nature, vol. 601, no. 7893, pp. 452–459, 2022.

Crossref Google Scholar

[33]

F. Gentile, J. C. Yaacoub, J. Gleave, M. Fernandez, A. T. Ton, F. Ban, A. Stern, and A. Cherkasov, Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking, Nat. Protoc., vol. 17, no. 3, pp. 672–697, 2022.

Crossref Google Scholar

[34]

J. Wang, Z. Qiu, X. Zhang, Z. Yang, W. Zhao, and X. Cui, Boosting deep learning-based docking with cross-attention and centrality embedding, in Proc. IEEE Int. Conf. Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 2022, pp. 360–365.

[35]

K. Atz, F. Grisoni, and G. Schneider, Geometric deep learning on molecular representations, Nat. Mach. Intell., vol. 3, no. 12, pp. 1023–1032, 2021.

Crossref Google Scholar

[36]

D. Bafna, F. Ban, P. S. Rennie, K. Singh, and A. Cherkasov, Computer-aided ligand discovery for estrogen receptor alpha, Int. J. Mol. Sci., vol. 21, no. 12, p. 4193, 2020.

Crossref Google Scholar

[37]

M. Kriegel, H. J. Wiederanders, S. Alkhashrom, J. Eichler, and Y. A. Muller, A PROSS-designed extensively mutated estrogen receptor α variant displays enhanced thermal stability while retaining native allosteric regulation and structure, Sci. Rep., vol. 11, no. 1, p. 10509, 2021.

Crossref Google Scholar

[38]

D. Probst and J. L. Reymond, Visualization of very large high-dimensional data sets as minimum spanning trees, J. Cheminf., vol. 12, no. 1, p. 12, 2020.

Crossref Google Scholar

Big Data Mining and Analytics

Volume 7 Issue 3,
September 2024

Pages 565-576

DOI: 10.26599/BDMA.2024.9020003

Cite this article:

Meng Z, Chen C, Zhang X, et al. Exploring Fragment Adding Strategies to Enhance Molecule Pretraining in AI-Driven Drug Discovery. Big Data Mining and Analytics, 2024, 7(3): 565-576. https://doi.org/10.26599/BDMA.2024.9020003

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号