| Sign up

PDF (2.9 MB)

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Open Access

Molecular Generation and Optimization of Molecular Properties Using a Transformer Model

Zhongyin Xu^¹, Xiujuan Lei^¹(), Mei Ma^¹, Yi Pan^²()

1School of Computer Science, Shaanxi Normal University, Xi’an 710119, China

2Faculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

Show Author Information

Abstract

Generating novel molecules to satisfy specific properties is a challenging task in modern drug discovery, which requires the optimization of a specific objective based on satisfying chemical rules. Herein, we aim to optimize the properties of a specific molecule to satisfy the specific properties of the generated molecule. The Matched Molecular Pairs (MMPs), which contain the source and target molecules, are used herein, and logD and solubility are selected as the optimization properties. The main innovative work lies in the calculation related to a specific transformer from the perspective of a matrix dimension. Threshold intervals and state changes are then used to encode logD and solubility for subsequent tests. During the experiments, we screen the data based on the proportion of heavy atoms to all atoms in the groups and select 12365, 1503, and 1570 MMPs as the training, validation, and test sets, respectively. Transformer models are compared with the baseline models with respect to their abilities to generate molecules with specific properties. Results show that the transformer model can accurately optimize the source molecules to satisfy specific properties.

Keywords

molecular optimization transformer Matched Molecular Pairs (MMPs)logD solubility

References

[1]

P. G. Polishchuk, T. I. Madzhidov, and A. Varnek, Estimation of the size of drug-like chemical space based on GDB-17 data, J. Comput. Aided Mol. Des., vol. 27, no. 8, pp. 675–679, 2013.

Crossref Google Scholar

[2]

S. Heller, A. McNaught, S. Stein, D. Tchekhovskoi, and I. Pletnev, InChI − the worldwide chemical structure identifier standard, J. Cheminform., vol. 5, no. 1, p. 7, 2013.

Crossref Google Scholar

[3]

N. M. O’Boyle and A. Dalke, DeepSMILES: An adaptation of SMILES for use in machine-learning of chemical structures. doi:10.26434/chemrxiv.7097960.

[4]

M. Krenn, F. Häse, A. K. Nigam, P. Friederich, and A. Aspuru-Guzik, Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation, Mach. Learn. Sci. Technol., vol. 1, no. 4, p. 045024, 2020.

Crossref Google Scholar

[5]

E. J. Bjerrum and R. Threlfall, Molecular generation with recurrent neural networks (RNNs), arXiv preprint arXiv: 1705.04612, 2017.

[6]

A. Gupta, A. T. Müller, B. J. H. Huisman, J. A. Fuchs, P. Schneider, and G. Schneider, Generative recurrent networks for de novo drug design, Mol. Inform., vol. 37, nos. 1&2, p. 1700111, 2018.

Crossref Google Scholar

[7]

M. H. S. Segler, T. Kogej, C. Tyrchan, and M. P. Waller, Generating focused molecule libraries for drug discovery with recurrent neural networks, ACS Cent. Sci., vol. 4, no. 1, pp. 120–131, 2018.

Crossref Google Scholar

[8]

R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik, Automatic chemical design using a data-driven continuous representation of molecules, ACS Cent. Sci., vol. 4, no. 2, pp. 268–276, 2018.

Crossref Google Scholar

[9]

J. Lim, S. Ryu, J. W. Kim, and W. Y. Kim, Molecular generative model based on conditional variational autoencoder for de novo molecular design, J. Cheminform., vol. 10, p. 31, 2018.

Crossref Google Scholar

[10]

M. J. Kusner, B. Paige, and J. M. Hernández-Lobato, Grammar variational autoencoder, in Proc. 34^th Int. Conf. Machine Learning, Sydney, Australia, 2017, pp. 1945–1954.

[11]

H. Dai, Y. Tian, B. Dai, S. Skiena, and L. Song, Syntax-directed variational autoencoder for molecule generation, in Proc. Int. Conf. Learning Representations, https://doi.org/10.48550/arXiv.1802.08786, 2018.

[12]

Q. Liu, M. Allamanis, M. Brockschmidt, and A. L. Gaunt, Constrained graph variational autoencoders for molecule design, in Proc. 32^nd Int. Conf. Neural Information Processing Systems, Montréal, Canada, 2018, pp. 7806–7815.

[13]

W. Jin, R. Barzilay, and T. Jaakkola, Junction tree variational autoencoder for molecular graph generation, in Proc. 35^th Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 2323–2332.

[14]

M. Simonovsky and N. Komodakis, GraphVAE: Towards generation of small graphs using variational autoencoders, in Proc. 27^th Int. Conf. Artificial Neural Networks, Rhodes, Greece, 2018, pp. 412–422.

[15]

G. L. Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. C. Farias, and A. Aspuru-Guzik, Objective-reinforced generative adversarial networks (ORGAN) for sequence generation models, arXiv preprint arXiv: 1705.10843, 2018.

[16]

E. Putin, A. Asadulaev, Y. Ivanenkov, V. Aladinskiy, B. Sanchez-Lengeling, A. Aspuru-Guzik, and A. Zhavoronkov, Reinforced adversarial neural computer for de novo molecular design, J. Chem. Inf. Model., vol. 58, no. 6, pp. 1194–1204, 2018.

Crossref Google Scholar

[17]

E. Putin, A. Asadulaev, Q. Vanhaelen, Y. Ivanenkov, A. V. Aladinskaya, A. Aliper, and A. Zhavoronkov, Adversarial threshold neural computer for molecular de novo design, Mol. Pharm., vol. 15, no. 10, pp. 4386–4397, 2018.

Crossref Google Scholar

[18]

N. De Cao and T. Kipf, MolGAN: An implicit generative model for small molecular graphs, arXiv preprint arXiv: 1805.11973, 2022.

[19]

L. Dinh, D. Krueger, and Y. Bengio, NICE: Non-linear independent components estimation, arXiv preprint arXiv: 1410.8516, 2015.

[20]

L. Dinh, J. Sohl-Dickstein, and S. Bengio, Density estimation using real NVP, arXiv preprint arXiv: 1605.08803, 2017.

[21]

D. P. Kingma and P. Dhariwal, Glow: Generative flow with invertible 1x1 convolutions, arXiv preprint arXiv: 1807.03039, 2018.

[22]

M. Lee and K. Min, MGCVAE: Multi-objective inverse design via molecular graph conditional variational autoencoder, J. Chem. Inf. Model., vol. 62, no. 12, pp. 2943–2950, 2022.

Crossref Google Scholar

[23]

C. Li, J. Yao, W. Wei, Z. Niu, X. Zeng, J. Li, and J. Wang, Geometry-based molecular generation with deep constrained variational autoencoder, IEEE Trans. Neural Netw. Learn. Syst. doi: 10.1109/TNNLS.2022.3147790.

[24]

C. Ma and X. Zhang, GF-VAE: A flow-based variational autoencoder for molecule generation, in Proc. 30^th ACM Int. Conf. Information & Knowledge Management, Virtual Event, Queensland, Australia, 2021, pp. 1181–1190.

[25]

S. Luo, J. Guan, J. Ma, and J. Peng, A 3D generative model for structure-based drug design, arXiv preprint arXiv: 2203.10446, 2022.

[26]

V. Bagal, R. Aggarwal, P. K. Vinod, and U. D. Priyakumar, MolGPT: Molecular generation using a transformer-decoder model, J. Chem. Inf. Model., vol. 62, no. 9, pp. 2064–2076, 2022.

Crossref Google Scholar

[27]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31^st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017. pp. 6000–6010.

[28]

M. Langevin, H. Minoux, M. Levesque, and M. Bianciotto, Scaffold-constrained molecular generation, J. Chem. Inf. Model., vol. 60, no. 12, pp. 5637–5646, 2020.

Crossref Google Scholar

[29]

J. Zhang and H. Chen, De novo molecule design using molecular generative models constrained by ligand-protein interactions, J. Chem. Inf. Model., vol. 62, no. 14, pp. 3291–3306, 2022.

Crossref Google Scholar

[30]

J. He, H. You, E. Sandström, E. Nittinger, E. J. Bjerrum, C. Tyrchan, W. Czechtizky, and O. Engkvist, Molecular optimization by capturing chemist’s intuition using deep neural networks, J. Cheminform., vol. 13, no. 1, p. 26, 2021.

Crossref Google Scholar

[31]

J. He, E. Nittinger, C. Tyrchan, W. Czechtizky, A. Patronov, E. J. Bjerrum, and O. Engkvist, Transformer-based molecular optimization beyond matched molecular pairs, J. Cheminform., vol. 14, no. 1, p. 18, 2022.

Crossref Google Scholar

[32]

G. R. Bickerton, G. V. Paolini, J. Besnard, S. Muresan, and A. L. Hopkins, Quantifying the chemical beauty of drugs, Nat. Chem., vol. 4, no. 2, pp. 90–98, 2012.

Crossref Google Scholar

[33]

K. Preuer, P. Renz, T. Unterthiner, S. Hochreiter, and G. Klambauer, Fréchet ChemNet distance: A metric for generative models for molecules in drug discovery, J. Chem. Inf. Model., vol. 58, no. 9, pp. 1736–1741, 2018.

Crossref Google Scholar

[34]

T. Fu, C. Xiao, and J. Sun, CORE: Automatic molecule optimization using copy & refine strategy, Proc. AAAI Conf. Artif. Intell., vol. 34, no. 1, pp. 638–645, 2020.

[35]

N. Brown, M. Fiscato, M. H. S. Segler, and A. C. Vaucher, GuacaMol: Benchmarking models for de novo molecular design, J. Chem. Inf. Model., vol. 59, no. 3, pp. 1096–1108, 2019.

Crossref Google Scholar

[36]

D. Polykovskiy, A. Zhebrak, B. Sanchez-Lengeling, S. Golovanov, O. Tatanov, S. Belyaev, R. Kurbanov, A. Artamonov, V. Aladinskiy, M. Veselov, et al., Molecular Sets (MOSES): A benchmarking platform for molecular generation models, Front. Pharmacol., vol. 11, p. 565644, 2020.

Crossref Google Scholar

[37]

D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv: 1409.0473, 2016.

[38]

A. Gaulton, L. J. Bellis, A. P. Bento, J. Chambers, M. Davies, A. Hersey, Y. Light, S. McGlinchey, D. Michalovich, B. Al-Lazikani, et al., ChEMBL: a large-scale bioactivity database for drug discovery, Nucleic Acids Res., vol. 40, no. D1, pp. D1100–D1107, 2012.

Crossref Google Scholar

[39]

A. Dalke, J. Hert, and C. Kramer, mmpdb: An open-source matched molecular pair platform for large multiproperty data sets, J. Chem. Inf. Model., vol. 58, no. 5, pp. 902–910, 2018.

Crossref Google Scholar

[40]

D. Weininger, SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules, J. Chem. Inf. Comput. Sci., vol. 28, no. 1, pp. 31–36, 1988.

Crossref Google Scholar

[41]

K. Yang, K. Swanson, W. G. Jin, C. Coley, P. Eiden, H. Gao, A. Guzman-Perez, T. Hopper, B. Kelley, M. Mathea, et al., Analyzing learned molecular representations for property prediction, J. Chem. Inf. Model., vol. 59, no. 8, pp. 3370–3388, 2019.

Crossref Google Scholar

[42]

S. Turk, B. Merget, F. Rippmann, and S. Fulle, Coupling matched molecular pairs with machine learning for virtual compound optimization, J. Chem. Inf. Model., vol. 57, no. 12, pp. 3079–3085, 2017.

Crossref Google Scholar

[43]

D. Mendez, A. Gaulton, A. P. Bento, J. Chambers, M. De Veij, E. Félix, M. P. Magariños, J. F. Mosquera, P. Mutowo, M. Nowotka, et al., ChEMBL: Towards direct deposition of bioassay data, Nucleic Acids Res., vol. 47, no. D1, pp. D930–D940, 2019.

Crossref Google Scholar

[44]

M. Swain, MolVS: Molecule validation and standardization, https://pypi.org/project/Molvs, 2018.

[45]

J. G. Cumming, A. M. Davis, S. Muresan, M. Haeberlein, and H. Chen, Chemical predictive modelling to improve compound quality, Nat. Rev. Drug Discov., vol. 12, no. 12, pp. 948–962, 2013.

Crossref Google Scholar

[46]

F. W. Scholz and M. A. Stephens, K-sample Anderson-darling tests, J. Am. Stat. Assoc., vol. 82, no. 399, pp. 918–924, 1987.

Crossref Google Scholar

[47]

J. B. Dressman and C. Reppas, In vitro-in vivo correlations for lipophilic, poorly water-soluble drugs, Eur. J. Pharm. Sci., vol. 11, no. S2, pp. S73–S80, 2000.

Crossref Google Scholar

Big Data Mining and Analytics

Volume 7 Issue 1,
March 2024

Pages 142-155

DOI: 10.26599/BDMA.2023.9020009

Cite this article:

Xu Z, Lei X, Ma M, et al. Molecular Generation and Optimization of Molecular Properties Using a Transformer Model. Big Data Mining and Analytics, 2024, 7(1): 142-155. https://doi.org/10.26599/BDMA.2023.9020009

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号