| Sign up

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Show Outline

Outline

Abstract

Keywords

Electronic Supplementary Material

References

Show full outline

Hide outline

Regular Paper

Decoding the Structural Keywords in Protein Structure Universe

Wessam Elhefnawy^¹, Min Li^², Jian-Xin Wang^², Yaohang Li^¹()

1

Department of Computer Science, Old Dominion University, Norfolk, VA 23452, U.S.A.

2

Department of Computer Science, Central South University, Changsha 410083, China

Show Author Information

Abstract

Although the protein sequence-structure gap continues to enlarge due to the development of high-throughput sequencing tools, the protein structure universe tends to be complete without proteins with novel structural folds deposited in the protein data bank (PDB) recently. In this work, we identify a protein structural dictionary (Frag-K) composed of a set of backbone fragments ranging from 4 to 20 residues as the structural “keywords” that can effectively distinguish between major protein folds. We firstly apply randomized spectral clustering and random forest algorithms to construct representative and sensitive protein fragment libraries from a large scale of high-quality, non-homologous protein structures available in PDB. We analyze the impacts of clustering cut-offs on the performance of the fragment libraries. Then, the Frag-K fragments are employed as structural features to classify protein structures in major protein folds defined by SCOP (Structural Classification of Proteins). Our results show that a structural dictionary with ~400 4- to 20-residue Frag-K fragments is capable of classifying major SCOP folds with high accuracy.

Keywords

protein fragment fold recognition protein structure universe

Electronic Supplementary Material

Download File(s)

jcst-34-1-3-Highlights.pdf (721.9 KB)

References

[1]

Schwede T. Protein modeling: What happened to the protein structure gap? Structure, 2013, 21(9): 1531-1540.

Crossref Google Scholar

[2]

Chothia C. Proteins. One thousand families for the molecular biologist. Nature, 1992, 357(6379): 543-544.

Crossref Google Scholar

[3]

Andreeva A, Howorth D, Chandonia J M, Brenner S E, Hubbard T J P, Chothia C, Murzin A G. Data growth and its impact on the SCOP database: New developments. Nucleic Acids Research, 2008, 36: D419-D425.

Crossref Google Scholar

[4]

Sillitoe I, Cuff A L, Dessailly B H, Dawson D L, Furnham N, Lee D, Lees J G, Lewis T E, Studer R A, Rentzsch R, Yeats C, Thornton J M, Orengo C A. New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures. Nucleic Acids Research, 2013, 41(D1): D490-D498.

Crossref Google Scholar

[5]

Chen D. Structural genomics: Exploring the 3D protein landscape, 2010. Biomedical Computation Review. http://biomedicalcomputationreview.org/content/structural-genomics-exploring-3d-protein-landscape, Nov. 2018.

[6]

Kolinski A. Protein modeling and structure prediction with a reduced representation. Acta Biochimica Polonica, 2004, 51(2): 349-371.

Crossref Google Scholar

[7]

Schwede T, Kopp J, Guex N, Peitsch M C. SWISS-MODEL: An automated protein homology-modeling server. Nucleic Acids Research, 2003, 31(13): 3381-3385.

Crossref Google Scholar

[8]

Zhou J F, Grigoryan G. Rapid search for tertiary fragments reveals protein sequence-structure relationships. Protein Science, 2015, 24(4): 508-524.

Crossref Google Scholar

[9]

Simons K T, Kooperberg C, Huang E, Baker D. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. Journal of Molecular Biology, 1997, 268(1): 209-225.

Crossref Google Scholar

[10]

Li Y. Conformational sampling in template-free protein loop structure modeling: An overview. Computational and Structural Biotechnology Journal, 2013, 5: Article No. e201302003.

Crossref Google Scholar

[11]

Li Y, Rata I, Jakobsson E. Integrating multiple scoring functions to improve protein loop structure conformation space sampling. In Proc. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, May 2010.

[12]

Li Y, Rata I, Chiu S W, Jakobsson E. Improving predicted protein loop structure ranking using a Pareto-optimality consensus method. BMC Structural Biology, 2010, 10: Article No. 22.

Crossref Google Scholar

[13]

Simons K T, Ruczinski I, Kooperberg C, Fox B A, Bystroff C, Baker D. Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins: Structure, Function, and Genetics, 1999, 34(1): 82-95.

Crossref Google Scholar

[14]

Kolodny R, Koehl P, Guibas L, Levitt M. Small libraries of protein fragments model native protein structures accurately. Journal of Molecular Biology, 2002, 323(2): 297-307.

Crossref Google Scholar

[15]

Budowski-Tal I, Nov Y, Kolodny R. FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately. Proceedings of the National Academy of Sciences of the United States of America, 2010, 107(8): 3481-3486.

Crossref Google Scholar

[16]

Handl J, Knowles J, Vernon R, Baker D, Lovell S C. The dual role of fragments in fragment-assembly methods for de novo protein structure prediction. Proteins: Structure, Function, and Bioinformatics, 2012, 80(2): 490-504.

Crossref Google Scholar

[17]

Ji H, Yu W, Li Y. A rank revealing randomized singular value decomposition (R3SVD) algorithm for low-rank matrix approximations. arXiv: 1605.08134, 2016. https://arxiv.org/ftp/arxiv/papers/1605/1605.08134.pdf, September 2018.

[18]

Elhefnawy W, Li M, Wang J, Li Y. Construction of protein backbone fragments libraries on large protein sets using a randomized spectral clustering algorithm. In Proc. the 13th International Symposium on Bioinformatics Research and Applications, May 2016, pp.108-119.

[19]

Wang G L, Dunbrack R L. PISCES: A protein sequence culling server. Bioinformatics, 2003, 19(12): 1589-1591.

Crossref Google Scholar

[20]

Dong Q W, Zhou S G, Guan J H. A new taxonomybased protein fold recognition approach based on autocrosscovariance transformation. Bioinformatics, 2009, 25(20): 2655-2662.

Crossref Google Scholar

[21]

Ding C H Q, Dubchak I. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 2001, 17(4): 349-358.

Crossref Google Scholar

[22]

Fox N K, Brenner S E, Chandonia J M. SCOPe: Structural classification of proteins-extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Research, 2014, 42(D1): D304-D309.

Crossref Google Scholar

[23]

von Luxburg U. A tutorial on spectral clustering. Statistics and Computing, 2007, 17(4): 395-416.

Crossref Google Scholar

[24]

Shi J B, Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905.

Crossref Google Scholar

[25]

Ng A Y, Jordan M I, Weiss Y. On spectral clustering: Analysis and an algorithm. In Proc. the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, December 2001, pp.849-856.

[26]

Halko N, Martinsson P G, Tropp J A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 2011, 53(2): 217-288.

Crossref Google Scholar

[27]

Gu Y, Yu W, Li J, Liu S, Li Y. Single-pass PCA of large high-dimensional data. In Proc. the 26th International Joint Conference on Artificial Intelligence, August 2017, pp.3350-3356.

[28]

Li Y, YuW. A fast implementation of singular value thresholding algorithm using recycling rank revealing randomized singular value decomposition. arXiv: 1704.05528, 2017. https://arxiv.org/pdf/1704.05528.pdf, September 2018.

[29]

Strobl C, Boulesteix A L, Zeileis A, Hothorn T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 2007, 8: Article No. 25.

Crossref Google Scholar

[30]

Chiang Y S, Gelfand T I, Kister A E, Gelfand I M. New classification of supersecondary structures of sandwich-like proteins uncovers strict patterns of strand assemblage. Proteins: Structure, Function, and Bioinformatics, 2007, 68(4): 915-921.

Crossref Google Scholar

[31]

Holmes J B, Tsai J. Some fundamental aspects of building protein structures from fragment libraries. Protein Science, 2004, 13(6): 1636-1650.

Crossref Google Scholar

[32]

Le Q, Pollastri G, Koehl P. Structural alphabets for protein structure classification: A comparison study. Journal of Molecular Biology, 2009, 387(2): 431-450.

Crossref Google Scholar

[33]

Bazzoli A, Tettamanzi A G B, Zhang Y. Computational protein design and large-scale assessment by I-TASSER structure assembly simulations. Journal of Molecular Biology, 2011, 407(5): 764-776.

Crossref Google Scholar

[34]

Elhefnawy W, Chen L, Han Y, Li Y. ICOSA: A distancedependent, orientation-specific coarse-grained contact potential for protein structure modeling. Journal of Molecular Biology, 2015, 427(15): 2562-2576.

Crossref Google Scholar

[35]

Li Y, Liu H, Rata I, Jakobsson E. Building a knowledgebased statistical potential by capturing high-order interresidue interactions and its applications in protein secondary structure assessment. Journal of Chemical Information and Modeling, 2013, 53(2): 500-508.

Crossref Google Scholar

Journal of Computer Science and Technology

Volume 34 Issue 1,
January 2019

Pages 3-15

DOI: 10.1007/s11390-019-1895-y

Cite this article:

Elhefnawy W, Li M, Wang J-X, et al. Decoding the Structural Keywords in Protein Structure Universe. Journal of Computer Science and Technology, 2019, 34(1): 3-15. https://doi.org/10.1007/s11390-019-1895-y

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号