Survey on Encoding Schemes for Genomic Data Representation and Feature Learning—From Signal Processing to Machine Learning

Ning Yu; Zhihua Li; Zeng Yu

doi:10.26599/BDMA.2018.9020018

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

PDF (3.2 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

Survey on Encoding Schemes for Genomic Data Representation and Feature Learning—From Signal Processing to Machine Learning

Ning Yu, Zhihua Li, Zeng Yu(

)

∙ Department of Computing Sciences, College at Brockport, State University of New York, Brockport, NY 14422, USA.

∙ Department of Computer Science and Technology at Jiangnan University, Wuxi 214122, China.

∙ School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China.

Show Author Information

Abstract

Data-driven machine learning, especially deep learning technology, is becoming an important tool for handling big data issues in bioinformatics. In machine learning, DNA sequences are often converted to numerical values for data representation and feature learning in various applications. Similar conversion occurs in Genomic Signal Processing (GSP), where genome sequences are transformed into numerical sequences for signal extraction and recognition. This kind of conversion is also called encoding scheme. The diverse encoding schemes can greatly affect the performance of GSP applications and machine learning models. This paper aims to collect, analyze, discuss, and summarize the existing encoding schemes of genome sequence particularly in GSP as well as other genome analysis applications to provide a comprehensive reference for the genomic data representation and feature learning in machine learning.

Keywords

encoding scheme data representation feature learning deep learning genomic signal processing machine learning genome analysis

References

[1]

Sanger

, G. M.

Air

, B. G.

Barrell

, N. L.

Brown

, A. R.

Coulson

, J. C.

Fiddes

, C. A.

Hutchison III

, P. M.

Slocombe

, and M.

Smith

, Nucleotide sequence of bacteriophage

ϕ

X174 DNA, Nature, vol. 265, no. 5596, pp. 687-695, 1977.

Crossref Google Scholar

[2]

, X.

Guo

, F.

, and Y.

Pan

, Signalign: An ontology of DNA as signal for comparative gene structure prediction using information-coding-and-processing techniques, IEEE Trans. NanoBioscience, vol. 15, no. 2, pp. 119-130, 2016.

Crossref Google Scholar

[3]

Anastassiou

, Genomic signal processing, IEEE Signal Process. Mag., vol. 18, no. 4, pp. 8-20, 2001.

Crossref Google Scholar

[4]

Holden

, R.

Subramaniam

, R.

Sullivan

, E.

Cheung

, C.

Schneider

, G.

Tremberger

Jr., A.

Flamholz

, D. H.

Lieberman

, and T. D.

Cheung

, ATCG nucleotide fluctuation of Deinococcus radiodurans radiation genes, in Proc. Instruments, Methods, and Missions for Astrobiology X, San Diego, CA, USA, 2007, p. 669417.

Crossref

[5]

, Z.

, F.

, and Y.

Pan

, Evaluating the impact of encoding schemes on deep auto- encoders for DNA annotation, in Bioinformatics Research and Applications, Z.

Cai

, O.

Daescu

, and M.

, eds. Springer International Publishing, 2017, pp. 390-395.

Crossref

[6]

P. D.

Cristea

, Conversion of nucleotides sequences into genomic signals, J. Cell. Mol. Med., vol. 6, no. 2, pp. 279-303, 2002.

Crossref Google Scholar

[7]

R. F.

Voss

, Evolution of long-range fractal correlations and

1 / f

noise in DNA base sequences, Phys. Rev. Lett., vol. 68, no. 25, pp. 3805-3808, 1992.

Crossref Google Scholar

[8]

Borrayo

, E. G.

Mendizabal-Ruiz

, H.

Vlez-Pérez

, R.

Romo-Vázquez

, A. P.

Mendizabal

, and J. A.

Morales

, Genomic signal processing methods for computation of alignment-free distances from DNA sequences, PLoS One, vol. 9, no. 11, p. e110954, 2014.

Crossref Google Scholar

[9]

Hutter

, V.

Helms

, and M.

Paulsen

, Tandem repeats in the CpG islands of imprinted genes, Genomics, vol. 88, no. 3, pp. 323-332, 2006.

Crossref Google Scholar

[10]

Z. M.

Ning

, A. J.

Cox

, and J. C.

Mullikin

, SSAHA: A fast search method for large DNA databases, Genome Res., vol. 11, no. 10, pp. 1725-1729, 2001.

Crossref Google Scholar

[11]

Katoh

, K.

Misawa

, K. I.

Kuma

, and T.

Miyata

, MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform, Nucleic Acids Res., vol. 30, no. 14, pp. 3059-3066, 2002.

Crossref Google Scholar

[12]

B. R.

King

, M.

Aburdene

, A.

Thompson

, and Z.

Warres

, Application of discrete Fourier inter-coefficient difference for assessing genetic sequence similarity, EURASIP J. Bioinform. Syst. Biol., vol. 2014, no. 1, p. 8, 2014.

Crossref Google Scholar

[13]

Hoang

, C. C.

Yin

, H.

Zheng

, C. L.

, R. L.

, and S. S. T.

Yau

, A new method to cluster DNA sequences using Fourier power spectrum, J. Theor. Biol., vol. 372, pp. 135-145, 2015.

Crossref Google Scholar

[14]

Peng

, J. X.

Wang

, B. H.

Zhao

, and L. S.

Wang

, Identification of protein complexes using weighted PageRank-nibble algorithm and core-attachment structure, IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 12, no. 1, pp. 179-192, 2015.

Crossref Google Scholar

[15]

Cervantes-De la Torre

, J. I.

González-Trejo

, C. A.

Real-Ramírez

, and L. F.

Hoyos-Reyes

, Fractal dimension algorithms and their application to time series associated with natural phenomena, J. Phys. Conf. Ser., vol. 475, no. 1, p. 012002, 2013.

Crossref Google Scholar

[16]

Vinga

, A. M.

Carvalho

, A. P.

Francisco

, L. M.

Russo

, and J. S.

Almeida

, Pattern matching through chaos game representation: Bridging numerical and discrete data structures for biological sequence analysis, Algorithms Mol. Biol., vol. 7, no. 1, p. 10, 2012.

Crossref Google Scholar

[17]

H. K.

Kwan

and S. B.

Arniker

, Numerical representation of DNA sequences, in Proc. 2009 IEEE International Conf. Electro/Information Technology, Windsor, ON, Canada, 2009, pp. 307-310.

Crossref

[18]

bai Arniker

and H. K.

Kwan

, Advanced numerical representation of DNA sequences, in Proc. 2012 Int. Conf. Bioscience, Biochemistry and Bioinformatices, Singapore, 2012, pp. 1-5.

[19]

Bielinska-Waz

, Graphical and numerical representations of DNA sequences: statistical aspects of similarity, J. Math. Chem., vol. 49, no. 10, pp. 2345-2407, 2011.

Crossref Google Scholar

[20]

Roy

, C.

Raychaudhury

, and A.

Nandy

, Novel techniques of graphical representation and analysis of DNA sequences—A review, J. Biosci., vol. 23, no. 1, pp. 55-71, 1998.

Crossref Google Scholar

[21]

Cosic

, Macromolecular bioactivity: Is it resonant interaction between macromolecules?—Theory and applications, IEEE Trans. Biomed. Eng., vol. 41, no. 12, pp. 1101-1114, 1994.

Crossref Google Scholar

[22]

Pirogova

and I.

Cosic

, Examination of amino acid indexes within the resonant recognition model, in Proc. 2nd Conf. Victorian Chapter of the IEEE EMBS, Melbourne, Australia, 2001, pp. 1-4.

[23]

Ning

, C. N.

Moore

, and J. C.

Nelson

, Preliminary wavelet analysis of genomic sequences, in Proc. 2003 IEEE Bioinformatics Conf. Computational Systems Bioinformatics, Stanford, CA, USA, 2003, pp. 509-510.

[24]

Nair

and S. P.

Sreenadhan

, A coding measure scheme employing electron-ion interaction pseudopotential (EIIP), Bioinformation, vol. 1, no. 6, pp. 197-202, 2006.

Google Scholar

[25]

H. E.

Stanley

, S. V.

Buldyrev

, A. L.

Goldberger

, Z. D.

Goldberger

, S.

Havlin

, R. N.

Mantegna

, S. M.

Ossadnik

, C. K.

Peng

, and M.

Simons

, Statistical mechanics in biology: How ubiquitous are long-range correlations? Phys. A, vol. 205, nos. 1-3, pp. 214-253, 1994.

Crossref Google Scholar

[26]

and K.

Kaneko

, Long-range correlation and partial 1/

f^{α}

spectrum in a noncoding DNA sequence, EPL, vol. 17, no. 7, p. 655, 1992.

Crossref Google Scholar

[27]

A. T. M. G.

Bari

, M. R.

Reaz

, A. K. M. T.

Islam

, H. J.

Choi

, and B. S.

Jeong

, Effective encoding for DNA sequence visualization based on nucleotide’s ring structure, Evol. Bioinform., vol. 9, pp. 251-261, 2013.

Crossref Google Scholar

[28]

K. J.

Breslauer

, R.

Frank

, H.

Blcker

, and L. A.

Marky

, Predicting DNA duplex stability from the base sequence, Proc. Natl. Acad. Sci. USA, vol. 83, no. 11, pp. 3746-3750, 1986.

Crossref Google Scholar

[29]

, X.

Guo

, F.

, and Y.

Pan

, DNA AS X: An information-coding-based model to improve the sensitivity in comparative gene analysis, in Bioinformatics Research and Applications, R.

Harrison

, Y. H.

, and I.

Mandoiu

, eds. Springer International Publishing, 2015, pp. 366-377.

Crossref

[30]

M. H.

Garzon

and R. J.

Deaton

, Codeword design and information encoding in DNA ensembles, Nat. Comput., vol. 3, no. 3, pp. 253-292, 2004.

Crossref Google Scholar

[31]

Deng

and Y. H.

Luan

, Analysis of similarity/ dissimilarity of DNA sequences based on chaos game representation, Abstr. Appl. Anal., vol. 2013, p. 926519, 2013.

Crossref Google Scholar

[32]

Gao

and Z. Y.

, Chaos game representation (CGR)-walk model for DNA sequences, Chin. Phys. B, vol. 18, no. 1, pp. 370-376, 2009.

Crossref Google Scholar

[33]

J. S.

Almeida

, J. A.

Carriço

, A.

Maretzek

, P. A.

Noble

, and M.

Fletcher

, Analysis of genomic sequences by chaos game representation, Bioinformatics, vol. 17, no. 5, pp. 429-437, 2001.

Crossref Google Scholar

[34]

L. C. B.

Faria

, A. S. L.

Rocha

, J. H.

Kleinschmidt

, M. C.

Silva-Filho

, E.

Bim

, R. H.

Herai

, M. E. B.

Yamagishi

, and R.

Palazzo

Jr., Is a genome a codeword of an error-correcting code? PLoS One, vol. 7, no. 5, p. e36644, 2012.

Crossref Google Scholar

[35]

Liu

and X. L.

Geng

, A convolutional code-based sequence analysis model and its application, Int. J. Mol. Sci., vol. 14, no. 4, pp. 8393-8405, 2013.

Crossref Google Scholar

[36]

Z. B.

Liu

, B.

Liao

, W.

Zhu

, and G. H.

Huang

, A 2D graphical representation of DNA sequence based on dual nucleotides and its application, Int. J. Quantum Chem., vol. 109, no. 5, pp. 948-958, 2009.

Crossref Google Scholar

[37]

A. S. S.

Nair

and T.

Mahalakshmi

, Visualization of genomic data using inter-nucleotide distance signals, in Proc. IEEE Genomic Signal Processing, Bucharest, Romania, 2005.

[38]

Hackenberg

, C.

Previti

, P. L.

Luque-Escamilla

, P.

Carpena

, J.

Martínez-Aroza

, and J. L.

Oliver

, CpGcluster: A distance-based algorithm for CpG-island detection, BMC Bioinf., vol. 7, p. 446, 2006.

Crossref Google Scholar

[39]

, X.

Guo

, A.

Zelikovsky

, and Y.

Pan

, GaussianCpG: A Gaussian model for detection of human CpG island, in Proc. 5th Int. Conf. Computational Advances in Bio and Medical Sciences, Miami, FL, USA, 2015, p. 1.

Crossref

[40]

Afreixo

, C. A. C.

Bastos

, A. J.

Pinho

, S. P.

Garcia

, and P. J. S. G.

Ferreira

, Genome analysis with inter-nucleotide distances, Bioinformatics, vol. 25, no. 23, pp. 3064-3070, 2009.

Crossref Google Scholar

[41]

L. Q.

Zhou

, R.

, and G. S.

Han

, A method based on the improved inter-nucleotide distances of genomes to construct vertebrates phylogeny tree, in Proc. 7th Int. Conf. Biomedical Engineering and Informatics, Dalian, China, 2014, pp. 776-780.

Crossref

[42]

C. A.

Bastos

, V.

Afreixo

, A. J.

Pinho

, S. P.

Garcia

, J. M.

Rodrigues

, and P. J.

Ferreira

, Inter-dinucleotide distances in the human genome: an analysis of the whole-genome and protein-coding distributions, J. Integr. Bioinform., vol. 8, no. 3, p. 172, 2011.

Crossref Google Scholar

[43]

, I.

Wasito

, and I.

Veritawati

, Fractal dimension approach for clustering of DNA sequences based on internucleotide distance, in Proc. 2013 Int. Conf. Information and Communication Technology, Bandung, Indonesia, 2013, pp. 82-87.

Crossref

[44]

C. A. C.

Bastos

, V.

Afreixo

, A. J.

Pinho

, S. P.

Garcia

, J. M. O. S.

Rodrigues

, and P. J. S. G.

Ferreira

, Distances between dinucleotides in the human genome, in Proc. 5th Int. Conf. Practical Applications of Computational Biology & Bioinformatics, 2011, pp. 205-211.

Crossref

[45]

S. Y.

Ding

, Y.

, X. W.

Yang

, and T. M.

Wang

, A simple

k

-word interval method for phylogenetic analysis of DNA sequences, J. Theor. Biol., vol. 317, pp. 192-199, 2013.

Crossref Google Scholar

[46]

Tang

, K. R.

Hua

, M. Y.

Chen

, R. M.

Zhang

, and X. L.

Xie

, A novel k-word relative measure for sequence comparison, Comput. Biol. Chem., vol. 53, pp. 331-338, 2014.

Crossref Google Scholar

[47]

X. H.

Xie

, Z. G.

, G. S.

Han

, W. F.

Yang

, and V.

Anh

, Whole-proteome based phylogenetic tree construction with inter-amino-acid distances and the conditional geometric distribution profiles, Mol. Phylogenet. Evol., vol. 89, pp. 37-45, 2015.

Crossref Google Scholar

[48]

Zou

, L.

Wang

, and J. F.

Wang

, A 2D graphical representation of the sequences of DNA based on triplets and its application, EURASIP J. Bioinform. Syst. Biol., vol. 2014, no. 1, p. 1, 2014.

Crossref Google Scholar

[49]

Akhtar

, J.

Epps

, and E.

Ambikairajah

, On DNA numerical representations for period-3 based exon prediction, in Proc. 2007 IEEE Int. Workshop on Genomic Signal Processing and Statistics, Tuusula, Finland, 2007, pp. 1-4.

Crossref

[50]

Jabbari

and G.

Bernardi

, Cytosine methylation and CpG, TpG (CpA) and TpA frequencies, Gene, vol. 333, pp. 143-149, 2004.

Crossref Google Scholar

[51]

Datta

and A.

Asif

, A fast DFT based gene prediction algorithm for identification of protein coding regions, in Proc. 2005 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 2005, pp. 653-656.

[52]

A. S.

Motahari

, G.

Bresler

, and D. N. C.

Tse

, Information theory of DNA shotgun sequencing, IEEE Trans. Inf. Theory, vol. 59, no. 10, pp. 6273-6289, 2013.

Crossref Google Scholar

[53]

M. W.

Simmen

, Genome-scale relationships between cytosine methylation and dinucleotide abundances in animals, Genomics, vol. 92, no. 1, pp. 33-40, 2008.

Crossref Google Scholar

[54]

Tuqan

and A.

Rushdi

, A DSP approach for finding the codon bias in DNA sequences, IEEE J. Sel. Top. Signal Process., vol. 2, no. 3, pp. 343-356, 2008.

Crossref Google Scholar

[55]

Galleani

and R.

Garello

, The minimum entropy mapping spectrum of a DNA sequence, IEEE Trans. Inf. Theory, vol. 56, no. 2, pp. 771-783, 2010.

Crossref Google Scholar

[56]

Román-Roldán

, P.

Bernaola-Galván

, and J.

Oliver

, Application of information theory to DNA sequence analysis: A review, Pattern Recognition, vol. 29, no. 7, pp. 1187-1194, 1996.

Crossref Google Scholar

[57]

Bernaola-Galván

, I.

Grosse

, P.

Carpena

, J. L.

Oliver

, R.

Román-Roldán

, and H. E.

Stanley

, Finding borders between coding and noncoding DNA regions by an entropic segmentation method, Phys. Rev. Lett., vol. 85, no. 6, pp. 1342-1345, 2000.

Crossref Google Scholar

[58]

Dan Cristea

, Genetic signal representation and analysis, in Proc. Functional Monitoring and Drug-Tissue Interaction, San Jose, CA, USA, 2002, pp. 77-84.

Crossref

[59]

Cristea

, Genetic signal analysis, in Proc. 6th Int. Symp. Signal Processing and Its Applications, Kuala Lumpur, Malaysia, 2001, pp. 703-706.

[60]

P. D. N.

Hebert

, A.

Cywinska

, S. L.

Ball

, and J. R.

deWaard

, Biological identifications through DNA barcodes, Proc. Roy. Soc. B Biol. Sci., vol. 270, no. 1512, pp. 313-321, 2003.

Crossref Google Scholar

[61]

Ratnasingham

and P. D. N.

Hebert

, Bold: The barcode of life data system, Mol. Ecol. Notes, vol. 7, no. 3, pp. 355-364, 2007.

Crossref Google Scholar

[62]

Afreixo

, C. A. C.

Bastos

, A. J.

Pinho

, S. P.

Garcia

, and P. J. S. G.

Ferreira

, Genome analysis with distance to the nearest dissimilar nucleotide, J. Theor. Biol., vol. 275, no. 1, pp. 52-58, 2011.

Crossref Google Scholar

[63]

W. J.

Kent

, C. W.

Sugnet

, T. S.

Furey

, K. M.

Roskin

, T. H.

Pringle

, A. M.

Zahler

, and D.

Haussler

, The human genome browser at UCSC, Genome Res., vol. 12, no. 6, pp. 996-1006, 2002.

Crossref Google Scholar

[64]

Kauer

and H.

Blöcker

, Applying signal theory to the analysis of biomolecules, Bioinformatics, vol. 19, no. 16, pp. 2016-2021, 2003.

Crossref Google Scholar

[65]

E. A.

Cheever

, D. B.

Searls

, W.

Karunaratne

, and G. C.

Overton

, Using signal processing techniques for DNA sequence comparison, in Proc. 15th Annu. Northeast Bioengineering Conference, Boston, MA, USA, 1989, pp. 173-174.

[66]

H. K.

Kwan

, B. Y. M.

Kwan

, and J. Y. Y.

Kwan

, Novel methodologies for spectral classification of exon and intron sequences, EURASIP J. Adv. Signal Process., vol. 2012, no. 1, p. 50, 2012.

Crossref Google Scholar

[67]

J. A.

Berger

, S. K.

Mitra

, M.

Carli

, and A.

Neri

, New Approaches to Genome Sequence Analysis Based on Digital Signal Processing. University of California, CA, USA, 2002.

[68]

Rao

and S. J.

Shepherd

, Detection of 3- periodicity for small genomic sequences based on AR technique, in Proc. 2004 Int. Conf. Communications, Circuits and Systems, Chengdu, China, 2004, pp. 1032-1036.

[69]

Chakravarthy

, A.

Spanias

, L. D.

Iasemidis

, and K.

Tsakalis

, Autoregressive modeling and feature analysis of DNA sequences, EURASIP J. Appl. Signal Process., vol. 2004, p. 952689, 2004.

Crossref Google Scholar

[70]

Z. G.

, V. V.

Anh

, Y.

Zhou

, and L. Q.

Zhou

, Numerical sequence representation of DNA sequences and methods to distinguish coding and non-coding sequences in a complete genome, in Proc. 11th World Multi-Conf. Systemics, Cybernetics and Informatics: WMSCI 2007, 2007, pp. 171-176.

[71]

A. K.

Brodzik

and O.

Peters

, Symbol-balanced quaternionic periodicity transform for latent pattern detection in DNA sequences, in Proc. 2005 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 2005, pp. 373-376.

[72]

Rosen

, Examining coding structure and redundancy in DNA, IEEE Eng. Med. Biol. Mag., vol. 25, no. 1, pp. 62-68, 2006.

Crossref Google Scholar

[73]

G. L.

Rosen

and J. D.

Moore

, Investigation of coding structure in DNA, in Proc. 2003 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Hong Kong, China, 2003, p. II-361-4.

[74]

C. K.

Peng

, S. V.

Buldyrev

, A. L.

Goldberger

, S.

Havlin

, F.

Sciortino

, M.

Simons

, and H. E.

Stanley

, Long-range correlations in nucleotide sequences, Nature, vol. 356, no. 6365, pp. 168-170, 1992.

Crossref Google Scholar

[75]

J. A.

Berger

, S. K.

Mitra

, M.

Carli

, and A.

Neri

, Visualization and analysis of DNA sequences using DNA walks, J. Franklin Inst., vol. 341, nos. 1&2, pp. 37-53, 2004.

Crossref Google Scholar

[76]

Tiwari

, S.

Ramachandran

, A.

Bhattacharya

, S.

Bhattacharya

, and R.

Ramaswamy

, Prediction of probable genes by Fourier analysis of genomic sequences, Bioinformatics, vol. 13, no. 3, pp. 263-270, 1997.

Crossref Google Scholar

[77]

W. T.

, T. G.

Marr

, and K.

Kaneko

, Understanding long-range correlations in DNA sequences, Phys. D Nonlinear Phenom., vol. 75, nos. 1-3, pp. 392-416, 1994.

Crossref Google Scholar

[78]

Abbasi

, A.

Rostami

, and G.

Karimian

, Identification of exonic regions in DNA sequences using cross-correlation and noise suppression by discrete wavelet transform, BMC Bioinformatics, vol. 12, p. 430, 2011.

Crossref Google Scholar

[79]

S. P.

Deng

, Y. X.

Shi

, L. Y.

Yuan

, Y. X.

, and G. H.

Ding

, Detecting the borders between coding and non-coding DNA regions in prokaryotes based on recursive segmentation and nucleotide doublets statistics, BMC Genomics, vol. 13, no. Suppl 8, p. S19, 2012.

Crossref Google Scholar

[80]

C. A. C.

Bastos

, V.

Afreixo

, S. P.

Garcia

, and A. J.

Pinho

, Inter-stop symbol distances for the identification of coding regions, J. Integr. Bioinform., vol. 10, no. 3, p. 230, 2013.

Crossref Google Scholar

[81]

G. L.

Rosen

, Signal processing for biologically-inspired gradient source localization and DNA sequence analysis, PhD dissertation, School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA, 2006.

[82]

Limbachiya

, B.

Rao

, and M. K.

Gupta

, The art of DNA strings: Sixteen years of DNA coding theory, arXiv preprint arXiv: 1607.00266, 2016.

Google Scholar

[83]

L. C. B.

Faria

, A. S. L.

Rocha

, J. H.

Kleinschmidt

, R.

Palazzo

, and M. C.

Silva-Filho

, DNA sequences generated by BCH codes over GF(4), Electron. Lett., vol. 46, no. 3, pp. 203-204, 2010.

Google Scholar

[84]

Zhang

, F. C.

Tian

, S. Y.

Wang

, and X.

Liu

, A novel coding method for gene mutation correction during protein translation process, J. Theor. Biol., vol. 296, pp. 33-40, 2012.

Google Scholar

[85]

Castro-Chavez

, A tetrahedral representation of the genetic code emphasizing aspects of symmetry, BIOcomplexity, vol. 2012, no. 2, pp. 1-6, 2012.

Google Scholar

[86]

Castro-Chavez

, Defragged binary I Ching genetic code chromosomes compared to Nirenberg’s and transformed into rotating 2D circles and squares and into a 3D 100% symmetrical tetrahedron coupled to a functional one to discern start from non-start methionines through a Stella octangula, J. Proteome Sci. Comput. Biol., vol. 1, no. 1, p. 3, 2012.

Google Scholar

[87]

H. J.

Jeffrey

, Chaos game representation of gene structure, Nucleic Acids Res., vol. 18, no. 8, pp. 2163-2170, 1990.

Google Scholar

[88]

Y. W.

Wang

, K.

Hill

, S.

Singh

, and L.

Kari

, The spectrum of genomic signatures: From dinucleotides to chaos game representation, Gene, vol. 346, pp. 173-185, 2005.

Google Scholar

[89]

Joseph

and R.

Sasikumar

, Chaos game representation for comparison of whole genomes, BMC Bioinformatics, vol. 7, p. 243, 2006.

Google Scholar

[90]

Dutta

and J.

Das

, Mathematical characterization of chaos game representation: New algorithms for nucleotide sequence analysis, J. Mol. Biol., vol. 228, no. 3, pp. 715-719, 1992.

Google Scholar

[91]

Goldman

, Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences, Nucleic Acids Res., vol. 21, no. 10, pp. 2487-2491, 1993.

Crossref Google Scholar

[92]

Castro-Chavez

, Most used codons per amino acid and per genome in the code of man compared to other organisms according to the rotating circular genetic code, Neuroquantology, vol. 9, no. 4, p. 500, 2011.

Crossref Google Scholar

[93]

Delgado

, F.

Morán

, A.

Mora

, J. J.

Merelo

, and C.

Briones

, A novel representation of genomic sequences for taxonomic clustering and visualization by means of self-organizing maps, Bioinformatics, vol. 31, no. 5, pp. 736-744, 2015.

Crossref Google Scholar

[94]

Z. G.

and V.

Anh

, Time series model based on global structure of complete genome, Chaos, Solitons & Fractals, vol. 12, no. 10, pp. 1827-1834, 2001.

Crossref Google Scholar

[95]

H. T.

Chang

, N. W.

, W. C.

, and C. J.

Kuo

, Visualization and comparison of DNA sequences by use of three-dimensional trajectories, in Proc. 1st Asia-Pacific Bioinformatics Conf. Bioinformatics 2003, Adelaide, Australia, 2003, pp. 81-85.

[96]

Kohonen

, Self-organized formation of topologically correct feature maps, Biol. Cybern., vol. 43, no. 1, pp. 59-69, 1982.

Crossref Google Scholar

[97]

Kohonen

and P.

Somervuo

, How to make large self-organizing maps for nonvectorial data, Neural Netw., vol. 15, nos. 8&9, pp. 945-952, 2002.

Crossref Google Scholar

[98]

A. P.

Boyle

, C. L.

Araya

, C.

Brdlik

, P.

Cayting

, C.

Cheng

, Y.

Cheng

, K.

Gardner

, L. W.

Hillier

, J.

Janette

, L. X.

Jiang

, D.

Kasper

, et al., Comparative analysis of regulatory information and circuits across distant species, Nature, vol. 512, no. 7515, pp. 453-456, 2014.

Crossref Google Scholar

[99]

Hamori

and J.

Ruskin

, H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences, J. Biol. Chem., vol. 258, no. 2, pp. 1318-1327, 1983.

Crossref Google Scholar

[100]

M. A.

Gates

, Simpler DNA sequence representations, Nature, vol. 316, no. 6025, p. 219, 1985.

Crossref Google Scholar

[101]

S. S. T.

Yau

, J. S.

Wang

, A.

Niknejad

, C. X.

, N.

Jin

, and Y. K.

, DNA sequence representation without degeneracy, Nucleic Acids Res., vol. 31, no. 12, pp. 3078-3080, 2003.

Crossref Google Scholar

[102]

Zhang

and C. T.

Zhang

, Z curves, an intutive tool for visualizing and analyzing the DNA sequences, J. Biomol. Struct. Dyn., vol. 11, no. 4, pp. 767-782, 1994.

Crossref Google Scholar

[103]

H. K.

Kwan

, R.

Atwal

, and B. Y. M.

Kwan

, Wavelet analysis of DNA sequences, in Proc. 2008 Int. Conf. Communications, Circuits and Systems, Fujian, China, 2008, pp. 816-820.

[104]

C. L.

, M.

Deng

, L.

Zheng

, R. L.

, J.

Yang

, and S. S. T.

Yau

, DFA7, a new method to distinguish between intron-containing and intronless genes, PLoS One, vol. 9, no. 7, p. e101363, 2014.

Crossref Google Scholar

[105]

Akhtar

, J.

Epps

, and E.

Ambikairajah

, Signal processing in sequence analysis: Advances in eukaryotic gene prediction, IEEE J. Sel. Top. Signal Process., vol. 2, no. 3, pp. 310-321, 2008.

Crossref Google Scholar

[106]

Mendizabal-Ruiz

, I.

Román-Godínez

, S.

Torres-Ramos

, R. A.

Salido-Ruiz

, and J. A.

Morales

, On DNA numerical representations for genomic similarity computation, PLoS One, vol. 12, no. 3, p. e0173288, 2017.

Crossref Google Scholar

[107]

Ranawana

and V.

Palade

, A neural network based multi-classifier system for gene identification in DNA sequences, Neural Comput. Appl., vol. 14, no. 2, pp. 122-131, 2005.

Crossref Google Scholar

[108]

S. B.

Arniker

, H. K.

Kwan

, N. F.

Law

, and D. P. K.

Lun

, DNA numerical representation and neural network based human promoter prediction system, in Proc. 2011 Annu. IEEE India Conf., Hyderabad, India, 2011, pp. 1-4.

Crossref

[109]

Xie

, S.

, K. M.

Lam

, and H.

Yan

, Promoterexplorer: An effective promoter identification method based on the AdaBoost algorithm, Bioinformatics, vol. 22, no. 22, pp. 2722-2728, 2006.

Crossref Google Scholar

[110]

Deng

and D.

, Deep learning: Methods and applications, Tech. Rep. MSR-TR-2014-21, 2014, http://research.microsoft.com/apps/pubs/default.aspx?id=209355

Crossref

[111]

Bengio

, A.

Courville

, and P.

Vincent

, Representation learning: A review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798-1828, 2013.

Crossref Google Scholar

[112]

M. G.

Reese

, F. H.

Eeckman

, D.

Kulp

, and D.

Haussler

, Improved splice site detection in genie, J. Comput. Biol., vol. 4, no. 3, pp. 311-323, 1997.

Crossref Google Scholar

[113]

, Z.

, and Y.

Pan

, A deep learning method for lincRNA detection using auto-encoder algorithm, BMC Bioinformatics, vol. 18, no. Suppl 15, p. 511, 2017.

Crossref Google Scholar

[114]

G. B.

Orr

and K. R.

Müller

, Neural Networks: Tricks of the Trade. Springer, 1998, p. 1524.

Crossref

[115]

Wiesler

, A.

Richard

, R.

Schluter

, and H.

Ney

, Mean-normalized stochastic gradient for large-scale deep learning, in Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Florence, Italy, 2014, pp. 180-184.

Crossref

[116]

Raiko

, H.

Valpola

, and Y.

LeCun

, Deep learning made easier by linear transformations in perceptrons, in Proc. 15th Int. Conf. Artificial Intelligence and Statistics, La Palma, Canary Islands, 2012, pp. 924-932.

[117]

Ioffe

and C.

Szegedy

, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv: 1502.03167, 2015.

Google Scholar

[118]

Danihelka

, G.

Wayne

, B.

Uria

, N.

Kalchbrenner

, and A.

Graves

, Associative long short-term memory, arXiv preprint arXiv: 1602.03032, 2016.

Google Scholar

[119]

Jose

, M.

Cisse

, and F.

Fleuret

, Kronecker recurrent units, arXiv preprint arXiv: 1705.10142, 2017.

Google Scholar

[120]

Jing

, Ç.

Gülçehre

, J.

Peurifoy

, Y. C.

Shen

, M.

Tegmark

, M.

Soljacic

, and Y.

Bengio

, Gated orthogonal recurrent units: On learning to forget, arXiv preprint arXiv: 1706.02761, 2017.

Google Scholar

[121]

Arjovsky

, A.

Shah

, and Y.

Bengio

, Unitary evolution recurrent neural networks, arXiv preprint arXiv: 1511.06464, 2015.

Google Scholar

[122]

Trabelsi

, O.

Bilaniuk

, Y.

Zhang

, D.

Serdyuk

, S.

Subramanian

, J. F.

Santos

, S.

Mehri

, N.

Rostamzadeh

, Y.

Bengio

, and C. J.

Pal

, Deep complex networks, arXiv preprint arXiv: 1705.09792, 2017.

Google Scholar

[123]

Mescheder

, S.

Nowozin

, and A.

Geiger

, The numerics of GANs, arXiv preprint arXiv: 1705.10461, 2017.

Google Scholar

Big Data Mining and Analytics

Volume 1 Issue 3,
September 2018

Pages 191-210

DOI: 10.26599/BDMA.2018.9020018

Cite this article:

Yu N, Li Z, Yu Z. Survey on Encoding Schemes for Genomic Data Representation and Feature Learning—From Signal Processing to Machine Learning. Big Data Mining and Analytics, 2018, 1(3): 191-210. https://doi.org/10.26599/BDMA.2018.9020018

1095

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 21 January 2018

Accepted: 24 January 2018

Published: 24 May 2018