Department of Mathematics, School of Science, Anhui Science and Technology University, Fengyang233100, China
Machine Learning and Systems Biology Laboratory, Tongji University, Shanghai201804, China
Show Author Information
Hide Author Information
Abstract
Numerical characterizations of DNA sequence can facilitate analysis of similar sequences. To visualize and compare different DNA sequences in less space, a novel descriptors extraction approach was proposed for numerical characterizations and similarity analysis of sequences. Initially, a transformation method was introduced to represent each DNA sequence with dinucleotide physicochemical property matrix. Then, based on the approximate joint diagonalization theory, an eigenvalue vector was extracted from each DNA sequence, which could be considered as descriptor of the DNA sequence. Moreover, similarity analyses were performed by calculating the pair-wise distances among the obtained eigenvalue vectors. The results show that the proposed approach can capture more sequence information, and can jointly analyze the information contained in all involved multiple sequences, rather than separately, whose effectiveness was demonstrated intuitively by constructing a dendrogram for the 15 beta-globin gene sequences.
B. E.Blaisdell, A measure of the similarity of sets of sequences not requiring sequence alignment, Proceedings of the National Academy of Sciences, vol. 83, pp. 5155-5159, 1986.
M. R.Kantorovitz, G. E.Robinson, and S.Sinha, A statistical method for alignment-free comparison of regulatory sequences, Bioinformatics, vol. 23, no. 13, pp. i249-i255, 2007.
G. E.Sims, S. R.Jun, G. A.Wu, and S. H.Kim, Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions, Proceedings of the National Academy of Sciences, vol. 106, no. 8, pp. 2677-2682, 2009.
S. R.Jun, G. E.Sims, G. A.Wu, and S. H.Kim, Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution, Proceedings of the National Academy of Sciences, vol. 107, no. 1, pp. 133-138, 2009.
Y.Wu, A. W-CLiew, H.Yan, and M. S.Yang, DB-Curve: A novel 2D method of DNA sequence visualization and representation, Chemical Physics Letters, vol. 367, pp. 170-176, 2003.
Z. B.Liu, B.Liao, W.Zhu, and G. H.Huang, A 2D graphical representation of DNA sequence based on dual nucleotides and its application, International Journal of Quantum Chemistry, vol. 109, no. 5, pp. 948-958, 2009.
C. Y.Lu, H.Min, J.Gui, L.Zhu, and Y. K.Lei, Face recognition via weighted sparse representation, Journal of Visual Communication and Image Representation, vol. 24, no. 2, pp. 111-116, 2013.
D.Bielinska-Waz, Graphical and numerical representations of DNA sequences: Statistical aspects of similarity, Journal of Mathematical Chemistry, vol. 49, no. 10, pp. 2345-2407, 2011.
M.Akhtar, J.Epps, and E.Ambikairajah, On DNA numerical representation for period-3 based exon prediction, in 5th International Workshop on Genomic Signal Processing and Statistics, Tuusula, Piscataway, NJ, USA, 2007.
[14]
H. J.Jeffrey, Chaos game representation of gene structure, Nucleic Acids Research, vol. 18, no. 8, pp. 2163-2170, 1990.
R.Zhangand C. T.Zhang, Zcurves, an intutive tool for visualizing and analyzing the DNA sequences, Journal of Biomolecular Structure & Dynamics, vol. 11, no. 4, pp. 767-782, 1994.
S.Wang, F.Tian, W.Feng, and X.Liu, Applications of representation method for DNA sequences based on symbolic dynamics, Journal of Molecular Structure: THEOCHEM, vol. 909, pp. 33-42, 2009.
A. K.Brodzikand O.Peters, Symbol-balanced quaternionic periodicity transform for latent pattern detection in DNA sequences, in Proceedings of IEEE ICASSP, Philadelphia, PA, USA, 2005, pp. 373-376.
C. Y.Lu, H.Min, Z. Z.Zhao, L.Zhu, D. S.Huang, and S. C.Yan, Robust and efficient subspace segmentation via least squares regression, European Conference on Computer Vision ECCV, vol. 7578, no. 7, pp. 347-360, 2012.
G. H.Goluband C. F. V.Loan, Matrix Computations, 3rd Ed. Baltimore and London: Johns Hopkins University Press, 1996.
[24]
H. J.Yuand D. S.Huang, Graphical representation for DNA sequences via joint diagonalization of matrix pencil, IEEE Journal of Biomedical and Health Informatics, vol. 17, no. 3, pp. 503-511, 2013.
A.Yeredor, Non-orthogonal joint diagonalization in the least-squares sense with application in blind source separation,IEEE Transactions on Signal Processing, vol. 50, no. 7, pp. 1545-1553, 2002.
Q.Dai, X. Q.Liu, Y. H.Yao, and F. K.Zhao, Sequence comparison via polar coordinates representation and curve tree, Journal of Theoretical Biology, vol. 292, pp. 78-85, 2011.
C.Li, H.Ma, Y.Zhou, X. L.Wang, and X. Q.Zheng, Similarity analysis of DNA sequences based on the weighted pseudo-entropy, Journal of Computational Chemistry, vol. 32, no. 4, pp. 675-680, 2011.
Yu H, Huang D. Descriptors for DNA Sequences Based on Joint Diagonalization of Their Feature Matrices from Dinucleotide Physicochemical Properties. Tsinghua Science and Technology, 2013, 18(5): 446-453. https://doi.org/10.1109/TST.2013.6616518
2.2 Feature extraction from multiple sequences via approximate joint diagonalization upon matrices
For all the multiple sequences, we can use Approximate Joint Diagonalization (AJD) of their corresponding transformed matrices , which has been successfully applied in the dataset with the first exon from sequences of 11 beta globin genes[
24
].
In brief, AJD corresponds to the problem of seeking a matrix V, which will lead to be the diagonal as possible for all , where V is a unitary matrix. This is based on the premise that a set of matrices consists of common statistical information of the observations which are the estimates of matrices in the form .
In general, for any matrix V, the AJD criterion may be defined as the following non-negative function of V:
Usually, AJD does not require that the involved matrix set be exactly simultaneously diagonalized by a common unitary matrix. Mostly, the criterion for AJD, as indicated by Eq. (
3
), cannot be zeroed, and the matrices can only be approximately jointly diagonalized. Thus, AJD deals with a kind of an "average eigen-structure", which is particularly convenient for statistically inferring the structural information extracted from sample statistics.
Considering two transformations:
denotes the -th sequence, where the length of the sequence is , and , while stands for the corresponding matrices mapped from each primary DNA sequences, and is a symmetric matrix, which can be determined by scanning along the sequence via the decision criterion listed in
Table 2
.
The feature vector is a 12-tuple vector consisting of all the eigenvalues extracted by AJD upon . Thus, compound transformation may be obtained as follows:
From Formula (
4
), we can freely extract the features of the DNA sequence. From the viewpoint of algebra space, the transformation may also be presented as
where S denotes the original sequence space comprising of primary DNA sequence having the length , while indicates the objective feature space that is transformed from the original space. Further, the diagonal elements of are simply the eigenvalues of the dinucleotide PhysicoChemical Matrix (PCM) via AJD.
According to the results from previous work[
24
], it was determined that the AJD-PCM algorithm has the property of distance-preserving. Thus, we can calculate all the Eigenvalue Vectors (EVs) for each obtained dinucleotide PCM, such as , . Further, the corresponding 12-tuple vectors may be obtained that can be regarded as features extracted from the original DNA sequence. The AJD-PCM algorithm is shown in Algorithm 1.
113745N-2013-05-446.F001
Graphical representation of the exon in the beta-globin gene from 15 species based on 12-tuple EVs via AJD upon all the corresponding PCP matrices. The y-axis indicates the values of each element in feature vectors ().
113745N-2013-05-446.F002
The performance curve for the convergence speed of AJD for the dataset listed in Table 3.
113745N-2013-05-446.F003
The dendrogram for 15 sequences according to the pairwise distance listed in Table 3.