Understanding the subcellular localization of long non-coding RNAs (lncRNAs) is crucial for unraveling their functional mechanisms. While previous computational methods have made progress in predicting lncRNA subcellular localization, most of them ignore the sequence order information by relying on k-mer frequency features to encode lncRNA sequences. In the study, we develope SGCL-LncLoc, a novel interpretable deep learning model based on supervised graph contrastive learning. SGCL-LncLoc transforms lncRNA sequences into de Bruijn graphs and uses the Word2Vec technique to learn the node representation of the graph. Then, SGCL-LncLoc applies graph convolutional networks to learn the comprehensive graph representation. Additionally, we propose a computational method to map the attention weights of the graph nodes to the weights of nucleotides in the lncRNA sequence, allowing SGCL-LncLoc to serve as an interpretable deep learning model. Furthermore, SGCL-LncLoc employs a supervised contrastive learning strategy, which leverages the relationships between different samples and label information, guiding the model to enhance representation learning for lncRNAs. Extensive experimental results demonstrate that SGCL-LncLoc outperforms both deep learning baseline models and existing predictors, showing its capability for accurate lncRNA subcellular localization prediction. Furthermore, we conduct a motif analysis, revealing that SGCL-LncLoc successfully captures known motifs associated with lncRNA subcellular localization. The SGCL-LncLoc web server is available at http://csuligroup.com:8000/SGCL-LncLoc. The source code can be obtained from https://github.com/CSUBioGroup/SGCL-LncLoc.
- Article type
- Year
- Co-author
Identifying cancer driver genes has paramount significance in elucidating the intricate mechanisms underlying cancer development, progression, and therapeutic interventions. Abundant omics data and interactome networks provided by numerous extensive databases enable the application of graph deep learning techniques that incorporate network structures into the deep learning framework. However, most existing models primarily focus on individual network, inevitably neglecting the incompleteness and noise of interactions. Moreover, samples with imbalanced classes in driver gene identification hamper the performance of models. To address this, we propose a novel deep learning framework MMGN, which integrates multiplex networks and pan-cancer multiomics data using graph neural networks combined with negative sample inference to discover cancer driver genes, which not only enhances gene feature learning based on the mutual information and the consensus regularizer, but also achieves balanced class of positive and negative samples for model training. The reliability of MMGN has been verified by the Area Under the Receiver Operating Characteristic curves (AUROC) and the Area Under the Precision-Recall Curves (AUPRC). We believe MMGN has the potential to provide new prospects in precision oncology and may find broader applications in predicting biomarkers for other intricate diseases. Implementations of MMGN can be found at https://github.com/xingyili/MMGN.
Medical knowledge graphs (MKGs) are the basis for intelligent health care, and they have been in use in a variety of intelligent medical applications. Thus, understanding the research and application development of MKGs will be crucial for future relevant research in the biomedical field. To this end, we offer an in-depth review of MKG in this work. Our research begins with the examination of four types of medical information sources, knowledge graph creation methodologies, and six major themes for MKG development. Furthermore, three popular models of reasoning from the viewpoint of knowledge reasoning are discussed. A reasoning implementation path (RIP) is proposed as a means of expressing the reasoning procedures for MKG. In addition, we explore intelligent medical applications based on RIP and MKG and classify them into nine major types. Finally, we summarize the current state of MKG research based on more than 130 publications and future challenges and opportunities.
In general, physicians make a preliminary diagnosis based on patients’ admission narratives and admission conditions, largely depending on their experiences and professional knowledge. An automatic and accurate tentative diagnosis based on clinical narratives would be of great importance to physicians, particularly in the shortage of medical resources. Despite its great value, little work has been conducted on this diagnosis method. Thus, in this study, we propose a fusion model that integrates the semantic and symptom features contained in the clinical text. The semantic features of the input text are initially captured by an attention-based Bidirectional Long Short-Term Memory (BiLSTM) network. The symptom concepts, recognized from the input text, are then vectorized by using the term frequency-inverse document frequency method based on the relations between symptoms and diseases. Finally, two fusion strategies are utilized to recommend the most potential candidate for the international classification of diseases code. Model training and evaluation are performed on a public clinical dataset. The results show that both fusion strategies achieved a promising performance, in which the best performance obtained a top-3 accuracy of 0.7412.
Recently, the emergence of single-cell RNA-sequencing (scRNA-seq) technology makes it possible to solve biological problems at the single-cell resolution. One of the critical steps in cellular heterogeneity analysis is the cell type identification. Diverse scRNA-seq clustering methods have been proposed to partition cells into clusters. Among all the methods, hierarchical clustering and spectral clustering are the most popular approaches in the downstream clustering analysis with different preprocessing strategies such as similarity learning, dropout imputation, and dimensionality reduction. In this study, we carry out a comprehensive analysis by combining different strategies with these two categories of clustering methods on scRNA-seq datasets under different biological conditions. The analysis results show that the methods with spectral clustering tend to perform better on datasets with continuous shapes in two-dimension, while those with hierarchical clustering achieve better results on datasets with obvious boundaries between clusters in two-dimension. Motivated by this finding, a new strategy, called QRS, is developed to quantitatively evaluate the latent representative shape of a dataset to distinguish whether it has clear boundaries or not. Finally, a data-driven clustering recommendation method, called DDCR, is proposed to recommend hierarchical clustering or spectral clustering for scRNA-seq data. We perform DDCR on two typical single cell clustering methods, SC3 and RAFSIL, and the results show that DDCR recommends a more suitable downstream clustering method for different scRNA-seq datasets and obtains more robust and accurate results.
Proteins drive virtually all cellular-level processes. The proteins that are critical to cell proliferation and survival are defined as essential. These essential proteins are implicated in key metabolic and regulatory networks, and are important in the context of rational drug design efforts. The computational identification of the essential proteins benefits from the proliferation of publicly available protein interaction datasets. Scientists have developed several algorithms that use these interaction datasets to predict essential proteins. However, a comprehensive web platform that facilitates the analysis and prediction of essential proteins is missing. In this study, we design, implement, and release NetEPD: a network-based essential protein discovery platform. This resource integrates data on Protein-Protein Interaction (PPI) networks, gene expression, subcellular localization, and a native set of essential proteins. It also computes a variety of node centrality measures, evaluates the predictions of essential proteins, and visualizes PPI networks. This comprehensive platform functions by implementing four activities, which include the collection of datasets, computation of centrality measures, evaluation, and visualization. The results produced by NetEPD are visualized on its website, and sent to a user-provided email, and they are available to download in a parsable format. This platform is freely available at http://bioinformatics.csu.edu.cn/netepd.
The explosion of digital healthcare data has led to a surge of data-driven medical research based on machine learning. In recent years, as a powerful technique for big data, deep learning has gained a central position in machine learning circles for its great advantages in feature representation and pattern recognition. This article presents a comprehensive overview of studies that employ deep learning methods to deal with clinical data. Firstly, based on the analysis of the characteristics of clinical data, various types of clinical data (e.g., medical images, clinical notes, lab results, vital signs, and demographic informatics) are discussed and details provided of some public clinical datasets. Secondly, a brief review of common deep learning models and their characteristics is conducted. Then, considering the wide range of clinical research and the diversity of data types, several deep learning applications for clinical data are illustrated: auxiliary diagnosis, prognosis, early warning, and other tasks. Although there are challenges involved in applying deep learning techniques to clinical data, it is still worthwhile to look forward to a promising future for deep learning applications in clinical big data in the direction of precision medicine.
Inferring Gene Regulatory Networks (GRNs) structure from gene expression data has been a challenging problem in systems biology. It is critical to identify complicated regulatory relationships among genes for understanding regulatory mechanisms in cells. Various methods based on information theory have been developed to infer GRNs. However, these methods introduce many redundant regulatory relationships in the network inference process due to external noise in the original data, topology sparseness in the network structure, and non-linear dependency among genes. Especially as the network size increases, the performance of these methods decreases dramatically. In this paper, a novel network structure inference method named Loc-PCA-CMI is proposed that first identifies local overlapped gene clusters, and then infers the local network structure for each cluster by a Path Consistency Algorithm based on Conditional Mutual Information (PCA-CMI). The final structure of the GRN is denoted as dependence among genes by an ensemble of the obtained local network structures. Loc-PCA-CMI was evaluated on DREAM3 knock-out datasets, and its performance was compared to other information theory-based network inference methods including ARACNE, MRNET, PCA-CMI, and PCA-PMI. Experimental results demonstrate our novel method Loc-PCA-CMI outperforms the other four methods in DREAM3 datasets especially in size 50 and 100 networks.
Deep learning provides exciting solutions in many fields, such as image analysis, natural language processing, and expert system, and is seen as a key method for various future applications. On account of its non-invasive and good soft tissue contrast, in recent years, Magnetic Resonance Imaging (MRI) has been attracting increasing attention. With the development of deep learning, many innovative deep learning methods have been proposed to improve MRI image processing and analysis performance. The purpose of this article is to provide a comprehensive overview of deep learning-based MRI image processing and analysis. First, a brief introduction of deep learning and imaging modalities of MRI images is given. Then, common deep learning architectures are introduced. Next, deep learning applications of MRI images, such as image detection, image registration, image segmentation, and image classification are discussed. Subsequently, the advantages and weaknesses of several common tools are discussed, and several deep learning tools in the applications of MRI images are presented. Finally, an objective assessment of deep learning in MRI applications is presented, and future developments and trends with regard to deep learning for MRI images are addressed.
With the continuing development and improvement of genome-wide techniques, a great number of candidate genes are discovered. How to identify the most likely disease genes among a large number of candidates becomes a fundamental challenge in human health. A common view is that genes related to a specific or similar disease tend to reside in the same neighbourhood of biomolecular networks. Recently, based on such observations, many methods have been developed to tackle this challenge. In this review, we firstly introduce the concept of disease genes, their properties, and available data for identifying them. Then we review the recent computational approaches for prioritizing candidate disease genes based on Protein-Protein Interaction (PPI) networks and investigate their advantages and disadvantages. Furthermore, some pieces of existing software and network resources are summarized. Finally, we discuss key issues in prioritizing candidate disease genes and point out some future research directions.