Scientific Updates

Nature Communications | Modeling regulatory cis-elements for functional annotation of transcription-modulating genetic variants

Cells are the fundamental building blocks of life, whose normal functioning relies on precise gene expression regulation processes, among which gene transcription regulation serves as a critical component. Approximately 98% of the human genome are noncoding, with about 80% of these regions potentially involved in the gene transcription regulation process. The spatiotemporal-specific expression of genes depends on complex regulatory networks that involve various cis-regulatory elements, which typically function through combinatorial interactions. Identifying regulatory elements within the genome and deciphering the principles of transcriptional regulation will facilitate our understanding of cellular functions and provide insights into disease pathogenesis mechanisms.

 

On December 30, 2024, the research team led by Prof. Ge Gao at the Biomedical Pioneering Innovation Center (BIOPIC) of Peking University/Changping Laboratory published a research paper titled "Quantifying the regulatory potential of genetic variants via a hybrid sequence-oriented model with SVEN" in Nature Communications. The study introduced SVEN, a multi-modality sequence-oriented in silico model, for quantifying the regulatory effects of these cis-elements in across more than 350 tissues and cell lines, and applied the model for annotating the transcriptomic impacts of genetic variant, including large-scale structural variations (SVs) as well as small-scale SNVs/indels. This provides a valuable methodological foundation and data repository for a deeper understanding of cellular regulatory landscapes at the sequence level.

 

SVEN took a hybrid distinct model architecture other than the canonical "one-holistic-network-for-all" one: first learning "regulatory rules" from large-scale data through multiple class-oriented holistic models and feature-oriented separate models, and then applying these rules to infer tissue-specific gene expression level from sequences directly (Figure 1):

  • A set of sequence-based deep neural networks that learn regulatory codes from sequences to predict functional genomic features (TF binding, histone modification, and DNA accessibility). The basic idea here is to combine feature-oriented models (to learn the context-sensitive sequence-to-regulatory code, like the CTCF binding events in K562) and class-oriented models (to learn a more generalized rule, like TF bindings across different cell lines/tissues) for better utilizing available data.
  • A feature selection and transformation module to remove redundant features and reduce the dimensionality of the features.
  • A set of gradient-boosting tree models to predict gene transcription level based on transformed functional genomic features. Each model corresponding to one tissue or cell type.

 

Figure 1. Architecture of the SVEN model

 

Benefiting from its unique design, SVEN shows consistently superior performance over canonical “one-holistic-network-for-all”-based Enformer in predicting tissue-specific gene expression level and assessing effects of variants on gene expression (Figure 2), with 40% smaller model size (153M for SVEN largest model and 249M for Enformer).

 

Figure 2. SVEN predicts tissue-specific gene transcription level accurately

 

Genetic variants refer to alterations in the nucleotide sequence of the genome, and variants occur in the noncoding regions are noncoding variants. Genome-wide association studies have revealed that more than 90% of the genetic variants linked to diseases and traits reside in noncoding regions. Large-scale whole-genome sequencing studies have provided a high-resolution map of human genetic variants, encompassing both small-scale variants (≤ 50 bp) and large-scale SVs (>50 bp), which can have a more substantial impact on biological functions due to their larger scale. However, investigating the impact of SVs on gene transcriptional regulation at the whole-genome scale remains a challenge.

 

Authors assessed SVEN's ability to predict the regulatory impact of SVs: SVEN demonstrated high accuracy, with a Spearman correlation of 0.921 between predicted and observed expression levels derived from paired RNA-seq data (Figure 3). Notably, the deletion upstream of the cancer biomarker PSMA-encoding gene FOLH1 disrupts the promoter region and the annotation-based algorithm predicted that this deletion would barely affect gene transcription; however, SVEN correctly predicted an increase in expression, partly because its annotation module indicated that the variant effectively increases expression-activating H3K4me3 and H3K27ac signals rather than the deleting known silencers or insulators. This finding suggests a plausible underlying mechanism for the observed effect of the deletion.

Figure 3. SVEN accurately quantify the regulatory potential of genetic variants

 

SVEN is publicly available at https://github.com/gao-lab/SVEN.

 

Dr. Yu Wang from Peking University (now is the postdoc in Changping Laboratory) is the first author of the paper. Nan Liang was responsible for conducting CRISPR experiments. This research received support from the National Key Research and Development Program, the State Key Laboratory of Protein and Plant Gene Research, the Beijing Advanced Innovation Center for Genomics, and Changping Laboratory. Computational analysis was conducted on the High-performance Computing Platforms of Changping Laboratory, High-performance Computing Platform of the Center for Life Science, and High-performance Computing Platform of Peking University.

 

Link to the paper: https://doi.org/10.1038/s41467-024-55392-7.