Scientific Updates

Nature Communications | Zhang Zemin's lab publishes a new method for rapid annotation of single-cell transcriptomic data

On April 14th, a bioinformatics methodology paper entitled "SciBet as a portable and fast single cell type identifier" was published on Nature Communications by the laboratory of Zemin Zhang at CLS (Center of Life Sciences, PKU), BIOPIC (Biomedical Pioneering Innovation Center, PKU), ICG (Beijing Advanced Innovation Center of Genomics, PKU) and Analytical Biosciences, indicating that a new tool SciBet for rapid and supervised cell type annotation based on single-cell transcriptome data was officially released.

Single-cell RNA sequencing (scRNA-seq) can reveal the gene expression status of a single cell, reflect the heterogeneity between cells, and provide an important way to identify the functions of various cells. With the development of sequencing technology, the price of a single cell transcriptome has been declining, and the size of the data set has grown exponentially. At the same time, the application scenarios of scRNA-seq have gradually expanded from isolated local areas to systematic atlas of certein species. Currently, the process of identifying cell types is often based on unsupervised identification methods, that is, clustering to find each cell group, with its possible biologic function identified through the marker genes. In fact, we can make full use of existing data as a reference, and use a supervised method to annotate newly generated data sets, which will greatly accelerate the research process. In recent years, supervised cell type tools such as scmap and Seurat3 have come out one after another, and their classification accuracy has become saturated. However, their non-parametric nature determines that they will consume too much computing time when dealing with extremely large-scale data sets, such as the Human Cell Atlas data set with hundreds of millions of cells that will come out in the future.

The SciBet developed by PhD students Chenwei Li and Baolin Liu and associate researcher Xianwen Ren effectively solves this problem. They start from the basic assumption that the expression profile of the same type of single cell obeys the same multiple distribution. Different cell types in the data set are modeled separately, and then the cells in the test set are annotated through maximum likelihood estimation. In the cross-validation test results of a batch of gold standard data sets, SciBet not only achieved a small lead in accuracy compared with scmap and Seurat3, but also achieved a thousand times advantage in calculation speed. Users can use SciBet to achieve supervised cell type prediction on the order of 100,000 cells per second using only a personal computer. In practical applications, this project also evaluated the performance of SciBet in scenarios such as cross-datasets, cross-sequencing platforms, and cross-species. The results proved that SciBet can perform the task of supervised cell type annotation robustly and accurately. For cell types that are not covered in the training set in the test set, SciBet can correctly exclude these cells while maintaining accurate annotations for the other cells.

Algorithms, benchmarks and applications of SciBet

 

Because the SciBet algorithm uses a straightforward parametric model, in addition to its speed advantage and strong interpretability, the trained model is also very efficient in storage and only depends on the number of cell types. For example, an ordinary SciBet model with a data set of dozens of cell types is less than 1MB. Based on this "portable" feature, this project has also released nearly one hundred SciBet pre-trained models with high-quality data sets, all of which can be directly imported into SciBet's R software package. In addition, this subject also provides an online version of SciBet based on JavaScript (http://scibet.cancer-pku.cn/). This allows users to quickly complete the cell type identification in the browser without uploading their own data to the server but only loading the pre-trained model online and their datasets locally, and obtain the visualization for the classification results. Associate researcher Xianwen Ren said: “As a rapid single cell type identification method facing the ultra-large-scale datasets, SciBet will have an important and positive impact on the field of single-cell sequencing in the future.”

PhD students Chenwei Li at CLS and Baolin Liu at BIOPIC/SLS are co-first authors. Associate researcher Xianwen Ren and Professor Zemin Zhang at CLS/BIOPIC/SLS are the co-correspondence authors. This work was supported by National Natural Science Foundation of China, ICG, and Analytical Biosciences.