Scientific Updates

Searching large-scale scRNA-seq databases via unbiased cell embedding with Cell BLAST

As a powerful tool for studying cellular heterogeneity, single-cell transcriptomic sequencing has seen rapid development in recent years, with large amounts of data continuously being acumulated. If effectively used for querying and inference, the existing data would greatly benefit the annotation of newly sequenced cells, as well as cross-dataset integrative studies. However, accurate single-cell transcriptomic querying and annotation is hampered by two primary challenges: first, the batch effect between datasets significantly undermines the reliability of cell querying; second, there is currently a lack of cross species and platform, well-curated single-cell transcriptomics databases.

 

Recently, Ge Gao group from the Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, published a bioinformatics paper titled Searching large-scale scRNA-seq databases via unbiased cell embedding with Cell BLAST in Nature Communications, releasing a deep learning-based single-cell transcriptomic querying and annotation method Cell BLAST, as well as a well-curated single-cell transcriptomic reference database ACA, providing new tools and resources for the effective utilization of existing data for cell annotation and cross-dataset studies.

Cell BLAST uses adversarial autoencoder (AAE) to reduce the dimensionality of transcriptomic data, and applies adversarial domain adaptation to eliminate batch effect among datasets, outperforming existing tools targeted at a similar purpose. In addition, the authors propose a novel and more accurate model-based cell-to-cell similarity measure for cell querying, which by design considers the intrinsic uncertainty in single-cell transcriptomic observations.

 

Apart from identification of known cell types, Cell BLAST can also identify novel cell types non-existent in the reference data with high sensitivity (Fig. 1 a, c). Additionally, the authors use a series of hematopoietic differentiation datasets to verify that Cell BLAST can also be used to annotate continuous cell state (Fig. 1 d, f).

 

Last but not least, by collecting a large number of published scRNA-seq datasets, the authors established a database covering 2,989,582 single cells across 8 species and 27 different organs, termed the Animal Cell Atlas (ACA) (Fig. 1g, h). They extensively organized the cell labels in ACA and used Cell Ontology to construct a set of structured cell type annotations, to unify the annotations in different datasets and support ontology-aware inference of cell types.

 

Figure 1 Cell BLAST application and ACA.

 

The project provides an online cell querying platform (https://cblast.gao-lab.org). Users can directly upload their scRNA-seq data to perform cell querying and automatic annotation based on the reference datasets in ACA. Meanwhile, an open source Python package Cell BLAST (https://github.com/gao-lab/Cell_BLAST) is also provided, which enables model training, cell querying and customized analysis on custom reference datasets.

 

Zhi-Jie Cao and Lin Wei from School of Life Sciences, Peking University are the co-first authors of the paper. Dr. Ge Gao is the corresponding author. Shen Lu and De-Chang Yang contributed to the construction of the Web portal. The project was supported by funds from the National Key Research and Development Program, the China 863 Program, as well as the State Key Laboratory of Protein and Plant Gene Research and the Beijing Advanced Innovation Center for Genomics (ICG) at Peking University.

 

Reference:

Cao, Z-J. et al. Searching large-scale scRNA-seq databases via unbiased cell embedding with Cell BLAST. Nat Commun 11, 3458 (2020). https://doi.org/10.1038/s41467-020-17281-7