22. Bulk deconvolution#

TL;DR We provide a brief overview over basic concepts of cell type deconvolution including input structure, data preprocessing and analysis of the output data.

22.1. Background#

Studying differences in cell type compositions within a bulk tissue is important for several reasons. First, interactions between cell types play a major role in disease progression and recovery. Second, molecular patterns (such as gene expression and protein abundance) are often directly associated with the cell types compositions within the tissue. Understanding these compositions is crucial for studying the underlying molecular mechanisms of a biological condition such as disease. Third, uncovering disease-specific cell type patterns allow for better therapeutic targeting, providing an important clinical utility.

Cell type deconvolution is a computational framework designed for inferring the compositions of cell populations within a bulk heterogeneous tissue [Du et al., 2019, Kuhn et al., 2012, Schwartz, 2010, Zaitsev et al., 2019]. As measuring these compositions experimentally is time-consuming and expensive, deconvolution methods allow large scale analysis of cell populations based on molecular data. Deconvolution methods usually follow linear regression defined as:

\(y = bX\)

where, \(y\) refers to a mixture of heterogeneous gene expression profile using common molecular pipelines (e.g. microarray or RNA-seq), \(X\) is a signature matrix containing homogeneous cell type-specific profiles, and \(b\) is the vector of cellular proportions in the mixture data inferred by the deconvolution method [Baron et al., 2016]. To select an optimal deconvolution method suitable for the targeted biological conditions, the impact of several technical and biological factors should be considered including: the deconvolution method, reference data with missing or rare cell types, data normalization and the selection of features (markers).

The signature matrix \(X\), used to infer cellular compositions, reflects our best knowledge of cellular heterogeneity within the tissue and highly influences the success of the deconvolution process [Aliee and Theis, 2021]. Originally, signature matrices were generated by profiling sorting cells from a heterogeneous tissue (using FACS or CyTOF), suffering from internal biases due to the pre-selection of cell type panels and the absence of appropriate antibodies [Aran et al., 2017, Monaco et al., 2019]. Today, these matrices are mainly generated as unbiased profiles using single cell technologies, allowing the generation of signature matrices across different organisms, tissues and biological conditions [Aliee and Theis, 2021, Newman et al., 2019].

22.2. Approaches#

Bulk deconvolution approaches can be divided into linear regression based methods, enrichment based methods, non-linear deep-learning based methods and others.

The most common type of cell type deconvolution methods are linear regression-based. These methods try to directly solve the \(y = bX\) equation using different regularizations and relying on a relatively high number of features. Examples of such tools are CIBERSORTx [Newman et al., 2019], MuSiC [], dtangle [] and DWLS [].

Enrichment based tools on the other hand calculate an enrichment score for each cell type separately, based on a gene set that represents this cell type. The enrichment scores for all cell types are then combined and transformed to compositions by a method-specific transformation function. As these methods only consider a single cell type at a time, tend to provide meaningful insights in simple cases, but are less accurate when reference data includes many cell types. xCell [Aran et al., 2017] is an example of an enrichment based tool.

The third option pose non-linear deep learning methods that emerged with the promise of improving deconvolution accuracy, while still attempting to maintain high biological interpretability. It is still early to say whether these methods can be better than other types of methods. Scaden [Menden et al., 2020] is an example of such a deep-learning based method.

Bulk deconvolution methods have been benchmarked extensively and the results are overall quite consistent [Nadel et al., 2021, Shen-Orr and Gaujoux, 2013]. Methods using single-cell RNA-seq data show good performance whereas semi-supervised approaches show higher error rates. Failing to include cell types in the reference that are present in the bulk mixture lead to worse results in any case []. Cobos et al. recommend that (1) the input data should be in linear scale, (2) Avoid row scaling, column min-max, column z-score or quantile normalization, (3) regression bulk deconvolution methods like CIBERSORTx or FARDEEP [] perform well and if scRNA-seq data is available DWLS, MuSiC or SCDC [] should also be applied for result comparisons, (4) a stringent marker selection strategy that focuses on the differences between the first and second cell types with highest expression values should be used, (5) a comprehensive reference matrix that includes all relevant cell types that are present in the mixture should be used []. The effect of the normalization strategy is unclear with Li et al. [] suggesting that it has a strong effect on the results which was not confirmed by Cobos et al. [].

Since many bulk deconvolution tools, including the well-performing CIBERSORTx, are only available as web tools, we opt to demonstrate a use-case with MuSic.

It is important to note that MuSiC requires multiple single-cell samples including the same cell types. In practice, our reference profile might not include more than a single sample or some cell types might be missing across the samples. In these cases, MuSiC fails and another deconvolution method should be selected.

22.3. Deconvolving bulk COVID-19 whole blood samples#

In the following use-case we deconvolve 49 bulk whole blood RNA-seq samples collected from 39 COVID-19 patients and 10 healthy controls [] using MuSiC. As single-cell reference dataset we’ll use whole blood single-cell RNA-seq data of patients with COVID-19 [Schulte-Schrepping et al., 2020].

22.3.1. Environment setup#

import scanpy as sc
import anndata
import numpy as np
import pandas as pd
import scipy as sci

22.3.2. Loading Data#

We start by reading the single-cell and bulk data. We do not scale the data as it has been shown that deconvolution methods using scRNA-seq data as reference, perform best when applied to data in linear scale with improved accuracy after library size normalization [].

data_file = "/storage/groups/ml01/workspace/amit.frishberg/OriginalData/"
adata = sc.read(data_file + "seurat_COVID19_freshWB_PBMC_cohort2_incl_raw.h5ad")

adata.X = adata.layers["counts"]
adata = adata[adata.obs["cells"] == "Whole_blood"].copy()
adata
AnnData object with n_obs × n_vars = 89883 × 33417
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'percent.mito', 'percent.hb', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification', 'HTO_classification.global', 'hash.ID', 'demultID', 'donor', 'onset_of_symptoms', 'days_after_onset', 'sampleID', 'date_of_sampling', 'experiment', 'cartridge', 'platform', 'purification', 'cells', 'age', 'sex', 'group_per_sample', 'who_per_sample', 'disease_stage', 'diagnosis', 'oxygen', 'outcome', 'comorbidities', 'COVID.19.related_medication_and_anti.microbials', 'primary_complaint', 'RNA_snn_res.0.8', 'cluster_labels_res.0.8', 'new.order', 'hpca.labels', 'blueprint.labels', 'monaco.labels', 'immune.labels', 'dmap.labels', 'hemato.labels'
    var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable'
    obsm: 'X_pca', 'X_umap'
    layers: 'counts'
adata.obs["cluster_labels_res.0.8"].value_counts()
Neutrophils_1             22714
Neutrophils_2             18675
Neutrophils_3              9986
CD4_T_cells_1              6278
CD14_Monocytes_1           5362
Neutrophils_4              4710
CD8_T_cells                3255
Megakaryocytes             3180
NK_cells                   2916
B_cells_1                  2561
CD4_T_cells_2              1917
Mixed_cells                1727
Immature Neutrophils_1     1317
CD16_Monocytes              983
Immature Neutrophils_2      981
CD14_Monocytes_3            632
Eosinophils                 579
CD14_Monocytes_2            556
CD4_T_cells_3               409
Plasmablast                 390
Prol. cells                 247
mDC                         235
B_cells_2                   138
pDC                          86
CD34+ GATA2+ cells           49
Name: cluster_labels_res.0.8, dtype: int64
bulk = pd.read_csv(data_file + "BulkSmall.txt", sep="\t", index_col=0)
metadata = pd.read_csv(data_file + "annoSmall.txt", sep="\t", index_col=0)
metadata.index = metadata.index.astype("str")
metadata = metadata.loc[bulk.transpose().index]

22.3.3. Data preprocessing#

After reading the data, we need to remove undefined cells from our single-cell reference data. Here, we remove mixed and proliferating cells which are not limited to a single cell type.

Note Quality control by means of removing low quality cells and genes has already been perform on the single-cell data. So, we skip that.

adata = adata[
    ~adata.obs["cluster_labels_res.0.8"].isin(["None", "Mixed_cells", "Prol. cells"])
].copy()
ct_counts = adata.obs["cluster_labels_res.0.8"].value_counts()
adata.obs["cluster_labels_res.0.8"].value_counts()
Neutrophils_1             22714
Neutrophils_2             18675
Neutrophils_3              9986
CD4_T_cells_1              6278
CD14_Monocytes_1           5362
Neutrophils_4              4710
CD8_T_cells                3255
Megakaryocytes             3180
NK_cells                   2916
B_cells_1                  2561
CD4_T_cells_2              1917
Immature Neutrophils_1     1317
CD16_Monocytes              983
Immature Neutrophils_2      981
CD14_Monocytes_3            632
Eosinophils                 579
CD14_Monocytes_2            556
CD4_T_cells_3               409
Plasmablast                 390
mDC                         235
B_cells_2                   138
pDC                          86
CD34+ GATA2+ cells           49
Name: cluster_labels_res.0.8, dtype: int64

It has been shown that deconvolution methods commonly fail to predict the proportion of rare cells []. In case that you want to remove rare cell types, you can limiting cell types to be above cellTypeNumCutOff cells. This cutoff is user-defined and should be selected based on the data.

# removing very rare cells
rare_ct_cut_off = 50  # This is a user-specific parameter based on the data
ct_to_keep = ct_counts[ct_counts > rare_ct_cut_off].index
adata = adata[adata.obs["cluster_labels_res.0.8"].isin(ct_to_keep)].copy()
adata.obs["cluster_labels_res.0.8"].value_counts()
Neutrophils_1             22714
Neutrophils_2             18675
Neutrophils_3              9986
CD4_T_cells_1              6278
CD14_Monocytes_1           5362
Neutrophils_4              4710
CD8_T_cells                3255
Megakaryocytes             3180
NK_cells                   2916
B_cells_1                  2561
CD4_T_cells_2              1917
Immature Neutrophils_1     1317
CD16_Monocytes              983
Immature Neutrophils_2      981
CD14_Monocytes_3            632
Eosinophils                 579
CD14_Monocytes_2            556
CD4_T_cells_3               409
Plasmablast                 390
mDC                         235
B_cells_2                   138
pDC                          86
Name: cluster_labels_res.0.8, dtype: int64

We need to filter for the shared genes across bulk and single-cell data before selecting highly variable genes.

bulk_sc_genes = np.intersect1d(bulk.index, adata.var_names)
bulk = bulk.loc[bulk_sc_genes, :].copy()
adata = adata[:, bulk_sc_genes].copy()

22.3.4. Visualization of single-cell data#

For visualization of single-cell data, we first normalize the counts, select highly-variable genes and log transform the data.

sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4, copy=False)
sc.pp.highly_variable_genes(adata, flavor="cell_ranger", n_top_genes=5000)
adata_log = sc.pp.log1p(
    adata, copy=True
)  # logged counts are only used for visualisation (can also work with layers)

We then reduce the dimensionality using PCA and visualize the data using UMAP:

sc.tl.pca(adata_log)
adata_log.obsm["X_pca"] *= -1  # multiply by -1 to match Seurat
sc.pp.neighbors(adata_log, n_neighbors=30)
sc.tl.umap(adata_log)
sc.pl.umap(adata_log, color="cluster_labels_res.0.8")
../_images/cc15b4ebe38bcc73aaab5123a6723b4e1b0754915d49bc165b6c2663a4b0aecd.png

22.3.5. Deconvolving using MuSiC#

22.3.5.1. Loading R#

# R interface
from rpy2.robjects import pandas2ri
from rpy2.robjects import r
import rpy2.rinterface_lib.callbacks
import anndata2ri

pandas2ri.activate()
anndata2ri.activate()
%load_ext rpy2.ipython
C:\Users\Shennor\.conda\envs\eharpy\lib\site-packages\rpy2\robjects\packages.py:365: UserWarning: The symbol 'quartz' is not in this R namespace/package.
  warnings.warn(

To use the R scripts, we need to convert the data from Python to R. In some cases, this requires sub-sampling of the data to avoid memory limitations.

import random
import itertools

downSamplingSize = 80
downSamplingIndexes = [
    random.sample(
        np.where(currCell == adata.obs["cluster_labels_res.0.8"])[0].tolist(),
        np.min(
            [downSamplingSize, np.sum(currCell == adata.obs["cluster_labels_res.0.8"])]
        ),
    )
    for currCell in np.unique(adata.obs["cluster_labels_res.0.8"])
]
downSamplingIndexes = list(itertools.chain(*downSamplingIndexes))

adata_r = adata[downSamplingIndexes].copy()
adata_r
AnnData object with n_obs × n_vars = 1809 × 26807
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'percent.mito', 'percent.hb', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification', 'HTO_classification.global', 'hash.ID', 'demultID', 'donor', 'onset_of_symptoms', 'days_after_onset', 'sampleID', 'date_of_sampling', 'experiment', 'cartridge', 'platform', 'purification', 'cells', 'age', 'sex', 'group_per_sample', 'who_per_sample', 'disease_stage', 'diagnosis', 'oxygen', 'outcome', 'comorbidities', 'COVID.19.related_medication_and_anti.microbials', 'primary_complaint', 'RNA_snn_res.0.8', 'cluster_labels_res.0.8', 'new.order', 'hpca.labels', 'blueprint.labels', 'monaco.labels', 'immune.labels', 'dmap.labels', 'hemato.labels'
    var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable'
    obsm: 'X_pca', 'X_umap'
    layers: 'counts'

If sub-sampling is not required (data is relatively small), one can simply convert the entire AnnData to an R object

adata_r = adata.copy()
adata_r
AnnData object with n_obs × n_vars = 87860 × 26807
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'percent.mito', 'percent.hb', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification', 'HTO_classification.global', 'hash.ID', 'demultID', 'donor', 'onset_of_symptoms', 'days_after_onset', 'sampleID', 'date_of_sampling', 'experiment', 'cartridge', 'platform', 'purification', 'cells', 'age', 'sex', 'group_per_sample', 'who_per_sample', 'disease_stage', 'diagnosis', 'oxygen', 'outcome', 'comorbidities', 'COVID.19.related_medication_and_anti.microbials', 'primary_complaint', 'RNA_snn_res.0.8', 'cluster_labels_res.0.8', 'new.order', 'hpca.labels', 'blueprint.labels', 'monaco.labels', 'immune.labels', 'dmap.labels', 'hemato.labels', 'n_counts'
    var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'hvg'
    obsm: 'X_pca', 'X_umap'
    layers: 'counts'

22.3.5.2. Running MuSiC#

%%R
library(MuSiC)
library(Biobase)

We will first gather all the remaining inputs needed for running MuSiC:
1. adata_r - contains the single cell matrix which will used as a reference.
2. cell_subsets_r - cell type identities.
3. bulk - the bulk samples which the deconvolution process will be conducted on.
4. sc_genes - gene markers used in the MuSiC analysis.

cell_subsets_r = adata_r.obs["cluster_labels_res.0.8"].astype(str).copy()
sc_genes = adata_r.var_names

MuSiC runs the deconvolution framework one sample at a time. Here, 9*** are sample names.

%%R -i adata_r,cell_subsets_r,bulk,sc_genes -o musicRes
df = data.frame(cellNames = cell_subsets_r, Sample = factor(rep(1, dim(adata_r@colData)[1])))
row.names(df) = row.names(adata_r@colData)
df = new("AnnotatedDataFrame", data = df) # Cell type identities are stored as an AnnotatedDataFrame

# Creating an ExpressionSet from the single cell matrix
scDataMatrix = Matrix::as.matrix(adata_r@assays@data@listData[[1]])
row.names(scDataMatrix) = sc_genes
scDataMatrix = scDataMatrix[rowSums(scDataMatrix)>0,] # Removing genes with no reads
SCDataES <- Biobase::ExpressionSet(assayData=scDataMatrix,phenoData = df, protocolData = df)

bulkDataES <- Biobase::ExpressionSet(assayData=as.matrix(bulk)) # Creating an ExpressionSet from the bulk matrix
musicRes = MuSiC::music_prop(bulk.eset = bulkDataES, sc.eset = SCDataES, clusters = 'cellNames') # Running MuSiC
R[write to console]: Creating Relative Abundance Matrix...

R[write to console]: Creating Variance Matrix...

R[write to console]: Creating Library Size Matrix...

R[write to console]: Used 20407 common genes...

R[write to console]: Used 23 cell types in deconvolution...

R[write to console]: 9088 has common genes 18250 ...

R[write to console]: 9089 has common genes 17774 ...

R[write to console]: 9091 has common genes 18598 ...

R[write to console]: 9092 has common genes 16184 ...

R[write to console]: 9093 has common genes 18585 ...

R[write to console]: 9094 has common genes 18701 ...

R[write to console]: 9095 has common genes 18646 ...

R[write to console]: 9096 has common genes 17247 ...

R[write to console]: 9097 has common genes 17922 ...

R[write to console]: 9098 has common genes 17987 ...

R[write to console]: 9099 has common genes 18223 ...

R[write to console]: 9100 has common genes 18786 ...

R[write to console]: 9101 has common genes 16458 ...

R[write to console]: 9102 has common genes 17604 ...

R[write to console]: 9103 has common genes 17429 ...

R[write to console]: 9104 has common genes 18118 ...

R[write to console]: 9105 has common genes 15286 ...

R[write to console]: 9106 has common genes 17237 ...

R[write to console]: 9107 has common genes 15735 ...

R[write to console]: 9108 has common genes 17387 ...

R[write to console]: 9109 has common genes 16863 ...

R[write to console]: 9110 has common genes 17736 ...

R[write to console]: 9112 has common genes 17513 ...

R[write to console]: 9113 has common genes 14950 ...

R[write to console]: 9114 has common genes 17318 ...

R[write to console]: 9116 has common genes 14518 ...

R[write to console]: 9117 has common genes 13708 ...

R[write to console]: 9118 has common genes 17526 ...

R[write to console]: 9119 has common genes 17274 ...

R[write to console]: 9120 has common genes 16520 ...

R[write to console]: 9121 has common genes 15434 ...

R[write to console]: 9165 has common genes 16907 ...

R[write to console]: 9166 has common genes 16534 ...

R[write to console]: 9167 has common genes 17115 ...

R[write to console]: 9168 has common genes 16621 ...

R[write to console]: 9169 has common genes 17251 ...

R[write to console]: 9170 has common genes 17353 ...

R[write to console]: 9171 has common genes 14603 ...

R[write to console]: 9172 has common genes 14141 ...

R[write to console]: 9122 has common genes 17865 ...

R[write to console]: 9123 has common genes 18009 ...

R[write to console]: 9124 has common genes 18390 ...

R[write to console]: 9125 has common genes 17235 ...

R[write to console]: 9126 has common genes 18432 ...

R[write to console]: 9127 has common genes 17662 ...

R[write to console]: 9128 has common genes 16875 ...

R[write to console]: 9129 has common genes 17396 ...

R[write to console]: 9130 has common genes 19469 ...

R[write to console]: 9131 has common genes 18756 ...
# Create the final output matrix
music_frac = pd.DataFrame(musicRes[0])
music_frac.index = bulk.columns
music_frac.columns = np.unique(cell_subsets_r)

22.3.5.3. Outputs and Validations#

The main output of all cell type deconvolution methods is a NxM matrix, where:
N (rows) - number of samples
M (columns) - the number of cell types
The values of each cell in the matrix represents the cell type composition of a specific cell type within a specific sample. In most cases, cell type compositions in a sample with be shown as fractions therefore, non-negative values, summing to one.

music_frac
B_cells_1 B_cells_2 CD14_Monocytes_1 CD14_Monocytes_2 CD14_Monocytes_3 CD16_Monocytes CD34+ GATA2+ cells CD4_T_cells_1 CD4_T_cells_2 CD4_T_cells_3 ... Immature Neutrophils_2 Megakaryocytes NK_cells Neutrophils_1 Neutrophils_2 Neutrophils_3 Neutrophils_4 Plasmablast mDC pDC
9088 0.025266 0.000007 0.009212 0.007646 0.004111 0.000000 0.021343 0.055596 0.0 0.022603 ... 0.004530 0.037775 0.035206 0.181982 0.000000 0.000000 0.096936 0.007348 0.0 0.000000
9089 0.000000 0.000000 0.011156 0.013301 0.000000 0.000000 0.011891 0.000000 0.0 0.000000 ... 0.000594 0.012610 0.000000 0.402237 0.000000 0.000000 0.117283 0.000086 0.0 0.000000
9091 0.000091 0.000000 0.005645 0.005848 0.000309 0.000247 0.003656 0.000000 0.0 0.000000 ... 0.000000 0.063919 0.011228 0.343444 0.000000 0.002390 0.155726 0.000936 0.0 0.000003
9092 0.004650 0.000000 0.000000 0.000000 0.002024 0.000000 0.003379 0.000000 0.0 0.000000 ... 0.000000 0.091188 0.004478 0.294886 0.000000 0.000000 0.175727 0.001784 0.0 0.000000
9093 0.008431 0.000000 0.000000 0.000000 0.032441 0.000000 0.004801 0.020848 0.0 0.001814 ... 0.000000 0.025210 0.033820 0.100954 0.090013 0.000000 0.276946 0.000504 0.0 0.000000
9094 0.099594 0.000000 0.000425 0.003925 0.000000 0.000000 0.016660 0.016462 0.0 0.027290 ... 0.002317 0.000000 0.029776 0.020330 0.082443 0.000000 0.128015 0.000000 0.0 0.000000
9095 0.001680 0.000000 0.000000 0.048237 0.000000 0.000000 0.006444 0.000000 0.0 0.000000 ... 0.000000 0.017268 0.017518 0.278576 0.000000 0.000000 0.217223 0.000744 0.0 0.000000
9096 0.019351 0.000000 0.000785 0.028895 0.000000 0.000000 0.007645 0.015595 0.0 0.000000 ... 0.000000 0.003820 0.060052 0.277122 0.061254 0.000000 0.026074 0.002298 0.0 0.000000
9097 0.015769 0.000000 0.000399 0.031298 0.000000 0.000000 0.009512 0.028784 0.0 0.000000 ... 0.000000 0.005165 0.061176 0.275073 0.036719 0.000000 0.056550 0.005490 0.0 0.000000
9098 0.015624 0.000000 0.015810 0.011449 0.004735 0.000000 0.011961 0.038063 0.0 0.007417 ... 0.000000 0.012727 0.051597 0.230531 0.002209 0.000000 0.022288 0.013302 0.0 0.000000
9099 0.015419 0.000000 0.000125 0.009149 0.026507 0.000000 0.013332 0.088082 0.0 0.022198 ... 0.000000 0.013670 0.043131 0.073787 0.085893 0.000000 0.084937 0.001153 0.0 0.000000
9100 0.038787 0.000000 0.000000 0.006566 0.000000 0.000000 0.014090 0.035495 0.0 0.002291 ... 0.000648 0.000000 0.038413 0.170412 0.044578 0.000000 0.078720 0.008307 0.0 0.000000
9101 0.003037 0.000000 0.000000 0.025247 0.000000 0.000000 0.002739 0.000516 0.0 0.000000 ... 0.000000 0.063421 0.000801 0.360854 0.005524 0.000000 0.111516 0.000328 0.0 0.000000
9102 0.008490 0.000000 0.000000 0.025538 0.000577 0.000000 0.010780 0.028101 0.0 0.000000 ... 0.000156 0.025812 0.037690 0.284248 0.054286 0.000000 0.014266 0.001326 0.0 0.000000
9103 0.018500 0.013523 0.000000 0.013188 0.000000 0.000000 0.008010 0.079577 0.0 0.022855 ... 0.000000 0.001177 0.058535 0.160046 0.053514 0.000065 0.028745 0.000483 0.0 0.000000
9104 0.035979 0.000430 0.000000 0.021775 0.000000 0.000000 0.028406 0.039231 0.0 0.014846 ... 0.000000 0.000000 0.046156 0.097204 0.000000 0.000000 0.000000 0.005464 0.0 0.000370
9105 0.029334 0.000000 0.012962 0.000000 0.014837 0.028242 0.000414 0.025195 0.0 0.007911 ... 0.000000 0.017720 0.059066 0.177582 0.000000 0.000000 0.000000 0.000380 0.0 0.000000
9106 0.004070 0.000000 0.000000 0.016110 0.000000 0.000000 0.010171 0.027782 0.0 0.000945 ... 0.000000 0.000000 0.014647 0.072228 0.222037 0.000000 0.245327 0.002171 0.0 0.000000
9107 0.012894 0.000000 0.000733 0.030433 0.000000 0.000000 0.029637 0.064172 0.0 0.021531 ... 0.000000 0.025631 0.008832 0.267814 0.000000 0.000000 0.030072 0.000000 0.0 0.000000
9108 0.015228 0.000000 0.000170 0.022236 0.000000 0.000000 0.018144 0.004921 0.0 0.005200 ... 0.000000 0.000000 0.031351 0.211152 0.000000 0.000000 0.131391 0.003772 0.0 0.000000
9109 0.036592 0.000000 0.003224 0.031009 0.004927 0.000000 0.020842 0.071492 0.0 0.021438 ... 0.000212 0.001298 0.054454 0.162173 0.001041 0.000000 0.060488 0.003435 0.0 0.000000
9110 0.017529 0.000000 0.000000 0.017824 0.000000 0.000000 0.018764 0.000000 0.0 0.000000 ... 0.002869 0.040700 0.006030 0.246727 0.000000 0.000000 0.086884 0.003378 0.0 0.000000
9112 0.031905 0.003329 0.003657 0.038107 0.000000 0.000000 0.019290 0.045312 0.0 0.002296 ... 0.000000 0.019050 0.053040 0.092170 0.000000 0.000000 0.000000 0.001793 0.0 0.000000
9113 0.045796 0.000000 0.003949 0.002286 0.000000 0.000000 0.006261 0.027501 0.0 0.000000 ... 0.000000 0.025152 0.062904 0.257408 0.037749 0.000000 0.000000 0.001175 0.0 0.000000
9114 0.079590 0.006749 0.002371 0.021311 0.000000 0.000000 0.017618 0.111433 0.0 0.010972 ... 0.000000 0.000000 0.109078 0.000000 0.000000 0.000000 0.000000 0.000221 0.0 0.000919
9116 0.000000 0.000000 0.000000 0.003446 0.000000 0.000000 0.005269 0.000000 0.0 0.000000 ... 0.000160 0.114610 0.022406 0.243280 0.014482 0.000000 0.112478 0.000000 0.0 0.000000
9117 0.001213 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 ... 0.000000 0.040750 0.001302 0.397878 0.000000 0.000000 0.166156 0.000911 0.0 0.000000
9118 0.019181 0.000000 0.000000 0.041347 0.000000 0.000000 0.015371 0.040876 0.0 0.012439 ... 0.000219 0.000000 0.023484 0.157825 0.031902 0.000000 0.137689 0.004242 0.0 0.000000
9119 0.036951 0.002960 0.000000 0.021283 0.000000 0.000000 0.010837 0.062043 0.0 0.015091 ... 0.000076 0.006685 0.045383 0.197475 0.040867 0.000457 0.026496 0.001799 0.0 0.000200
9120 0.026611 0.000781 0.000000 0.000468 0.000000 0.000000 0.009929 0.040768 0.0 0.003021 ... 0.000000 0.024230 0.019282 0.235651 0.017742 0.000000 0.095742 0.002195 0.0 0.000066
9121 0.005433 0.000000 0.000454 0.056839 0.000000 0.000000 0.019102 0.052365 0.0 0.011261 ... 0.000000 0.048820 0.035044 0.133349 0.024641 0.000000 0.109806 0.002113 0.0 0.000000
9165 0.017479 0.000000 0.000000 0.008826 0.000000 0.000000 0.013337 0.013744 0.0 0.003507 ... 0.001174 0.056218 0.020538 0.298131 0.008824 0.000000 0.125729 0.002626 0.0 0.000000
9166 0.017747 0.000000 0.000000 0.000000 0.000000 0.000000 0.013241 0.025715 0.0 0.000000 ... 0.000000 0.011393 0.015843 0.236452 0.071622 0.000000 0.133848 0.001551 0.0 0.000000
9167 0.046150 0.004153 0.000586 0.029131 0.000000 0.000000 0.015318 0.090483 0.0 0.048755 ... 0.000000 0.000000 0.097628 0.052031 0.000000 0.000000 0.000000 0.000000 0.0 0.000393
9168 0.038030 0.000000 0.008009 0.032995 0.000000 0.000000 0.015055 0.019886 0.0 0.000122 ... 0.000750 0.000000 0.017958 0.237107 0.000000 0.000000 0.034764 0.000354 0.0 0.000000
9169 0.040401 0.000000 0.000000 0.017743 0.000000 0.000000 0.012677 0.018073 0.0 0.000072 ... 0.000000 0.000000 0.037869 0.152494 0.000000 0.000000 0.000000 0.002456 0.0 0.000000
9170 0.001406 0.000000 0.000000 0.024063 0.000000 0.000000 0.006337 0.000000 0.0 0.000000 ... 0.000000 0.034838 0.036429 0.237845 0.186743 0.000000 0.048438 0.000954 0.0 0.000000
9171 0.004296 0.000412 0.000000 0.037080 0.007700 0.000000 0.008648 0.013575 0.0 0.000000 ... 0.000353 0.000761 0.010418 0.191832 0.000000 0.001537 0.144852 0.001676 0.0 0.000000
9172 0.004607 0.000422 0.016581 0.029331 0.000000 0.000019 0.010006 0.031523 0.0 0.000000 ... 0.002156 0.018158 0.009102 0.241950 0.000000 0.004823 0.104858 0.015060 0.0 0.000000
9122 0.044896 0.006798 0.000000 0.013065 0.000000 0.000000 0.022354 0.067906 0.0 0.011042 ... 0.000000 0.000000 0.030763 0.148463 0.000000 0.000000 0.004873 0.000000 0.0 0.000000
9123 0.036793 0.021414 0.000000 0.013317 0.000000 0.000000 0.027294 0.108887 0.0 0.038598 ... 0.000000 0.000000 0.037017 0.085672 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
9124 0.047839 0.005392 0.000000 0.000000 0.000000 0.000000 0.023929 0.095896 0.0 0.000795 ... 0.000000 0.000000 0.036180 0.072006 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
9125 0.047328 0.008957 0.000000 0.000000 0.000000 0.000000 0.025473 0.122208 0.0 0.012059 ... 0.000000 0.000000 0.045437 0.036258 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
9126 0.052478 0.007149 0.000000 0.005149 0.000000 0.000000 0.026425 0.088092 0.0 0.008506 ... 0.000000 0.000000 0.081810 0.031593 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
9127 0.031883 0.000000 0.000000 0.007273 0.000000 0.000000 0.019330 0.083236 0.0 0.011144 ... 0.000000 0.000000 0.040736 0.128713 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
9128 0.049647 0.006472 0.000000 0.003832 0.000000 0.000000 0.024532 0.087951 0.0 0.010565 ... 0.000000 0.000000 0.069194 0.045462 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
9129 0.071568 0.000000 0.000000 0.015938 0.000000 0.000000 0.031499 0.087671 0.0 0.014011 ... 0.000000 0.000000 0.033312 0.047501 0.000000 0.000000 0.000000 0.000006 0.0 0.000000
9130 0.066304 0.000000 0.002818 0.001318 0.000000 0.000000 0.021337 0.090575 0.0 0.002119 ... 0.000000 0.000000 0.060295 0.080626 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
9131 0.039067 0.009778 0.000000 0.000000 0.000000 0.000000 0.023478 0.111008 0.0 0.018155 ... 0.000000 0.000000 0.040867 0.056370 0.000000 0.000000 0.000000 0.000000 0.0 0.000000

49 rows × 23 columns

In case we know the true fractions of the cell types across sample, we can validate our deconvolved compositions. Here, Neutrophil counts were measures and indeed we found high correlations between the measured values and our decovolution-based results.

neutCounts = metadata["Total.neutrophil.count...mm3."].astype(float)
subsetCorMuSiC = pd.Series(
    np.corrcoef(
        music_frac.to_numpy()[~np.isnan(neutCounts.to_numpy()), :].transpose(),
        neutCounts[~np.isnan(neutCounts.to_numpy())].astype(float),
    )[music_frac.shape[1], 0 : music_frac.shape[1]]
)
subsetCorMuSiC.index = music_frac.columns
subsetCorMuSiC.sort_values()
C:\Users\Shennor\AppData\Roaming\Python\Python38\site-packages\numpy\lib\function_base.py:2691: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[:, None]
C:\Users\Shennor\AppData\Roaming\Python\Python38\site-packages\numpy\lib\function_base.py:2692: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[None, :]
NK_cells                 -0.511184
Eosinophils              -0.498070
CD4_T_cells_1            -0.403485
B_cells_1                -0.325301
B_cells_2                -0.281896
CD8_T_cells              -0.281529
pDC                      -0.264388
CD4_T_cells_3            -0.180847
CD16_Monocytes           -0.165263
CD14_Monocytes_2         -0.164895
CD34+ GATA2+ cells       -0.132763
CD14_Monocytes_3         -0.063592
Neutrophils_2            -0.058816
Immature Neutrophils_1   -0.004655
CD14_Monocytes_1         -0.002719
Plasmablast               0.015157
Neutrophils_3             0.057406
Megakaryocytes            0.292929
Immature Neutrophils_2    0.322763
Neutrophils_1             0.446441
Neutrophils_4             0.447348
CD4_T_cells_2                  NaN
mDC                            NaN
dtype: float64

We can also look for significant changes in cell compositions between disease patients and healthy controls. In this case, for each cell type, we calculate the p-value of this change based on student’s t-test.

healty_vs_covid = pd.Series(
    [
        sci.stats.ttest_ind(
            music_frac[cell].to_numpy()[metadata["status"].to_numpy() == "covid"],
            music_frac[cell].to_numpy()[metadata["status"].to_numpy() == "healthy"],
        )[1]
        for cell in music_frac.columns
    ]
)
healty_vs_covid.index = music_frac.columns
healty_vs_covid.sort_values()
CD4_T_cells_1             3.535377e-08
Eosinophils               2.174486e-07
CD34+ GATA2+ cells        1.313440e-06
B_cells_2                 4.179994e-05
Neutrophils_1             1.133780e-04
B_cells_1                 3.829060e-04
Neutrophils_4             4.508343e-04
CD14_Monocytes_2          1.010730e-02
Megakaryocytes            1.307137e-02
Plasmablast               1.902793e-02
Neutrophils_2             6.436426e-02
NK_cells                  1.110367e-01
Immature Neutrophils_1    1.321325e-01
CD14_Monocytes_1          1.452041e-01
CD4_T_cells_3             1.676061e-01
Immature Neutrophils_2    1.806682e-01
CD14_Monocytes_3          2.641251e-01
CD8_T_cells               3.427982e-01
pDC                       3.571902e-01
Neutrophils_3             4.002572e-01
CD16_Monocytes            6.143465e-01
CD4_T_cells_2                      NaN
mDC                                NaN
dtype: float64

Here is a boxplot presenting the differences between the two conditions

selected_cell = healty_vs_covid.index[np.nanargmin(healty_vs_covid.to_numpy())]
status_df = pd.DataFrame(metadata["status"])
status_df["cellFraction"] = music_frac[[selected_cell]]
status_df.boxplot(by="status")
<AxesSubplot:title={'center':'cellFraction'}, xlabel='[status]'>
../_images/2fbb756aa61b0c3257d8453d0e0099e17539912210313c6dbdf314d5e3592b0e.png

22.4. Limitations and traps#

While single cell data is less prone to contain missing cell types, compared to pre-selected sorted cells, dealing with missing cell types is still considered a major challenge in the field of cell type deconvolution to this day []. In addition, the number of cell types within the signature matrix has a major influence on deconvolution accuracy, as more cell types usually results in a less accurate deconvolution process [Newman et al., 2019].

While several deconvolution methods can correctly infer the proportion of major cellular components, their performance for rare or correlated components varies. To deal with the co-linearity between the predictors some methods perform feature selection prior deconvolution by selecting a subset of genes (called signature gene list) that minimizes the correlation between cell types.

22.5. New directions#

Other than the methods we described before, there are additional tools which can be utilize to either improve deconvolution or decovolve higher resolution states within cell types:

AutoGeneS [Aliee and Theis, 2021] proposes a multi-objective feature selection method that can be integrated into a deconvolution platform. AutoGeneS requires no prior knowledge about marker genes and selects genes by simultaneously optimizing multiple criteria: minimizing the correlation and maximizing the distance between cell types. AutoGeneS can be applied to reference profiles from various sources like single-cell experiments or sorted cell populations.

CPM [] is a cell-state deconvolution method which allows discovering compositional changes occurring within each cell type, based on the single cell cell space, often capturing transitions in cell compositions across continuous cellular trajectories. By focusing on the variation within cell types, and less between cell types (as most deconvolution methods), CPM can discover changes in compositions of distinct subgroups of cells or continuous changes across specific cellular trajectories.

22.6. Key takeaways#

  1. Use unbiased reference data with no missing cell type.

  2. Normalize the reference data using library size normalization.

  3. Feature selection prior deconvolution improves the accuracy of the results particularly for linear deconvolution methods.

Deconvolution is a challenging task (see limitations and traps). Therefore, we recommend the users to try more than one deconvolution method and select one after assessing the results biologically.

22.7. References#

[bulkAT21] (1,2,3)

Hananeh Aliee and Fabian J. Theis. Autogenes: automatic gene selection using multi-objective optimization for rna-seq deconvolution. Cell Systems, 12(7):706–715, Jul 2021. URL: https://doi.org/10.1016/j.cels.2021.05.006, doi:10.1016/j.cels.2021.05.006.

[bulkAHB17] (1,2)

Dvir Aran, Zicheng Hu, and Atul J. Butte. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biology, 11 2017. URL: https://doi.org/10.1186/s13059-017-1349-1, doi:10.1186/s13059-017-1349-1.

[bulkBVW+16]

Maayan Baron, Adrian Veres, Samuel L. Wolock, Aubrey L. Faust, Renaud Gaujoux, Amedeo Vetere, Jennifer Hyoje Ryu, Bridget K. Wagner, Shai S. Shen-Orr, Allon M. Klein, Douglas A. Melton, and Itai Yanai. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Bioinformatics, 3(4):346–360, Oct 2016. URL: https://doi.org/10.1016/j.cels.2016.08.011, doi:10.1016/j.cels.2016.08.011.

[bulkDCW19]

Rose Du, Vince Carey, and Scott T. Weiss. Deconvseq: deconvolution of cell mixture distribution in sequencing data. Bioinformatics, 35(24):5095–5102, Dec 2019. URL: https://doi.org/10.1093/bioinformatics/btz444, doi:10.1093/bioinformatics/btz444.

[bulkKKB+12]

Alexandre Kuhn, Azad Kumar, Alexandra Beilina, Allissa Dillman, Mark R. Cookson, and Andrew B. Singleton. Cell population-specific expression analysis of human cerebellum. BMC Genomics, 13(610):1471–2164, Nov 2012. URL: https://doi.org/10.1186/1471-2164-13-610, doi:10.1186/1471-2164-13-610.

[bulkMMO+20]

Kevin Menden, Mohamed Marouf, Sergio Oller, Anupriya Dalmia, Daniel Sumner Magruder, Karin Kloiber, Peter Heutink, and Stefan Bonn. Deep learning&#x2013;based cell composition analysis from tissue expression profiles. Science Advances, 6(30):eaba2619, 2020. URL: https://www.science.org/doi/abs/10.1126/sciadv.aba2619, arXiv:https://www.science.org/doi/pdf/10.1126/sciadv.aba2619, doi:10.1126/sciadv.aba2619.

[bulkMLX+19]

Gianni Monaco, Bernett Lee, Weili Xu, Seri Mustafah, You Yi Hwang, Christophe Carré, Nicolas Burdin, Lucian Visan, Michele Ceccarelli, Michael Poidinger, Alfred Zippelius, João Pedro de Magalhães, and Anis Larbi. Rna-seq signatures normalized by mrna abundance allow absolute deconvolution of human immune cell types. Cell Reports, 26(6):1627–1640, Feb 2019. URL: https://doi.org/10.1016/j.celrep.2019.01.041, doi:10.1016/j.celrep.2019.01.041.

[bulkNOS+21]

Brian B Nadel, Meritxell Oliva, Benjamin L Shou, Keith Mitchell, Feiyang Ma, Dennis J Montoya, Alice Mouton, Sarah Kim-Hellmuth, Barbara E Stranger, Matteo Pellegrini, and Serghei Mangul. Systematic evaluation of transcriptomics-based deconvolution methods and references using thousands of clinical samples. Briefings in Bioinformatics, 08 2021. bbab265. URL: https://doi.org/10.1093/bib/bbab265, arXiv:https://academic.oup.com/bib/article-pdf/22/6/bbab265/42242154/bbab265.pdf, doi:10.1093/bib/bbab265.

[bulkNSL+19] (1,2,3)

Aaron M. Newman, Chloé B. Steen, Chih Long Liu, Andrew J. Gentles, Aadel A. Chaudhuri, Florian Scherer, Michael S. Khodadoust, Mohammad S. Esfahani, Bogdan A. Luca, David Steiner, Maximilian Diehn, and Ash A. Alizadeh. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nature Biotechnology, 37:773–782, May 2019. URL: https://doi.org/10.1038/s41587-019-0114-2, doi:10.1038/s41587-019-0114-2.

[bulkSSRP+20]

Jonas Schulte-Schrepping, Nico Reusch, Daniela Paclik, Kevin Baßler, Stephan Schlickeiser, Bowen Zhang, Benjamin Krämer, Tobias Krammer, Sophia Brumhard, Lorenzo Bonaguro, Elena De Domenico, Daniel Wendisch, Martin Grasshoff, Theodore S. Kapellos, Michael Beckstette, Tal Pecht, Adem Saglam, Oliver Dietrich, Henrik E. Mei, Axel R. Schulz, Claudia Conrad, Désirée Kunkel, Ehsan Vafadarnejad, Cheng-Jian Xu, Arik Horne, Miriam Herbert, Anna Drews, Charlotte Thibeault, Moritz Pfeiffer, Stefan Hippenstiel, Andreas Hocke, Holger Müller-Redetzky, Katrin-Moira Heim, Felix Machleidt, Alexander Uhrig, Laure Bosquillon de Jarcy, Linda Jürgens, Miriam Stegemann, Christoph R. Glösenkamp, Hans-Dieter Volk, Christine Goffinet, Markus Landthaler, Emanuel Wyler, Philipp Georg, Maria Schneider, Chantip Dang-Heine, Nick Neuwinger, Kai Kappert, Rudolf Tauber, Victor Corman, Jan Raabe, Kim Melanie Kaiser, Michael To Vinh, Gereon Rieke, Christian Meisel, Thomas Ulas, Matthias Becker, Robert Geffers, Martin Witzenrath, Christian Drosten, Norbert Suttorp, Christof von Kalle, Florian Kurth, Kristian Händler, Joachim L. Schultze, Anna C. Aschenbrenner, Yang Li, Jacob Nattermann, Birgit Sawitzki, Antoine-Emmanuel Saliba, Leif Erik Sander, and Deutsche COVID-19 OMICS Initiative (DeCOI). Severe covid-19 is marked by a dysregulated myeloid cell compartment. Cell, 182(6):1419–1440.e23, Sep 2020. S0092-8674(20)30992-2[PII]. URL: https://doi.org/10.1016/j.cell.2020.08.001, doi:10.1016/j.cell.2020.08.001.

[bulkSch10]

Stanley E. Schwartz, Russell Shackney. Applying unmixing to gene expression data for tumor phylogeny inference. BMC Bioinformatics, Jan 2010. URL: https://doi.org/10.1186/1471-2105-11-42, doi:10.1186/1471-2105-11-42.

[bulkSOG13]

Shai S Shen-Orr and Renaud Gaujoux. Computational deconvolution: extracting cell type-specific information from heterogeneous samples. Current Opinion in Immunology, 25(5):571–578, 2013. Special section: Systems biology and bioinformatics / Immunogenetics and transplantation. URL: https://www.sciencedirect.com/science/article/pii/S0952791513001507, doi:https://doi.org/10.1016/j.coi.2013.09.015.

[bulkZBSA19]

Konstantin Zaitsev, Monika Bambouskova, Amanda Swain, and Maxim N. Artyomov. Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures. Nature Communications, May 2019. URL: https://doi.org/10.1038/s41467-019-09990-5, doi:10.1038/s41467-019-09990-5.

22.8. Contributors#

We gratefully acknowledge the contributions of:

22.8.1. Authors#

  • Hananeh Aliee

  • Amit Frishberg