36. Normalization#

36.1. Motivation#

Contrary to the negative binomial distribution of UMI counts, ADT data is less sparse with a negative peak for non-specific antibody binding and a positive peak resembling enrichment of specific cell surface proteins[Zheng et al., 2022]. The capture efficiency varies from cell to cell due to difference in biophysical properties. Since CITE-seq experiments enrich for a priori selected features, compositional biases are more severe. Analogously to scRNA-seq data, many approaches to normalization exist. We cover the two most widely used ideas methods that require different input data and starting points.

ADT data can be normalized using Centered Log-Ratio (CLR) transformation [Stoeckius et al., 2017]. Nevertheless, a new low-level normalization method tailored to dealing with the challenges this modality poses now exists: DSB (denoised and scaled by background). DSB normalization removes two kinds of noise. First, it uses the empty droplets to estimate a background noise and remove the ambient noise. Secondly, it uses the background population mean and isotypes (antibodies that bind non-specifically to the cells) to define and remove cell-to-cell technical noise[Mulè et al., 2022]

36.2. Environment setup#

import muon as mu
import pandas as pd
import warnings

warnings.simplefilter(action="ignore", category=UserWarning)
warnings.simplefilter(action="ignore", category=FutureWarning)
/home/icb/ciro.suastegui/miniconda3/envs/citeseq_pp/lib/python3.7/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

36.3. Loading the data#

raw_mu_path = (
filtered_qc_mu_path = "/lustre/groups/ml01/workspace/ciro.suastegui/bp2.0/data/neurips_cite_pp_filtered-qc.h5mu"
filtered_norm_mu_path = "/lustre/groups/ml01/workspace/ciro.suastegui/bp2.0/data/neurips_cite_pp_filtered-qc-norm.h5mu"

We are simply loading the saved MuData object from the quality control chapter back in.

raw = mu.read(raw_mu_path)
CPU times: user 14min 3s, sys: 1min 54s, total: 15min 58s
Wall time: 16min 4s
filtered = mu.read(filtered_qc_mu_path)
CPU times: user 3.76 s, sys: 1.3 s, total: 5.05 s
Wall time: 6.07 s
MuData object with n_obs × n_vars = 120502 × 36741
  var:	'gene_ids', 'feature_types'
  2 modalities
    rna:	120502 x 36601
      obs:	'donor', 'batch'
      var:	'gene_ids', 'feature_types'
    prot:	120502 x 140
      obs:	'donor', 'batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts', 'outliers'
      var:	'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'

36.4. DSB normalization#

We are ready to normalize the data. In this case, we can use the raw data’s distribution as background. We also have isotype controls to define and remove cell-to-cell technical variations.

Isotype contols are antibodies that bind to the cells present in this study non-specifically, meaning you would not expect a significant abundance difference between the cells. Thus, we can use the values of the isotype controls to normalize technical differences.

We are calling the normalization function mu.prot.pp.dsb with the filtered and raw mudata object as well as the names of the isotype controls.

isotype_controls = ["Mouse-IgG1", "Mouse-IgG2a", "Mouse-IgG2b", "Rat-IgG2b"]
filtered["prot"].layers["counts"] = filtered["prot"].X
filtered["prot"].X = filtered["prot"].layers["counts"]
mu.prot.pp.dsb(filtered, raw, isotype_controls=isotype_controls)
CPU times: user 8min 3s, sys: 28.6 s, total: 8min 31s
Wall time: 8min 32s

Let’s have a look at counts before denoising and normalization.

pd.Series(filtered["prot"].layers["counts"][:100, :100].A.flatten()).value_counts()
1.0      1090
0.0      1045
2.0       918
3.0       691
4.0       581
350.0       1
706.0       1
296.0       1
970.0       1
763.0       1
Length: 524, dtype: int64

See after denoise and normalization the range changed.

pd.Series(filtered["prot"].X[:100, :100].flatten()).value_counts()
-1.174030    2
-0.996048    1
 1.722345    1
-0.262355    1
 6.112263    1
 0.153576    1
 0.285257    1
-0.149485    1
 0.287904    1
-0.263154    1
Length: 9999, dtype: int64

36.5. Centered Log-Ratio normalization#

If you don’t have the unfiltered data available, you can also normalize the ADT data with mu.prot.pp.clr, implementing Centered Log-Ratio normalization. There is no denoising in this type of normalization. We instead assume that the geometric mean is a good reference to make all else relative to (divide by)[Quinn et al., 2018]. We are in fact taking the natural log ratio of each protein in each cell relative to either other proteins or other cells, depending on the implementation. At first, it was done across proteins, but then it was changed to across cells. This change made the normalization less dependent on the antibody panel[Mulè et al., 2022].

MuData object with n_obs × n_vars = 120502 × 36741
  var:	'gene_ids', 'feature_types'
  2 modalities
    rna:	120502 x 36601
      obs:	'donor', 'batch'
      var:	'gene_ids', 'feature_types'
    prot:	120502 x 140
      obs:	'donor', 'batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts', 'outliers'
      var:	'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
      layers:	'counts'
/home/icb/ciro.suastegui/miniconda3/envs/citeseq_pp/lib/python3.7/site-packages/anndata/_core/anndata.py:1241: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[key] = c
... storing 'feature_types' as categorical

36.6. Key takeaways#


36.7. References#


Matthew P. Mulè, Andrew J. Martins, and John S. Tsang. Normalizing and denoising protein expression data from droplet-based single cell profiling. Nature Communications, 13(11):2099, Apr 2022. doi:10.1038/s41467-022-29356-8.


Thomas P Quinn, Ionas Erb, Mark F Richardson, and Tamsyn M Crowley. Understanding sequencing data as compositions: an outlook and review. Bioinformatics, 34(16):2870–2878, Aug 2018. doi:10.1093/bioinformatics/bty175.


Marlon Stoeckius, Christoph Hafemeister, William Stephenson, Brian Houck-Loomis, Pratip K. Chattopadhyay, Harold Swerdlow, Rahul Satija, and Peter Smibert. Simultaneous epitope and transcriptome measurement in single cells. Nature Methods, 14(9):865–868, Sep 2017. URL: https://doi.org/10.1038/nmeth.4380, doi:10.1038/nmeth.4380.


Ye Zheng, Seong-Hwan Jun, Yuan Tian, Mair Florian, and Raphael Gottardo. Robust normalization and integration of single-cell protein expression across cite-seq datasets. bioRxiv, 2022. URL: https://www.biorxiv.org/content/early/2022/05/01/2022.04.29.489989, arXiv:https://www.biorxiv.org/content/early/2022/05/01/2022.04.29.489989.full.pdf, doi:10.1101/2022.04.29.489989.

36.8. Contributors#

We gratefully acknowledge the contributions of:

36.8.1. Authors#

  • Daniel Strobl

  • Ciro Ramírez-Suástegui

36.8.2. Reviewers#

  • Lukas Heumos