35. Quality control#

35.1. Motivation#

In addition to capturing only transcriptomic data with single cell analyses, we are now able to also capture the abundance of surface protein expression. The protocol used for this is usually referred to as CITE-seq[Stoeckius et al., 2017]. This modality requires different preprocessing compared to what we described earlier for gene expression data since the data distributions are different. In the following, we will guide you through the process of dealing with CITE-seq data. As CITE-seq data provides you with two different modalities, you can either analyze them separately or jointly. Here, we will focus on the ADT part of the data and analyze it unimodally.

Single-cell RNA-seq data acts as a proxy for protein level with a partial correlation at various transcription estates of a cell[Liu et al., 2016]. Therefore, it is in our interest to measure the protein levels in single-cells if we are to capture a better picture of cellular processes. Quantifying these aspects of a cell is essential to understand cell differentiation and fate, cell signal transduction pathway, disease progression, perturbations, and clinical diagnostics[Xie and Ding, 2022].

We can already detect relevant populations with single-cell transcriptomics. This is a valuable piece of information, but incomplete if we want to better understand the cellular identities and dynamics happening in the biological processes we study. Having surface protein measurements allows us to close the gap between identity by transcription and identity by protein where there might be a delay in synthesis that could be important in our experiment. For example, it has been noted that ICOS, an immune checkpoint protein, was increased on the surface of treated cells, regardless of the fact that this protein’s mRNA does not differ in abundance between the treatment groups[Peterson et al., 2017]. Another advantage is that surface protein levels help us detect doublets that might not be reflected at the transcript level in our data. This is possible by looking at the co-occurrence of cell-type-specific markers[Sun et al., 2021].

By using antibodies tagged with a barcode, it is possible to first bind the antibodies to the cells and later sequence the barcodes together with the RNA. There are two main protocols CITE-seq (Cellular Indexing of Transcriptomes and Epitopes by Sequencing) and REAP-seq (RNA expression and protein sequencing assay). The main difference resides in their antibody-oligo conjugates also known as Antibody-Derived Tags (ADT). CITE-seq uses streptavidin that is noncovalently bound to biotinylated DNA barcodes. REAP-seq implements covalent bonds between the antibody and a DNA barcode[Peterson et al., 2017]. Furthermore, there have been advances integrating the CITE-seq protocol in a multimodal assay. One is DOGMA-seq[Mimitou et al., 2021], an adaptation of CITE-seq that allows the measurement of chromatin accessibility, gene expression, and protein from the same cell. This method includes ASAP-seq, which combines scATAC-seq and ADT by adding a bridge oligo specific to the CITE-seq reagents[Mimitou et al., 2021]. The advantage of ASAP-seq is that it can measure surface and intracellular proteins. We will refer to the surface protein measurements as ADT data.

CITE-Seq

With ADT data, we can identify cell types based on conventional markers usually utilized in flow cytometry experiments. These markers are especially useful for specific immune cell populations. The advantage of ADT is that other modalities are measured simultaneously. However, the way we process ADT data differs from others. Contrary to the negative binomial distribution of UMI counts, ADT data is less sparse with a negative peak for non-specific antibody binding and a positive peak resembling enrichment of specific cell surface proteins[Zheng et al., 2022]. Many experiments only include a small number in the tens or hundreds of antibodies of interest. Moreover, sequencing resources can be concentrated enabling deeper coverage of ADTs since they are separated from transcripts. ADT data is also noisier, as unbound antibodies lead to counts in cells or empty droplets where the protein is not present.

35.2. About the data#

Here we are analysing ADT data obtained from bone marrow mononuclear cells (BMMCs)[Luecken et al., 2021]. It contains 136 surface protein and 4 isotypes.

In this chapter we will use scanpy and muon[Bredikhin et al., 2022] to analyse the data. We show you how to load, perform quality control, normalize, and do some basic visualization. We also show a way to remove batch effect.

35.3. Environment setup#

import scanpy as sc
import muon as mu
import pandas as pd
import numpy as np
import glob
import seaborn as sns
import warnings
import os

warnings.simplefilter(action="ignore", category=UserWarning)
warnings.simplefilter(action="ignore", category=FutureWarning)
/home/icb/ciro.suastegui/miniconda3/envs/citeseq_pp/lib/python3.7/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

35.4. Loading the data#

We’re reading in the data with muon. This way we already have two subobjects, one for RNA and another for ADT.

We use glob with a regular expression to get the file paths for the individual samples.

Cell Ranger gives us two outputs: a filtered and a raw output. The raw output contains all sequenced droplets while the filtered output only consists of the droplets considered as cells by Cell Ranger.

data_path = "/lustre/groups/ml01/datasets/projects/20220323_neurips21_bmmc_christopher.lance/cite/s*d*/cellranger_out/"
raw_mu_path = (
    "/lustre/groups/ml01/workspace/ciro.suastegui/bp2.0/data/neurips_cite_pp_raw.h5mu"
)
filtered_mu_path = "/lustre/groups/ml01/workspace/ciro.suastegui/bp2.0/data/neurips_cite_pp_filtered.h5mu"
filtered_qc_mu_path = "/lustre/groups/ml01/workspace/ciro.suastegui/bp2.0/data/neurips_cite_pp_filtered-qc.h5mu"
filtered_files = glob.glob(data_path + "filtered_*/")
raw_files = glob.glob(data_path + "raw_*/")

As Mudata does not support concatenation, we need to use the scanpy function for this. We split each object into a GEX and an ADT anndata object, concatenate those individually and finally attach them to a new joint MuData object.

We do this for both the filtered and the raw object.

def read_data_aggregate(files_list, cache_file=None):
    cache_file_exists = False
    if not cache_file is None:
        if os.path.isfile(cache_file):
            cache_file_exists = True
    if cache_file_exists:
        print("Loading cached data")
        data = mu.read(cache_file)
    else:
        print("Reading data per sample")
        data_mu = {}
        # Iterate over samples, load and add name as obs column
        for file in files_list:
            name = file.split("/")[8]
            data_mu[name] = mu.read_10x_mtx(file)
            data_mu[name]["prot"].obs["donor"] = name
            data_mu[name]["rna"].obs["donor"] = name
        # Split muon object into two anndata objects
        data = data_mu.pop("s1d1")
        data_rna = data["rna"]
        data_prot = data["prot"]
        # Concatenate rna and adt objects separately
        for name in data_mu.keys():
            data_rna = data_rna.concatenate(data_mu[name]["rna"])
            data_prot = data_prot.concatenate(data_mu[name]["prot"])
        # Create mudata object from the two anndata objects
        data = mu.MuData({"rna": data_rna, "prot": data_prot})
        if not cache_file is None:
            print("Saving data to:", cache_file)
            data.write(cache_file)
    return data
%%time
filtered = read_data_aggregate(filtered_files, filtered_mu_path)
Loading cached data
CPU times: user 3.79 s, sys: 1.36 s, total: 5.14 s
Wall time: 6.34 s
%%time
raw = read_data_aggregate(raw_files, raw_mu_path)
Loading cached data
CPU times: user 14min 23s, sys: 2min 7s, total: 16min 30s
Wall time: 16min 36s

Now, we have two Mudata objects. One with the called cells and one with all cell barcodes.

filtered
MuData object with n_obs × n_vars = 122016 × 36741
  var:	'gene_ids', 'feature_types'
  2 modalities
    rna:	122016 x 36601
      obs:	'donor', 'batch'
      var:	'gene_ids', 'feature_types'
    prot:	122016 x 140
      obs:	'donor', 'batch'
      var:	'gene_ids', 'feature_types'
raw
MuData object with n_obs × n_vars = 24807643 × 36741
  var:	'gene_ids', 'feature_types'
  2 modalities
    rna:	24807643 x 36601
      obs:	'donor', 'batch'
      var:	'gene_ids', 'feature_types'
    prot:	24807643 x 140
      obs:	'donor', 'batch'
      var:	'gene_ids', 'feature_types'

While the raw object contains over 24 million droplets, the filtered object only contains 122,016. The rna modality has 36601 features while the prot modality has 140 features.

35.5. Quality Control#

Similar to transcriptomics data quality control and filtering, we need to remove cells that failed to capture ADTs. Solely reusing the earlier introduced transcriptomics quality control measures is inappropriate due to the above-mentioned different distributions. Since features for CITE-seq experiments are chosen based on expected biological relevance, separating technical and biological artifacts is more challenging.

Successful capture of targeted proteins leads to a massive increase of total ADT count, due to the binary nature of surface markers. Caution is required when removing cells with low total ADT counts, since many cell types do not express the selected protein targets. Empty droplets however still contain reads for ADT counts because unbound antibodies are sequenced. We therefore expect non-zero counts for most ADTs. We can, however make use of these empty droplets to normalize our count data as we can consider this as the ambient background level for each antibody. This is discussed at a later point.

filtered["prot"].X
<122016x140 sparse matrix of type '<class 'numpy.float32'>'
	with 14267923 stored elements in Compressed Sparse Row format>
sc.pp.calculate_qc_metrics(filtered["prot"], inplace=True, percent_top=None)
sc.pp.calculate_qc_metrics(raw["prot"], inplace=True, percent_top=None)
sc.pp.calculate_qc_metrics(raw["rna"], inplace=True, percent_top=None)

We first look at the distribution of ADTs per cell over all samples. We plot this using the seaborn library. We first take a look at the wole range and can see that most cells express between 70 and 140 proteins.

sns.displot(filtered["prot"].obs.n_genes_by_counts)
<seaborn.axisgrid.FacetGrid at 0x7f51e0eb41d0>
../_images/quality_control_28_1.png

As the cells falling below a certain threshold of present ADT markers and not following the distribution are probably not viable cells, we want to filter out those cells. Thus, we look at the lower end of the distribution.

sns.displot(
    filtered["prot"][filtered["prot"].obs.n_genes_by_counts < 70].obs.n_genes_by_counts
)
<seaborn.axisgrid.FacetGrid at 0x7f51e13d6610>
../_images/quality_control_30_1.png

We can see a ‘valley’ in the distribution at around 55 ADTs. This looks like an appropriate cutoff.

Next, we do the same thing based on total counts per cell. Looking at the total range, we can’t see any apparent ranges of the distribution of counts.

sns.displot(filtered["prot"].obs.total_counts)
<seaborn.axisgrid.FacetGrid at 0x7f4ccce24450>
../_images/quality_control_33_1.png

We zoom in to see the upper end of the distribution for the total counts to decide on a cutoff for the maximum number of counts as droplets exceeding a certain threshold probably contain multiple cells, so called doublets.

sns.displot(
    filtered["prot"]
    .obs.query("total_counts>20000 and total_counts<100000")
    .total_counts
)
<seaborn.axisgrid.FacetGrid at 0x7f4a38cbabd0>
../_images/quality_control_35_1.png

For the dsb normalization performed later, we need to take a look at the raw rna count distribution. The first large peak between 10**1 and 10**2.1 are droplets that don’t contain cells, thus we can use this range for the dsb algorithm at a later stage.

sns.displot(
    raw["rna"].obs.sample(frac=0.01).query("total_counts<100000 and total_counts>10"),
    x="total_counts",
    log_scale=True,
    hue="donor",
    multiple="stack",
)
<seaborn.axisgrid.FacetGrid at 0x7f4a38cbad90>
../_images/quality_control_37_1.png
sns.displot(
    filtered["prot"].obs.sample(frac=0.01).query("total_counts>10"),
    x="total_counts",
    log_scale=True,
    hue="donor",
    multiple="stack",
)
<seaborn.axisgrid.FacetGrid at 0x7f50ffe26c50>
../_images/quality_control_38_1.png
filtered
MuData object with n_obs × n_vars = 122016 × 36741
  var:	'gene_ids', 'feature_types'
  2 modalities
    rna:	122016 x 36601
      obs:	'donor', 'batch'
      var:	'gene_ids', 'feature_types'
    prot:	122016 x 140
      obs:	'donor', 'batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts'
      var:	'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'

35.5.1. Sample-wise QC#

Now we look at the distribution of counts per cell across the samples to see if there are differences. As the total amount of reads and droplets can differ between the samples, a stringent, hard cutoff applied to all samples would not be appropriate.

sc.pp.filter_cells(filtered["prot"], max_counts=100000)
sns.boxplot(y=filtered["prot"].obs.total_counts, x=filtered["prot"].obs["donor"])
<AxesSubplot:xlabel='donor', ylabel='total_counts'>
../_images/quality_control_43_1.png

The distributions of counts are different between samples. Thus, sample-wise QC is deemed pertinent. If we compare sample s3d7 vs sample s4d8, we can see that the outliers of one sample would fit the regular distribution of normal counts in the other sample.

Since we have a significant number of samples we can do sample-wise QC automatically as described in the RNA preprocessing chapter.

def is_outlier(adata, metric: str, nmads: int):
    M = adata.obs[metric]
    outlier = (M < np.median(M) - nmads * M.mad()) | (
        np.median(M) + nmads * M.mad() < M
    )
    return outlier
%%time
outliers = []
for sample in np.unique(filtered["prot"].obs["donor"]):
    adata_temp = filtered["prot"][filtered["prot"].obs["donor"] == sample].copy()
    adata_temp.obs["outlier"] = is_outlier(
        adata_temp, "log1p_total_counts", 5
    ) | is_outlier(adata_temp, "log1p_n_genes_by_counts", 5)
    outliers.append(adata_temp.obs["outlier"])
    print(f"{sample}: outliers {adata_temp.obs.outlier.value_counts()[True]}")
s1d1: outliers 116
s1d2: outliers 103
s1d3: outliers 100
s2d1: outliers 132
s2d4: outliers 102
s2d5: outliers 104
s3d1: outliers 180
s3d6: outliers 206
s3d7: outliers 177
s4d1: outliers 81
s4d8: outliers 41
s4d9: outliers 137
CPU times: user 203 ms, sys: 6.96 ms, total: 210 ms
Wall time: 210 ms
filtered["prot"].obs["outliers"] = pd.concat(outliers)
filtered["prot"].obs
donor batch n_genes_by_counts log1p_n_genes_by_counts total_counts log1p_total_counts n_counts outliers
AAACCCAAGGATGGCT-1-0-0-0-0-0-0-0-0-0-0-0 s1d1 0 124 4.828314 6483.0 8.777093 6483.0 False
AAACCCAAGGCCTAGA-1-0-0-0-0-0-0-0-0-0-0-0 s1d1 0 140 4.948760 19711.0 9.888983 19711.0 False
AAACCCAAGTGAGTGC-1-0-0-0-0-0-0-0-0-0-0-0 s1d1 0 120 4.795791 3349.0 8.116715 3349.0 False
AAACCCACAAGAGGCT-1-0-0-0-0-0-0-0-0-0-0-0 s1d1 0 128 4.859812 7841.0 8.967249 7841.0 False
AAACCCACATCGTGGC-1-0-0-0-0-0-0-0-0-0-0-0 s1d1 0 114 4.744932 2462.0 7.809135 2462.0 False
... ... ... ... ... ... ... ... ...
TTTGTTGGTCCAATCA-1-1 s4d1 1 130 4.875197 3493.0 8.158802 3493.0 False
TTTGTTGGTGACTAAA-1-1 s4d1 1 134 4.905275 4227.0 8.349484 4227.0 False
TTTGTTGTCCCTCTCC-1-1 s4d1 1 135 4.912655 7294.0 8.894944 7294.0 False
TTTGTTGTCTAGAGCT-1-1 s4d1 1 130 4.875197 3813.0 8.246434 3813.0 False
TTTGTTGTCTCTGCCA-1-1 s4d1 1 134 4.905275 2537.0 7.839132 2537.0 False

121981 rows × 8 columns

Let’s first update our MuData (.obs and .var) with the data from all the modalities.

filtered.update()
filtered = filtered[filtered.obs.loc[filtered["prot"].obs_names].index]
filtered = filtered[filtered["prot"].obs["outliers"] == False]
filtered
View of MuData object with n_obs × n_vars = 120502 × 36741
  var:	'gene_ids', 'feature_types'
  2 modalities
    rna:	120502 x 36601
      obs:	'donor', 'batch'
      var:	'gene_ids', 'feature_types'
    prot:	120502 x 140
      obs:	'donor', 'batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts', 'outliers'
      var:	'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'

We removed 1500 cells during the filtering which equates to roughly 1% of cells. This is a relatively permissive filtering and we might need further filtering of doublets in the following.

sns.boxplot(y=filtered["prot"].obs.total_counts, x=filtered["prot"].obs["donor"])
<AxesSubplot:xlabel='donor', ylabel='total_counts'>
../_images/quality_control_55_1.png

As we can see in the above plot, outliers are now filtered out for each sample separately. To now bring the values for each sample into a similar range, we need to normalize the data.

filtered.write(filtered_qc_mu_path)
/home/icb/ciro.suastegui/miniconda3/envs/citeseq_pp/lib/python3.7/site-packages/anndata/_core/anndata.py:1241: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[key] = c
... storing 'feature_types' as categorical

35.6. Key takeaways#

TODO

35.7. References#

spBKS22

Danila Bredikhin, Ilia Kats, and Oliver Stegle. MUON: multimodal omics analysis framework. Genome Biology, feb 2022. URL: https://doi.org/10.1186%2Fs13059-021-02577-8, doi:10.1186/s13059-021-02577-8.

spLBA16

Yansheng Liu, Andreas Beyer, and Ruedi Aebersold. On the dependency of cellular protein levels on mrna abundance. Cell, 165(3):535–550, Apr 2016. doi:10.1016/j.cell.2016.03.014.

spLBC+21

Malte D Luecken, Daniel Bernard Burkhardt, Robrecht Cannoodt, Christopher Lance, Aditi Agrawal, Hananeh Aliee, Ann T Chen, Louise Deconinck, Angela M Detweiler, Alejandro A Granados, Shelly Huynh, Laura Isacco, Yang Joon Kim, Dominik Klein, BONY DE KUMAR, Sunil Kuppasani, Heiko Lickert, Aaron McGeever, Honey Mekonen, Joaquin Caceres Melgarejo, Maurizio Morri, Michaela Müller, Norma Neff, Sheryl Paul, Bastian Rieck, Kaylie Schneider, Scott Steelman, Michael Sterr, Daniel J. Treacy, Alexander Tong, Alexandra-Chloe Villani, Guilin Wang, Jia Yan, Ce Zhang, Angela Oliveira Pisco, Smita Krishnaswamy, Fabian J Theis, and Jonathan M. Bloom. A sandbox for prediction and integration of DNA, RNA, and proteins in single cells. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). 2021. URL: https://openreview.net/forum?id=gN35BGa1Rt.

spMLC+21(1,2)

Eleni P. Mimitou, Caleb A. Lareau, Kelvin Y. Chen, Andre L. Zorzetto-Fernandes, Yuhan Hao, Yusuke Takeshima, Wendy Luo, Tse-Shun Huang, Bertrand Z. Yeung, Efthymia Papalexi, Pratiksha I. Thakore, Tatsuya Kibayashi, James Badger Wing, Mayu Hata, Rahul Satija, Kristopher L. Nazor, Shimon Sakaguchi, Leif S. Ludwig, Vijay G. Sankaran, Aviv Regev, and Peter Smibert. Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. Nature Biotechnology, 39(1010):1246–1258, Oct 2021. doi:10.1038/s41587-021-00927-2.

spPZK+17(1,2)

Vanessa M. Peterson, Kelvin Xi Zhang, Namit Kumar, Jerelyn Wong, Lixia Li, Douglas C. Wilson, Renee Moore, Terrill K. McClanahan, Svetlana Sadekova, and Joel A. Klappenbach. Multiplexed quantification of proteins and transcripts in single cells. Nature Biotechnology, 35(1010):936–939, Oct 2017. doi:10.1038/nbt.3973.

spSHS+17

Marlon Stoeckius, Christoph Hafemeister, William Stephenson, Brian Houck-Loomis, Pratip K. Chattopadhyay, Harold Swerdlow, Rahul Satija, and Peter Smibert. Simultaneous epitope and transcriptome measurement in single cells. Nature Methods, 14(9):865–868, Sep 2017. URL: https://doi.org/10.1038/nmeth.4380, doi:10.1038/nmeth.4380.

spSBEO+21

Bo Sun, Emmanuel Bugarin-Estrada, Lauren Elizabeth Overend, Catherine Elizabeth Walker, Felicia Anna Tucci, and Rachael Jennifer Mary Bashford-Rogers. Double-jeopardy: scrna-seq doublet/multiplet detection using multi-omic profiling. Cell Reports Methods, 1(1):100008, May 2021. doi:10.1016/j.crmeth.2021.100008.

spXD22

Haiyang Xie and Xianting Ding. The intriguing landscape of single-cell protein analysis. Advanced Science, n/a(n/a):2105932, 2022. doi:10.1002/advs.202105932.

spZJT+22

Ye Zheng, Seong-Hwan Jun, Yuan Tian, Mair Florian, and Raphael Gottardo. Robust normalization and integration of single-cell protein expression across cite-seq datasets. bioRxiv, 2022. URL: https://www.biorxiv.org/content/early/2022/05/01/2022.04.29.489989, arXiv:https://www.biorxiv.org/content/early/2022/05/01/2022.04.29.489989.full.pdf, doi:10.1101/2022.04.29.489989.

35.8. Contributors#

We gratefully acknowledge the contributions of:

35.8.1. Authors#

  • Daniel Strobl

  • Ciro Ramírez-Suástegui

35.8.2. Reviewers#

  • Lukas Heumos