39. Batch correction#

39.1. Motivation#

As could be seen for our earlier visualized ADT data, batch effects between samples are very pronounced. Hence, batch correction to mitigate this effect is required. Generally, no purpose build methods for the batch correction of ADT data have been developed, and we therefore suggest applying methods designed for transcriptomics data to ADT data.

39.2. Environment setup#

import scanpy as sc
import muon as mu
import numpy as np
import seaborn as sns
import harmonypy as hm
import warnings

warnings.simplefilter(action="ignore", category=UserWarning)
warnings.simplefilter(action="ignore", category=FutureWarning)
/home/icb/daniel.strobl/conda21/envs/surface-protein/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

39.3. Loading the data#

filtered_xdbt_mu_path = "/lustre/groups/ml01/workspace/ciro.suastegui/bp2.0/data/neurips_cite_pp_filtered-qc-norm-xdbt.h5mu"
filtered = mu.read(filtered_xdbt_mu_path)

It is not yet clear which batch effect correction works best for ADT data. For general purposes we recommend scVI [Lopez et al., 2018] or Harmony [Korsunsky et al., 2019] to perform batch correction of the data due to their robust performance on scRNA-seq data.

39.4. Harmony#

%%time
ho = hm.run_harmony(filtered["prot"].X, filtered["prot"].obs, ["donor"])
2023-02-03 15:07:03,371 - harmonypy - INFO - Iteration 1 of 10
2023-02-03 15:08:18,164 - harmonypy - INFO - Iteration 2 of 10
2023-02-03 15:09:32,771 - harmonypy - INFO - Iteration 3 of 10
2023-02-03 15:10:47,929 - harmonypy - INFO - Iteration 4 of 10
2023-02-03 15:11:59,638 - harmonypy - INFO - Iteration 5 of 10
2023-02-03 15:12:44,865 - harmonypy - INFO - Iteration 6 of 10
2023-02-03 15:13:22,069 - harmonypy - INFO - Iteration 7 of 10
2023-02-03 15:13:56,532 - harmonypy - INFO - Iteration 8 of 10
2023-02-03 15:14:31,102 - harmonypy - INFO - Converged after 8 iterations
CPU times: user 1h 20min 16s, sys: 1h 25min 24s, total: 2h 45min 40s
Wall time: 9min 11s
pc_std = np.std(ho.Z_corr, axis=1).tolist()
sns.scatterplot(x=range(0, len(pc_std)), y=sorted(pc_std, reverse=True))
<AxesSubplot: >
../_images/batch_correction_12_1.png
filtered["prot"].obsm["X_pcahm"] = ho.Z_corr.transpose()
filtered["prot"].obsm
AxisArrays with keys: X_pcahm
%%time
sc.pp.neighbors(filtered["prot"], n_pcs=30, use_rep="X_pcahm")
sc.tl.umap(filtered["prot"])
CPU times: user 5min 17s, sys: 2min 9s, total: 7min 26s
Wall time: 2min 33s
sc.pl.umap(filtered["prot"], color=["donor", "batch"])
../_images/batch_correction_16_0.png

As we can see here, the cells of different donors are much more intermixed in the embedding than before.

sc.pl.umap(filtered["prot"], color=["CD4-1", "CD8", "CD3"])
sc.pl.umap(filtered["prot"], color=["CD14-1", "CD16"])
../_images/batch_correction_18_0.png ../_images/batch_correction_18_1.png

We check the expression of a few marker genes to confirm that separate cell types are still separate from each other. We can see that T cells still form a separate population that is further split into CD4 and CD8 T cells.

In following steps, you can now go ahead and cluster and annotate the cells in a similar process as it is described in the annotation chapter[LINK]. Here, we have only used the ADT part of the data and thus lost all of the information contained in the RNA part of the study. In other chapters, we will explore how you can make use of both modalities jointly, which allows for a more detailed cell type annotation for example.

39.5. Key takeaways#

TODO

39.6. References#

spKMF+19

Ilya Korsunsky, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-ru Loh, and Soumya Raychaudhuri. Fast, sensitive and accurate integration of single-cell data with harmony. Nature Methods, 16(12):1289–1296, Dec 2019. URL: https://doi.org/10.1038/s41592-019-0619-0, doi:10.1038/s41592-019-0619-0.

spLRC+18

Romain Lopez, Jeffrey Regier, Michael B. Cole, Michael I. Jordan, and Nir Yosef. Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12):1053–1058, Dec 2018. URL: https://doi.org/10.1038/s41592-018-0229-2, doi:10.1038/s41592-018-0229-2.

39.7. Contributors#

We gratefully acknowledge the contributions of:

39.7.1. Authors#

  • Daniel Strobl

  • Ciro Ramírez-Suástegui

39.7.2. Reviewers#

  • Lukas Heumos