12. Data integration#

12.1. Motivation#

A central challenge in most scRNA-seq data analyses is presented by batch effects. Batch effects are changes in measured expression levels that are the result of handling cells in distinct groups or “batches”. For example, a batch effect can arise if two labs have taken samples from the same cohort, but these samples are dissociated differently. If Lab A optimizes its dissociation protocol to dissociate cells in the sample while minimizing the stress on them, and Lab B does not, then it is likely that the cells in the data from the group B will express more stress-linked genes (JUN, JUNB, FOS, etc. see [van den Brink et al., 2017]) even if the cells had the same profile in the original tissue. In general, the origins of batch effects are diverse and difficult to pin down. Some batch effect sources might be technical such as differences in sample handling, experimental protocols, or sequencing depths, but biological effects such as donor variation, tissue, or sampling location are also often interpreted as a batch effect [Luecken et al., 2021]. Whether or not biological factors should be considered batch effects can depend on the experimental design and the question being asked. Removing batch effects is crucial to enable joint analysis that can focus on finding common structure in the data across batches and enable us to perform queries across datasets. Often it is only after removing these effects that rare cell populations can be identified that were previously obscured by differences between batches. Enabling queries across datasets allows us to ask questions that could not be answered by analysing individual datasets, such as Which cell types express SARS-CoV-2 entry factors and how does this expression differ between individuals? [Muus et al., 2021].

When removing batch effects from omics data, one must make two central choices: (1) the method and parameterization, and (2) the batch covariate. As batch effects can arise between groupings of cells at different levels (i.e., samples, donors, datasets etc.), the choice of batch covariate indicates which level of variation should be retained and which level removed. The finer the batch resolution, the more effects will be removed. However, fine batch variation is also more likely to be confounded with biologically meaningful signals. For example, samples typically come from different individuals or different locations in the tissue. These effects may be meaningful to inspect. Thus, the choice of batch covariate will depend on the goal of your integration task. Do you want to see differences between individuals, or are you focused on common cell type variation? An approach to batch covariate selection based on quantitative analyses was pioneered in a recent effort to build an integrated atlas of the human lung, where the variance attributable to different technical covariates was used to make this choice [Sikkema et al., 2022].

12.1.1. Types of integration models#

Methods that remove batch effects in scRNA-seq are typically composed of (up to) three steps:

  1. Dimensionality reduction

  2. Modeling and removing the batch effect

  3. Projection back into a high-dimensional space

While modeling and removing the batch effect (Step 2) is the central part of any batch removal method, many methods first project the data to a lower dimensional space (Step 1) to improve the signal-to-noise ratio (see the dimensionality reduction chapter) and perform batch correction in that space to achieve better performance (see [Luecken et al., 2021]). In the third step, a method may project the data back into the original high-dimensional feature space after removing the fitted batch effect to output a batch-corrected gene expression matrix.

Batch-effect removal methods can vary in each of these three steps. They may use various linear or non-linear dimensionality reduction approaches, linear or non-linear batch effect models, and they may output different formats of batch-corrected data. Overall, we can divide methods for batch effect removal into 4 categories. In their order of development, these are global models, linear embedding models, graph-based methods, and deep learning approaches (Fig I1).

Global models originate from bulk transcriptomics and model the batch effect as a consistent (additive and/or multiplicative) effect across all cells. A common example is ComBat [Johnson et al., 2007].

Linear embedding models were the first single-cell-specific batch removal methods. These approaches often use a variant of singular value decomposition (SVD) to embed the data, then look for local neighborhoods of similar cells across batches in the embedding, which they use to correct the batch effect in a locally adaptive (non-linear) manner. Methods often project the data back into gene expression space using the SVD loadings, but may also only output a corrected embedding. This is the most common group of methods and prominent examples include the pioneering mutual nearest neighbors (MNN) method [Haghverdi et al., 2018] (which does not perform any dimensionality reduction), Seurat integration [Butler et al., 2018, Stuart et al., 2019], Scanorama [Hie et al., 2019], FastMNN [Haghverdi et al., 2018], and Harmony [Korsunsky et al., 2019].

Graph-based methods are typically the fastest methods to run. These approaches use a nearest-neighbor graph to represent the data from each batch. Batch effects are corrected by forcing connections between cells from different batches and then allowing for differences in cell type compositions by pruning the forced edges. The most prominent example of these approaches is the Batch-Balanced k-Nearest Neighbor (BBKNN) method [Polański et al., 2019].

Deep learning (DL) approaches are the most recent, and most complex methods for batch effect removal that typically require the most data for good performance. Most deep learning integration methods are based on autoencoder networks, and either condition the dimensionality reduction on the batch covariate in a conditional variational autoencoder (CVAE) or fit a locally linear correction in the embedded space. Prominent examples of DL methods are scVI [Lopez et al., 2018], scANVI [Xu et al., 2021], and scGen [Lotfollahi et al., 2019].

Some methods can use cell identity labels to provide the method with a reference for what biological variation should not be removed as a batch effect. As batch-effect removal is typically a preprocessing task, such approaches may not be applicable to many integration scenarios as labels are generally not yet available at this stage.

More detailed overviews of batch-effect removal methods can be found in [Argelaguet et al., 2021] and [Luecken et al., 2021].

Overview_fig Fig. I1: Overview of different types of integration methods with examples.

12.1.2. Batch removal complexity#

The removal of batch effects in scRNA-seq data has previously been divided into two subtasks: batch correction and data integration [Luecken and Theis, 2019]. These subtasks differ in the complexity of the batch effect that must be removed. Batch correction methods deal with batch effects between samples in the same experiment where cell identity compositions are consistent, and the effect is often quasi-linear. In contrast, data integration methods deal with complex, often nested, batch effects between datasets that may be generated with different protocols and where cell identities may not be shared across batches. While we use this distinction here we should note that these terms are often used interchangeably in general use. Given the differences in complexity, it is not surprising that different methods have been benchmarked as being optimal for these two subtasks.

12.1.3. Comparisons of data integration methods#

Several benchmarks have previously evaluated the performance of methods for batch correction and data integration. When removing batch effects, methods may overcorrect and remove meaningful biological variation in addition to the batch effect. For this reason, it is important that integration performance is evaluated by considering both batch effect removal and the conservation of biological variation.

The k-nearest-neighbor Batch-Effect Test (kBET) was the first metric for quantifying batch correction of scRNA-seq data [Büttner et al., 2019]. Using kBET, the authors found that ComBat outperformed other approaches for batch correction while comparing predominantly global models. Building on this, two recent benchmarks [Tran et al., 2020] and [Chazarra-Gil et al., 2021] also benchmarked linear-embedding and deep-learning models on batch correction tasks with few batches or low biological complexity. These studies found that the linear-embedding models Seurat [Butler et al., 2018, Stuart et al., 2019] and Harmony [Korsunsky et al., 2019] performed well for simple batch correction tasks.

Benchmarking complex integration tasks poses additional challenges due to both the size and number of datasets as well as the diversity of scenarios. Recently, a large study used 14 metrics to benchmark 16 methods across integration method classes on five RNA tasks and two simulations [Luecken et al., 2021]. While top-performing methods per task differed, approaches that use cell type labels performed better across tasks. Furthermore, deep learning approaches scANVI (with labels), scVI, and scGen (with labels), as well as the linear embedding model Scanorama, performed best, particularly on complex tasks, while Harmony performed well on less complex tasks. A similar benchmark performed for the specific purpose of integrating retina datasets to build an ocular mega-atlas also found that scVI outperformed other methods [Swamy et al., 2021].

12.1.4. Choosing an integration method#

While integration methods have now been extensively benchmarked, an optimal method for all scenarios does not exist. Packages of integration performance metrics and evaluation pipelines like scIB and batchbench can be used to evaluate integration performance on your own data. However many metrics (particularly those that measure the conservation of biological variation) require ground-truth cell identity labels. Parameter optimization may tune many methods to work for particular tasks, yet in general, one can say that Harmony and Seurat consistently perform well for simple batch correction tasks, and scVI, scGen, scANVI, and Scanorama perform well for more complex data integration tasks. When choosing a method, we would recommend looking into these options first. Additionally, integration method choice may be guided by the required output data format (i.e., do you need corrected gene expression data or does an integrated embedding suffice?). It would be prudent to test multiple methods and evaluate the outputs on the basis of quantitative definitions of success before selecting one. Extensive guidelines for data integration method choice can be found in [Luecken et al., 2021].

In the rest of this chapter, we demonstrate some of the best-performing methods and quickly demonstrate how integration performance can be evaluated.

# Python packages
import scanpy as sc
import scvi
import bbknn
import scib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

# R interface
from rpy2.robjects import pandas2ri
from rpy2.robjects import r
import rpy2.rinterface_lib.callbacks
import anndata2ri

pandas2ri.activate()
anndata2ri.activate()

%load_ext rpy2.ipython
Global seed set to 0
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead.
  new_rank_zero_deprecation(
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: The `pytorch_lightning.loggers.base.rank_zero_experiment` is deprecated in v1.7 and will be removed in v1.9. Please use `pytorch_lightning.loggers.logger.rank_zero_experiment` instead.
  return new_rank_zero_deprecation(*args, **kwargs)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/rpy2/robjects/pandas2ri.py:263: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/rpy2/robjects/numpy2ri.py:205: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
%%R
# R packages
library(Seurat)
    WARNING: The R package "reticulate" does not
    consider that it could be called from a Python process. This
    results in a quasi-obligatory segfault when rpy2 is evaluating
    R code using it. On the hand, rpy2 is accounting for the
    fact that it might already be running embedded in a Python
    process. This is why:
    - Python -> rpy2 -> R -> reticulate: crashes
    - R -> reticulate -> Python -> rpy2: works

    The issue with reticulate is tracked here:
    https://github.com/rstudio/reticulate/issues/208
    

12.2. Dataset#

The dataset we will use to demonstrate data integration contains several samples of bone marrow mononuclear cells. These samples were originally created for the Open Problems in Single-Cell Analysis NeurIPS Competition 2021 [Lance et al., 2022, Luecken et al., 2022]. The 10x Multiome protocol was used which measures both RNA expression (scRNA-seq) and chromatin accessibility (scATAC-seq) in the same cells. The version of the data we use here was already pre-processed to remove low-quality cells.

Let’s read in the dataset using scanpy to get an AnnData object.

!wget https://figshare.com/ndownloader/files/45452260 -O openproblems_bmmc_multiome_genes_filtered.h5ad
adata_raw = sc.read_h5ad("openproblems_bmmc_multiome_genes_filtered.h5ad")
adata_raw.layers["logcounts"] = adata_raw.X
adata_raw
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/h5ad.py:238: OldFormatWarning: Element '/layers' was written without encoding metadata.
  d[k] = read_elem(f[k])
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/obs/GEX_pct_counts_mt' was written without encoding metadata.
  return read_elem(dataset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/obs/GEX_n_counts' was written without encoding metadata.
  return read_elem(dataset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/obs/GEX_n_genes' was written without encoding metadata.
  return read_elem(dataset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/obs/GEX_size_factors' was written without encoding metadata.
  return read_elem(dataset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:584: OldFormatWarning: Element '/obs/__categories/GEX_phase' was written without encoding metadata.
  categories = read_elem(categories_dset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:587: OldFormatWarning: Element '/obs/GEX_phase' was written without encoding metadata.
  read_elem(dataset), categories, ordered=ordered
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/obs/ATAC_nCount_peaks' was written without encoding metadata.
  return read_elem(dataset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/obs/ATAC_atac_fragments' was written without encoding metadata.
  return read_elem(dataset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/obs/ATAC_reads_in_peaks_frac' was written without encoding metadata.
  return read_elem(dataset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/obs/ATAC_blacklist_fraction' was written without encoding metadata.
  return read_elem(dataset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/obs/ATAC_nucleosome_signal' was written without encoding metadata.
  return read_elem(dataset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:584: OldFormatWarning: Element '/obs/__categories/cell_type' was written without encoding metadata.
  categories = read_elem(categories_dset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:587: OldFormatWarning: Element '/obs/cell_type' was written without encoding metadata.
  read_elem(dataset), categories, ordered=ordered
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:584: OldFormatWarning: Element '/obs/__categories/batch' was written without encoding metadata.
  categories = read_elem(categories_dset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:587: OldFormatWarning: Element '/obs/batch' was written without encoding metadata.
  read_elem(dataset), categories, ordered=ordered
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/obs/ATAC_pseudotime_order' was written without encoding metadata.
  return read_elem(dataset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/obs/GEX_pseudotime_order' was written without encoding metadata.
  return read_elem(dataset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:584: OldFormatWarning: Element '/obs/__categories/Samplename' was written without encoding metadata.
  categories = read_elem(categories_dset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:587: OldFormatWarning: Element '/obs/Samplename' was written without encoding metadata.
  read_elem(dataset), categories, ordered=ordered
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:584: OldFormatWarning: Element '/obs/__categories/Site' was written without encoding metadata.
  categories = read_elem(categories_dset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:587: OldFormatWarning: Element '/obs/Site' was written without encoding metadata.
  read_elem(dataset), categories, ordered=ordered
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:584: OldFormatWarning: Element '/obs/__categories/DonorNumber' was written without encoding metadata.
  categories = read_elem(categories_dset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:587: OldFormatWarning: Element '/obs/DonorNumber' was written without encoding metadata.
  read_elem(dataset), categories, ordered=ordered
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:584: OldFormatWarning: Element '/obs/__categories/Modality' was written without encoding metadata.
  categories = read_elem(categories_dset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:587: OldFormatWarning: Element '/obs/Modality' was written without encoding metadata.
  read_elem(dataset), categories, ordered=ordered
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/obs/VendorLot' was written without encoding metadata.
  return read_elem(dataset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/obs/DonorID' was written without encoding metadata.
  return read_elem(dataset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/obs/DonorAge' was written without encoding metadata.
  return read_elem(dataset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/obs/DonorBMI' was written without encoding metadata.
  return read_elem(dataset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:584: OldFormatWarning: Element '/obs/__categories/DonorBloodType' was written without encoding metadata.
  categories = read_elem(categories_dset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:587: OldFormatWarning: Element '/obs/DonorBloodType' was written without encoding metadata.
  read_elem(dataset), categories, ordered=ordered
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:584: OldFormatWarning: Element '/obs/__categories/DonorRace' was written without encoding metadata.
  categories = read_elem(categories_dset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:587: OldFormatWarning: Element '/obs/DonorRace' was written without encoding metadata.
  read_elem(dataset), categories, ordered=ordered
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:584: OldFormatWarning: Element '/obs/__categories/Ethnicity' was written without encoding metadata.
  categories = read_elem(categories_dset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:587: OldFormatWarning: Element '/obs/Ethnicity' was written without encoding metadata.
  read_elem(dataset), categories, ordered=ordered
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:584: OldFormatWarning: Element '/obs/__categories/DonorGender' was written without encoding metadata.
  categories = read_elem(categories_dset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:587: OldFormatWarning: Element '/obs/DonorGender' was written without encoding metadata.
  read_elem(dataset), categories, ordered=ordered
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:584: OldFormatWarning: Element '/obs/__categories/QCMeds' was written without encoding metadata.
  categories = read_elem(categories_dset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:587: OldFormatWarning: Element '/obs/QCMeds' was written without encoding metadata.
  read_elem(dataset), categories, ordered=ordered
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:584: OldFormatWarning: Element '/obs/__categories/DonorSmoker' was written without encoding metadata.
  categories = read_elem(categories_dset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:587: OldFormatWarning: Element '/obs/DonorSmoker' was written without encoding metadata.
  read_elem(dataset), categories, ordered=ordered
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/obs/_index' was written without encoding metadata.
  return read_elem(dataset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/h5ad.py:238: OldFormatWarning: Element '/obsm' was written without encoding metadata.
  d[k] = read_elem(f[k])
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:92: OldFormatWarning: Element '/obsm/ATAC_lsi_full' was written without encoding metadata.
  return {k: read_elem(v) for k, v in elem.items()}
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:92: OldFormatWarning: Element '/obsm/ATAC_lsi_red' was written without encoding metadata.
  return {k: read_elem(v) for k, v in elem.items()}
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:92: OldFormatWarning: Element '/obsm/ATAC_umap' was written without encoding metadata.
  return {k: read_elem(v) for k, v in elem.items()}
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:92: OldFormatWarning: Element '/obsm/GEX_X_pca' was written without encoding metadata.
  return {k: read_elem(v) for k, v in elem.items()}
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:92: OldFormatWarning: Element '/obsm/GEX_X_umap' was written without encoding metadata.
  return {k: read_elem(v) for k, v in elem.items()}
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/h5ad.py:238: OldFormatWarning: Element '/uns' was written without encoding metadata.
  d[k] = read_elem(f[k])
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:92: OldFormatWarning: Element '/uns/ATAC_gene_activity_var_names' was written without encoding metadata.
  return {k: read_elem(v) for k, v in elem.items()}
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:92: OldFormatWarning: Element '/uns/dataset_id' was written without encoding metadata.
  return {k: read_elem(v) for k, v in elem.items()}
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:92: OldFormatWarning: Element '/uns/genome' was written without encoding metadata.
  return {k: read_elem(v) for k, v in elem.items()}
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:92: OldFormatWarning: Element '/uns/organism' was written without encoding metadata.
  return {k: read_elem(v) for k, v in elem.items()}
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:584: OldFormatWarning: Element '/var/__categories/feature_types' was written without encoding metadata.
  categories = read_elem(categories_dset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:587: OldFormatWarning: Element '/var/feature_types' was written without encoding metadata.
  read_elem(dataset), categories, ordered=ordered
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:584: OldFormatWarning: Element '/var/__categories/gene_id' was written without encoding metadata.
  categories = read_elem(categories_dset)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:587: OldFormatWarning: Element '/var/gene_id' was written without encoding metadata.
  read_elem(dataset), categories, ordered=ordered
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata/_io/specs/methods.py:590: OldFormatWarning: Element '/var/_index' was written without encoding metadata.
  return read_elem(dataset)
AnnData object with n_obs × n_vars = 69249 × 129921
    obs: 'GEX_pct_counts_mt', 'GEX_n_counts', 'GEX_n_genes', 'GEX_size_factors', 'GEX_phase', 'ATAC_nCount_peaks', 'ATAC_atac_fragments', 'ATAC_reads_in_peaks_frac', 'ATAC_blacklist_fraction', 'ATAC_nucleosome_signal', 'cell_type', 'batch', 'ATAC_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker'
    var: 'feature_types', 'gene_id'
    uns: 'ATAC_gene_activity_var_names', 'dataset_id', 'genome', 'organism'
    obsm: 'ATAC_gene_activity', 'ATAC_lsi_full', 'ATAC_lsi_red', 'ATAC_umap', 'GEX_X_pca', 'GEX_X_umap'
    layers: 'counts', 'logcounts'

The full dataset contains 69,249 cells and measurements for 129,921 features. There are two versions of the expression matrix, counts which contains the raw count values and logcounts which contains normalised log counts (these values are also stored in adata.X).

The obs slot contains several variables, some of which were calculated during pre-processing (for quality control) and others that contain metadata about the samples. The ones we are interested in here are:

  • cell_type - The annotated label for each cell

  • batch - The sequencing batch for each cell

For a real analysis it would be important to consider more variables but to keep it simple here we will only look at these.

We define variables to hold these names so that it is clear how we are using them in the code. This also helps with reproducibility because if we decided to change one of them for whatever reason we can be sure it has changed in the whole notebook.

label_key = "cell_type"
batch_key = "batch"

What to use as the batch label?

As mentioned above, deciding what to use as a “batch” for data integration is one of the central choices when integrating your data. The most common approach is to define each sample as a batch (as we have here) which generally produces the strongest batch correction. However, samples are usually confounded with biological factors that you may want to preserve. For example, imagine an experiment that took samples from two locations in a tissue. If samples are considered as batches then data integration methods will attempt to remove differences between them and therefore differences between the locations. In this case, it may be more appropriate to use the donor as the batch to remove differences between individuals but not between locations. The planned analysis should also be considered. In cases where you are integrating many datasets and want to capture differences between individuals, the dataset may be a useful batch covariate. In our example, it may be better to have consistent cell type labels for the two locations and then test for differences between them than to have separate clusters for each location which need to be annotated separately and then matched.

The issue of confounding between samples and biological factors can be mitigated through careful experimental design that minimizes the batch effect overall by using multiplexing techniques that allow biological samples to be combined into a single sequencing sample. However, this is not always possible and requires both extra processing in the lab and as well as extra computational steps.

Let’s have a look at the different batches and how many cells we have for each.

adata_raw.obs[batch_key].value_counts()
s4d8     9876
s4d1     8023
s3d10    6781
s1d2     6740
s1d1     6224
s2d4     6111
s2d5     4895
s3d3     4325
s4d9     4325
s1d3     4279
s2d1     4220
s3d7     1771
s3d6     1679
Name: batch, dtype: int64

There are 13 different batches in the dataset. During this experiment, multiple samples were taken from a set of donors and sequenced at different facilities so the names here are a combination of the sample number (eg. “s1”) and the donor (eg. “d2”). For simplicity, and to reduce computational time, we will select three samples to use.

keep_batches = ["s1d3", "s2d1", "s3d7"]
adata = adata_raw[adata_raw.obs[batch_key].isin(keep_batches)].copy()
adata
AnnData object with n_obs × n_vars = 10270 × 129921
    obs: 'GEX_pct_counts_mt', 'GEX_n_counts', 'GEX_n_genes', 'GEX_size_factors', 'GEX_phase', 'ATAC_nCount_peaks', 'ATAC_atac_fragments', 'ATAC_reads_in_peaks_frac', 'ATAC_blacklist_fraction', 'ATAC_nucleosome_signal', 'cell_type', 'batch', 'ATAC_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker'
    var: 'feature_types', 'gene_id'
    uns: 'ATAC_gene_activity_var_names', 'dataset_id', 'genome', 'organism'
    obsm: 'ATAC_gene_activity', 'ATAC_lsi_full', 'ATAC_lsi_red', 'ATAC_umap', 'GEX_X_pca', 'GEX_X_umap'
    layers: 'counts', 'logcounts'

After subsetting to select these batches we are left with 10,270 cells.

We have two annotations for the features stored in var:

  • feature_types - The type of each feature (RNA or ATAC)

  • gene_id - The gene associated with each feature

Let’s have a look at the feature types.

adata.var["feature_types"].value_counts()
ATAC    116490
GEX      13431
Name: feature_types, dtype: int64

We can see that there are over 100,000 ATAC features but only around 13,000 gene expression (“GEX”) features. Integration of multiple modalities is a complex problem that will be described in the multimodal integration chapter, so for now we will subset to only the gene expression features. We also perform simple filtering to make sure we have no features with zero counts (this is necessary because by selecting a subset of samples we may have removed all the cells which expressed a particular feature).

adata = adata[:, adata.var["feature_types"] == "GEX"].copy()
sc.pp.filter_genes(adata, min_cells=1)
adata
AnnData object with n_obs × n_vars = 10270 × 13431
    obs: 'GEX_pct_counts_mt', 'GEX_n_counts', 'GEX_n_genes', 'GEX_size_factors', 'GEX_phase', 'ATAC_nCount_peaks', 'ATAC_atac_fragments', 'ATAC_reads_in_peaks_frac', 'ATAC_blacklist_fraction', 'ATAC_nucleosome_signal', 'cell_type', 'batch', 'ATAC_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker'
    var: 'feature_types', 'gene_id', 'n_cells'
    uns: 'ATAC_gene_activity_var_names', 'dataset_id', 'genome', 'organism'
    obsm: 'ATAC_gene_activity', 'ATAC_lsi_full', 'ATAC_lsi_red', 'ATAC_umap', 'GEX_X_pca', 'GEX_X_umap'
    layers: 'counts', 'logcounts'

Because of the subsetting we also need to re-normalise the data. Here we just normalise using global scaling by the total counts per cell.

adata.X = adata.layers["counts"].copy()
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
adata.layers["logcounts"] = adata.X.copy()

We will use this dataset to demonstrate integration.

Most integration methods require a single object containing all the samples and a batch variable (like we have here). If instead, you have separate objects for each of your samples you can join them using the anndata concat() function. See the concatenation tutorial for more details. Similar functionality exists in other ecosystems.

Integrating UMI and full-length data

Integrating samples from UMI and full-length protocols can present additional challenges. This is because full-length protocols are affected by gene-length bias (longer genes will be more highly expressed) while UMI data is not [Phipson et al., 2017]. Because of this, it is generally recommended to transform counts for full-length samples into a unit that corrects for gene-length (such as transcripts per million (TPM) [Wagner et al., 2012]) before attempting integration. This isn’t necessary however if all the samples being integrated used a full-length protocol.

12.3. Unintegrated data#

It is always recommended to look at the raw data before performing any integration. This can give some indication of how big any batch effects are and what might be causing them (and therefore which variables to consider as the batch label). For some experiments, it might even suggest that integration is not required if samples already overlap. This is not uncommon for mouse or cell line studies from a single lab for example, where most of the variables which contribute to batch effects can be controlled (i.e. the batch correction setting).

We will perform highly variable gene (HVG) selection, PCA and UMAP dimensionality reduction as we have seen in previous chapters.

sc.pp.highly_variable_genes(adata)
sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
adata
AnnData object with n_obs × n_vars = 10270 × 13431
    obs: 'GEX_pct_counts_mt', 'GEX_n_counts', 'GEX_n_genes', 'GEX_size_factors', 'GEX_phase', 'ATAC_nCount_peaks', 'ATAC_atac_fragments', 'ATAC_reads_in_peaks_frac', 'ATAC_blacklist_fraction', 'ATAC_nucleosome_signal', 'cell_type', 'batch', 'ATAC_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker'
    var: 'feature_types', 'gene_id', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'ATAC_gene_activity_var_names', 'dataset_id', 'genome', 'organism', 'log1p', 'hvg', 'pca', 'neighbors', 'umap'
    obsm: 'ATAC_gene_activity', 'ATAC_lsi_full', 'ATAC_lsi_red', 'ATAC_umap', 'GEX_X_pca', 'GEX_X_umap', 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts', 'logcounts'
    obsp: 'distances', 'connectivities'

This adds several new items to our AnnData object. The var slot now includes means, dispersions and the selected variable genes. In the obsp slot we have distances and connectivities for our KNN graph and in obsm are the PCA and UMAP embeddings.

Let’s plot the UMAP, colouring the points by cell identity and batch labels. If the dataset had not already been labelled (which is often the case) we would only be able to consider the batch labels.

adata.uns[batch_key + "_colors"] = [
    "#1b9e77",
    "#d95f02",
    "#7570b3",
]  # Set custom colours for batches
sc.pl.umap(adata, color=[label_key, batch_key], wspace=1)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:163: PendingDeprecationWarning: The get_cmap function will be deprecated in a future version. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap = copy(get_cmap(cmap))
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../_images/754e7fe740ab1d359b5bc5245d857a5c5ccc2675f7acc1e41e2679d9c72fc1f9.png

Often when looking at these plots you will see a clear separation between batches. In this case, what we see is more subtle and while cells from the same label are generally near each other there is a shift between batches. If we were to perform a clustering analysis using this raw data we would probably end up with some clusters containing a single batch which would be difficult to interpret at the annotation stage. We are also likely to overlook rare cell types which are not common enough in any single sample to produce their own cluster. While UMAPs can often display batch effects, as always when considering these 2D representations it is important not to overinterpret them. For a real analysis, you should confirm the integration in other ways such as by inspecting the distribution of marker genes. In the “Benchmarking your own integration” section below we discuss metrics for quantifying the quality of an integration.

Now that we have confirmed there are batch effects to correct we can move on to the different integration methods. If the batches perfectly overlaid each other or we could discover meaningful cell clusters without correction then there would be no need to perform integration.

12.4. Batch-aware feature selection#

As shown in previous chapters we often select a subset of genes to use for our analysis in order to reduce noise and processing time. We do the same thing when we have multiple samples, however, it is important that gene selection is performed in a batch-aware way. This is because genes that are variable across the whole dataset could be capturing batch effects rather than the biological signals we are interested in. It also helps to select genes relevant to rare cell identities, for example, if an identity is only present in one sample then markers for it may not be variable across all the samples but should be in that one sample.

We can perform batch-aware highly variable gene selection by setting the batch_key argument in the scanpy highly_variable_genes() function. scanpy will then calculate HVGs for each batch separately and combine the results by selecting those genes that are highly variable in the highest number of batches. We use the scanpy function here because it has this batch awareness built in. For other methods, we would have to run them on each batch individually and then manually combine the results.

sc.pp.highly_variable_genes(
    adata, n_top_genes=2000, flavor="cell_ranger", batch_key=batch_key
)
adata
adata.var
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/preprocessing/_highly_variable_genes.py:478: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  hvg = hvg.append(missing_hvg, ignore_index=True)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/preprocessing/_highly_variable_genes.py:478: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  hvg = hvg.append(missing_hvg, ignore_index=True)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/preprocessing/_highly_variable_genes.py:478: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  hvg = hvg.append(missing_hvg, ignore_index=True)
feature_types gene_id n_cells highly_variable means dispersions dispersions_norm highly_variable_nbatches highly_variable_intersection
AL627309.5 GEX ENSG00000241860 112 False 0.006533 0.775289 0.357973 1 False
LINC01409 GEX ENSG00000237491 422 False 0.024462 0.716935 -0.126119 0 False
LINC01128 GEX ENSG00000228794 569 False 0.030714 0.709340 -0.296701 0 False
NOC2L GEX ENSG00000188976 675 False 0.037059 0.704363 -0.494025 0 False
KLHL17 GEX ENSG00000187961 88 False 0.005295 0.721757 -0.028456 0 False
... ... ... ... ... ... ... ... ... ...
MT-ND5 GEX ENSG00000198786 4056 False 0.269375 0.645837 -0.457491 0 False
MT-ND6 GEX ENSG00000198695 1102 False 0.051506 0.697710 -0.248421 0 False
MT-CYB GEX ENSG00000198727 5647 False 0.520368 0.613233 -0.441362 0 False
AL592183.1 GEX ENSG00000273748 732 False 0.047486 0.753417 0.413699 0 False
AC240274.1 GEX ENSG00000271254 43 False 0.001871 0.413772 -0.386635 0 False

13431 rows × 9 columns

We can see there are now some additional columns in var:

  • highly_variable_nbatches - The number of batches where each gene was found to be highly variable

  • highly_variable_intersection - Whether each gene was highly variable in every batch

  • highly_variable - Whether each gene was selected as highly variable after combining the results from each batch

Let’s check how many batches each gene was variable in:

n_batches = adata.var["highly_variable_nbatches"].value_counts()
ax = n_batches.plot(kind="bar")
n_batches
0    9931
1    1824
2     852
3     824
Name: highly_variable_nbatches, dtype: int64
../_images/41d44028f52454e07da80ad3afdf65b386f5951c3dba3f5e92afea0c8803af23.png

The first thing we notice is that most genes are not highly variable. This is typically the case but it can depend on how different the samples we are trying to integrate are. The overlap then decreases as we add more samples, with relatively few genes being highly variable in all three batches. By selecting the top 2000 genes we have selected all HVGs that are present in two or three batches and most of those that are present in one batch.

How many genes to use?

This is a question that doesn’t have a clear answer. The authors of the scvi-tools package which we use below recommend between 1000 and 10000 genes but how many depends on the context including the complexity of the dataset and the number of batches. A survey from a previous best practices paper [Luecken and Theis, 2019] indicated people typically use between 1000 and 6000 HVGs in a standard analysis. While selecting fewer genes can aid in the removal of batch effects [Luecken et al., 2021] (the most highly-variable genes often describe only dominant biological variation), we recommend selecting slightly too many genes rather than selecting too few and risk removing genes that are important for a rare cell type or a pathway of interest. It should however be noted that more genes will also increase the time required to run the integration methods.

We will create an object with just the selected genes to use for integration.

adata_hvg = adata[:, adata.var["highly_variable"]].copy()
adata_hvg
AnnData object with n_obs × n_vars = 10270 × 2000
    obs: 'GEX_pct_counts_mt', 'GEX_n_counts', 'GEX_n_genes', 'GEX_size_factors', 'GEX_phase', 'ATAC_nCount_peaks', 'ATAC_atac_fragments', 'ATAC_reads_in_peaks_frac', 'ATAC_blacklist_fraction', 'ATAC_nucleosome_signal', 'cell_type', 'batch', 'ATAC_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker'
    var: 'feature_types', 'gene_id', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'ATAC_gene_activity_var_names', 'dataset_id', 'genome', 'organism', 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'batch_colors', 'cell_type_colors'
    obsm: 'ATAC_gene_activity', 'ATAC_lsi_full', 'ATAC_lsi_red', 'ATAC_umap', 'GEX_X_pca', 'GEX_X_umap', 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts', 'logcounts'
    obsp: 'distances', 'connectivities'

12.5. Variational autoencoder (VAE) based integration#

The first integration method we will use is scVI (single-cell Variational Inference), a method based on a conditional variational autoencoder [Lopez et al., 2018] available in the scvi-tools package [Gayoso et al., 2022]. A variational autoencoder is a type of artificial neural network which attempts to reduce the dimensionality of a dataset. The conditional part refers to conditioning this dimensionality reduction process on a particular covariate (in this case batches) such that the covariate does not affect the low-dimensional representation. In benchmarking studies scVI has been shown to perform well across a range of datasets with a good balance of batch correction while conserving biological variability [Luecken et al., 2021]. scVI models raw counts directly, so it is important that we provide it with a count matrix rather than a normalized expression matrix.

First, let’s make a copy of our dataset to use for this integration. Normally it is not necessary to do this but as we will demonstrate multiple integration methods making a copy makes it easier to show what has been added by each method.

adata_scvi = adata_hvg.copy()

12.5.1. Data preparation#

The first step in using scVI is to prepare our AnnData object. This step stores some information required by scVI such as which expression matrix to use and what the batch key is.

scvi.model.SCVI.setup_anndata(adata_scvi, layer="counts", batch_key=batch_key)
adata_scvi
AnnData object with n_obs × n_vars = 10270 × 2000
    obs: 'GEX_pct_counts_mt', 'GEX_n_counts', 'GEX_n_genes', 'GEX_size_factors', 'GEX_phase', 'ATAC_nCount_peaks', 'ATAC_atac_fragments', 'ATAC_reads_in_peaks_frac', 'ATAC_blacklist_fraction', 'ATAC_nucleosome_signal', 'cell_type', 'batch', 'ATAC_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker', '_scvi_batch', '_scvi_labels'
    var: 'feature_types', 'gene_id', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'ATAC_gene_activity_var_names', 'dataset_id', 'genome', 'organism', 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'batch_colors', 'cell_type_colors', '_scvi_uuid', '_scvi_manager_uuid'
    obsm: 'ATAC_gene_activity', 'ATAC_lsi_full', 'ATAC_lsi_red', 'ATAC_umap', 'GEX_X_pca', 'GEX_X_umap', 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts', 'logcounts'
    obsp: 'distances', 'connectivities'

The fields created by scVI are prefixed with _scvi. These are designed for internal use and should not be manually modified. The general advice from the scvi-tools authors is to not make any changes to our object until after the model is trained. On other datasets, you may see a warning about the input expression matrix containing unnormalised count data. This usually means you should check that the layer provided to the setup function does actually contain count values but it can also happen if you have values from performing gene length correction on data from a full-length protocol or from another quantification method that does not produce integer counts.

12.5.2. Building the model#

We can now construct an scVI model object. As well as the scVI model we use here, the scvi-tools package contains various other models (we will use the scANVI model below).

model_scvi = scvi.model.SCVI(adata_scvi)
model_scvi
SCVI Model with the following params: 
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: zinb, 
latent_distribution: normal
Training status: Not Trained

The scVI model object contains the provided AnnData object as well as the neural network for the model itself. You can see that currently the model is not trained. If we wanted to modify the structure of the network we could provide additional arguments to the model construction function but here we just use the defaults.

We can also print a more detailed description of the model that shows us where things are stored in the associated AnnData object.

model_scvi.view_anndata_setup()
Anndata setup with scvi-tools version 0.18.0.

Setup via `SCVI.setup_anndata` with arguments:
{
'layer': 'counts',
'batch_key': 'batch',
'labels_key': None,
'size_factor_key': None,
'categorical_covariate_keys': None,
'continuous_covariate_keys': None
}

         Summary Statistics         
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃     Summary Stat Key      Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│         n_batch             3   │
│         n_cells           10270 │
│ n_extra_categorical_covs    0   │
│ n_extra_continuous_covs     0   │
│         n_labels            1   │
│          n_vars           2000  │
└──────────────────────────┴───────┘
               Data Registry                
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Registry Key     scvi-tools Location    ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│      X         adata.layers['counts']   │
│    batch      adata.obs['_scvi_batch']  │
│    labels     adata.obs['_scvi_labels'] │
└──────────────┴───────────────────────────┘
                  batch State Registry                   
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃  Source Location    Categories  scvi-tools Encoding ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ adata.obs['batch']     s1d3              0          │
│                        s2d1              1          │
│                        s3d7              2          │
└────────────────────┴────────────┴─────────────────────┘
                     labels State Registry                      
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃      Source Location       Categories  scvi-tools Encoding ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ adata.obs['_scvi_labels']      0                0          │
└───────────────────────────┴────────────┴─────────────────────┘

Here we can see exactly what information has been assigned by scVI, including details like how each different batch is encoded in the model.

12.5.3. Training the model#

The model will be trained for a given number of epochs, a training iteration where every cell is passed through the network. By default scVI uses the following heuristic to set the number of epochs. For datasets with fewer than 20,000 cells, 400 epochs will be used and as the number of cells grows above 20,000 the number of epochs is continuously reduced. The reasoning behind this is that as the network sees more cells during each epoch it can learn the same amount of information as it would from more epochs with fewer cells.

max_epochs_scvi = np.min([round((20000 / adata.n_obs) * 400), 400])
max_epochs_scvi
400

We now train the model for the selected number of epochs (this will take ~20-40 minutes depending on the computer you are using).

model_scvi.train()
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Epoch 400/400: 100%|██████████| 400/400 [15:49<00:00,  2.27s/it, loss=648, v_num=1]
`Trainer.fit` stopped: `max_epochs=400` reached.
Epoch 400/400: 100%|██████████| 400/400 [15:49<00:00,  2.37s/it, loss=648, v_num=1]

Early stopping

Additionally to setting a target number of epochs, it is possible to also set early_stopping=True in the training function. This will let scVI decide to stop training early depending on the convergence of the model. The exact conditions for stopping can be controlled by other parameters.

12.5.4. Extracting the embedding#

The main result we want to extract from the trained model is the latent representation for each cell. This is a multi-dimensional embedding where the batch effects have been removed that can be used in a similar way to how we use PCA dimensions when analysing a single dataset. We store this in obsm with the key X_scvi.

adata_scvi.obsm["X_scVI"] = model_scvi.get_latent_representation()

12.5.5. Calculate a batch-corrected UMAP#

We will now visualise the data as we did before integration. We calculate a new UMAP embedding but instead of finding nearest neighbors in PCA space, we start with the corrected representation from scVI.

sc.pp.neighbors(adata_scvi, use_rep="X_scVI")
sc.tl.umap(adata_scvi)
adata_scvi
AnnData object with n_obs × n_vars = 10270 × 2000
    obs: 'GEX_pct_counts_mt', 'GEX_n_counts', 'GEX_n_genes', 'GEX_size_factors', 'GEX_phase', 'ATAC_nCount_peaks', 'ATAC_atac_fragments', 'ATAC_reads_in_peaks_frac', 'ATAC_blacklist_fraction', 'ATAC_nucleosome_signal', 'cell_type', 'batch', 'ATAC_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker', '_scvi_batch', '_scvi_labels'
    var: 'feature_types', 'gene_id', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'ATAC_gene_activity_var_names', 'dataset_id', 'genome', 'organism', 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'batch_colors', 'cell_type_colors', '_scvi_uuid', '_scvi_manager_uuid'
    obsm: 'ATAC_gene_activity', 'ATAC_lsi_full', 'ATAC_lsi_red', 'ATAC_umap', 'GEX_X_pca', 'GEX_X_umap', 'X_pca', 'X_umap', 'X_scVI'
    varm: 'PCs'
    layers: 'counts', 'logcounts'
    obsp: 'distances', 'connectivities'

Once we have the new UMAP representation we can plot it coloured by batch and identity labels as before.

sc.pl.umap(adata_scvi, color=[label_key, batch_key], wspace=1)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:163: PendingDeprecationWarning: The get_cmap function will be deprecated in a future version. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap = copy(get_cmap(cmap))
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../_images/8e5313bb4f33a892aaef9b713a3343aa7328fce5af7be5ce85460c48a64cc163.png

This looks better! Before, the various batches were shifted apart from each other. Now, the batches overlap more and we have a single blob for each cell identity label.

In many cases, we would not already have identity labels so from this stage we would continue with clustering, annotation and further analysis as described in other chapters.

12.6. VAE integration using cell labels#

When performing integration with scVI we pretended that we didn’t already have any cell labels (although we showed them in plots). While this scenario is common there are some cases where we do know something about cell identity in advance. Most often this is when we want to combine one or more publicly available datasets with data from a new study. When we have labels for at least some of the cells we can use scANVI (single-cell ANnotation using Variational Inference) [Xu et al., 2021]. This is an extension of the scVI model that can incorporate cell identity label information as well as batch information. Because it has this extra information it can try to keep the differences between cell labels while removing batch effects. Benchmarking suggests that scANVI tends to better preserve biological signals compared to scVI but sometimes it is not as effective at removing batch effects [Luecken et al., 2021]. While we have labels for all cells here it is also possible to use scANVI in a semi-supervised manner where labels are only provided for some cells.

Label harmonization

If you are using scANVI to integrate multiple datasets for which you already have labels it is important to first perform label harmonization. This refers to a process of checking that labels are consistent across the datasets that are being integrated. For example, a cell may be annotated as a “T cell” in one dataset, but a cell of the same type could have been given the label “CD8+ T cell” in another dataset. How best to harmonize labels is an open question but often requires input from subject-matter experts.

We start by creating a scANVI model object. Note that because scANVI refines an already trained scVI model, we provide the scVI model rather than an AnnData object. If we had not already trained an scVI model we would need to do that first. We also provide a key for the column of adata.obs which contains our cell labels as well as the label which corresponds to unlabelled cells. In this case all of our cells are labelled so we just provide a dummy value. In most cases, it is important to check that this is set correctly so that scANVI knows which label to ignore during training.

# Normally we would need to run scVI first but we have already done that here
# model_scvi = scvi.model.SCVI(adata_scvi) etc.
model_scanvi = scvi.model.SCANVI.from_scvi_model(
    model_scvi, labels_key=label_key, unlabeled_category="unlabelled"
)
print(model_scanvi)
model_scanvi.view_anndata_setup()
ScanVI Model with the following params: 
unlabeled_category: unlabelled, n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, 
gene_likelihood: zinb
Training status: Not Trained

Anndata setup with scvi-tools version 0.18.0.

Setup via `SCANVI.setup_anndata` with arguments:
{
'labels_key': 'cell_type',
'unlabeled_category': 'unlabelled',
'layer': 'counts',
'batch_key': 'batch',
'size_factor_key': None,
'categorical_covariate_keys': None,
'continuous_covariate_keys': None
}

         Summary Statistics         
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃     Summary Stat Key      Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│         n_batch             3   │
│         n_cells           10270 │
│ n_extra_categorical_covs    0   │
│ n_extra_continuous_covs     0   │
│         n_labels           22   │
│          n_vars           2000  │
└──────────────────────────┴───────┘
               Data Registry                
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Registry Key     scvi-tools Location    ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│      X         adata.layers['counts']   │
│    batch      adata.obs['_scvi_batch']  │
│    labels     adata.obs['_scvi_labels'] │
└──────────────┴───────────────────────────┘
                  batch State Registry                   
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃  Source Location    Categories  scvi-tools Encoding ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ adata.obs['batch']     s1d3              0          │
│                        s2d1              1          │
│                        s3d7              2          │
└────────────────────┴────────────┴─────────────────────┘
                       labels State Registry                       
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃    Source Location         Categories     scvi-tools Encoding ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ adata.obs['cell_type']        B1 B                 0          │
│                         CD4+ T activated           1          │
│                           CD4+ T naive             2          │
│                              CD8+ T                3          │
│                           CD8+ T naive             4          │
│                            CD14+ Mono              5          │
│                            CD16+ Mono              6          │
│                           Erythroblast             7          │
│                             G/M prog               8          │
│                               HSC                  9          │
│                               ILC                 10          │
│                            Lymph prog             11          │
│                            MK/E prog              12          │
│                                NK                 13          │
│                          Naive CD20+ B            14          │
│                            Normoblast             15          │
│                           Plasma cell             16          │
│                         Proerythroblast           17          │
│                          Transitional B           18          │
│                               cDC2                19          │
│                               pDC                 20          │
│                            unlabelled             21          │
└────────────────────────┴──────────────────┴─────────────────────┘

This scANVI model object is very similar to what we saw before for scVI. As mentioned previously, we could modify the structure of the model network but here we just use the default parameters.

Again, we have a heuristic for selecting the number of training epochs. Note that this is much fewer than before as we are just refining the scVI model, rather than training a whole network from scratch.

max_epochs_scanvi = int(np.min([10, np.max([2, round(max_epochs_scvi / 3.0)])]))
model_scanvi.train(max_epochs=max_epochs_scanvi)
INFO     Training for 10 epochs.                                                                                   
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Epoch 10/10: 100%|██████████| 10/10 [00:35<00:00,  3.69s/it, loss=747, v_num=1]
`Trainer.fit` stopped: `max_epochs=10` reached.
Epoch 10/10: 100%|██████████| 10/10 [00:35<00:00,  3.56s/it, loss=747, v_num=1]

We can extract the new latent representation from the model and create a new UMAP embedding as we did for scVI.

adata_scanvi = adata_scvi.copy()
adata_scanvi.obsm["X_scANVI"] = model_scanvi.get_latent_representation()
sc.pp.neighbors(adata_scanvi, use_rep="X_scANVI")
sc.tl.umap(adata_scanvi)
sc.pl.umap(adata_scanvi, color=[label_key, batch_key], wspace=1)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:163: PendingDeprecationWarning: The get_cmap function will be deprecated in a future version. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap = copy(get_cmap(cmap))
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../_images/55f6b0b00ec8bef5e16f261b03f8a355d603bb935fa2f505cf1cc16b3e4e2f77.png

By looking at the UMAP representation it is difficult to tell the difference between scANVI and scVI but as we will see below there are differences in metric scores when the quality of the integrations is quantified. This is a reminder that we shouldn’t overinterpret these two-dimensional representations, especially when it comes to comparing methods.

12.7. Graph-based integration#

The next method we will look at is BBKNN or “Batch Balanced KNN” [Polański et al., 2019]. This is a very different approach to scVI, which rather than using a neural network to embed cells in a batch corrected space, instead modifies how the k-nearest neighbor (KNN) graph used for clustering and embedding is constructed. As we have seen in previous chapters the normal KNN procedure connects cells to the most similar cells across the whole dataset. The change that BBKNN makes is to enforce that cells are connected to cells from other batches. While this is a simple modification it can be quite effective, particularly when there are very strong batch effects. However, as the output is an integrated graph it can have limited downstream uses as few packages will accept this as an input.

An important parameter for BBKNN is the number of neighbors per batch. A suggested heuristic for this is to use 25 if there are more than 100,000 cells or the default of 3 if there are fewer than 100,000.

neighbors_within_batch = 25 if adata_hvg.n_obs > 100000 else 3
neighbors_within_batch
3

Before using BBKNN we first perform a PCA as we would before building a normal KNN graph. Unlike scVI which models raw counts here, we start with the log-normalised expression matrix.

adata_bbknn = adata_hvg.copy()
adata_bbknn.X = adata_bbknn.layers["logcounts"].copy()
sc.pp.pca(adata_bbknn)

We can now run BBKNN, replacing the call to the scanpy neighbors() function in a standard workflow. An important difference is to make sure the batch_key argument is set which specifies a column in adata_hvg.obs that contains batch labels.

bbknn.bbknn(
    adata_bbknn, batch_key=batch_key, neighbors_within_batch=neighbors_within_batch
)
adata_bbknn
AnnData object with n_obs × n_vars = 10270 × 2000
    obs: 'GEX_pct_counts_mt', 'GEX_n_counts', 'GEX_n_genes', 'GEX_size_factors', 'GEX_phase', 'ATAC_nCount_peaks', 'ATAC_atac_fragments', 'ATAC_reads_in_peaks_frac', 'ATAC_blacklist_fraction', 'ATAC_nucleosome_signal', 'cell_type', 'batch', 'ATAC_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker'
    var: 'feature_types', 'gene_id', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'ATAC_gene_activity_var_names', 'dataset_id', 'genome', 'organism', 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'batch_colors', 'cell_type_colors'
    obsm: 'ATAC_gene_activity', 'ATAC_lsi_full', 'ATAC_lsi_red', 'ATAC_umap', 'GEX_X_pca', 'GEX_X_umap', 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts', 'logcounts'
    obsp: 'distances', 'connectivities'

Unlike the default scanpy function, BBKNN does not allow specifying a key for storing results so they are always stored under the default “neighbors” key.

We can use this new integrated graph just like we would a normal KNN graph to construct a UMAP embedding.

sc.tl.umap(adata_bbknn)
sc.pl.umap(adata_bbknn, color=[label_key, batch_key], wspace=1)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:163: PendingDeprecationWarning: The get_cmap function will be deprecated in a future version. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap = copy(get_cmap(cmap))
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../_images/cf3c4dddae9af800a23218887dbd9684fc4639658d2066de0fff02b5262f9e47.png

This integration is also improved compared to the unintegrated data with cell identities grouped together but we sill see some shifts between batches.

12.8. Linear embedding integration using Mutual Nearest Neighbors (MNN)#

Some downstream applications cannot accept an integrated embedding or neighborhood graph and require a corrected expression matrix. One approach that can produce this output is the integration method in Seurat [Butler et al., 2018, Satija et al., 2015, Stuart et al., 2019]. The Seurat integration method belongs to a class of linear embedding models that make use of the idea of mutual nearest neighbors (which Seurat calls anchors) to correct batch effects [Haghverdi et al., 2018]. Mutual nearest neighbors are pairs of cells from two different datasets which are in the neighborhood of each other when the datasets are placed in the same (latent) space. After finding these cells they can be used to align the two datasets and correct the differences between them. Seurat has also been found to be one of the top mixing methods in some evaluations [Tran et al., 2020].

As Seurat is an R package we must transfer our data from Python to R. Here we prepare the AnnData to convert so that it can be handled by rpy2 and anndata2ri.

adata_seurat = adata_hvg.copy()
# Convert categorical columns to strings
adata_seurat.obs[batch_key] = adata_seurat.obs[batch_key].astype(str)
adata_seurat.obs[label_key] = adata_seurat.obs[label_key].astype(str)
# Delete uns as this can contain arbitrary objects which are difficult to convert
del adata_seurat.uns
adata_seurat
AnnData object with n_obs × n_vars = 10270 × 2000
    obs: 'GEX_pct_counts_mt', 'GEX_n_counts', 'GEX_n_genes', 'GEX_size_factors', 'GEX_phase', 'ATAC_nCount_peaks', 'ATAC_atac_fragments', 'ATAC_reads_in_peaks_frac', 'ATAC_blacklist_fraction', 'ATAC_nucleosome_signal', 'cell_type', 'batch', 'ATAC_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker'
    var: 'feature_types', 'gene_id', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    obsm: 'ATAC_gene_activity', 'ATAC_lsi_full', 'ATAC_lsi_red', 'ATAC_umap', 'GEX_X_pca', 'GEX_X_umap', 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts', 'logcounts'
    obsp: 'distances', 'connectivities'

The prepared AnnData is now available in R as a SingleCellExperiment object thanks to anndata2ri. Note that this is transposed compared to an AnnData object so our observations (cells) are now the columns and our variables (genes) are now the rows.

%%R -i adata_seurat
adata_seurat
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/rpy2/robjects/pandas2ri.py:263: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/rpy2/robjects/numpy2ri.py:205: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/rpy2/robjects/pandas2ri.py:263: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/rpy2/robjects/numpy2ri.py:205: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata2ri/r2py.py:106: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
  return AnnData(exprs, obs, var, uns, obsm or None, layers=layers)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/rpy2/robjects/pandas2ri.py:263: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/rpy2/robjects/numpy2ri.py:205: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata2ri/r2py.py:106: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
  return AnnData(exprs, obs, var, uns, obsm or None, layers=layers)
class: SingleCellExperiment 
dim: 2000 10270 
metadata(0):
assays(3): X counts logcounts
rownames(2000): GPR153 TNFRSF25 ... TMLHE-AS1 MT-ND3
rowData names(9): feature_types gene_id ... highly_variable_nbatches
  highly_variable_intersection
colnames(10270): TCACCTGGTTAGGTTG-3-s1d3 CGTTAACAGGTGTCCA-3-s1d3 ...
  AGCAGGTAGGCTATGT-12-s3d7 GCCATGATCCCTTGCG-12-s3d7
colData names(28): GEX_pct_counts_mt GEX_n_counts ... QCMeds
  DonorSmoker
reducedDimNames(8): ATAC_gene_activity ATAC_lsi_full ... PCA UMAP
mainExpName: NULL
altExpNames(0):

Seurat uses its own object to store data. Helpfully the authors provide a function to convert from SingleCellExperiment. We just provide the SingleCellExperiment object and tell Seurat which assays (layers in our AnnData object) contain raw counts and normalised expression (which Seurat stores in a slot called “data”).

%%R -i adata_seurat
seurat <- as.Seurat(adata_seurat, counts = "counts", data = "logcounts")
seurat
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/rpy2/robjects/pandas2ri.py:263: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/rpy2/robjects/numpy2ri.py:205: DeprecationWarning: The global conversion available with activate() is deprecated and will be removed in the next major release. Use a local converter.
  warnings.warn('The global conversion available with activate() '
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/anndata2ri/r2py.py:106: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
  return AnnData(exprs, obs, var, uns, obsm or None, layers=layers)
An object of class Seurat 
2000 features across 10270 samples within 1 assay 
Active assay: originalexp (2000 features, 0 variable features)
 8 dimensional reductions calculated: ATAC_gene_activity, ATAC_lsi_full, ATAC_lsi_red, ATAC_umap, GEX_X_pca, GEX_X_umap, PCA, UMAP

Unlike some of the other methods, we have seen which take a single object and a batch key, the Seurat integration functions require a list of objects. We create this using the SplitObject() function.

%%R -i batch_key
batch_list <- SplitObject(seurat, split.by = batch_key)
batch_list
$s1d3
An object of class Seurat 
2000 features across 4279 samples within 1 assay 
Active assay: originalexp (2000 features, 0 variable features)
 8 dimensional reductions calculated: ATAC_gene_activity, ATAC_lsi_full, ATAC_lsi_red, ATAC_umap, GEX_X_pca, GEX_X_umap, PCA, UMAP

$s2d1
An object of class Seurat 
2000 features across 4220 samples within 1 assay 
Active assay: originalexp (2000 features, 0 variable features)
 8 dimensional reductions calculated: ATAC_gene_activity, ATAC_lsi_full, ATAC_lsi_red, ATAC_umap, GEX_X_pca, GEX_X_umap, PCA, UMAP

$s3d7
An object of class Seurat 
2000 features across 1771 samples within 1 assay 
Active assay: originalexp (2000 features, 0 variable features)
 8 dimensional reductions calculated: ATAC_gene_activity, ATAC_lsi_full, ATAC_lsi_red, ATAC_umap, GEX_X_pca, GEX_X_umap, PCA, UMAP

We can now use this list to find anchors for each pair of datasets. Usually, you would identify batch-aware highly variable genes first (using the FindVariableFeatures() and SelectIntegrationFeatures() functions) but as we have done that already we tell Seurat to use all the features in the object.

%%R
anchors <- FindIntegrationAnchors(batch_list, anchor.features = rownames(seurat))
anchors
  |                                                  | 0 % ~calculating   |+++++++++++++++++                                 | 33% ~03s           |++++++++++++++++++++++++++++++++++                | 67% ~01s           |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02s  
  |                                                  | 0 % ~calculating   |+++++++++++++++++                                 | 33% ~01m 39s       |++++++++++++++++++++++++++++++++++                | 67% ~41s           |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01m 56s
An AnchorSet object containing 25352 anchors between 3 Seurat objects 
 This can be used as input to IntegrateData.

Seurat can then use the anchors to compute a transformation that maps one dataset onto another. This is done in a pairwise way until all the datasets are merged. By default Seurat will determine a merge order so that more similar datasets are merged together first but it is also possible to define this order.

%%R
integrated <- IntegrateData(anchors)
integrated
An object of class Seurat 
4000 features across 10270 samples within 2 assays 
Active assay: integrated (2000 features, 2000 variable features)
 1 other assay present: originalexp

The result is another Seurat object, but notice now that the active assay is called “integrated”. This contains the corrected expression matrix which is the final output of the integration.

Here we extract that matrix and prepare it for transfer back to Python.

%%R -o integrated_expr
# Extract the integrated expression matrix
integrated_expr <- GetAssayData(integrated)
# Make sure the rows and columns are in the same order as the original object
integrated_expr <- integrated_expr[rownames(seurat), colnames(seurat)]
# Transpose the matrix to AnnData format
integrated_expr <- t(integrated_expr)
print(integrated_expr[1:10, 1:10])
10 x 10 sparse Matrix of class "dgCMatrix"
                                                                               
TCACCTGGTTAGGTTG-3-s1d3  .            -0.0005365199  1.032812e-02 -2.653187e-02
CGTTAACAGGTGTCCA-3-s1d3  0.0001382038 -0.1809919733 -1.454901e-02  3.608087e-03
ATTCGTTTCAGTATTG-3-s1d3 -0.0121073040 -0.0634131457  .             2.144075e-02
GGACCGAAGTGAGGTA-3-s1d3  .             .             2.972292e-04  .           
ATGAAGCCAGGGAGCT-3-s1d3 -0.0139047071 -0.0313151274  .             2.239855e-02
AGTGCGGAGTAAGGGC-3-s1d3 -0.0004299227 -0.0002657828  .            -1.871410e-03
CTACCTCAGACACCGC-3-s1d3 -0.0055208620 -0.0398862297  7.182253e-06  8.240406e-03
CTTCAATTCACGAATC-3-s1d3  .            -0.0109928448  .             1.935677e-04
CCATTGTGTAGACAAA-3-s1d3  .             0.0171909615  .             5.711311e-05
CCGTTACTCAATGTGC-3-s1d3  0.0139905515  0.0007981117  2.303346e-03  1.356206e-02
                                                                          
TCACCTGGTTAGGTTG-3-s1d3 -0.023237586  0.031938502 -0.003196878  0.01777767
CGTTAACAGGTGTCCA-3-s1d3  0.114149766 -0.013183394  0.038076743  0.80491293
ATTCGTTTCAGTATTG-3-s1d3 -0.054419895  0.010955783 -0.005951634  0.37223306
GGACCGAAGTGAGGTA-3-s1d3  0.002305526  0.011544715  0.011133475  0.02366670
ATGAAGCCAGGGAGCT-3-s1d3 -0.123505733 -0.009382408  0.002153635 -0.07013584
AGTGCGGAGTAAGGGC-3-s1d3  0.035848768  0.013858992 -0.000379393  0.05617137
CTACCTCAGACACCGC-3-s1d3 -0.003837952  0.082027580 -0.001109391 -0.06307772
CTTCAATTCACGAATC-3-s1d3  0.052970708  0.153601547  0.247920321 -0.01143158
CCATTGTGTAGACAAA-3-s1d3 -0.015445186  0.025763467 -0.003632830  0.02040172
CCGTTACTCAATGTGC-3-s1d3 -0.018025385  0.022560136  0.005755797  0.61496228
                                                   
TCACCTGGTTAGGTTG-3-s1d3 -6.661643e-03 -0.0183198213
CGTTAACAGGTGTCCA-3-s1d3  5.079864e-02  0.0394717101
ATTCGTTTCAGTATTG-3-s1d3  6.600434e-02  0.0009021682
GGACCGAAGTGAGGTA-3-s1d3  7.172740e-04  0.0095352521
ATGAAGCCAGGGAGCT-3-s1d3  1.226039e-01  0.0063816143
AGTGCGGAGTAAGGGC-3-s1d3 -1.830964e-03  0.0008381943
CTACCTCAGACACCGC-3-s1d3  8.559358e-01  0.0102285085
CTTCAATTCACGAATC-3-s1d3 -5.318167e-06 -0.0341661907
CCATTGTGTAGACAAA-3-s1d3 -6.362298e-03  0.0232357011
CCGTTACTCAATGTGC-3-s1d3 -4.910886e-02  0.0317359576

We will now store the corrected expression matrix as a layer in our AnnData object. We also set adata.X to use this matrix.

adata_seurat.X = integrated_expr
adata_seurat.layers["seurat"] = integrated_expr
print(adata_seurat)
adata.X
AnnData object with n_obs × n_vars = 10270 × 2000
    obs: 'GEX_pct_counts_mt', 'GEX_n_counts', 'GEX_n_genes', 'GEX_size_factors', 'GEX_phase', 'ATAC_nCount_peaks', 'ATAC_atac_fragments', 'ATAC_reads_in_peaks_frac', 'ATAC_blacklist_fraction', 'ATAC_nucleosome_signal', 'cell_type', 'batch', 'ATAC_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker'
    var: 'feature_types', 'gene_id', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    obsm: 'ATAC_gene_activity', 'ATAC_lsi_full', 'ATAC_lsi_red', 'ATAC_umap', 'GEX_X_pca', 'GEX_X_umap', 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts', 'logcounts', 'seurat'
    obsp: 'distances', 'connectivities'
<10270x13431 sparse matrix of type '<class 'numpy.float32'>'
	with 14348115 stored elements in Compressed Sparse Row format>

Now that we have the results of our integration we can calculate a UMAP and plot it as we have for the other methods (we could also have done this in R).

# Reset the batch colours because we deleted them earlier
adata_seurat.uns[batch_key + "_colors"] = [
    "#1b9e77",
    "#d95f02",
    "#7570b3",
]
sc.tl.pca(adata_seurat)
sc.pp.neighbors(adata_seurat)
sc.tl.umap(adata_seurat)
sc.pl.umap(adata_seurat, color=[label_key, batch_key], wspace=1)
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:163: PendingDeprecationWarning: The get_cmap function will be deprecated in a future version. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap = copy(get_cmap(cmap))
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../_images/31c555d9e6f54b51ed9e6522efb720e2253e0e4696088c4f59f2e4a8969604d4.png

As we have previously seen, the batches are mixed while the labels are separated. It is tempting to select an integration based on the UMAPs but this does not fully represent the quality of an integration. In the next section, we present some approaches to more rigorously evaluate integration methods.

A note on scalability

As you ran the different integration methods you may have noticed that scVI took the most time. While this is true for small datasets like the example shown here, benchmarks have shown that scVI scales well for larger datasets. This is largely because the number of training epochs is adjusted for larger dataset sizes. MNN methods typically don’t scale as well, partly because they perform several pairwise integrations, so if you have 20 batches you are performing 20 integrations while other methods can consider all batches at once.

12.9. Benchmarking your own integration#

The methods demonstrated here are selected based on results from benchmarking experiments including the single-cell integration benchmarking project [Luecken et al., 2021]. This project also produced a software package called scib that can be used to run a range of integration methods as well as the metrics that were used for evaluation. In this section, we show how to use this package to evaluate the quality of an integration.

What is the ground truth?

Some of these metrics, particularly those that evaluate the conservation of biological variation, require a known ground truth to compare to. Usually, this is a cell identity label but can sometimes be other information such as known trajectories. Because of this requirement, it is difficult to evaluate integration for a completely new dataset where it is unclear what biological signal should be preserved.

The scib metrics can be run individually but there are also wrappers for running multiple metrics at once. Here we run a subset of the metrics which are quick to compute using the metrics_fast() function. This function takes a few arguments: the original unintegrated dataset, the integrated dataset, a batch key and a label key. Depending on the output of the integration method we might also need to supply additional arguments, for example here we specify the embedding to use for scVI and scANVI with the embed argument. You can also control how some metrics are run with additional arguments. Also note that you may need to check that objects are formatted properly so that scIB can find the required information.

Let’s run the metrics for each of the integrations we have performed above, as well as the unintegrated data (after highly variable gene selection).

metrics_scvi = scib.metrics.metrics_fast(
    adata, adata_scvi, batch_key, label_key, embed="X_scVI"
)
metrics_scanvi = scib.metrics.metrics_fast(
    adata, adata_scanvi, batch_key, label_key, embed="X_scANVI"
)
metrics_bbknn = scib.metrics.metrics_fast(adata, adata_bbknn, batch_key, label_key)
metrics_seurat = scib.metrics.metrics_fast(adata, adata_seurat, batch_key, label_key)
metrics_hvg = scib.metrics.metrics_fast(adata, adata_hvg, batch_key, label_key)
Silhouette score...
PC regression...
Isolated labels ASW...
Graph connectivity...
Silhouette score...
PC regression...
Isolated labels ASW...
Graph connectivity...
Silhouette score...
PC regression...
Isolated labels ASW...
Graph connectivity...
Silhouette score...
PC regression...
Isolated labels ASW...
Graph connectivity...
Silhouette score...
PC regression...
Isolated labels ASW...
Graph connectivity...

Here is an example of what one of the metrics results looks like for a single integration:

metrics_hvg
0
NMI_cluster/label NaN
ARI_cluster/label NaN
ASW_label 0.555775
ASW_label/batch 0.833656
PCR_batch 0.537850
cell_cycle_conservation NaN
isolated_label_F1 NaN
isolated_label_silhouette 0.487975
graph_conn 0.985533
kBET NaN
iLISI NaN
cLISI NaN
hvg_overlap 1.000000
trajectory NaN

Each row is a different metric and the values show the score for that metric. Scores are between 0 and 1, where 1 is a good performance and 0 is a poor performance (scib can also return unscaled scores for some metrics if required). Because we have only run the fast metrics here, some of the metrics have NaN scores. Also, note that some metrics cannot be used with some output formats which can also be a reason for NaN values being returned.

To compare the methods it is useful to have all the metrics results in one table. This code combines them and tidies them into a more convenient format.

# Concatenate metrics results
metrics = pd.concat(
    [metrics_scvi, metrics_scanvi, metrics_bbknn, metrics_seurat, metrics_hvg],
    axis="columns",
)
# Set methods as column names
metrics = metrics.set_axis(
    ["scVI", "scANVI", "BBKNN", "Seurat", "Unintegrated"], axis="columns"
)
# Select only the fast metrics
metrics = metrics.loc[
    [
        "ASW_label",
        "ASW_label/batch",
        "PCR_batch",
        "isolated_label_silhouette",
        "graph_conn",
        "hvg_overlap",
    ],
    :,
]
# Transpose so that metrics are columns and methods are rows
metrics = metrics.T
# Remove the HVG overlap metric because it's not relevant to embedding outputs
metrics = metrics.drop(columns=["hvg_overlap"])
metrics
ASW_label ASW_label/batch PCR_batch isolated_label_silhouette graph_conn
scVI 0.570953 0.907165 0.913474 0.560875 0.982940
scANVI 0.587934 0.905402 0.887735 0.556232 0.983729
BBKNN 0.555469 0.847772 0.537850 0.481913 0.969514
Seurat 0.571231 0.915237 0.788894 0.488585 0.989366
Unintegrated 0.555775 0.833656 0.537850 0.487975 0.985533

We now have all the scores in one table with metrics as columns and methods as rows. Styling the table with a gradient can make it easier to see the differences between scores.

metrics.style.background_gradient(cmap="Blues")
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/pandas/io/formats/style.py:3930: PendingDeprecationWarning: The get_cmap function will be deprecated in a future version. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  rgbas = plt.cm.get_cmap(cmap)(norm(gmap))
  ASW_label ASW_label/batch PCR_batch isolated_label_silhouette graph_conn
scVI 0.570953 0.907165 0.913474 0.560875 0.982940
scANVI 0.587934 0.905402 0.887735 0.556232 0.983729
BBKNN 0.555469 0.847772 0.537850 0.481913 0.969514
Seurat 0.571231 0.915237 0.788894 0.488585 0.989366
Unintegrated 0.555775 0.833656 0.537850 0.487975 0.985533

For some metrics, the scores tend to be in a relatively small range. To emphasise the differences between methods and place each metric on the same scale, we scale them so that the worst performer gets a score of 0, the best performer gets a score of 1 and the others are somewhere in between.

metrics_scaled = (metrics - metrics.min()) / (metrics.max() - metrics.min())
metrics_scaled.style.background_gradient(cmap="Blues")
  ASW_label ASW_label/batch PCR_batch isolated_label_silhouette graph_conn
scVI 0.476942 0.901060 1.000000 1.000000 0.676270
scANVI 1.000000 0.879454 0.931475 0.941194 0.716018
BBKNN 0.000000 0.173036 0.000000 0.000000 0.000000
Seurat 0.485521 1.000000 0.668338 0.084500 1.000000
Unintegrated 0.009437 0.000000 0.000000 0.076776 0.806917

The values now better represent the differences between methods (and better match the colour scale). However, it is important to note that the scaled scores can only be used to compare the relative performance of this specific set of integrations. If we wanted to add another method we would need to perform the scaling again. We also can’t say that an integration is definitively “good”, only that it is better than the other methods we have tried. This scaling emphasises differences between methods. For example, if we had metric scores of 0.92, 0.94 and 0.96 these would be scaled to 0, 0.5 and 1.0. This makes the first method appear to score much worse, even though it is only slightly lower than the other two and still got a very high score. This effect is bigger when comparing a few methods and when they get similar raw scores. Whether you look at raw or scaled scores depends on whether you want to focus on absolute performance or the difference in performance between methods.

The evaluation metrics can be grouped into two categories, those that measure the removal of batch effects and those that measure the conservation of biological variation. We can calculate summary scores for each of these categories by taking the mean of the scaled values for each group. This kind of summary score wouldn’t make sense with raw values as some metrics consistently produce higher scores than others (and therefore have a greater effect on the mean).

metrics_scaled["Batch"] = metrics_scaled[
    ["ASW_label/batch", "PCR_batch", "graph_conn"]
].mean(axis=1)
metrics_scaled["Bio"] = metrics_scaled[["ASW_label", "isolated_label_silhouette"]].mean(
    axis=1
)
metrics_scaled.style.background_gradient(cmap="Blues")
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/pandas/io/formats/style.py:3930: PendingDeprecationWarning: The get_cmap function will be deprecated in a future version. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  rgbas = plt.cm.get_cmap(cmap)(norm(gmap))
  ASW_label ASW_label/batch PCR_batch isolated_label_silhouette graph_conn Batch Bio
scVI 0.476942 0.901060 1.000000 1.000000 0.676270 0.859110 0.738471
scANVI 1.000000 0.879454 0.931475 0.941194 0.716018 0.842316 0.970597
BBKNN 0.000000 0.173036 0.000000 0.000000 0.000000 0.057679 0.000000
Seurat 0.485521 1.000000 0.668338 0.084500 1.000000 0.889446 0.285010
Unintegrated 0.009437 0.000000 0.000000 0.076776 0.806917 0.268972 0.043106

Plotting the two summary scores against each other gives an indication of the priorities of each method. Some will be biased towards batch correction while others will favour retaining biological variation.

fig, ax = plt.subplots()
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
metrics_scaled.plot.scatter(
    x="Batch",
    y="Bio",
    c=range(len(metrics_scaled)),
    ax=ax,
)

for k, v in metrics_scaled[["Batch", "Bio"]].iterrows():
    ax.annotate(
        k,
        v,
        xytext=(6, -3),
        textcoords="offset points",
        family="sans-serif",
        fontsize=12,
    )
../_images/83f33a364c9b09be9f8ac6299b638db499d0fb06de66d991190b4155c9e0757d.png

In our small example scenario BBKNN is clearly the worst performer, getting the lowest scores for both batch removal and biological conservation. The other three methods have similar batch correction scores with scANVI scoring highest for biological conservation followed by scVI and Seurat.

To get an overall score for each method we can combine the two summary scores. The scIB paper suggests a weighting of 40% batch correction and 60% biological conservation but you may prefer to weight things differently depending on the priorities for your dataset.

metrics_scaled["Overall"] = 0.4 * metrics_scaled["Batch"] + 0.6 * metrics_scaled["Bio"]
metrics_scaled.style.background_gradient(cmap="Blues")
/Users/luke.zappia/miniconda/envs/bp-integration/lib/python3.9/site-packages/pandas/io/formats/style.py:3930: PendingDeprecationWarning: The get_cmap function will be deprecated in a future version. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  rgbas = plt.cm.get_cmap(cmap)(norm(gmap))
  ASW_label ASW_label/batch PCR_batch isolated_label_silhouette graph_conn Batch Bio Overall
scVI 0.476942 0.901060 1.000000 1.000000 0.676270 0.859110 0.738471 0.786726
scANVI 1.000000 0.879454 0.931475 0.941194 0.716018 0.842316 0.970597 0.919285
BBKNN 0.000000 0.173036 0.000000 0.000000 0.000000 0.057679 0.000000 0.023071
Seurat 0.485521 1.000000 0.668338 0.084500 1.000000 0.889446 0.285010 0.526785
Unintegrated 0.009437 0.000000 0.000000 0.076776 0.806917 0.268972 0.043106 0.133453

Let’s make a quick bar chart to visualise the overall performance.

metrics_scaled.plot.bar(y="Overall")
<AxesSubplot: >
../_images/a301c0818aba8302479b965e75b1249516c0fd6979b2a7c1a3833bbf04b01613.png

As we have already seen scVI and scANVI are the best performers with scANVI scoring slightly higher. It is important to note that this is just an example of how to run these metrics for this specific dataset, not a proper evaluation of these methods. For that, you should refer to existing benchmarking publications. In particular, we have only run a small selection of high-performing methods and a subset of metrics. Also, remember that scores are relative to the methods used so even if the methods perform almost equally well and small differences will be exaggerated.

Existing benchmarks have suggested methods that generally perform well, but performance can also be quite variable across scenarios. For some analyses, it may be worthwhile performing your own evaluation of integration. The scib package makes this process easier, but it can still be a significant undertaking, relying on a good knowledge of the ground truth and interpretation of the metrics.

12.10. Key Takeaways#

  1. Visualize your data before attempting to correct for batch effects to assess the extent of the issue. Batch effect correction is not always required and it might mask the biological variation of interest.

  2. If cell labels are available and biological variation is the most important, the usage of methods that can use these labels (such as scANVI) is advised.

  3. Consider running several integration methods on your dataset and evaluating them with the scIB metrics to use the integration that is most robust for your use case.

12.11. Quiz#

  1. What are the sources of batch effects?

  2. What is the difference between technical and biological variation?

  3. How does one evaluate whether the integration worked well or not? What are useful metrics for this purpose?

12.12. Session information#

12.12.1. Python#

import session_info

session_info.show()
Click to view session information
-----
anndata             0.8.0
anndata2ri          1.1
bbknn               NA
matplotlib          3.6.1
numpy               1.23.4
pandas              1.5.1
rpy2                3.5.1
scanpy              1.9.1
scib                1.0.4
scipy               1.9.3
scvi                0.18.0
session_info        1.0.0
-----
Click to view modules imported as dependencies
PIL                         9.2.0
absl                        NA
annoy                       NA
appnope                     0.1.3
asttokens                   NA
attr                        22.1.0
backcall                    0.2.0
beta_ufunc                  NA
binom_ufunc                 NA
cffi                        1.15.1
chex                        0.1.5
colorama                    0.4.6
contextlib2                 NA
cycler                      0.10.0
cython_runtime              NA
dateutil                    2.8.2
debugpy                     1.6.3
decorator                   5.1.1
defusedxml                  0.7.1
deprecate                   0.3.2
deprecated                  1.2.13
docrep                      0.3.2
entrypoints                 0.4
etils                       0.8.0
executing                   1.1.1
flax                        0.5.0
fsspec                      2022.10.0
google                      NA
h5py                        3.7.0
hypergeom_ufunc             NA
igraph                      0.10.2
importlib_metadata          NA
ipykernel                   6.16.2
ipython_genutils            0.2.0
ipywidgets                  8.0.2
jax                         0.3.23
jaxlib                      0.3.22
jedi                        0.18.1
jinja2                      3.1.2
joblib                      1.2.0
jupyter_server              1.21.0
kiwisolver                  1.4.4
leidenalg                   0.9.0
llvmlite                    0.39.1
louvain                     0.8.0
markupsafe                  2.1.1
matplotlib_inline           0.1.6
ml_collections              NA
mpl_toolkits                NA
msgpack                     1.0.4
mudata                      0.2.0
multipledispatch            0.6.0
natsort                     8.2.0
nbinom_ufunc                NA
ncf_ufunc                   NA
numba                       0.56.3
numpyro                     0.10.1
opt_einsum                  v3.3.0
optax                       0.1.3
packaging                   21.3
parso                       0.8.3
pexpect                     4.8.0
pickleshare                 0.7.5
pkg_resources               NA
prompt_toolkit              3.0.31
psutil                      5.9.3
ptyprocess                  0.7.0
pure_eval                   0.2.2
pycparser                   2.21
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.8.0
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.13.0
pynndescent                 0.5.7
pyparsing                   3.0.9
pyro                        1.8.2+2e3bd02
pytorch_lightning           1.7.7
pytz                        2022.5
pytz_deprecation_shim       NA
rich                        NA
seaborn                     0.12.0
setuptools                  65.5.0
six                         1.16.0
sklearn                     1.1.3
sphinxcontrib               NA
stack_data                  0.5.1
statsmodels                 0.13.2
tensorboard                 2.10.1
texttable                   1.6.4
threadpoolctl               3.1.0
toolz                       0.12.0
torch                       1.12.1
torchmetrics                0.10.1
tornado                     6.2
tqdm                        4.64.1
traitlets                   5.5.0
tree                        0.1.7
typing_extensions           NA
tzlocal                     NA
umap                        0.5.3
wcwidth                     0.2.5
wrapt                       1.14.1
yaml                        6.0
zipp                        NA
zmq                         24.0.1
zoneinfo                    NA
-----
IPython             8.5.0
jupyter_client      7.4.4
jupyter_core        4.11.1
jupyterlab          3.5.0
notebook            6.5.1
-----
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 17:01:00) [Clang 13.0.1 ]
macOS-10.15.7-x86_64-i386-64bit
-----
Session information updated at 2022-11-10 12:53

12.12.2. R#

%%R
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.3 (2022-03-10)
 os       macOS Catalina 10.15.7
 system   x86_64, darwin13.4.0
 ui       unknown
 language (EN)
 collate  C
 ctype    UTF-8
 tz       Europe/Berlin
 date     2022-11-10
 pandoc   2.19.2 @ /Users/luke.zappia/miniconda/envs/bp-integration/bin/pandoc

─ Packages ───────────────────────────────────────────────────────────────────
 package              * version  date (UTC) lib source
 abind                  1.4-5    2016-07-21 [1] CRAN (R 4.1.3)
 Biobase              * 2.54.0   2021-10-26 [1] Bioconductor
 BiocGenerics         * 0.40.0   2021-10-26 [1] Bioconductor
 bitops                 1.0-7    2021-04-24 [1] CRAN (R 4.1.3)
 cli                    3.4.1    2022-09-23 [1] CRAN (R 4.1.3)
 cluster                2.1.4    2022-08-22 [1] CRAN (R 4.1.3)
 codetools              0.2-18   2020-11-04 [1] CRAN (R 4.1.3)
 colorspace             2.0-3    2022-02-21 [1] CRAN (R 4.1.3)
 cowplot                1.1.1    2020-12-30 [1] CRAN (R 4.1.3)
 data.table             1.14.4   2022-10-17 [1] CRAN (R 4.1.3)
 DelayedArray           0.20.0   2021-10-26 [1] Bioconductor
 deldir                 1.0-6    2021-10-23 [1] CRAN (R 4.1.3)
 digest                 0.6.30   2022-10-18 [1] CRAN (R 4.1.3)
 dplyr                  1.0.10   2022-09-01 [1] CRAN (R 4.1.3)
 ellipsis               0.3.2    2021-04-29 [1] CRAN (R 4.1.3)
 fansi                  1.0.3    2022-03-24 [1] CRAN (R 4.1.3)
 fastmap                1.1.0    2021-01-25 [1] CRAN (R 4.1.3)
 fitdistrplus           1.1-8    2022-03-10 [1] CRAN (R 4.1.3)
 future                 1.28.0   2022-09-02 [1] CRAN (R 4.1.3)
 future.apply           1.9.1    2022-09-07 [1] CRAN (R 4.1.3)
 generics               0.1.3    2022-07-05 [1] CRAN (R 4.1.3)
 GenomeInfoDb         * 1.30.1   2022-01-30 [1] Bioconductor
 GenomeInfoDbData       1.2.7    2022-10-28 [1] Bioconductor
 GenomicRanges        * 1.46.1   2021-11-18 [1] Bioconductor
 ggplot2                3.3.6    2022-05-03 [1] CRAN (R 4.1.3)
 ggrepel                0.9.1    2021-01-15 [1] CRAN (R 4.1.3)
 ggridges               0.5.4    2022-09-26 [1] CRAN (R 4.1.3)
 globals                0.16.1   2022-08-28 [1] CRAN (R 4.1.3)
 glue                   1.6.2    2022-02-24 [1] CRAN (R 4.1.3)
 goftest                1.2-3    2021-10-07 [1] CRAN (R 4.1.3)
 gridExtra              2.3      2017-09-09 [1] CRAN (R 4.1.3)
 gtable                 0.3.1    2022-09-01 [1] CRAN (R 4.1.3)
 htmltools              0.5.3    2022-07-18 [1] CRAN (R 4.1.3)
 htmlwidgets            1.5.4    2021-09-08 [1] CRAN (R 4.1.3)
 httpuv                 1.6.6    2022-09-08 [1] CRAN (R 4.1.3)
 httr                   1.4.4    2022-08-17 [1] CRAN (R 4.1.3)
 ica                    1.0-3    2022-07-08 [1] CRAN (R 4.1.3)
 igraph                 1.3.5    2022-09-22 [1] CRAN (R 4.1.3)
 IRanges              * 2.28.0   2021-10-26 [1] Bioconductor
 irlba                  2.3.5.1  2022-10-03 [1] CRAN (R 4.1.3)
 jsonlite               1.8.3    2022-10-21 [1] CRAN (R 4.1.3)
 KernSmooth             2.23-20  2021-05-03 [1] CRAN (R 4.1.3)
 later                  1.2.0    2021-04-23 [1] CRAN (R 4.1.3)
 lattice                0.20-45  2021-09-22 [1] CRAN (R 4.1.3)
 lazyeval               0.2.2    2019-03-15 [1] CRAN (R 4.1.3)
 leiden                 0.4.3    2022-09-10 [1] CRAN (R 4.1.3)
 lifecycle              1.0.3    2022-10-07 [1] CRAN (R 4.1.3)
 listenv                0.8.0    2019-12-05 [1] CRAN (R 4.1.3)
 lmtest                 0.9-40   2022-03-21 [1] CRAN (R 4.1.3)
 magrittr               2.0.3    2022-03-30 [1] CRAN (R 4.1.3)
 MASS                   7.3-58.1 2022-08-03 [1] CRAN (R 4.1.3)
 Matrix               * 1.5-1    2022-09-13 [1] CRAN (R 4.1.3)
 MatrixGenerics       * 1.6.0    2021-10-26 [1] Bioconductor
 matrixStats          * 0.62.0   2022-04-19 [1] CRAN (R 4.1.3)
 mgcv                   1.8-41   2022-10-21 [1] CRAN (R 4.1.3)
 mime                   0.12     2021-09-28 [1] CRAN (R 4.1.3)
 miniUI                 0.1.1.1  2018-05-18 [1] CRAN (R 4.1.3)
 munsell                0.5.0    2018-06-12 [1] CRAN (R 4.1.3)
 nlme                   3.1-160  2022-10-10 [1] CRAN (R 4.1.3)
 parallelly             1.32.1   2022-07-21 [1] CRAN (R 4.1.3)
 patchwork              1.1.2    2022-08-19 [1] CRAN (R 4.1.3)
 pbapply                1.5-0    2021-09-16 [1] CRAN (R 4.1.3)
 pillar                 1.8.1    2022-08-19 [1] CRAN (R 4.1.3)
 pkgconfig              2.0.3    2019-09-22 [1] CRAN (R 4.1.3)
 plotly                 4.10.0   2021-10-09 [1] CRAN (R 4.1.3)
 plyr                   1.8.7    2022-03-24 [1] CRAN (R 4.1.3)
 png                    0.1-7    2013-12-03 [1] CRAN (R 4.1.3)
 polyclip               1.10-4   2022-10-20 [1] CRAN (R 4.1.3)
 progressr              0.11.0   2022-09-02 [1] CRAN (R 4.1.3)
 promises               1.2.0.1  2021-02-11 [1] CRAN (R 4.1.3)
 purrr                  0.3.5    2022-10-06 [1] CRAN (R 4.1.3)
 R6                     2.5.1    2021-08-19 [1] CRAN (R 4.1.3)
 RANN                   2.6.1    2019-01-08 [1] CRAN (R 4.1.3)
 RColorBrewer           1.1-3    2022-04-03 [1] CRAN (R 4.1.3)
 Rcpp                   1.0.9    2022-07-08 [1] CRAN (R 4.1.3)
 RcppAnnoy              0.0.19   2021-07-30 [1] CRAN (R 4.1.3)
 RCurl                  1.98-1.9 2022-10-03 [1] CRAN (R 4.1.3)
 reshape2               1.4.4    2020-04-09 [1] CRAN (R 4.1.3)
 reticulate             1.26     2022-08-31 [1] CRAN (R 4.1.3)
 rgeos                  0.5-9    2021-12-15 [1] CRAN (R 4.1.3)
 rlang                  1.0.6    2022-09-24 [1] CRAN (R 4.1.3)
 ROCR                   1.0-11   2020-05-02 [1] CRAN (R 4.1.3)
 rpart                  4.1.19   2022-10-21 [1] CRAN (R 4.1.3)
 Rtsne                  0.16     2022-04-17 [1] CRAN (R 4.1.3)
 S4Vectors            * 0.32.4   2022-03-24 [1] Bioconductor
 scales                 1.2.1    2022-08-20 [1] CRAN (R 4.1.3)
 scattermore            0.8      2022-02-14 [1] CRAN (R 4.1.3)
 sctransform            0.3.5    2022-09-21 [1] CRAN (R 4.1.3)
 sessioninfo            1.2.2    2021-12-06 [1] CRAN (R 4.1.3)
 Seurat               * 4.2.0    2022-09-21 [1] CRAN (R 4.1.3)
 SeuratObject         * 4.1.2    2022-09-20 [1] CRAN (R 4.1.3)
 shiny                  1.7.3    2022-10-25 [1] CRAN (R 4.1.3)
 SingleCellExperiment * 1.16.0   2021-10-26 [1] Bioconductor
 sp                   * 1.5-0    2022-06-05 [1] CRAN (R 4.1.3)
 spatstat.core          2.4-4    2022-05-18 [1] CRAN (R 4.1.3)
 spatstat.data          3.0-0    2022-10-21 [1] CRAN (R 4.1.3)
 spatstat.geom          3.0-3    2022-10-25 [1] CRAN (R 4.1.3)
 spatstat.random        2.2-0    2022-03-30 [1] CRAN (R 4.1.3)
 spatstat.sparse        3.0-0    2022-10-21 [1] CRAN (R 4.1.3)
 spatstat.utils         3.0-1    2022-10-19 [1] CRAN (R 4.1.3)
 stringi                1.7.8    2022-07-11 [1] CRAN (R 4.1.3)
 stringr                1.4.1    2022-08-20 [1] CRAN (R 4.1.3)
 SummarizedExperiment * 1.24.0   2021-10-26 [1] Bioconductor
 survival               3.4-0    2022-08-09 [1] CRAN (R 4.1.3)
 tensor                 1.5      2012-05-05 [1] CRAN (R 4.1.3)
 tibble                 3.1.8    2022-07-22 [1] CRAN (R 4.1.3)
 tidyr                  1.2.1    2022-09-08 [1] CRAN (R 4.1.3)
 tidyselect             1.2.0    2022-10-10 [1] CRAN (R 4.1.3)
 utf8                   1.2.2    2021-07-24 [1] CRAN (R 4.1.3)
 uwot                   0.1.14   2022-08-22 [1] CRAN (R 4.1.3)
 vctrs                  0.5.0    2022-10-22 [1] CRAN (R 4.1.3)
 viridisLite            0.4.1    2022-08-22 [1] CRAN (R 4.1.3)
 xtable                 1.8-4    2019-04-21 [1] CRAN (R 4.1.3)
 XVector                0.34.0   2021-10-26 [1] Bioconductor
 zlibbioc               1.40.0   2021-10-26 [1] Bioconductor
 zoo                    1.8-11   2022-09-17 [1] CRAN (R 4.1.3)

 [1] /Users/luke.zappia/miniconda/envs/bp-integration/lib/R/library

──────────────────────────────────────────────────────────────────────────────

12.13. References#

[intACSM21]

Ricard Argelaguet, Anna S E Cuomo, Oliver Stegle, and John C Marioni. Computational principles and challenges in single-cell data integration. Nature biotechnology, pages 1–14, May 2021. doi:10.1038/s41587-021-00895-7.

[intBHS+18] (1,2,3)

Andrew Butler, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature biotechnology, April 2018. doi:10.1038/nbt.4096.

[intButtnerMW+19]

Maren Büttner, Zhichao Miao, F Alexander Wolf, Sarah A Teichmann, and Fabian J Theis. A test metric for assessing single-cell RNA-seq batch correction. Nature methods, 16(1):43–49, January 2019. doi:10.1038/s41592-018-0254-1.

[intCGvDKH21]

Ruben Chazarra-Gil, Stijn van Dongen, Vladimir Yu Kiselev, and Martin Hemberg. Flexible comparison of batch correction methods for single-cell RNA-seq using BatchBench. Nucleic acids research, 49(7):e42, April 2021. doi:10.1093/nar/gkab004.

[intGLX+22]

Adam Gayoso, Romain Lopez, Galen Xing, Pierre Boyeau, Valeh Valiollah Pour Amiri, Justin Hong, Katherine Wu, Michael Jayasuriya, Edouard Mehlman, Maxime Langevin, Yining Liu, Jules Samaran, Gabriel Misrachi, Achille Nazaret, Oscar Clivio, Chenling Xu, Tal Ashuach, Mariano Gabitto, Mohammad Lotfollahi, Valentine Svensson, Eduardo da Veiga Beltrame, Vitalii Kleshchevnikov, Carlos Talavera-López, Lior Pachter, Fabian J Theis, Aaron Streets, Michael I Jordan, Jeffrey Regier, and Nir Yosef. A Python library for probabilistic analysis of single-cell omics data. Nature biotechnology, February 2022. doi:10.1038/s41587-021-01206-w.

[intHLMM18] (1,2,3)

Laleh Haghverdi, Aaron T L Lun, Michael D Morgan, and John C Marioni. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nature biotechnology, April 2018. doi:10.1038/nbt.4091.

[intHBB19]

Brian Hie, Bryan Bryson, and Bonnie Berger. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nature biotechnology, May 2019. doi:10.1038/s41587-019-0113-3.

[intJLR07]

W Evan Johnson, Cheng Li, and Ariel Rabinovic. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1):118–127, January 2007. doi:10.1093/biostatistics/kxj037.

[intKMF+19] (1,2)

Ilya Korsunsky, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-Ru Loh, and Soumya Raychaudhuri. Fast, sensitive and accurate integration of single-cell data with Harmony. Nature methods, November 2019. doi:10.1038/s41592-019-0619-0.

[intLLB+22]

Christopher Lance, Malte D Luecken, Daniel B Burkhardt, Robrecht Cannoodt, Pia Rautenstrauch, Anna Laddach, Aidyn Ubingazhibov, Zhi-Jie Cao, Kaiwen Deng, Sumeer Khan, Qiao Liu, Nikolay Russkikh, Gleb Ryazantsev, Uwe Ohler, Neurips 2021 Multimodal Data integration competition participants, Angela Oliveira Pisco, Jonathan Bloom, Smita Krishnaswamy, and Fabian J Theis. Multimodal single cell data integration challenge: Results and lessons learned. In Douwe Kiela, Marco Ciccone, and Barbara Caputo, editors, Proceedings of the NeurIPS 2021 Competitions and Demonstrations Track, volume 176 of Proceedings of Machine Learning Research, 162–176. PMLR, 2022.

[intLRC+18] (1,2)

Romain Lopez, Jeffrey Regier, Michael B Cole, Michael I Jordan, and Nir Yosef. Deep generative modeling for single-cell transcriptomics. Nature methods, 15(12):1053–1058, November 2018. doi:10.1038/s41592-018-0229-2.

[intLWT19]

Mohammad Lotfollahi, F Alexander Wolf, and Fabian J Theis. scGen predicts single-cell perturbation responses. Nature methods, 16(8):715–721, August 2019. doi:10.1038/s41592-019-0494-8.

[intLBC+22]

Malte D Luecken, Daniel Bernard Burkhardt, Robrecht Cannoodt, Christopher Lance, Aditi Agrawal, Hananeh Aliee, Ann T Chen, Louise Deconinck, Angela M Detweiler, Alejandro A Granados, Shelly Huynh, Laura Isacco, Yang Joon Kim, Dominik Klein, Bony de Kumar, Sunil Kuppasani, Heiko Lickert, Aaron McGeever, Honey Mekonen, Joaquin Caceres Melgarejo, Maurizio Morri, Michaela Müller, Norma Neff, Sheryl Paul, Bastian Rieck, Kaylie Schneider, Scott Steelman, Michael Sterr, Daniel J Treacy, Alexander Tong, Alexandra-Chloe Villani, Guilin Wang, Jia Yan, Ce Zhang, Angela Oliveira Pisco, Smita Krishnaswamy, Fabian J Theis, and Jonathan M Bloom. A sandbox for prediction and integration of DNA, RNA, and proteins in single cells. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). January 2022.

[intLButtnerC+21] (1,2,3,4,5,6,7,8,9)

Malte D Luecken, M Büttner, K Chaichoompu, A Danese, M Interlandi, M F Mueller, D C Strobl, L Zappia, M Dugas, M Colomé-Tatché, and Fabian J Theis. Benchmarking atlas-level data integration in single-cell genomics. Nature methods, December 2021. doi:10.1038/s41592-021-01336-8.

[intLT19] (1,2)

Malte D Luecken and Fabian J Theis. Current best practices in single‐cell RNA‐seq analysis: a tutorial. Molecular systems biology, June 2019. doi:10.15252/msb.20188746.

[intMLE+21]

Christoph Muus, Malte D Luecken, Gökcen Eraslan, Lisa Sikkema, Avinash Waghray, Graham Heimberg, Yoshihiko Kobayashi, Eeshit Dhaval Vaishnav, Ayshwarya Subramanian, Christopher Smillie, Karthik A Jagadeesh, Elizabeth Thu Duong, Evgenij Fiskin, Elena Torlai Triglia, Meshal Ansari, Peiwen Cai, Brian Lin, Justin Buchanan, Sijia Chen, Jian Shu, Adam L Haber, Hattie Chung, Daniel T Montoro, Taylor Adams, Hananeh Aliee, Samuel J Allon, Zaneta Andrusivova, Ilias Angelidis, Orr Ashenberg, Kevin Bassler, Christophe Bécavin, Inbal Benhar, Joseph Bergenstråhle, Ludvig Bergenstråhle, Liam Bolt, Emelie Braun, Linh T Bui, Steven Callori, Mark Chaffin, Evgeny Chichelnitskiy, Joshua Chiou, Thomas M Conlon, Michael S Cuoco, Anna S E Cuomo, Marie Deprez, Grant Duclos, Denise Fine, David S Fischer, Shila Ghazanfar, Astrid Gillich, Bruno Giotti, Joshua Gould, Minzhe Guo, Austin J Gutierrez, Arun C Habermann, Tyler Harvey, Peng He, Xiaomeng Hou, Lijuan Hu, Yan Hu, Alok Jaiswal, Lu Ji, Peiyong Jiang, Theodoros S Kapellos, Christin S Kuo, Ludvig Larsson, Michael A Leney-Greene, Kyungtae Lim, Monika Litviňuková, Leif S Ludwig, Soeren Lukassen, Wendy Luo, Henrike Maatz, Elo Madissoon, Lira Mamanova, Kasidet Manakongtreecheep, Sylvie Leroy, Christoph H Mayr, Ian M Mbano, Alexi M McAdams, Ahmad N Nabhan, Sarah K Nyquist, Lolita Penland, Olivier B Poirion, Sergio Poli, Cancan Qi, Rachel Queen, Daniel Reichart, Ivan Rosas, Jonas C Schupp, Conor V Shea, Xingyi Shi, Rahul Sinha, Rene V Sit, Kamil Slowikowski, Michal Slyper, Neal P Smith, Alex Sountoulidis, Maximilian Strunz, Travis B Sullivan, Dawei Sun, Carlos Talavera-López, Peng Tan, Jessica Tantivit, Kyle J Travaglini, Nathan R Tucker, Katherine A Vernon, Marc H Wadsworth, Julia Waldman, Xiuting Wang, Ke Xu, Wenjun Yan, William Zhao, Carly G K Ziegler, NHLBI LungMap Consortium, and Human Cell Atlas Lung Biological Network. Single-cell meta-analysis of SARS-CoV-2 entry genes across tissues and demographics. Nature medicine, 27(3):546–559, March 2021. doi:10.1038/s41591-020-01227-z.

[intPZO17]

Belinda Phipson, Luke Zappia, and Alicia Oshlack. Gene length and detection bias in single cell RNA sequencing protocols. F1000Research, April 2017. doi:10.12688/f1000research.11290.1.

[intPolanskiPY+19] (1,2)

Krzysztof Polański, Jong-Eun Park, Matthew D Young, Zhichao Miao, Kerstin B Meyer, and Sarah A Teichmann. BBKNN: Fast Batch Alignment of Single Cell Transcriptomes. Bioinformatics, August 2019. doi:10.1093/bioinformatics/btz625.

[intSFG+15]

Rahul Satija, Jeffrey A Farrell, David Gennert, Alexander F Schier, and Aviv Regev. Spatial reconstruction of single-cell gene expression data. Nature Biotechnology, 33(5):495–502, April 2015. doi:10.1038/nbt.3192.

[intSSZ+22]

Lisa Sikkema, Daniel C Strobl, Luke Zappia, Elo Madissoon, Nikolay S Markov, Laure-Emmanuelle Zaragosi, Meshal Ansari, Marie-Jeanne Arguel, Leonie Apperloo, Christophe Becavin, Marijn Berg, Evgeny Chichelnitskiy, Mei-I Chung, Antoine Collin, Aurore C A Gay, Baharak Hooshiar Kashani, Manu Jain, Theodore Kapellos, Tessa M Kole, Christoph H Mayr, Von Michael Papen, Lance Peter, Ciro Ramirez-Suastegui, Janine Schniering, Chase J Taylor, Thomas Walzthoeni, Chuan Xu, Linh T Bui, Carlo de Donno, Leander Dony, Minzhe Guo, Austin J Gutierrez, Lukas Heumos, Ni Huang, Ignacio L Ibarra, Nathan D Jackson, Preetish Kadur Lakshminarasimha Murthy, Mohammad Lotfollahi, Tracy Tabib, Carlos Talavera-Lopez, Kyle J Travaglini, Anna Wilbrey-Clark, Kaylee B Worlock, Masahiro Yoshida, Tushar J Desai, Oliver Eickelberg, Christine Falk, Naftali Kaminski, Mark A Krasnow, Robert Lafyatis, Marko Z Nikoli, Joseph E Powell, Jayaraj Rajagopal, Orit Rozenblatt-Rosen, Max A Seibold, Dean Sheppard, Douglas P Shepherd, Sarah A Teichmann, Alexander M Tsankov, Jeffrey Whitsett, Yan Xu, Nicholas E Banovich, Pascal Barbry, Thu E Duong, Kerstin B Meyer, Jonathan A Kropski, Dana Pe'er, Herbert B Schiller, Purushothama Rao Tata, Joachim L Schultze, Alexander V Misharin, Martijn C Nawijn, Malte D Luecken, and Fabian J Theis. An integrated cell atlas of the human lung in health and disease. bioRxiv, pages 2022.03.10.483747, March 2022. doi:10.1101/2022.03.10.483747.

[intSBH+19] (1,2,3)

Tim Stuart, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck, 3rd, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. Comprehensive Integration of Single-Cell Data. Cell, 177(7):1888–1902.e21, June 2019. doi:10.1016/j.cell.2019.05.031.

[intSFHM21]

Vinay S Swamy, Temesgen D Fufa, Robert B Hufnagel, and David M McGaughey. Building the mega single-cell transcriptome ocular meta-atlas. GigaScience, October 2021. doi:10.1093/gigascience/giab061.

[intTAC+20] (1,2)

Hoa Thi Nhu Tran, Kok Siong Ang, Marion Chevrier, Xiaomeng Zhang, Nicole Yee Shin Lee, Michelle Goh, and Jinmiao Chen. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome biology, 21(1):12, January 2020. doi:10.1186/s13059-019-1850-9.

[intvdBSVertesy+17]

Susanne C van den Brink, Fanny Sage, Ábel Vértesy, Bastiaan Spanjaard, Josi Peterson-Maduro, Chloé S Baron, Catherine Robin, and Alexander van Oudenaarden. Single-cell sequencing reveals dissociation-induced gene expression in tissue subpopulations. Nature methods, 14(10):935–936, September 2017. doi:10.1038/nmeth.4437.

[intWKL12]

Günter P Wagner, Koryu Kin, and Vincent J Lynch. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory in Biosciences, 131(4):281–285, December 2012. doi:10.1007/s12064-012-0162-3.

[intXLM+21] (1,2)

Chenling Xu, Romain Lopez, Edouard Mehlman, Jeffrey Regier, Michael I Jordan, and Nir Yosef. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Molecular systems biology, 17(1):e9620, January 2021. doi:10.15252/msb.20209620.

12.14. Contributors#

We gratefully acknowledge the contributions of:

12.14.1. Authors#

  • Luke Zappia

  • Malte Lücken

12.14.2. Reviewers#

  • Lukas Heumos