37. Dimensionality Reduction#

   Key takeaways

Use UMAP for ADT data visualization; PCA can help reduce dimensionality in large datasets with many surface proteins.

Motivation
   Environment setup
  1. Install conda:

    • Before creating the environment, ensure that conda is installed on your system.

  2. Save the yml content:

    • Copy the content from the yml tab into a file named environment.yml.

  3. Create the environment:

    • Open a terminal or command prompt.

    • Run the following command:

      conda env create -f environment.yml
      
  4. Activate the environment:

    • After the environment is created, activate it using:

      conda activate <environment_name>
      
    • Replace <environment_name> with the name specified in the environment.yml file. In the yml file it will look like this:

      name: <environment_name>
      
  5. Verify the installation:

    • Check that the environment was created successfully by running:

      conda env list
      
name: surface-protein
channels:
  - conda-forge
dependencies:
  - python=3.13
  - scanpy=1.12
  - muon=0.1.7
  - python-igraph=1.0.0
  - ipykernel=7.2.0
  - pip==26.0.1
  - pip:
      - lamindb==2.3.1
      - harmonypy==0.0.9
   Get data and notebooks

This book uses lamindb to store, share, and load datasets and notebooks using the theislab/sc-best-practices instance. We acknowledge free hosting from Lamin Labs.

  1. Install lamindb

    • Install the lamindb Python package:

    pip install lamindb
    
  2. Optionally create a lamin account

  3. Verify your setup

    • Run the lamin connect command:

    import lamindb as ln
    
    ln.Artifact.connect("theislab/sc-best-practices").df()
    

    You should now see up to 100 of the stored datasets.

  4. Accessing datasets (Artifacts)

    • Search for the datasets on the Artifacts page

    • Load an Artifact and the corresponding object:

    import lamindb as ln
    af = ln.Artifact.connect("theislab/sc-best-practices").get(key="key_of_dataset", is_latest=True)
    obj = af.load()
    

    The object is now accessible in memory and is ready for analysis. Adapt the ln.Artifact.connect("theislab/sc-best-practices").get("SOMEIDXXXX") suffix to get respective versions.

  5. Accessing notebooks (Transforms)

    lamin load <notebook url>
    

    which will download the notebook to the current working directory. Analogously to Artifacts, you can adapt the suffix ID to get older versions.

37.1. Motivation#

Feature matrices of surface protein markers are hard to grasp for humans as raw tables. Therefore, we resort to low dimensional embeddings that allow us to visualize the ADTs in commonly two dimensions. The approaches that we use and recommend for ADT data do not differ from the ones for transcriptomics data. All aforementioned limitations of visualizations obtained through methods like t-SNE and UMAP also apply to ADT data.

ADT data generally does not require any sophisticated feature selection, because features have already been selected a priori during experimental design. All selected ADTs should correspond to biologically relevant features. Nevertheless, large datasets may benefit from PCA to reduce the dataset from several hundred features to a few principal components. This is especially advisable if computational resources are limited.

In this and the following two chapters, we decided to focus on the ADT data and do not use the RNA data of the study. In the Paired integration chapter, we will explore how we can make use of both modalities jointly, which allows for a more detailed cell type annotation.

37.2. Environment setup#

import warnings

import muon as mu
import scanpy as sc

warnings.filterwarnings("ignore")
mu.set_options(pull_on_update=False)
sc.settings.verbosity = 0
sc.set_figure_params(
    dpi=80,
    facecolor="white",
    frameon=False,
)

import lamindb as ln

ln.track()
→ found notebook dimensionality_reduction.ipynb, making new version
→ created Transform('liGMVGre4G5H0004', key='dimensionality_reduction.ipynb'), started new Run('2jUW2z0ijgULitLv') at 2026-04-10 17:28:52 UTC
→ notebook imports: lamindb-core==2.3.1 muon==0.1.7 scanpy==1.12
• recommendation: to identify the notebook across renames, pass the uid: ln.track("liGMVGre4G5H")

37.3. Loading the data#

We load the MuData object we saved at the end of the previous chapter Doublet detection:

af = ln.Artifact.connect("theislab/sc-best-practices").get(
    key="surface-protein/cite_doublet_detection.h5mu", is_latest=True
)
mdata = af.load()
mdata
MuData object with n_obs × n_vars = 117951 × 36741
  var:	'gene_ids', 'feature_types'
  2 modalities
    rna:	117951 x 36601
      obs:	'donor', 'batch'
      var:	'gene_ids', 'feature_types'
    prot:	117951 x 140
      obs:	'donor', 'batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts', 'outliers', 'doublets_markers'
      var:	'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
      uns:	'doublets_markers_colors'
      layers:	'counts'

We remove the counts layer containing the raw data since we do not need it anymore.

del mdata["prot"].layers["counts"]

Isotype controls do not contain any biological information since their only purpose is to use them for dsb normalization, see the Normalization section. We can therefore remove them from our data.

mdata["prot"].var.index[:50]
Index(['CD86-1', 'CD274-1', 'CD270', 'CD155', 'CD112', 'CD47-1', 'CD48-1',
       'CD40-1', 'CD154', 'CD52-1', 'CD3', 'CD8', 'CD56', 'CD19-1', 'CD33-1',
       'CD11c', 'HLA-A-B-C', 'CD45RA', 'CD123', 'CD7-1', 'CD105', 'CD49f',
       'CD194', 'CD4-1', 'CD44-1', 'CD14-1', 'CD16', 'CD25', 'CD45RO', 'CD279',
       'TIGIT-1', 'Mouse-IgG1', 'Mouse-IgG2a', 'Mouse-IgG2b', 'Rat-IgG2b',
       'CD20', 'CD335', 'CD31', 'Podoplanin', 'CD146', 'IgM', 'CD5-1', 'CD195',
       'CD32', 'CD196', 'CD185', 'CD103', 'CD69-1', 'CD62L', 'CD161'],
      dtype='object')
isotype_controls = ["Mouse-IgG1", "Mouse-IgG2a", "Mouse-IgG2b", "Rat-IgG2b"]
temp = (
    mdata["prot"].var.loc[~mdata["prot"].var.index.isin(isotype_controls), :].index
)  # Select all proteins except isotype controls.

Now we actually remove isotype controls from the data.

mu.pp.filter_var(data=mdata["prot"], var=temp.tolist())

The data does not contain the isotype controls anymore.

mdata["prot"].var.index[:50]
Index(['CD86-1', 'CD274-1', 'CD270', 'CD155', 'CD112', 'CD47-1', 'CD48-1',
       'CD40-1', 'CD154', 'CD52-1', 'CD3', 'CD8', 'CD56', 'CD19-1', 'CD33-1',
       'CD11c', 'HLA-A-B-C', 'CD45RA', 'CD123', 'CD7-1', 'CD105', 'CD49f',
       'CD194', 'CD4-1', 'CD44-1', 'CD14-1', 'CD16', 'CD25', 'CD45RO', 'CD279',
       'TIGIT-1', 'CD20', 'CD335', 'CD31', 'Podoplanin', 'CD146', 'IgM',
       'CD5-1', 'CD195', 'CD32', 'CD196', 'CD185', 'CD103', 'CD69-1', 'CD62L',
       'CD161', 'CD152', 'CD223', 'KLRG1-1', 'CD27-1'],
      dtype='object')
mdata["prot"]
AnnData object with n_obs × n_vars = 117951 × 136
    obs: 'donor', 'batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts', 'outliers', 'doublets_markers'
    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    uns: 'doublets_markers_colors'

37.4. PCA and UMAP#

We can now reduce the dimensionality of the data with PCA since our dataset is quite big (136 surface proteins).

sc.pp.pca(mdata["prot"], svd_solver="arpack", random_state=0)

We create an elbow plot in order to decide how many PCs we use:

sc.pl.pca_variance_ratio(mdata["prot"], n_pcs=50)

We use 20 PCs because PCs 1-20 capture much of the variance in the data and PCs 20-50 capture little variance of the data and can thus be discarded. We now compute a neighborhood graph and a UMAP embedding to visualize the study’s variables.

sc.pp.neighbors(mdata["prot"], n_pcs=20, random_state=0)
sc.tl.umap(mdata["prot"], random_state=0)

Now we have our data compressed into 2 dimensions, which we can use to visualize the data. Let’s first visualize and evaluate if there are batch effects, that is, if different donors and different batches form separate clusters.

sc.pl.umap(mdata["prot"], color=["donor", "batch"])

We indeed see that some donors form separate clusters. Also batches form separate clusters. Thus, it seems that batch correction is necessary. To confirm, we plot markers of CD4 and CD8 T cells:

sc.pl.umap(mdata["prot"], color=["CD4-1", "CD8", "CD3"])

CD4 T cells fragment into donor-specific mini-clusters, meaning cells are grouping by donor identity rather than cell type. Ideally, CD4 T cells from all donors should cluster together regardless of their donor of origin. This donor-driven separation is a batch effect, and must be corrected before downstream analysis.

af_dimensionality_reduction = ln.Artifact.from_mudata(
    mdata,
    key="surface-protein/cite_dimensionality_reduction.h5mu",
    description="CITE-seq data after dimensionality reduction",
)
af_dimensionality_reduction.save()

Hide code cell output

→ returning artifact with same hash: Artifact(uid='9DWigMKiH7zGL1jy0001', key='surface-protein/cite_dimensionality_reduction.h5mu', description='CITE-seq data after dimensionality reduction', suffix='.h5mu', kind='dataset', otype='MuData', size=1498728248, hash='OGdSkrdRzOgP_SJoVwIgpT', n_files=None, n_observations=117951, branch_id=1, created_on_id=1, space_id=1, storage_id=1, run_id=70, schema_id=None, created_by_id=7, created_at=2026-04-10 17:04:28 UTC, is_locked=False, version_tag=None, is_latest=True); to track this artifact as an input, use: ln.Artifact.get()
Artifact(uid='9DWigMKiH7zGL1jy0001', key='surface-protein/cite_dimensionality_reduction.h5mu', description='CITE-seq data after dimensionality reduction', suffix='.h5mu', kind='dataset', otype='MuData', size=1498728248, hash='OGdSkrdRzOgP_SJoVwIgpT', n_files=None, n_observations=117951, branch_id=1, created_on_id=1, space_id=1, storage_id=1, run_id=70, schema_id=None, created_by_id=7, created_at=2026-04-10 17:04:28 UTC, is_locked=False, version_tag=None, is_latest=True)
ln.finish()

Hide code cell output

• please hit CTRL + s to save the notebook in your editor .... still waiting .....
.............. ✓
! cells [(0, 3)] were not run consecutively
→ finished Run('2jUW2z0ijgULitLv') after 1m at 2026-04-10 17:30:42 UTC
→ go to: https://lamin.ai/theislab/sc-best-practices/transform/liGMVGre4G5H0004
→ to update your notebook from the CLI, run: lamin save /groups/nils/members/javier/single-cell-best-practices/jupyter-book/surface_protein/dimensionality_reduction.ipynb

37.5. References#

37.6. Contributors#

We gratefully acknowledge the contributions of:

37.6.1. Authors#

  • Javier Marchena-Hurtado

  • Daniel Strobl

  • Ciro Ramírez-Suástegui

37.6.2. Reviewers#

  • Lukas Heumos

  • Anna Schaar