37. Dimensionality Reduction#
Key takeaways
Use UMAP for ADT data visualization; PCA can help reduce dimensionality in large datasets with many surface proteins.
Environment setup
Install conda:
Before creating the environment, ensure that conda is installed on your system.
Save the yml content:
Copy the content from the yml tab into a file named
environment.yml.
Create the environment:
Open a terminal or command prompt.
Run the following command:
conda env create -f environment.yml
Activate the environment:
After the environment is created, activate it using:
conda activate <environment_name>
Replace
<environment_name>with the name specified in theenvironment.ymlfile. In the yml file it will look like this:name: <environment_name>
Verify the installation:
Check that the environment was created successfully by running:
conda env list
name: surface-protein
channels:
- conda-forge
dependencies:
- python=3.13
- scanpy=1.12
- muon=0.1.7
- python-igraph=1.0.0
- ipykernel=7.2.0
- pip==26.0.1
- pip:
- lamindb==2.3.1
- harmonypy==0.0.9
Get data and notebooks
This book uses lamindb to store, share, and load datasets and notebooks using the theislab/sc-best-practices instance. We acknowledge free hosting from Lamin Labs.
Install lamindb
Install the lamindb Python package:
pip install lamindb
Optionally create a lamin account
Sign up and log in following the instructions
Verify your setup
Run the
lamin connectcommand:
import lamindb as ln ln.Artifact.connect("theislab/sc-best-practices").df()
You should now see up to 100 of the stored datasets.
Accessing datasets (Artifacts)
Search for the datasets on the Artifacts page
Load an Artifact and the corresponding object:
import lamindb as ln af = ln.Artifact.connect("theislab/sc-best-practices").get(key="key_of_dataset", is_latest=True) obj = af.load()
The object is now accessible in memory and is ready for analysis. Adapt the
ln.Artifact.connect("theislab/sc-best-practices").get("SOMEIDXXXX")suffix to get respective versions.Accessing notebooks (Transforms)
Search for the notebook on the Transforms page
Load the notebook:
lamin load <notebook url>
which will download the notebook to the current working directory. Analogously to
Artifacts, you can adapt the suffix ID to get older versions.
37.1. Motivation#
Feature matrices of surface protein markers are hard to grasp for humans as raw tables. Therefore, we resort to low dimensional embeddings that allow us to visualize the ADTs in commonly two dimensions. The approaches that we use and recommend for ADT data do not differ from the ones for transcriptomics data. All aforementioned limitations of visualizations obtained through methods like t-SNE and UMAP also apply to ADT data.
ADT data generally does not require any sophisticated feature selection, because features have already been selected a priori during experimental design. All selected ADTs should correspond to biologically relevant features. Nevertheless, large datasets may benefit from PCA to reduce the dataset from several hundred features to a few principal components. This is especially advisable if computational resources are limited.
In this and the following two chapters, we decided to focus on the ADT data and do not use the RNA data of the study. In the Paired integration chapter, we will explore how we can make use of both modalities jointly, which allows for a more detailed cell type annotation.
37.2. Environment setup#
import warnings
import muon as mu
import scanpy as sc
warnings.filterwarnings("ignore")
mu.set_options(pull_on_update=False)
sc.settings.verbosity = 0
sc.set_figure_params(
dpi=80,
facecolor="white",
frameon=False,
)
import lamindb as ln
ln.track()
→ found notebook dimensionality_reduction.ipynb, making new version
→ created Transform('liGMVGre4G5H0004', key='dimensionality_reduction.ipynb'), started new Run('2jUW2z0ijgULitLv') at 2026-04-10 17:28:52 UTC
→ notebook imports: lamindb-core==2.3.1 muon==0.1.7 scanpy==1.12
• recommendation: to identify the notebook across renames, pass the uid: ln.track("liGMVGre4G5H")
37.3. Loading the data#
We load the MuData object we saved at the end of the previous chapter Doublet detection:
af = ln.Artifact.connect("theislab/sc-best-practices").get(
key="surface-protein/cite_doublet_detection.h5mu", is_latest=True
)
mdata = af.load()
mdata
MuData object with n_obs × n_vars = 117951 × 36741
var: 'gene_ids', 'feature_types'
2 modalities
rna: 117951 x 36601
obs: 'donor', 'batch'
var: 'gene_ids', 'feature_types'
prot: 117951 x 140
obs: 'donor', 'batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts', 'outliers', 'doublets_markers'
var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
uns: 'doublets_markers_colors'
layers: 'counts'We remove the counts layer containing the raw data since we do not need it anymore.
del mdata["prot"].layers["counts"]
Isotype controls do not contain any biological information since their only purpose is to use them for dsb normalization, see the Normalization section. We can therefore remove them from our data.
mdata["prot"].var.index[:50]
Index(['CD86-1', 'CD274-1', 'CD270', 'CD155', 'CD112', 'CD47-1', 'CD48-1',
'CD40-1', 'CD154', 'CD52-1', 'CD3', 'CD8', 'CD56', 'CD19-1', 'CD33-1',
'CD11c', 'HLA-A-B-C', 'CD45RA', 'CD123', 'CD7-1', 'CD105', 'CD49f',
'CD194', 'CD4-1', 'CD44-1', 'CD14-1', 'CD16', 'CD25', 'CD45RO', 'CD279',
'TIGIT-1', 'Mouse-IgG1', 'Mouse-IgG2a', 'Mouse-IgG2b', 'Rat-IgG2b',
'CD20', 'CD335', 'CD31', 'Podoplanin', 'CD146', 'IgM', 'CD5-1', 'CD195',
'CD32', 'CD196', 'CD185', 'CD103', 'CD69-1', 'CD62L', 'CD161'],
dtype='object')
isotype_controls = ["Mouse-IgG1", "Mouse-IgG2a", "Mouse-IgG2b", "Rat-IgG2b"]
temp = (
mdata["prot"].var.loc[~mdata["prot"].var.index.isin(isotype_controls), :].index
) # Select all proteins except isotype controls.
Now we actually remove isotype controls from the data.
mu.pp.filter_var(data=mdata["prot"], var=temp.tolist())
The data does not contain the isotype controls anymore.
mdata["prot"].var.index[:50]
Index(['CD86-1', 'CD274-1', 'CD270', 'CD155', 'CD112', 'CD47-1', 'CD48-1',
'CD40-1', 'CD154', 'CD52-1', 'CD3', 'CD8', 'CD56', 'CD19-1', 'CD33-1',
'CD11c', 'HLA-A-B-C', 'CD45RA', 'CD123', 'CD7-1', 'CD105', 'CD49f',
'CD194', 'CD4-1', 'CD44-1', 'CD14-1', 'CD16', 'CD25', 'CD45RO', 'CD279',
'TIGIT-1', 'CD20', 'CD335', 'CD31', 'Podoplanin', 'CD146', 'IgM',
'CD5-1', 'CD195', 'CD32', 'CD196', 'CD185', 'CD103', 'CD69-1', 'CD62L',
'CD161', 'CD152', 'CD223', 'KLRG1-1', 'CD27-1'],
dtype='object')
mdata["prot"]
AnnData object with n_obs × n_vars = 117951 × 136
obs: 'donor', 'batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts', 'outliers', 'doublets_markers'
var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
uns: 'doublets_markers_colors'
37.4. PCA and UMAP#
We can now reduce the dimensionality of the data with PCA since our dataset is quite big (136 surface proteins).
sc.pp.pca(mdata["prot"], svd_solver="arpack", random_state=0)
We create an elbow plot in order to decide how many PCs we use:
We use 20 PCs because PCs 1-20 capture much of the variance in the data and PCs 20-50 capture little variance of the data and can thus be discarded. We now compute a neighborhood graph and a UMAP embedding to visualize the study’s variables.
sc.pp.neighbors(mdata["prot"], n_pcs=20, random_state=0)
sc.tl.umap(mdata["prot"], random_state=0)
Now we have our data compressed into 2 dimensions, which we can use to visualize the data. Let’s first visualize and evaluate if there are batch effects, that is, if different donors and different batches form separate clusters.
We indeed see that some donors form separate clusters. Also batches form separate clusters. Thus, it seems that batch correction is necessary. To confirm, we plot markers of CD4 and CD8 T cells:
CD4 T cells fragment into donor-specific mini-clusters, meaning cells are grouping by donor identity rather than cell type. Ideally, CD4 T cells from all donors should cluster together regardless of their donor of origin. This donor-driven separation is a batch effect, and must be corrected before downstream analysis.
af_dimensionality_reduction = ln.Artifact.from_mudata(
mdata,
key="surface-protein/cite_dimensionality_reduction.h5mu",
description="CITE-seq data after dimensionality reduction",
)
af_dimensionality_reduction.save()
ln.finish()
37.5. References#
37.6. Contributors#
We gratefully acknowledge the contributions of:
37.6.2. Reviewers#
Lukas Heumos
Anna Schaar