Dimensionality Reduction
Contents
38. Dimensionality Reduction#
38.1. Motivation#
Feature matrices of surface protein markers are hard to grasp for humans as raw tables. Therefore, we resort to low dimensional embeddings that allow us to visualize the ADTs in commonly two dimensions. The approaches that we use and recommend for ADT data do not differ from the ones for transcriptomics data. All aforementioned limitations of visualizations obtained through methods like t-SNE and UMAP also apply to ADT data.
ADT data generally does not require any sophisticated feature selection, because features have already been selected a priori during experimental design. All selected ADT should correspond to biologically relevant features. Nevertheless, large datasets may benefit from PCA to reduce the dataset from several hundred features to a couple of principal components. This is especially advisable if computational resources are limited.
38.2. Environment setup#
import scanpy as sc
import muon as mu
import pandas as pd
import warnings
warnings.simplefilter(action="ignore", category=UserWarning)
warnings.simplefilter(action="ignore", category=FutureWarning)
/home/icb/ciro.suastegui/miniconda3/envs/citeseq_pp/lib/python3.7/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
38.3. Loading the data#
filtered_xdbt_mu_path = "/lustre/groups/ml01/workspace/ciro.suastegui/bp2.0/data/neurips_cite_pp_filtered-qc-norm-xdbt.h5mu"
We are simply loading the saved MuData object from the normalization chapter back in.
filtered = mu.read(filtered_xdbt_mu_path)
As the isotypes do not contain any biological information, we can remove them from our data.
filtered["prot"].var.index[:50]
Index(['CD86-1', 'CD274-1', 'CD270', 'CD155', 'CD112', 'CD47-1', 'CD48-1',
'CD40-1', 'CD154', 'CD52-1', 'CD3', 'CD8', 'CD56', 'CD19-1', 'CD33-1',
'CD11c', 'HLA-A-B-C', 'CD45RA', 'CD123', 'CD7-1', 'CD105', 'CD49f',
'CD194', 'CD4-1', 'CD44-1', 'CD14-1', 'CD16', 'CD25', 'CD45RO', 'CD279',
'TIGIT-1', 'Mouse-IgG1', 'Mouse-IgG2a', 'Mouse-IgG2b', 'Rat-IgG2b',
'CD20', 'CD335', 'CD31', 'Podoplanin', 'CD146', 'IgM', 'CD5-1', 'CD195',
'CD32', 'CD196', 'CD185', 'CD103', 'CD69-1', 'CD62L', 'CD161'],
dtype='object')
isotype_controls = ["Mouse-IgG1", "Mouse-IgG2a", "Mouse-IgG2b", "Rat-IgG2b"]
temp = (
filtered["prot"]
.var.loc[~filtered["prot"].var.index.isin(isotype_controls), :]
.index
)
temp
Index(['CD86-1', 'CD274-1', 'CD270', 'CD155', 'CD112', 'CD47-1', 'CD48-1',
'CD40-1', 'CD154', 'CD52-1',
...
'CD94', 'CD162', 'CD85j', 'CD23', 'CD328', 'HLA-E-1', 'CD82-1',
'CD101-1', 'CD88', 'CD224'],
dtype='object', length=136)
We store the isotype data as a multi-dimensional annotation.
filtered["prot"].obsm["X_isotypes"] = filtered["prot"].X[
:, ~filtered["prot"].var.index.isin(temp.tolist())
]
mu.pp.filter_var(data=filtered["prot"], var=temp.tolist())
filtered["prot"].var.index[:50]
Index(['CD86-1', 'CD274-1', 'CD270', 'CD155', 'CD112', 'CD47-1', 'CD48-1',
'CD40-1', 'CD154', 'CD52-1', 'CD3', 'CD8', 'CD56', 'CD19-1', 'CD33-1',
'CD11c', 'HLA-A-B-C', 'CD45RA', 'CD123', 'CD7-1', 'CD105', 'CD49f',
'CD194', 'CD4-1', 'CD44-1', 'CD14-1', 'CD16', 'CD25', 'CD45RO', 'CD279',
'TIGIT-1', 'CD20', 'CD335', 'CD31', 'Podoplanin', 'CD146', 'IgM',
'CD5-1', 'CD195', 'CD32', 'CD196', 'CD185', 'CD103', 'CD69-1', 'CD62L',
'CD161', 'CD152', 'CD223', 'KLRG1-1', 'CD27-1'],
dtype='object')
38.4. PCA and UMAP#
We can now reduce the dimensionality of the data with PCA since our dataset is very big, compute a neighborhood graph and a UMAP embedding to visualize the study’s variables.
%%time
sc.pp.pca(filtered["prot"], svd_solver="arpack")
CPU times: user 2.54 s, sys: 216 ms, total: 2.76 s
Wall time: 1.46 s
sc.pl.pca_variance_ratio(filtered["prot"], n_pcs=50)

%%time
sc.pp.neighbors(filtered["prot"], n_pcs=20)
CPU times: user 30.4 s, sys: 881 ms, total: 31.3 s
Wall time: 26.3 s
%%time
sc.tl.umap(filtered["prot"])
CPU times: user 2min 32s, sys: 4.21 s, total: 2min 36s
Wall time: 2min 14s
sc.pl.umap(filtered["prot"], color=["CD4-1", "CD8", "CD3"])
sc.pl.umap(filtered["prot"], color=["CD14-1", "CD16"])


sc.pl.umap(filtered["prot"], color=["donor", "batch"])

As it can be seen in the above UMAP representation, different samples cluster apart from each other for similar populations (see CD4 and CD8 expression in the previous plot). Thus, batch correction of the data would be necessary.
38.5. Key takeaways#
TODO
38.6. References#
38.7. Contributors#
We gratefully acknowledge the contributions of:
38.7.2. Reviewers#
Lukas Heumos