8. Feature selection#
Key takeaways
Feature selection in single-cell RNA-seq data focuses on identifying the most informative genes by using methods like deviance, which avoids biases from arbitrary normalization choices.
Environment setup
Install conda:
Before creating the environment, ensure that conda is installed on your system.
Save the yml content:
Copy the content from the yml tab into a file named
environment.yml.
Create the environment:
Open a terminal or command prompt.
Run the following command:
conda env create -f environment.yml
Activate the environment:
After the environment is created, activate it using:
conda activate <environment_name>
Replace
<environment_name>with the name specified in theenvironment.ymlfile. In the yml file it will look like this:name: <environment_name>
Verify the installation:
Check that the environment was created successfully by running:
conda env list
name: preprocessing
channels:
- bioconda
- conda-forge
dependencies:
- conda-forge::ipywidgets=8.1.5
- conda-forge::leidenalg=0.10.2
- conda-forge::numba=0.61.0
- conda-forge::python=3.12.9
- conda-forge::r-base=4.3.3
- conda-forge::r-soupx=1.6.2
- conda-forge::r-sctransform=0.4.1
- conda-forge::r-glmpca=0.2.0
- conda-forge::rpy2=3.5.11
- conda-forge::scanpy=1.11.1
- conda-forge::session-info=1.0.0
- bioconda::anndata2ri=1.3.2
- bioconda::bioconductor-scdblfinder=1.16.0
- bioconda::bioconductor-scry=1.14.0
- bioconda::bioconductor-scran=1.30.0
- bioconda::bioconductor-glmgampoi=1.14.0
- pip
- pip:
- lamindb[bionty,jupyter]
Get data and notebooks
This book uses lamindb to store, share, and load datasets and notebooks using the theislab/sc-best-practices instance. We acknowledge free hosting from Lamin Labs.
Install lamindb
Install the lamindb Python package:
pip install lamindb
Optionally create a lamin account
Sign up and log in following the instructions
Verify your setup
Run the
lamin connectcommand:
import lamindb as ln ln.Artifact.connect("theislab/sc-best-practices").df()
You should now see up to 100 of the stored datasets.
Accessing datasets (Artifacts)
Search for the datasets on the Artifacts page
Load an Artifact and the corresponding object:
import lamindb as ln af = ln.Artifact.connect("theislab/sc-best-practices").get(key="key_of_dataset", is_latest=True) obj = af.load()
The object is now accessible in memory and is ready for analysis. Adapt the
ln.Artifact.connect("theislab/sc-best-practices").get("SOMEIDXXXX")suffix to get respective versions.Accessing notebooks (Transforms)
Search for the notebook on the Transforms page
Load the notebook:
lamin load <notebook url>
which will download the notebook to the current working directory. Analogously to
Artifacts, you can adapt the suffix ID to get older versions.
8.1. Motivation#
We now have a normalized data representation that still preserves biological heterogeneity but with reduced technical sampling effects in gene expression. ScRNA-seq datasets usually contain up to 30,000 genes (depending on the organism) and so far we only removed genes that are not detected in at least 20 cells. However, many of the remaining genes are not informative and contain mostly zero counts. Therefore, a standard preprocessing pipeline involves the step of feature selection which aims to exclude uninformative genes which might not represent meaningful biological variation across samples.
Fig. 8.1 Feature selection generally describes the process of only selecting a subset of relevant features which can be the most informative, most variable or most deviant ones.#
Often, the scRNA-seq experiment focuses on one specific tissue and hence, only a small fraction of genes is informative and biologically variable. Traditional approaches and pipelines either compute the coefficient of variation (highly variable genes) or the average expression level (highly expressed genes) of all genes to obtain 500-2000 selected genes and use these features for their downstream analysis steps. However, these methods are highly sensitive to the normalization technique used before. As mentioned earlier, a former preprocessing workflow included normalization with CPM and subsequent log transformation. But as log-transformation is not possible for exact zeros, analysts often add a small pseudo count, e.g., 1 (log1p), to all normalized counts before log transforming the data. Choosing the pseudo count, however, is arbitrary and can introduce biases to the transformed data. This arbitrariness has then also an effect on the feature selection as the observed variability depends on the chosen pseudo count. A small pseudo count value close to zero is increasing the variance of genes with zero counts [Townes et al., 2019].
Germain et al. instead propose to use deviance for feature selection which works on raw counts [Germain et al., 2020]. Deviance can be computed in closed form and quantifies whether genes show constant expression profile across cells as these are not informative. Genes with constant expression are described by a multinomial null model, they are approximated by the binomial deviance. Highly informative genes across cells will have a high deviance value which indicates a poor fit by the null model (i.e., they don’t show constant expression across cells). According to the deviance values, the method then ranks all genes and obtains only highly deviant genes.
As mentioned before, deviance can be computed in closed form and is provided within the R package scry.
We start by setting up our environment.
import logging
import lamindb as ln
import matplotlib.pyplot as plt
import numpy as np
import rpy2.rinterface_lib.callbacks as rcb
import rpy2.robjects as ro
import rpy2.robjects.packages as rpackages
import scanpy as sc
import seaborn as sns
from rpy2.robjects import default_converter, numpy2ri, pandas2ri, r
from rpy2.robjects.conversion import localconverter
# Suppress verbose logging from Scanpy
sc.settings.verbosity = 0
# Set figure parameters for clean, minimal plots
sc.settings.set_figure_params(dpi=80, facecolor="white", frameon=False)
assert ln.setup.settings.instance.slug == "theislab/sc-best-practices"
ln.track()
rcb.logger.setLevel(logging.ERROR)
%load_ext rpy2.ipython
%%R
library(scry)
library(SingleCellExperiment)
Next, we load the already normalized dataset from the previous chapter.
Deviance works on raw counts so there is no need to replace adata.X with one of the normalized layers, but we can directly use the object as it was stored in the normalization notebook.
af = ln.Artifact.connect("theislab/sc-best-practices").get(
key="preprocessing_visualization/s4d8_normalization.h5ad", is_latest=True
)
adata = af.load()
adata
AnnData object with n_obs × n_vars = 14814 × 20223
obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_20_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'outlier', 'mt_outlier', 'soupx_groups', 'scDblFinder_score', 'scDblFinder_class', 'size_factors'
var: 'gene_ids', 'feature_types', 'genome', 'interval', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells'
layers: 'analytic_pearson_residuals', 'counts', 'log1p_norm', 'scran_normalization', 'soupX_counts'
Similar to before, we save the AnnData object in our R environment.
X_sparse = adata.X.T.tocoo()
Matrix = rpackages.importr("Matrix")
with localconverter(ro.default_converter + pandas2ri.converter + numpy2ri.converter):
ro.globalenv["obs"] = adata.obs
ro.globalenv["var"] = adata.var
i, j = X_sparse.row, X_sparse.col
x = X_sparse.data
ro.globalenv["i"] = ro.IntVector((i + 1).tolist()) # R is 1-indexed
ro.globalenv["j"] = ro.IntVector((j + 1).tolist())
ro.globalenv["x"] = ro.FloatVector(x.tolist())
r("X <- sparseMatrix(i = i, j = j, x = x, dims = c({}, {}))".format(*X_sparse.shape))
<rpy2.robjects.methods.RS4 object at 0x164985550> [25]
R classes: ('dgCMatrix',)
We can now directly call feature selection with deviance on the non-normalized counts matrix and export the binomial deviance values as a vector.
%%R
sce <- SingleCellExperiment(
assays = list(X = X),
colData = obs,
rowData = var
)
sce <- devianceFeatureSelection(sce, assay = "X")
with localconverter(default_converter + pandas2ri.converter + numpy2ri.converter):
binomial_deviance = ro.r("rowData(sce)$binomial_deviance")
As a next step, we now sort the vector and select the top 4,000 highly deviant genes and save them as an additional column in .var as ‘highly_deviant’.
We additionally save the computed binomial deviance in case we want to sub-select a different number of highly variable genes afterwards.
idx = binomial_deviance.argsort()[-4000:]
mask = np.zeros(adata.var_names.shape, dtype=bool)
mask[idx] = True
adata.var["highly_deviant"] = mask
adata.var["binomial_deviance"] = binomial_deviance
Last, we visualise the feature selection results. We use a scanpy function to compute the mean and dispersion for each gene across all cells.
sc.pp.highly_variable_genes(adata, layer="scran_normalization")
We inspect our results by plotting dispersion versus mean for the genes and color by ‘highly_deviant’.
ax = sns.scatterplot(
data=adata.var, x="means", y="dispersions", hue="highly_deviant", s=5
)
ax.set_xlim(None, 1.5)
ax.set_ylim(None, 3)
plt.show()
We observe that genes with a high mean expression are selected as highly deviant. This is in agreement with empirical observations by [Townes et al., 2019].
af = ln.Artifact.from_anndata(
adata,
key="preprocessing_visualization/s4d8_feature_selection.h5ad",
description="anndata after feature selection",
).save()
af
8.2. References#
Pierre-Luc Germain, Anthony Sonrel, and Mark D. Robinson. pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single cell rna-seq preprocessing tools. Genome Biology, 21(1):227, September 2020. URL: https://doi.org/10.1186/s13059-020-02136-7, doi:10.1186/s13059-020-02136-7.
F. William Townes, Stephanie C. Hicks, Martin J. Aryee, and Rafael A. Irizarry. Feature selection and dimension reduction for single-cell term`rna`-seq based on a multinomial model. Genome Biology, 20(1):295, Dec 2019. URL: https://doi.org/10.1186/s13059-019-1861-6, doi:10.1186/s13059-019-1861-6.
8.3. Contributors#
We gratefully acknowledge the contributions of:
8.3.2. Reviewers#
Lukas Heumos