4. Fundamental data structures and frameworks#

   Key takeaways

Three major ecosystems exist for single-cell analysis: R-based Bioconductor and Seurat, and Python-based scverse.

Single-cell analysis frameworks and consortia

While all ecosystems are widely used, scverse is preferred for scalability (handling 500k+ cells) and strong interoperability with Python’s data and machine learning tools.

Single-cell analysis frameworks and consortia

AnnData is the core data structure of scverse. It stores the count matrix in .X (cells × genes), with cell and gene annotations in .obs and .var.

Storing unimodal data with AnnData

AnnData is designed for memory efficiency. It supports sparse matrices, backed reading of large files, and view-based subsetting that avoids allocating new memory unless explicitly copied with .copy().

Efficient data access

Scanpy is the primary analysis framework built on AnnData. It provides a modular API (sc.pp for preprocessing, sc.tl for tools, sc.pl for plotting) covering the full workflow from basic quality control to dimensionality reduction and the many advanced analyses explored in later chapters.

Unimodal data analysis with scanpy
   Environment setup
  1. Install conda:

    • Before creating the environment, ensure that conda is installed on your system.

  2. Save the yml content:

    • Copy the content from the yml tab into a file named environment.yml.

  3. Create the environment:

    • Open a terminal or command prompt.

    • Run the following command:

      conda env create -f environment.yml
      
  4. Activate the environment:

    • After the environment is created, activate it using:

      conda activate <environment_name>
      
    • Replace <environment_name> with the name specified in the environment.yml file. In the yml file it will look like this:

      name: <environment_name>
      
  5. Verify the installation:

    • Check that the environment was created successfully by running:

      conda env list
      
name: fundamental_data_structures_and_frameworks
channels:
  - conda-forge
  - bioconda
dependencies:
  - conda-forge::anndata=0.12.7
  - conda-forge::python=3.13.12
  - conda-forge::scanpy=1.12
  - conda-forge::ipywidgets=8.1.8
  - pip
  - pip:
      - lamindb
   Get data and notebooks

This book uses lamindb to store, share, and load datasets and notebooks using the theislab/sc-best-practices instance. We acknowledge free hosting from Lamin Labs.

  1. Install lamindb

    • Install the lamindb Python package:

    pip install lamindb
    
  2. Optionally create a lamin account

  3. Verify your setup

    • Run the lamin connect command:

    import lamindb as ln
    
    ln.Artifact.connect("theislab/sc-best-practices").df()
    

    You should now see up to 100 of the stored datasets.

  4. Accessing datasets (Artifacts)

    • Search for the datasets on the Artifacts page

    • Load an Artifact and the corresponding object:

    import lamindb as ln
    af = ln.Artifact.connect("theislab/sc-best-practices").get(key="key_of_dataset", is_latest=True)
    obj = af.load()
    

    The object is now accessible in memory and is ready for analysis. Adapt the ln.Artifact.connect("theislab/sc-best-practices").get("SOMEIDXXXX") suffix to get respective versions.

  5. Accessing notebooks (Transforms)

    lamin load <notebook url>
    

    which will download the notebook to the current working directory. Analogously to Artifacts, you can adapt the suffix ID to get older versions.

4.1. Single-cell analysis frameworks and consortia#

After obtaining the count matrices, as described earlier, the exploratory data analysis phase begins. While in the early days, people used to analyze their data with custom scripts, frameworks for precisely this purpose now exist. The three most popular options are the R-based Bioconductor [Huber et al., 2015] and Seurat [Hao et al., 2021] ecosystems and the Python-based scverse [scverse, 2022] ecosystem. These differ not only in the used programming languages but also in the underlying data structures and available specialized analysis tools.

Bioconductor is an open-source project for rigorous and reproducible biological data analysis, including single-cell. Its greatest strengths are a homogeneous developer and user experience and extensive, user-friendly documentation. Seurat is a well-regarded R package for single-cell analysis, covering all analysis steps including multimodal and spatial data. It is known for its well-written vignettes and large user base. Both R options can struggle with very large datasets (500k+ cells), which motivated the Python community to develop the scverse ecosystem. Scverse is an organization dedicated to foundational life science tools, with an initial focus on single-cell. Key advantages include scalability, extendability, and strong interoperability with Python’s data and machine learning ecosystem.

All three ecosystems are involved in many efforts to allow for interoperability of the involved frameworks. This will be discussed in the “Interoperability” chapter. This book always focuses on the best tools for the corresponding question and will, therefore, use a mix of the above-mentioned ecosystems. However, the basis of all analyses will be the scverse ecosystem for two reasons:

  1. While we will regularly switch ecosystems and even programming languages throughout this book, consistent use of data structures and tooling helps readers focus on the concepts rather than implementation details.

  2. A great book on exclusively the Bioconductor ecosystem already exists. We encourage users who only want to learn about single-cell analysis with Bioconductor to read it.

In the following sections, the scverse ecosystem will be introduced in more detail, and the key concepts will be explained with a focus on the most important data structures. This chapter introduces the fundamental data structure AnnData and the scanpy framework (See Fig. 4.1). In the following chapter, we will explore more advanced libraries. This introduction cannot cover all aspects of the data structures and frameworks. We refer to the respective frameworks’ tutorials and documentation where required.

Scverse ecosystem overview

Fig. 4.1 Scverse ecosystem overview highlighting the libraries of this chapter. The publication date by a scientific journal is shown in brackets. We have obtained the symbols of the libraries from the corresponding Github pages [Bredikhin et al., 2022, Marconato et al., 2025, Palla et al., 2022, Virshup et al., 2021, Wolf et al., 2018].#

4.2. Storing unimodal data with AnnData#

As previously discussed, genomics data is typically summarized into a feature matrix after alignment and gene annotation. This matrix will be of the shape number_observations x number_variables. In scRNA-seq, observations are cellular barcodes, and the variables are annotated genes. Throughout the analysis, the observations and variables of this matrix are annotated with computationally derived measurements (e.g., quality control metrics or latent space embeddings) and prior knowledge (e.g., source donor or alternative gene identifier). In the scverse ecosystem, AnnData [Virshup et al., 2021] is used to associate the data matrix with these annotations. To allow for fast and memory-efficient transformations, AnnData also supports sparse matrices and partial reading.

While AnnData is broadly similar to data structures from the R ecosystems (e.g., Bioconductor’s SummarizedExperiment or Seurat’s object), R packages use a transposed feature matrix.

At its core, an AnnData object stores a sparse or dense matrix (the count matrix in the case of scRNA-Seq) in X. This matrix has the dimensions of obs_names x var_names where the obs (=observations) correspond to the cells’ barcodes and the var (=variables) correspond to the gene identifiers. This matrix X is surrounded by Pandas DataFrames obs and var, which save annotations of cells and genes, respectively. Further, AnnData saves whole matrices of calculations for the observations (obsm) or variables (varm) with the corresponding dimensions. Graph-like structures that associate cells with cells or genes with genes are usually saved in obsp and varp. Any other unstructured data which does not fit any other slot is saved as unstructured data in uns. It is further possible to store more values of X in layers. Use cases for this are, for example, the storage of raw, unnormalized count data in a counts layer and the normalized data in the unnamed default layer. AnnData is primarily designed for unimodal (for example, just scRNA-Seq) data. However, extensions of AnnData, such as MuData, which is covered in the next chapter, allow for the efficient storage and access of multimodal data.

AnnData Overview

Fig. 4.2 AnnData overview. Image obtained from [Virshup et al., 2021].#

4.2.1. Installation#

AnnData is available on PyPI and Conda. It can be installed using either of the following commands.

pip install anndata
conda install -c conda-forge anndata

4.2.2. Initializing an AnnData object#

This section is inspired by AnnData’s “getting started” tutorial. Let us create a simple AnnData object with sparse count information, which may, for example, represent gene expression counts. First, we import the required packages.

import anndata as ad
import lamindb as ln
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

ln.track()

Hide code cell output

→ loaded Transform('zmuES9DakbmX0000', key='fundamental_data_structures_and_frameworks.ipynb'), re-started Run('pVMXOM3UaklSn6tp') at 2026-02-27 12:11:51 UTC
→ notebook imports: anndata==0.12.7 lamindb==2.0.1 numpy==2.3.5 pandas==2.3.3 scanpy==1.11.5 scipy==1.16.3
• recommendation: to identify the notebook across renames, pass the uid: ln.track("zmuES9DakbmX")

As a next step, we initialize an AnnData object with random Poisson distributed data. It is an unwritten rule to name the primary AnnData object of the analysis adata.

counts = csr_matrix(
    np.random.default_rng().poisson(1, size=(100, 2000)), dtype=np.float32
)
adata = ad.AnnData(counts)
adata
AnnData object with n_obs × n_vars = 100 × 2000

The obtained AnnData object has 100 observations and 2000 variables. This would correspond to 100 cells with 2000 genes. The initial data we passed are accessible as a sparse matrix using adata.X.

adata.X
<Compressed Sparse Row sparse matrix of dtype 'float32'
	with 126197 stored elements and shape (100, 2000)>

Now, we provide the index to both the obs and var axes using .obs_names and .var_names, respectively.

adata.obs_names = [f"Cell_{i:d}" for i in range(adata.n_obs)]
adata.var_names = [f"Gene_{i:d}" for i in range(adata.n_vars)]
print(adata.obs_names[:10])
Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
       'Cell_7', 'Cell_8', 'Cell_9'],
      dtype='object')

4.2.3. Adding aligned metadata#

4.2.3.1. Observational or variable level#

The core of our AnnData object is now in place. As a next step, we add metadata at both the observational and variable levels. Remember, we store such annotations in the .obs and .var slots of the AnnData object for cell and gene annotations, respectively.

ct = np.random.default_rng().choice(["B", "T", "Monocyte"], size=(adata.n_obs,))
adata.obs["cell_type"] = pd.Categorical(ct)  # Categoricals are preferred for efficiency
adata.obs
cell_type
Cell_0 B
Cell_1 T
Cell_2 Monocyte
Cell_3 B
Cell_4 Monocyte
... ...
Cell_95 T
Cell_96 B
Cell_97 B
Cell_98 T
Cell_99 B

100 rows × 1 columns

If we examine the representation of the AnnData object again now, we will notice that it was updated with the cell_type information in obs as well.

adata
AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'

4.2.3.2. Subsetting using metadata#

We can also subset the AnnData object with the randomly generated cell types. The slicing and masking of the AnnData object behaves similarly to the data access in Pandas DataFrames or R matrices. More details on this can be found below.

bdata = adata[adata.obs.cell_type == "B"]
bdata
View of AnnData object with n_obs × n_vars = 40 × 2000
    obs: 'cell_type'

4.2.4. Observation/variable-level matrices#

We might also have metadata at either level with many dimensions, such as a UMAP embedding of the data. AnnData has the .obsm/.varm attributes for this type of metadata. We use keys to identify the different matrices we insert. The restriction of .obsm/.varm is that .obsm matrices must have a length equal to the number of observations as .n_obs and .varm matrices must have a length equal to .n_vars. They can each independently have a different number of dimensions.

Let us start with a randomly generated matrix that we can interpret as a UMAP embedding of the data we would like to store, as well as some random gene-level metadata.

adata.obsm["X_umap"] = np.random.default_rng().normal(0, 1, size=(adata.n_obs, 2))
adata.varm["gene_stuff"] = np.random.default_rng().normal(0, 1, size=(adata.n_vars, 5))
adata.obsm
AxisArrays with keys: X_umap

Again, the AnnData representation is updated.

adata
AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'
    obsm: 'X_umap'
    varm: 'gene_stuff'

A few more notes about .obsm/.varm:

  1. The “array-like” metadata can originate from a Pandas DataFrame, scipy sparse matrix, or numpy dense array.

  2. When using scanpy, their values (columns) are not easily plotted, whereas items from .obs are easily plotted on, e.g., UMAP plots.

4.2.5. Unstructured metadata#

As mentioned above, AnnData has .uns, which allows for any unstructured metadata. This can be anything, like a list or a dictionary, with some general information that was useful in the analysis of our data. Try only using this slot for data that cannot be efficiently stored in the other slots.

adata.uns["random"] = [1, 2, 3]
adata.uns
OrderedDict([('random', [1, 2, 3])])

4.2.6. Layers#

Finally, we may have different forms of our original core data, perhaps one that is normalized and one that is not. These can be stored in different layers in AnnData. For example, let us log transform the original data and store it in a layer.

adata.layers["log_transformed"] = np.log1p(adata.X)
adata
AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'
    uns: 'random'
    obsm: 'X_umap'
    varm: 'gene_stuff'
    layers: 'log_transformed'

Our original matrix X was not modified and is still accessible. We can verify this by comparing the original X to the new layer (.nnz returns number of non-zero elements in the boolean matrix).

(adata.X != adata.layers["log_transformed"]).nnz == 0
False

4.2.7. Conversion to DataFrames#

It is possible to obtain a Pandas DataFrame from one of the layers.

adata.to_df(layer="log_transformed")
Gene_0 Gene_1 Gene_2 Gene_3 Gene_4 Gene_5 Gene_6 Gene_7 Gene_8 Gene_9 Gene_10 Gene_11 Gene_12 Gene_13 Gene_14 Gene_15 Gene_16 Gene_17 Gene_18 Gene_19 Gene_20 Gene_21 Gene_22 Gene_23 Gene_24 Gene_25 Gene_26 Gene_27 Gene_28 Gene_29 Gene_30 Gene_31 Gene_32 Gene_33 Gene_34 Gene_35 Gene_36 Gene_37 Gene_38 Gene_39 Gene_40 Gene_41 Gene_42 Gene_43 Gene_44 Gene_45 Gene_46 Gene_47 Gene_48 Gene_49 Gene_50 Gene_51 Gene_52 Gene_53 Gene_54 Gene_55 Gene_56 Gene_57 Gene_58 Gene_59 Gene_60 Gene_61 Gene_62 Gene_63 Gene_64 Gene_65 Gene_66 Gene_67 Gene_68 Gene_69 Gene_70 Gene_71 Gene_72 Gene_73 Gene_74 Gene_75 Gene_76 Gene_77 Gene_78 Gene_79 Gene_80 Gene_81 Gene_82 Gene_83 Gene_84 Gene_85 Gene_86 Gene_87 Gene_88 Gene_89 Gene_90 Gene_91 Gene_92 Gene_93 Gene_94 Gene_95 Gene_96 Gene_97 Gene_98 Gene_99 ... Gene_1900 Gene_1901 Gene_1902 Gene_1903 Gene_1904 Gene_1905 Gene_1906 Gene_1907 Gene_1908 Gene_1909 Gene_1910 Gene_1911 Gene_1912 Gene_1913 Gene_1914 Gene_1915 Gene_1916 Gene_1917 Gene_1918 Gene_1919 Gene_1920 Gene_1921 Gene_1922 Gene_1923 Gene_1924 Gene_1925 Gene_1926 Gene_1927 Gene_1928 Gene_1929 Gene_1930 Gene_1931 Gene_1932 Gene_1933 Gene_1934 Gene_1935 Gene_1936 Gene_1937 Gene_1938 Gene_1939 Gene_1940 Gene_1941 Gene_1942 Gene_1943 Gene_1944 Gene_1945 Gene_1946 Gene_1947 Gene_1948 Gene_1949 Gene_1950 Gene_1951 Gene_1952 Gene_1953 Gene_1954 Gene_1955 Gene_1956 Gene_1957 Gene_1958 Gene_1959 Gene_1960 Gene_1961 Gene_1962 Gene_1963 Gene_1964 Gene_1965 Gene_1966 Gene_1967 Gene_1968 Gene_1969 Gene_1970 Gene_1971 Gene_1972 Gene_1973 Gene_1974 Gene_1975 Gene_1976 Gene_1977 Gene_1978 Gene_1979 Gene_1980 Gene_1981 Gene_1982 Gene_1983 Gene_1984 Gene_1985 Gene_1986 Gene_1987 Gene_1988 Gene_1989 Gene_1990 Gene_1991 Gene_1992 Gene_1993 Gene_1994 Gene_1995 Gene_1996 Gene_1997 Gene_1998 Gene_1999
Cell_0 0.000000 0.693147 0.693147 0.693147 0.693147 0.693147 0.000000 0.000000 0.693147 0.000000 0.000000 0.000000 0.000000 1.098612 0.000000 1.098612 1.098612 1.098612 0.693147 0.000000 1.098612 0.000000 0.693147 0.693147 0.693147 0.000000 0.000000 0.000000 0.000000 1.098612 0.693147 0.000000 1.098612 1.386294 1.609438 0.000000 1.386294 0.693147 0.000000 0.000000 0.693147 0.000000 0.000000 1.098612 0.693147 1.386294 0.000000 0.000000 0.693147 0.000000 0.000000 0.000000 0.000000 0.693147 0.000000 0.693147 0.000000 0.693147 1.386294 0.000000 0.000000 0.000000 0.000000 0.693147 0.693147 1.609438 0.000000 0.693147 0.693147 0.000000 0.000000 1.098612 1.098612 0.693147 1.098612 0.000000 0.693147 0.000000 1.098612 0.000000 0.000000 0.693147 1.098612 0.000000 0.693147 1.098612 1.098612 0.693147 0.693147 0.693147 0.693147 1.098612 1.098612 0.000000 1.386294 0.000000 0.693147 0.000000 0.000000 0.693147 ... 0.693147 1.098612 0.693147 0.693147 0.000000 0.693147 0.693147 0.693147 0.693147 0.000000 1.098612 0.000000 1.098612 0.000000 1.098612 1.386294 0.693147 0.000000 0.693147 1.098612 0.000000 0.000000 1.098612 0.000000 0.000000 0.693147 0.693147 0.693147 1.386294 1.098612 0.693147 0.693147 0.693147 0.693147 0.000000 1.386294 0.000000 0.000000 1.386294 0.000000 0.000000 0.693147 1.098612 0.000000 0.693147 0.000000 1.386294 1.098612 0.693147 0.693147 1.386294 0.000000 0.000000 1.386294 0.693147 1.098612 0.693147 0.693147 1.098612 1.386294 0.693147 0.000000 0.000000 0.000000 1.098612 0.693147 0.693147 1.386294 1.098612 0.693147 0.000000 0.693147 0.000000 0.000000 0.000000 0.000000 1.098612 0.000000 1.609438 1.098612 1.386294 0.000000 0.693147 1.386294 1.386294 0.693147 0.693147 1.386294 0.693147 0.000000 0.693147 0.000000 0.000000 0.000000 0.693147 0.693147 0.000000 1.098612 0.000000 0.693147
Cell_1 0.000000 0.693147 0.000000 0.693147 0.000000 0.693147 1.098612 1.098612 1.098612 0.000000 0.693147 0.693147 0.693147 1.098612 0.000000 0.000000 0.000000 0.693147 1.386294 0.000000 1.386294 0.693147 0.000000 0.000000 0.000000 0.693147 1.386294 1.609438 1.098612 0.000000 1.098612 1.098612 0.693147 0.000000 0.693147 1.098612 1.098612 1.098612 1.098612 0.693147 0.000000 1.098612 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.098612 0.000000 0.000000 1.098612 0.693147 0.693147 0.000000 0.000000 0.000000 0.000000 1.098612 0.693147 0.000000 1.609438 1.098612 1.098612 1.386294 0.693147 0.000000 1.098612 0.000000 0.000000 1.098612 1.098612 0.000000 0.693147 0.693147 1.386294 0.000000 0.693147 0.000000 0.693147 0.000000 0.000000 1.386294 0.693147 0.693147 0.000000 0.693147 0.693147 0.693147 0.000000 0.693147 0.693147 0.000000 1.098612 1.386294 0.000000 0.693147 0.000000 1.098612 1.098612 ... 0.693147 0.693147 1.098612 1.098612 0.000000 0.693147 0.693147 0.693147 0.000000 0.693147 1.386294 0.000000 0.000000 1.098612 0.000000 0.693147 1.609438 1.098612 0.000000 0.693147 0.000000 0.000000 1.098612 0.693147 0.000000 0.000000 1.098612 0.000000 0.000000 0.693147 0.693147 0.000000 0.693147 0.000000 0.000000 0.693147 0.693147 0.000000 0.000000 1.098612 0.693147 0.693147 1.098612 0.693147 0.000000 0.693147 0.000000 0.000000 0.000000 0.000000 0.693147 0.000000 0.000000 0.693147 0.000000 0.000000 0.693147 0.693147 0.693147 1.098612 0.000000 1.386294 0.693147 0.000000 0.000000 0.000000 0.000000 0.693147 0.693147 0.000000 1.098612 0.000000 0.000000 0.000000 0.693147 0.693147 0.000000 1.098612 0.000000 0.000000 0.000000 0.693147 0.693147 0.000000 0.000000 0.000000 1.098612 0.693147 0.693147 0.000000 1.098612 0.000000 1.098612 0.000000 0.000000 0.000000 0.000000 0.693147 0.000000 1.098612
Cell_2 0.693147 1.098612 1.098612 1.386294 0.000000 0.693147 1.098612 0.693147 0.000000 0.693147 0.693147 0.000000 0.000000 1.386294 0.000000 0.693147 1.098612 0.000000 1.386294 1.386294 1.098612 0.000000 0.000000 0.693147 0.693147 0.693147 0.693147 0.693147 0.000000 0.693147 0.000000 1.098612 0.000000 0.000000 1.098612 0.693147 0.693147 1.098612 1.386294 1.098612 0.693147 0.693147 0.693147 0.000000 1.386294 0.693147 0.693147 0.693147 0.000000 0.693147 0.693147 1.098612 0.693147 0.000000 0.693147 1.386294 0.000000 0.000000 0.000000 0.693147 0.000000 0.693147 1.098612 0.693147 0.693147 0.693147 1.386294 0.693147 0.693147 0.000000 1.609438 0.693147 0.000000 0.000000 1.098612 0.693147 0.693147 0.693147 0.693147 1.098612 0.000000 0.693147 1.098612 0.000000 0.000000 0.000000 0.693147 0.693147 1.098612 1.609438 0.000000 0.693147 0.000000 0.693147 0.693147 0.000000 0.693147 0.000000 0.000000 0.000000 ... 1.098612 1.098612 0.693147 0.000000 0.693147 0.693147 0.000000 0.693147 1.098612 1.386294 0.693147 0.000000 0.693147 0.693147 0.000000 0.000000 1.098612 0.000000 0.693147 1.098612 0.693147 1.098612 0.693147 0.000000 0.693147 1.098612 0.000000 0.000000 1.098612 0.693147 0.000000 0.000000 1.098612 0.693147 0.693147 1.098612 0.000000 0.693147 0.693147 0.000000 0.000000 0.000000 1.098612 0.693147 1.098612 0.000000 0.693147 0.693147 0.000000 0.000000 0.000000 0.693147 1.098612 0.693147 0.000000 0.693147 0.000000 0.693147 0.000000 0.693147 1.386294 0.000000 0.000000 0.000000 0.693147 0.000000 0.693147 0.000000 0.693147 0.000000 1.386294 1.098612 1.098612 0.693147 0.693147 0.000000 0.000000 0.000000 0.000000 0.000000 0.693147 0.693147 0.693147 1.098612 1.098612 0.693147 0.693147 0.000000 0.000000 0.693147 0.000000 0.000000 0.693147 0.000000 0.000000 0.000000 0.693147 0.693147 0.000000 0.000000
Cell_3 0.000000 1.098612 0.693147 0.693147 0.000000 0.693147 0.000000 1.386294 1.098612 0.693147 1.386294 0.693147 0.693147 1.098612 0.693147 1.098612 0.000000 1.098612 0.693147 1.098612 1.386294 0.693147 0.693147 0.693147 1.098612 0.693147 0.693147 0.693147 0.000000 1.098612 1.386294 0.000000 1.098612 0.693147 1.609438 0.000000 0.000000 0.000000 0.693147 0.000000 1.098612 0.000000 0.000000 0.000000 1.098612 1.098612 1.098612 0.693147 0.000000 0.693147 0.000000 0.693147 0.000000 0.000000 1.098612 1.098612 1.098612 1.098612 0.000000 0.693147 1.386294 0.693147 0.693147 0.000000 0.000000 1.386294 0.000000 0.693147 0.000000 1.098612 0.000000 0.693147 0.000000 1.098612 0.693147 0.693147 0.000000 0.000000 0.000000 1.098612 1.386294 0.693147 0.000000 0.693147 1.386294 1.386294 0.693147 0.000000 0.000000 0.000000 0.693147 0.000000 1.098612 0.693147 0.000000 0.693147 0.000000 1.098612 0.693147 0.000000 ... 1.098612 0.000000 0.000000 0.000000 0.000000 0.693147 0.000000 0.693147 1.098612 1.386294 1.098612 1.386294 1.098612 0.000000 0.693147 0.000000 0.000000 1.098612 0.693147 0.693147 0.693147 1.386294 0.693147 0.000000 0.693147 1.386294 1.098612 0.000000 0.693147 0.000000 0.693147 0.693147 1.098612 1.386294 1.098612 1.098612 0.000000 0.000000 0.693147 0.693147 0.693147 0.693147 0.693147 1.386294 0.693147 1.098612 0.000000 0.000000 1.098612 1.098612 1.098612 0.000000 0.000000 0.000000 0.000000 0.693147 1.386294 0.693147 0.000000 1.386294 1.098612 1.098612 1.098612 1.098612 1.386294 0.000000 1.386294 1.098612 0.693147 1.098612 0.000000 0.693147 1.098612 0.000000 1.386294 1.098612 1.098612 0.000000 0.000000 0.000000 0.693147 0.693147 1.098612 0.693147 0.693147 0.693147 1.386294 0.693147 0.000000 0.693147 1.609438 0.693147 0.000000 0.000000 0.000000 0.693147 0.693147 1.098612 0.000000 1.098612
Cell_4 0.000000 0.693147 1.386294 0.693147 0.693147 1.609438 0.693147 0.000000 0.693147 0.693147 0.000000 0.693147 0.000000 0.000000 0.693147 0.000000 0.693147 0.000000 0.000000 1.098612 0.693147 1.386294 0.000000 0.693147 1.098612 0.000000 0.693147 0.693147 0.693147 0.693147 0.000000 0.000000 1.098612 0.693147 1.098612 0.000000 0.000000 0.000000 1.386294 0.000000 1.386294 0.693147 0.693147 0.000000 0.693147 0.693147 0.693147 1.098612 1.098612 0.000000 0.693147 0.693147 0.693147 0.693147 0.693147 0.693147 0.693147 0.000000 0.693147 0.693147 1.098612 0.000000 0.000000 0.693147 1.098612 0.693147 0.693147 0.693147 1.098612 0.693147 1.098612 1.098612 1.098612 0.000000 1.098612 0.000000 1.098612 0.000000 0.693147 0.693147 1.098612 0.693147 0.000000 0.000000 0.693147 0.000000 1.098612 0.000000 0.693147 1.098612 0.000000 0.000000 0.000000 0.000000 0.000000 0.693147 1.098612 0.693147 0.000000 0.000000 ... 0.000000 0.693147 0.693147 1.098612 0.693147 0.693147 0.000000 0.693147 0.693147 0.000000 1.098612 0.000000 1.098612 0.693147 1.386294 0.000000 0.693147 0.000000 0.000000 0.000000 1.609438 0.693147 0.693147 0.693147 0.000000 1.386294 0.000000 1.098612 0.693147 0.000000 1.098612 0.693147 0.693147 0.693147 0.000000 0.000000 0.693147 0.000000 0.693147 1.098612 0.693147 0.000000 0.000000 0.000000 1.098612 0.693147 0.693147 0.693147 1.386294 1.098612 0.000000 0.000000 0.693147 1.386294 0.693147 0.693147 0.000000 0.000000 0.693147 1.098612 0.000000 0.693147 0.000000 1.386294 0.000000 1.098612 1.386294 0.693147 0.000000 0.693147 1.098612 1.098612 1.098612 0.693147 1.386294 1.098612 1.098612 0.693147 0.693147 1.609438 0.693147 0.693147 1.098612 0.000000 0.693147 1.098612 1.098612 1.098612 0.000000 0.000000 0.693147 0.000000 0.693147 0.693147 0.693147 1.386294 0.693147 1.098612 0.693147 0.693147
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Cell_95 0.000000 0.693147 1.098612 0.693147 1.098612 0.000000 0.693147 0.693147 0.693147 0.000000 0.693147 1.098612 0.000000 0.000000 0.000000 0.000000 0.693147 1.609438 0.000000 0.000000 0.693147 0.693147 1.386294 1.098612 0.693147 0.000000 0.693147 0.000000 0.000000 0.693147 0.693147 0.000000 1.098612 0.693147 0.000000 1.098612 1.098612 1.098612 0.000000 0.693147 0.693147 0.693147 0.000000 0.000000 0.693147 0.000000 0.693147 1.098612 1.098612 0.000000 0.000000 0.693147 0.000000 1.098612 0.693147 0.693147 0.000000 0.000000 0.693147 1.098612 0.693147 0.000000 0.000000 0.000000 1.386294 0.000000 0.000000 0.693147 0.000000 0.000000 0.693147 0.693147 0.693147 0.693147 0.000000 0.693147 0.693147 0.693147 1.098612 1.386294 1.609438 0.693147 1.098612 1.098612 0.693147 0.693147 0.693147 0.693147 0.693147 0.000000 0.693147 0.000000 0.693147 0.000000 1.386294 0.000000 0.693147 1.098612 0.693147 0.693147 ... 0.000000 1.386294 1.098612 1.609438 0.693147 1.098612 1.386294 1.386294 0.693147 0.000000 0.000000 1.386294 0.000000 0.693147 0.693147 1.098612 0.693147 1.098612 0.693147 1.098612 0.000000 0.000000 0.693147 0.000000 0.693147 0.693147 0.000000 0.693147 1.386294 0.000000 0.000000 0.693147 0.000000 0.000000 0.000000 0.000000 0.693147 0.693147 0.000000 1.098612 1.098612 1.098612 1.098612 0.000000 0.000000 0.693147 1.386294 0.000000 1.386294 0.000000 0.000000 1.098612 1.098612 0.000000 0.000000 1.098612 0.000000 0.000000 0.693147 0.693147 1.098612 1.098612 1.098612 0.000000 0.000000 0.693147 0.000000 0.000000 0.000000 0.693147 0.000000 0.693147 0.000000 1.098612 0.000000 1.098612 0.000000 0.000000 0.693147 0.693147 0.693147 0.693147 0.693147 0.000000 0.693147 0.000000 0.000000 0.693147 0.000000 1.098612 0.693147 0.000000 0.000000 1.098612 0.693147 0.000000 0.693147 0.693147 1.098612 1.098612
Cell_96 1.098612 0.000000 0.693147 0.000000 1.098612 1.098612 1.098612 0.000000 0.000000 0.000000 0.000000 0.693147 0.000000 0.693147 0.000000 1.098612 0.000000 1.098612 1.386294 0.693147 0.693147 0.693147 1.098612 0.693147 1.098612 0.693147 0.693147 1.098612 0.000000 1.609438 0.000000 1.098612 0.693147 0.693147 1.386294 1.098612 1.098612 0.693147 0.000000 0.693147 0.693147 0.693147 0.693147 0.693147 0.693147 1.098612 0.693147 1.098612 1.098612 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.386294 1.098612 0.693147 1.098612 0.000000 0.693147 0.693147 0.693147 1.386294 0.693147 0.000000 0.000000 0.693147 0.000000 0.693147 0.000000 0.000000 1.098612 1.098612 0.000000 0.693147 0.000000 0.000000 0.000000 0.000000 1.098612 0.000000 0.693147 0.000000 0.000000 0.693147 0.693147 0.000000 1.098612 0.000000 0.693147 1.098612 0.693147 0.000000 0.693147 0.000000 0.000000 ... 0.000000 0.693147 0.693147 1.098612 0.000000 1.098612 0.000000 1.386294 0.693147 1.098612 1.098612 0.693147 0.000000 1.098612 0.693147 1.098612 0.693147 1.609438 0.693147 0.000000 0.693147 0.693147 0.000000 0.000000 0.693147 0.693147 0.000000 0.693147 0.000000 0.000000 1.098612 0.693147 0.000000 0.693147 1.386294 0.000000 1.098612 0.693147 0.000000 0.000000 0.000000 0.693147 0.693147 0.000000 0.000000 0.693147 0.693147 0.693147 0.693147 0.000000 0.000000 0.693147 1.386294 1.098612 0.693147 0.000000 0.000000 1.098612 0.693147 0.693147 0.000000 0.693147 1.098612 0.000000 1.098612 0.000000 0.693147 1.609438 0.693147 0.693147 0.000000 1.098612 0.693147 1.098612 0.000000 0.000000 0.000000 0.693147 1.098612 0.693147 0.000000 1.945910 1.098612 0.693147 0.000000 0.693147 1.098612 0.693147 1.098612 0.693147 0.000000 1.098612 1.098612 1.098612 1.098612 0.693147 0.000000 1.098612 1.386294 0.693147
Cell_97 0.693147 0.693147 0.000000 1.386294 0.693147 0.693147 1.098612 1.098612 0.693147 0.000000 0.693147 0.693147 0.000000 0.693147 0.000000 0.000000 0.000000 0.693147 0.000000 0.693147 0.693147 1.386294 0.000000 1.098612 0.693147 0.693147 0.693147 0.000000 0.693147 1.098612 0.000000 1.098612 1.098612 0.000000 0.000000 0.000000 1.098612 0.693147 0.000000 1.098612 0.693147 1.386294 1.098612 0.000000 0.693147 0.693147 0.693147 0.693147 0.693147 0.000000 0.000000 1.098612 0.693147 1.386294 0.000000 1.098612 1.098612 0.693147 0.000000 0.693147 0.693147 0.693147 1.098612 0.693147 1.098612 1.386294 0.000000 0.693147 0.693147 1.098612 0.693147 1.098612 0.000000 0.693147 0.000000 0.693147 1.098612 0.000000 0.000000 0.000000 1.098612 0.000000 0.693147 0.000000 0.000000 0.693147 0.000000 0.000000 1.098612 1.098612 0.000000 1.098612 0.693147 0.000000 0.693147 1.098612 0.000000 0.693147 0.000000 0.000000 ... 0.000000 0.000000 0.693147 0.693147 0.693147 0.000000 1.098612 0.693147 0.693147 0.000000 0.693147 1.098612 0.693147 0.693147 0.693147 0.000000 0.000000 0.693147 0.000000 0.000000 0.693147 0.000000 0.693147 0.000000 0.000000 1.098612 1.098612 0.693147 0.000000 1.386294 0.000000 0.693147 1.098612 0.000000 0.693147 1.386294 0.693147 1.098612 1.098612 0.000000 0.000000 0.000000 0.000000 0.693147 0.000000 1.098612 1.386294 0.693147 0.693147 0.000000 1.386294 0.693147 0.693147 1.098612 0.693147 0.000000 0.693147 0.000000 1.386294 0.693147 1.098612 0.000000 0.693147 0.000000 1.098612 0.000000 1.098612 0.693147 1.386294 0.693147 1.098612 0.000000 0.693147 0.000000 0.000000 0.000000 1.791759 1.098612 1.098612 0.693147 0.693147 0.693147 0.693147 0.693147 1.609438 0.000000 0.693147 0.693147 1.098612 0.000000 0.000000 0.000000 0.693147 0.693147 0.693147 0.693147 1.098612 0.693147 1.386294 1.098612
Cell_98 0.000000 0.693147 0.000000 0.693147 0.693147 0.000000 1.386294 0.000000 0.693147 0.693147 0.000000 1.098612 0.000000 1.098612 0.000000 0.693147 1.098612 0.693147 0.000000 1.098612 0.000000 0.693147 1.609438 0.693147 0.693147 0.000000 1.098612 0.693147 0.693147 1.386294 0.000000 0.693147 0.693147 0.693147 0.000000 0.000000 1.609438 0.000000 1.098612 0.693147 0.693147 1.098612 0.693147 1.098612 0.693147 1.098612 1.098612 0.693147 0.693147 0.693147 0.693147 0.000000 0.000000 0.000000 0.693147 0.000000 0.693147 0.000000 0.000000 1.098612 0.000000 1.098612 0.693147 0.000000 0.000000 1.098612 0.000000 0.000000 0.693147 0.000000 0.000000 1.098612 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.609438 0.693147 0.693147 1.098612 0.693147 0.693147 0.000000 0.000000 0.000000 1.386294 0.693147 1.098612 0.693147 1.098612 0.693147 0.693147 0.000000 0.693147 0.000000 0.693147 0.693147 0.693147 ... 1.098612 0.693147 1.098612 0.693147 1.386294 1.098612 0.000000 0.000000 1.098612 0.693147 0.000000 0.000000 0.000000 1.609438 0.693147 0.693147 0.000000 0.693147 0.000000 1.098612 1.098612 0.000000 0.693147 1.098612 0.000000 0.000000 0.000000 0.000000 0.693147 0.000000 0.693147 0.693147 0.000000 0.000000 0.000000 0.693147 1.386294 0.693147 0.693147 0.000000 0.693147 0.693147 0.693147 0.000000 0.693147 0.000000 0.693147 0.000000 0.000000 0.693147 0.000000 0.000000 0.000000 0.000000 0.000000 0.693147 0.000000 0.693147 1.098612 1.098612 0.000000 1.386294 0.000000 0.693147 0.693147 0.693147 0.693147 0.693147 0.693147 0.693147 0.693147 0.693147 0.000000 0.000000 1.098612 0.000000 0.000000 0.000000 1.098612 0.000000 0.000000 0.693147 0.000000 0.000000 0.693147 0.693147 0.000000 0.693147 0.000000 1.386294 0.693147 0.693147 0.693147 1.609438 1.386294 0.693147 0.000000 0.693147 0.693147 1.609438
Cell_99 1.098612 1.386294 0.693147 1.098612 0.000000 0.000000 1.386294 1.098612 1.098612 0.000000 0.000000 0.000000 1.609438 1.098612 0.693147 0.000000 0.693147 0.693147 0.693147 0.000000 0.693147 0.000000 0.693147 0.693147 1.098612 0.693147 0.000000 0.693147 0.693147 1.098612 0.693147 0.693147 0.000000 1.098612 0.000000 0.693147 1.791759 0.000000 0.000000 0.000000 0.693147 0.693147 1.098612 0.000000 0.693147 0.693147 0.000000 0.000000 0.693147 0.000000 0.693147 0.000000 1.386294 1.098612 0.000000 0.000000 1.098612 0.693147 1.098612 0.693147 1.386294 0.000000 1.098612 1.386294 0.000000 1.098612 1.386294 0.693147 0.000000 0.000000 1.098612 0.693147 1.098612 0.000000 0.000000 0.000000 0.693147 1.098612 0.693147 0.000000 0.000000 0.693147 0.000000 0.693147 1.098612 1.098612 0.000000 0.693147 0.000000 0.000000 0.000000 0.000000 0.693147 0.693147 1.098612 0.000000 0.000000 0.000000 0.000000 1.098612 ... 1.098612 1.098612 1.386294 0.000000 0.000000 0.000000 0.693147 1.386294 0.693147 0.000000 0.000000 1.098612 0.000000 1.098612 1.098612 0.693147 0.000000 0.000000 0.693147 1.791759 1.098612 0.693147 0.000000 0.693147 1.386294 0.693147 0.000000 1.098612 0.693147 0.000000 0.000000 0.693147 0.693147 1.098612 0.000000 1.098612 0.693147 0.000000 0.693147 1.098612 0.000000 0.000000 1.098612 0.000000 0.000000 0.693147 0.000000 0.000000 0.000000 0.693147 0.693147 0.693147 0.000000 0.000000 1.791759 0.000000 1.098612 0.693147 0.000000 0.693147 0.693147 0.693147 1.098612 0.000000 0.000000 0.000000 0.000000 0.693147 0.693147 0.693147 1.386294 0.000000 1.098612 1.098612 1.098612 1.386294 0.693147 1.609438 0.693147 0.693147 0.693147 0.693147 0.693147 0.693147 0.000000 0.693147 0.000000 0.000000 0.693147 0.000000 0.693147 1.386294 1.098612 0.693147 0.693147 0.693147 0.000000 0.000000 0.693147 0.000000

100 rows × 2000 columns

4.2.8. Reading and writing of AnnData objects#

AnnData objects can be saved on disk to hierarchical array stores like HDF5 or Zarr to enable similar structures in disk and on memory. AnnData comes with its own persistent HDF5-based file format: h5ad. If string columns with a few categories are not yet categorical, AnnData will auto-transform them to categorical. We will now save our AnnData object in h5ad format.

adata.write("my_results.h5ad", compression="gzip")

… and read it back in.

adata_new = ad.read_h5ad("my_results.h5ad")
adata_new
AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'
    uns: 'random'
    obsm: 'X_umap'
    varm: 'gene_stuff'
    layers: 'log_transformed'

4.2.9. Efficient data access#

4.2.9.1. View and copies#

For the fun of it, let us look at another metadata use case. Imagine that the observations come from instruments characterizing 10 readouts in a multi-year study with samples taken from different subjects at different sites. We would typically get that information in some format and then store it in a DataFrame:

obs_meta = pd.DataFrame(
    {
        "time_yr": np.random.default_rng().choice([0, 2, 4, 8], adata.n_obs),
        "subject_id": np.random.default_rng().choice(
            ["subject 1", "subject 2", "subject 4", "subject 8"], adata.n_obs
        ),
        "instrument_type": np.random.default_rng().choice(
            ["type a", "type b"], adata.n_obs
        ),
        "site": np.random.default_rng().choice(["site x", "site y"], adata.n_obs),
    },
    index=adata.obs.index,  # these are the same IDs of observations as above!
)

This is how we join the readout data with the metadata. Of course, the first argument of the following call for X could also just be a DataFrame. This will result in a single data container that tracks everything.

adata = ad.AnnData(adata.X, obs=obs_meta, var=adata.var)
adata
AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

Subsetting the joint data matrix can be important to focus on subsets of variables or observations, or to define train-test splits for a machine learning model.

Similar to numpy arrays, AnnData objects can either hold actual data or reference another AnnData object. In the latter case, they are referred to as “view”. Subsetting AnnData objects always returns views, which has two advantages:

  • No new memory is allocated.

  • It is possible to modify the underlying AnnData object.

You can get an actual AnnData object from a view by calling .copy() on the view. Usually, this is not necessary, as any modification of elements of a view (calling .[] on an attribute of the view) internally calls .copy() and makes the view an AnnData object that holds actual data. See the example below.

adata
AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

Indexing into AnnData will assume that integer arguments to [] behave like .iloc in pandas, whereas string arguments behave like .loc. AnnData always assumes string indices.

adata_view = adata[:5, ["Gene_1", "Gene_3"]]
adata_view
View of AnnData object with n_obs × n_vars = 5 × 2
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

This is a view! This can be verified by examining the AnnData object again.

adata
AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

The dimensions of the AnnData object have not changed. It still contains the same data. If we want an AnnData that holds the data in memory, we must call it .copy().

adata_subset = adata[:5, ["Gene_1", "Gene_3"]].copy()
adata_subset
AnnData object with n_obs × n_vars = 5 × 2
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

For a view, we can also set the first three elements of a column.

print(adata[:3, "Gene_1"].X.toarray().tolist())
adata[:3, "Gene_1"].X = [0, 0, 0]
print(adata[:3, "Gene_1"].X.toarray().tolist())
[[1.0], [1.0], [2.0]]
[[0.0], [0.0], [0.0]]

If you try to access parts of a view of an AnnData, the content will be auto-copied and a data-storing object will be generated.

adata_subset = adata[:3, ["Gene_1", "Gene_2"]]
adata_subset
View of AnnData object with n_obs × n_vars = 3 × 2
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'
adata_subset.obs["foo"] = range(3)

Now adata_subset stores the actual data and is no longer just a reference to adata.

adata_subset
AnnData object with n_obs × n_vars = 3 × 2
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site', 'foo'

Evidently, you can use all of pandas to slice with sequences or boolean indices.

adata[adata.obs.time_yr.isin([2, 4])].obs.head()
time_yr subject_id instrument_type site
Cell_2 4 subject 1 type a site y
Cell_3 2 subject 2 type a site y
Cell_5 4 subject 4 type b site x
Cell_6 2 subject 1 type b site y
Cell_7 4 subject 4 type a site x

4.2.9.2. Partial reading of large data#

If a single h5ad file is very large, you can partially read it into memory by using backed mode.

adata = ad.read_h5ad("my_results.h5ad", backed="r")
adata.isbacked
True

If you do this, you will need to remember that the AnnData object has an open connection to the file used for reading.

adata.filename
PosixPath('my_results.h5ad')

As we are using it in read-only mode, we cannot damage anything. To proceed with this tutorial, we still need to explicitly close it.

adata.file.close()

4.3. Unimodal data analysis with scanpy#

Now that we understand the fundamental data structure of unimodal single-cell analysis, the question remains: How can we actually analyze the stored data? In the scverse ecosystem, several tools exist for analyzing specific omics data. For example, scanpy [Wolf et al., 2018] provides tooling for general RNA-Seq-focused analysis, squidpy [Palla et al., 2022] focuses on spatial transcriptomics, and scirpy [Sturm et al., 2020] provides tooling for the analysis of T-cell receptor (TCR) and B-cell receptor (BCR) data. Even though many scverse extensions for various data modalities exist, they usually use some of scanpy’s preprocessing and visualization capabilities to some extent.

More specifically, scanpy is a Python package that builds on top of AnnData to facilitate the analysis of single-cell gene expression data. Several methods for preprocessing, embedding, visualization, clustering, differential gene expression testing, pseudotime and trajectory inference, and simulation of gene regulatory networks are accessible through scanpy. The efficient implementation based on the Python data science and machine learning libraries allows scanpy to scale to millions of cells. Generally, best-practice single-cell data analysis is an interactive process. Many of the decisions and analysis steps depend on the results of previous steps and the potential input of experimental partners. Pipelines such as scflow [Khozoie et al., 2021] entirely automate some downstream analysis steps. These pipelines have to make assumptions and simplifications, which may not result in the most robust analysis. Scanpy is therefore designed for interactive analyses with, for example, Jupyter Notebooks [Jupyter, 2022].

scanpy Overview

Fig. 4.3 Scanpy overview. Image obtained from [Wolf et al., 2018].#

4.3.1. Installation#

Scanpy is available on PyPI and Conda. It can be installed using either of the following commands.

pip install scanpy
conda install -c conda-forge scanpy

4.3.2. Scanpy API design#

The scanpy framework is designed in a way that functions belonging to the same step are grouped into corresponding modules. For example, all preprocessing functions are available in the scanpy.preprocessing module, all transformations of a data matrix that are not preprocessing are available in scanpy.tools, and all visualizations are available in scanpy.plot. These modules are commonly accessed after having imported scanpy like import scanpy as sc with the corresponding abbreviations sc.pp for preprocessing, sc.tl for tools, and sc.pl for plots. All modules which read or write data are directly accessed. Further, a module for various datasets is available as sc.datasets. All functions with corresponding parameters and potential example plots are documented in the scanpy API documentation [scverse scanpy, 2022].

Note that this tutorial only covers a tiny subset of scanpy’s features and options. Readers are strongly encouraged to examine scanpy’s documentation for more details.

scanpy API

Fig. 4.4 Scanpy API overview. The API is divided into datasets, preprocessing (pp), tools (tl) and corresponding plotting (pl) functions.#

4.3.3. Scanpy example#

In the following cells we will shortly demonstrate the workflow of an analysis with scanpy. We explicitly do not conduct a full analysis because the specific analysis steps are covered in the corresponding chapters.

As a first step we import scanpy and define defaults for our following quick scanpy demo. We use scanpy’s setting object to set the Matplotlib plotting defaults for all of scanpy’s plots and finally print scanpy’s header. This header contains the versions of all relevant Python packages in the current environment including scanpy and AnnData. This output is especially useful when reporting bugs to the scverse team and for reproducibility reasons.

import scanpy as sc

sc.settings.set_figure_params(dpi=80, facecolor="white")
sc.logging.print_header()

The dataset of choice is a dataset of 2700 peripheral blood mononuclear cells of a healthy donor which were sequenced on the Illumina NextSeq 500. We can load the dataset from lamindb, although it is also available via sc.datasets.pbmc3k().

adata = ln.Artifact.get(
    key="introduction/fundamental_data_structures_and_frameworks.h5ad", is_latest=True
).load()
adata
AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'

The returned AnnData object has 2700 cells with 32738 genes. The var slot further contains the gene IDs.

adata.var
gene_ids
index
MIR1302-10 ENSG00000243485
FAM138A ENSG00000237613
OR4F5 ENSG00000186092
RP11-34P13.7 ENSG00000238009
RP11-34P13.8 ENSG00000239945
... ...
AC145205.1 ENSG00000215635
BAGE5 ENSG00000268590
CU459201.1 ENSG00000251180
AC002321.2 ENSG00000215616
AC002321.1 ENSG00000215611

32738 rows × 1 columns

As mentioned above, all of scanpy’s analysis functions are accessible via sc.[pp, tl, pl]. As a first step to get an overview over our data, we use scanpy to show those genes that yield the highest fraction of counts in each single cell, across all cells. We simply call the sc.pl.highest_expr_genes function, pass the AnnData object which is in pretty much all cases the first parameter of any scanpy function, and specify that we want the top 20 expressed genes to be shown.

sc.pl.highest_expr_genes(adata, n_top=20)

Apparently, MALAT1 is the most expressed gene which is frequently detected in poly-A captured scRNA-Seq data, independent of protocol. This gene has been shown to have an inverse correlation with cell health. Especially dead/dying cells have a higher expression of MALAT1.

We now filter cells with less than 200 detected genes and genes which were found in less than 3 cells for a rough quality threshold with scanpy’s preprocessing module.

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

A common step in single-cell RNA-Seq analysis is dimensionality reduction with for example PCA to unveil the main axes of variation. This also denoises the data. Scanpy offers PCA as a preprocessing or tools function. These are equivalent. Here, we use the version in tools for no particular reason.

sc.tl.pca(adata, svd_solver="arpack")

The corresponding plotting function allows us to pass genes to the color argument. The corresponding values are automatically extracted from the AnnData object.

sc.pl.pca(adata, color="CST3")

A fundamental step for any advanced embedding and downstream calculations is the calculating of the neighborhood graph using the PCA representation of the data matrix. It is automatically used for other tools that require it such as the calculation of a UMAP.

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

We now use the calculating neighborhood graph to embed the cells with a UMAP, one of many advanced dimension reduction algorithms implemented in scanpy.

sc.tl.umap(adata)
sc.pl.umap(adata, color=["CST3", "NKG7", "PPBP"])

Scanpy’s documentation also provides tutorials which we recommend to all readers who need a refresher of scanpy or are new to scanpy. Video tutorials are available on the scverse youtube channel.

4.4. Questions#

4.4.1. Flipcards#

What is the fundamental data structure for single-cell analysis in the scverse?
AnnData
What is the fundamental framework for single-cell analysis in the scverse?
Scanpy
In single-cell RNA-seq data, which dimensions correspond to genes and cells?
Genes are stored in `.var` (columns), and cells are stored in `.obs` (rows) of the matrix.

4.4.2. Multiple-choice questions#

What is a common limitation of the R-based frameworks Bioconductor and Seurat?





In an AnnData object, which slot stores the main count matrix for scRNA-seq data?





Where would you store additional matrices derived from the main data, such as normalized counts or log-transformed values, in an AnnData object?





4.5. References#

[atBKS22]

Danila Bredikhin, Ilia Kats, and Oliver Stegle. Muon: multimodal omics analysis framework. Genome Biology, 23(1):42, Feb 2022. URL: https://doi.org/10.1186/s13059-021-02577-8, doi:10.1186/s13059-021-02577-8.

[atHHAN+21]

Yuhan Hao, Stephanie Hao, Erica Andersen-Nissen, William M. Mauck III, Shiwei Zheng, Andrew Butler, Maddie J. Lee, Aaron J. Wilk, Charlotte Darby, Michael Zagar, Paul Hoffman, Marlon Stoeckius, Efthymia Papalexi, Eleni P. Mimitou, Jaison Jain, Avi Srivastava, Tim Stuart, Lamar B. Fleming, Bertrand Yeung, Angela J. Rogers, Juliana M. McElrath, Catherine A. Blish, Raphael Gottardo, Peter Smibert, and Rahul Satija. Integrated analysis of multimodal single-cell data. Cell, 2021. URL: https://doi.org/10.1016/j.cell.2021.04.048, doi:10.1016/j.cell.2021.04.048.

[atHCG+15]

Wolfgang Huber, Vincent J. Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S. Carvalho, Hector Corrada Bravo, Sean Davis, Laurent Gatto, Thomas Girke, Raphael Gottardo, Florian Hahne, Kasper D. Hansen, Rafael A. Irizarry, Michael Lawrence, Michael I. Love, James MacDonald, Valerie Obenchain, Andrzej K. Oleś, Hervé Pagès, Alejandro Reyes, Paul Shannon, Gordon K. Smyth, Dan Tenenbaum, Levi Waldron, and Martin Morgan. Orchestrating high-throughput genomic analysis with bioconductor. Nature Methods, 12(2):115–121, Feb 2015. URL: https://doi.org/10.1038/nmeth.3252, doi:10.1038/nmeth.3252.

[atJup22]

Project Jupyter. Jupyter. https://jupyter.org/, 2022. Accessed: 2022-04-21.

[atKFM+21]

Combiz Khozoie, Nurun Fancy, Mahdi M. Marjaneh, Alan E. Murphy, Paul M. Matthews, and Nathan Skene. Scflow: a scalable and reproducible analysis pipeline for single-cell term`rna` sequencing data. bioRxiv, 2021. URL: https://www.biorxiv.org/content/early/2021/08/19/2021.08.16.456499.1, arXiv:https://www.biorxiv.org/content/early/2021/08/19/2021.08.16.456499.1.full.pdf, doi:10.1101/2021.08.16.456499.

[atMPY+25]

Luca Marconato, Giovanni Palla, Kevin A. Yamauchi, Isaac Virshup, Elyas Heidari, Tim Treis, Wouter-Michiel Vierdag, Marcella Toth, Sonja Stockhaus, Rahul B. Shrestha, Benjamin Rombaut, Lotte Pollaris, Laurens Lehner, Harald Vöhringer, Ilia Kats, Yvan Saeys, Sinem K. Saka, Wolfgang Huber, Moritz Gerstung, Josh Moore, Fabian J. Theis, and Oliver Stegle. Spatialdata: an open and universal data framework for spatial omics. Nature Methods, 22(1):58–62, 2025. URL: https://doi.org/10.1038/s41592-024-02212-x, doi:10.1038/s41592-024-02212-x.

[atPSK+22] (1,2)

Giovanni Palla, Hannah Spitzer, Michal Klein, David Fischer, Anna Christina Schaar, Louis Benedikt Kuemmerle, Sergei Rybakov, Ignacio L. Ibarra, Olle Holmberg, Isaac Virshup, Mohammad Lotfollahi, Sabrina Richter, and Fabian J. Theis. Squidpy: a scalable framework for spatial omics analysis. Nature Methods, 19(2):171–178, Feb 2022. URL: https://doi.org/10.1038/s41592-021-01358-2, doi:10.1038/s41592-021-01358-2.

[atscv22]

scverse. Scverse. https://scverse.org, 2022. Accessed: 2022-04-21.

[atss22]

scverse scanpy. Scanpy api. https://scanpy.readthedocs.io/en/stable/api.html#, 2022. Accessed: 2022-04-21.

[atSSF+20]

Gregor Sturm, Tamas Szabo, Georgios Fotakis, Marlene Haider, Dietmar Rieder, Zlatko Trajanoski, and Francesca Finotello. Scirpy: a Scanpy extension for analyzing single-cell T-cell receptor-sequencing data. Bioinformatics, 36(18):4817–4818, 07 2020. URL: https://doi.org/10.1093/bioinformatics/btaa611, arXiv:https://academic.oup.com/bioinformatics/article-pdf/36/18/4817/34560298/btaa611.pdf, doi:10.1093/bioinformatics/btaa611.

[atVRT+21] (1,2,3)

Isaac Virshup, Sergei Rybakov, Fabian J. Theis, Philipp Angerer, and F. Alexander Wolf. Anndata: annotated data. bioRxiv, 2021. URL: https://www.biorxiv.org/content/early/2021/12/19/2021.12.16.473007, arXiv:https://www.biorxiv.org/content/early/2021/12/19/2021.12.16.473007.full.pdf, doi:10.1101/2021.12.16.473007.

[atWAT18] (1,2,3)

F. Alexander Wolf, Philipp Angerer, and Fabian J. Theis. Scanpy: large-scale single-cell gene expression data analysis. Genome Biology, 19(1):15, Feb 2018. URL: https://doi.org/10.1186/s13059-017-1382-0, doi:10.1186/s13059-017-1382-0.

4.6. Contributors#

We gratefully acknowledge the contributions of:

4.6.1. Authors#

  • Lukas Heumos

  • Luis Heinzmeier

4.6.2. Reviewers#

  • Isaac Virshup