Fundamental data structures and frameworks

4. Fundamental data structures and frameworks#

4.1. Single-cell analysis frameworks and consortia#

After obtaining the count matrices, as described earlier, the exploratory data analysis phase begins. While in the early days, people used to analyze their data with custom scripts, frameworks for precisely this purpose now exist. The three most popular options are the R-based Bioconductor [Huber et al., 2015] and Seurat [Hao et al., 2021] ecosystems and the Python-based scverse [scverse, 2022] ecosystem. These differ not only in the used programming languages but also in the underlying data structures and available specialized analysis tools.

Bioconductor is an open-source project for rigorous and reproducible biological data analysis, including single-cell. Its greatest strengths are a homogeneous developer and user experience and extensive, user-friendly documentation. Seurat is a well-regarded R package for single-cell analysis, covering all analysis steps including multimodal and spatial data. It is known for its well-written vignettes and large user base. Both R options can struggle with very large datasets (500k+ cells), which motivated the Python community to develop the scverse ecosystem. Scverse is an organization dedicated to foundational life science tools, with an initial focus on single-cell. Key advantages include scalability, extendability, and strong interoperability with Python’s data and machine learning ecosystem.

All three ecosystems are involved in many efforts to allow for interoperability of the involved frameworks. This will be discussed in the “Interoperability” chapter. This book always focuses on the best tools for the corresponding question and will, therefore, use a mix of the above-mentioned ecosystems. However, the basis of all analyses will be the scverse ecosystem for two reasons:

While we will regularly switch ecosystems and even programming languages throughout this book, consistent use of data structures and tooling helps readers focus on the concepts rather than implementation details.
A great book on exclusively the Bioconductor ecosystem already exists. We encourage users who only want to learn about single-cell analysis with Bioconductor to read it.

In the following sections, the scverse ecosystem will be introduced in more detail, and the key concepts will be explained with a focus on the most important data structures. This chapter introduces the fundamental data structure AnnData and the scanpy framework (See Fig. 4.1). In the following chapter, we will explore more advanced libraries. This introduction cannot cover all aspects of the data structures and frameworks. We refer to the respective frameworks’ tutorials and documentation where required.

Fig. 4.1 Scverse ecosystem overview highlighting the libraries of this chapter. The publication date by a scientific journal is shown in brackets. We have obtained the symbols of the libraries from the corresponding Github pages [Bredikhin *et al.*, 2022, Marconato *et al.*, 2025, Palla *et al.*, 2022, Virshup *et al.*, 2021, Wolf *et al.*, 2018].#

4.2. Storing unimodal data with AnnData#

As previously discussed, genomics data is typically summarized into a feature matrix after alignment and gene annotation. This matrix will be of the shape number_observations x number_variables. In scRNA-seq, observations are cellular barcodes, and the variables are annotated genes. Throughout the analysis, the observations and variables of this matrix are annotated with computationally derived measurements (e.g., quality control metrics or latent space embeddings) and prior knowledge (e.g., source donor or alternative gene identifier). In the scverse ecosystem, AnnData [Virshup et al., 2021] is used to associate the data matrix with these annotations. To allow for fast and memory-efficient transformations, AnnData also supports sparse matrices and partial reading.

While AnnData is broadly similar to data structures from the R ecosystems (e.g., Bioconductor’s SummarizedExperiment or Seurat’s object), R packages use a transposed feature matrix.

At its core, an AnnData object stores a sparse or dense matrix (the count matrix in the case of scRNA-Seq) in X. This matrix has the dimensions of obs_names x var_names where the obs (=observations) correspond to the cells’ barcodes and the var (=variables) correspond to the gene identifiers. This matrix X is surrounded by Pandas DataFrames obs and var, which save annotations of cells and genes, respectively. Further, AnnData saves whole matrices of calculations for the observations (obsm) or variables (varm) with the corresponding dimensions. Graph-like structures that associate cells with cells or genes with genes are usually saved in obsp and varp. Any other unstructured data which does not fit any other slot is saved as unstructured data in uns. It is further possible to store more values of X in layers. Use cases for this are, for example, the storage of raw, unnormalized count data in a counts layer and the normalized data in the unnamed default layer. AnnData is primarily designed for unimodal (for example, just scRNA-Seq) data. However, extensions of AnnData, such as MuData, which is covered in the next chapter, allow for the efficient storage and access of multimodal data.

AnnData Overview — Fig. 4.2 AnnData overview. Image obtained from [Virshup *et al.*, 2021].#

4.2.1. Installation#

AnnData is available on PyPI and Conda. It can be installed using either of the following commands.

pip install anndata
conda install -c conda-forge anndata

4.2.2. Initializing an AnnData object#

This section is inspired by AnnData’s “getting started” tutorial. Let us create a simple AnnData object with sparse count information, which may, for example, represent gene expression counts. First, we import the required packages.

import anndata as ad
import lamindb as ln
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

ln.track()

As a next step, we initialize an AnnData object with random Poisson distributed data. It is an unwritten rule to name the primary AnnData object of the analysis adata.

counts = csr_matrix(
    np.random.default_rng().poisson(1, size=(100, 2000)), dtype=np.float32
)
adata = ad.AnnData(counts)
adata

AnnData object with n_obs × n_vars = 100 × 2000

The obtained AnnData object has 100 observations and 2000 variables. This would correspond to 100 cells with 2000 genes. The initial data we passed are accessible as a sparse matrix using adata.X.

adata.X

<Compressed Sparse Row sparse matrix of dtype 'float32'
	with 126197 stored elements and shape (100, 2000)>

Now, we provide the index to both the obs and var axes using .obs_names and .var_names, respectively.

adata.obs_names = [f"Cell_{i:d}" for i in range(adata.n_obs)]
adata.var_names = [f"Gene_{i:d}" for i in range(adata.n_vars)]
print(adata.obs_names[:10])

Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
       'Cell_7', 'Cell_8', 'Cell_9'],
      dtype='object')

4.2.3. Adding aligned metadata#

4.2.3.1. Observational or variable level#

The core of our AnnData object is now in place. As a next step, we add metadata at both the observational and variable levels. Remember, we store such annotations in the .obs and .var slots of the AnnData object for cell and gene annotations, respectively.

ct = np.random.default_rng().choice(["B", "T", "Monocyte"], size=(adata.n_obs,))
adata.obs["cell_type"] = pd.Categorical(ct)  # Categoricals are preferred for efficiency
adata.obs

	cell_type
Cell_0	B
Cell_1	T
Cell_2	Monocyte
Cell_3	B
Cell_4	Monocyte
...	...
Cell_95	T
Cell_96	B
Cell_97	B
Cell_98	T
Cell_99	B

100 rows × 1 columns

If we examine the representation of the AnnData object again now, we will notice that it was updated with the cell_type information in obs as well.

adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'

4.2.3.2. Subsetting using metadata#

We can also subset the AnnData object with the randomly generated cell types. The slicing and masking of the AnnData object behaves similarly to the data access in Pandas DataFrames or R matrices. More details on this can be found below.

bdata = adata[adata.obs.cell_type == "B"]
bdata

View of AnnData object with n_obs × n_vars = 40 × 2000
    obs: 'cell_type'

4.2.4. Observation/variable-level matrices#

We might also have metadata at either level with many dimensions, such as a UMAP embedding of the data. AnnData has the .obsm/.varm attributes for this type of metadata. We use keys to identify the different matrices we insert. The restriction of .obsm/.varm is that .obsm matrices must have a length equal to the number of observations as .n_obs and .varm matrices must have a length equal to .n_vars. They can each independently have a different number of dimensions.

Let us start with a randomly generated matrix that we can interpret as a UMAP embedding of the data we would like to store, as well as some random gene-level metadata.

adata.obsm["X_umap"] = np.random.default_rng().normal(0, 1, size=(adata.n_obs, 2))
adata.varm["gene_stuff"] = np.random.default_rng().normal(0, 1, size=(adata.n_vars, 5))
adata.obsm

AxisArrays with keys: X_umap

Again, the AnnData representation is updated.

adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'
    obsm: 'X_umap'
    varm: 'gene_stuff'

A few more notes about .obsm/.varm:

The “array-like” metadata can originate from a Pandas DataFrame, scipy sparse matrix, or numpy dense array.
When using scanpy, their values (columns) are not easily plotted, whereas items from .obs are easily plotted on, e.g., UMAP plots.

4.2.5. Unstructured metadata#

As mentioned above, AnnData has .uns, which allows for any unstructured metadata. This can be anything, like a list or a dictionary, with some general information that was useful in the analysis of our data. Try only using this slot for data that cannot be efficiently stored in the other slots.

adata.uns["random"] = [1, 2, 3]
adata.uns

OrderedDict([('random', [1, 2, 3])])

4.2.6. Layers#

Finally, we may have different forms of our original core data, perhaps one that is normalized and one that is not. These can be stored in different layers in AnnData. For example, let us log transform the original data and store it in a layer.

adata.layers["log_transformed"] = np.log1p(adata.X)
adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'
    uns: 'random'
    obsm: 'X_umap'
    varm: 'gene_stuff'
    layers: 'log_transformed'

Our original matrix X was not modified and is still accessible. We can verify this by comparing the original X to the new layer (.nnz returns number of non-zero elements in the boolean matrix).

(adata.X != adata.layers["log_transformed"]).nnz == 0

False

4.2.7. Conversion to DataFrames#

It is possible to obtain a Pandas DataFrame from one of the layers.

with pd.option_context("display.max_columns", 10):
    display(adata.to_df(layer="log_transformed"))

	Gene_0	Gene_1	Gene_2	Gene_3	Gene_4	Gene_5	Gene_6	Gene_7	Gene_8	Gene_9	Gene_10	Gene_11	Gene_12	Gene_13	Gene_14	Gene_15	Gene_16	Gene_17	Gene_18	Gene_19	Gene_20	Gene_21	Gene_22	Gene_23	Gene_24	Gene_25	Gene_26	Gene_27	Gene_28	Gene_29	Gene_30	Gene_31	Gene_32	Gene_33	Gene_34	Gene_35	Gene_36	Gene_37	Gene_38	Gene_39	Gene_40	Gene_41	Gene_42	Gene_43	Gene_44	Gene_45	Gene_46	Gene_47	Gene_48	Gene_49	Gene_50	Gene_51	Gene_52	Gene_53	Gene_54	Gene_55	Gene_56	Gene_57	Gene_58	Gene_59	Gene_60	Gene_61	Gene_62	Gene_63	Gene_64	Gene_65	Gene_66	Gene_67	Gene_68	Gene_69	Gene_70	Gene_71	Gene_72	Gene_73	Gene_74	Gene_75	Gene_76	Gene_77	Gene_78	Gene_79	Gene_80	Gene_81	Gene_82	Gene_83	Gene_84	Gene_85	Gene_86	Gene_87	Gene_88	Gene_89	Gene_90	Gene_91	Gene_92	Gene_93	Gene_94	Gene_95	Gene_96	Gene_97	Gene_98	Gene_99	...	Gene_1900	Gene_1901	Gene_1902	Gene_1903	Gene_1904	Gene_1905	Gene_1906	Gene_1907	Gene_1908	Gene_1909	Gene_1910	Gene_1911	Gene_1912	Gene_1913	Gene_1914	Gene_1915	Gene_1916	Gene_1917	Gene_1918	Gene_1919	Gene_1920	Gene_1921	Gene_1922	Gene_1923	Gene_1924	Gene_1925	Gene_1926	Gene_1927	Gene_1928	Gene_1929	Gene_1930	Gene_1931	Gene_1932	Gene_1933	Gene_1934	Gene_1935	Gene_1936	Gene_1937	Gene_1938	Gene_1939	Gene_1940	Gene_1941	Gene_1942	Gene_1943	Gene_1944	Gene_1945	Gene_1946	Gene_1947	Gene_1948	Gene_1949	Gene_1950	Gene_1951	Gene_1952	Gene_1953	Gene_1954	Gene_1955	Gene_1956	Gene_1957	Gene_1958	Gene_1959	Gene_1960	Gene_1961	Gene_1962	Gene_1963	Gene_1964	Gene_1965	Gene_1966	Gene_1967	Gene_1968	Gene_1969	Gene_1970	Gene_1971	Gene_1972	Gene_1973	Gene_1974	Gene_1975	Gene_1976	Gene_1977	Gene_1978	Gene_1979	Gene_1980	Gene_1981	Gene_1982	Gene_1983	Gene_1984	Gene_1985	Gene_1986	Gene_1987	Gene_1988	Gene_1989	Gene_1990	Gene_1991	Gene_1992	Gene_1993	Gene_1994	Gene_1995	Gene_1996	Gene_1997	Gene_1998	Gene_1999
Cell_0	0.000000	0.693147	0.693147	0.693147	0.693147	0.693147	0.000000	0.000000	0.693147	0.000000	0.000000	0.000000	0.000000	1.098612	0.000000	1.098612	1.098612	1.098612	0.693147	0.000000	1.098612	0.000000	0.693147	0.693147	0.693147	0.000000	0.000000	0.000000	0.000000	1.098612	0.693147	0.000000	1.098612	1.386294	1.609438	0.000000	1.386294	0.693147	0.000000	0.000000	0.693147	0.000000	0.000000	1.098612	0.693147	1.386294	0.000000	0.000000	0.693147	0.000000	0.000000	0.000000	0.000000	0.693147	0.000000	0.693147	0.000000	0.693147	1.386294	0.000000	0.000000	0.000000	0.000000	0.693147	0.693147	1.609438	0.000000	0.693147	0.693147	0.000000	0.000000	1.098612	1.098612	0.693147	1.098612	0.000000	0.693147	0.000000	1.098612	0.000000	0.000000	0.693147	1.098612	0.000000	0.693147	1.098612	1.098612	0.693147	0.693147	0.693147	0.693147	1.098612	1.098612	0.000000	1.386294	0.000000	0.693147	0.000000	0.000000	0.693147	...	0.693147	1.098612	0.693147	0.693147	0.000000	0.693147	0.693147	0.693147	0.693147	0.000000	1.098612	0.000000	1.098612	0.000000	1.098612	1.386294	0.693147	0.000000	0.693147	1.098612	0.000000	0.000000	1.098612	0.000000	0.000000	0.693147	0.693147	0.693147	1.386294	1.098612	0.693147	0.693147	0.693147	0.693147	0.000000	1.386294	0.000000	0.000000	1.386294	0.000000	0.000000	0.693147	1.098612	0.000000	0.693147	0.000000	1.386294	1.098612	0.693147	0.693147	1.386294	0.000000	0.000000	1.386294	0.693147	1.098612	0.693147	0.693147	1.098612	1.386294	0.693147	0.000000	0.000000	0.000000	1.098612	0.693147	0.693147	1.386294	1.098612	0.693147	0.000000	0.693147	0.000000	0.000000	0.000000	0.000000	1.098612	0.000000	1.609438	1.098612	1.386294	0.000000	0.693147	1.386294	1.386294	0.693147	0.693147	1.386294	0.693147	0.000000	0.693147	0.000000	0.000000	0.000000	0.693147	0.693147	0.000000	1.098612	0.000000	0.693147
Cell_1	0.000000	0.693147	0.000000	0.693147	0.000000	0.693147	1.098612	1.098612	1.098612	0.000000	0.693147	0.693147	0.693147	1.098612	0.000000	0.000000	0.000000	0.693147	1.386294	0.000000	1.386294	0.693147	0.000000	0.000000	0.000000	0.693147	1.386294	1.609438	1.098612	0.000000	1.098612	1.098612	0.693147	0.000000	0.693147	1.098612	1.098612	1.098612	1.098612	0.693147	0.000000	1.098612	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.098612	0.000000	0.000000	1.098612	0.693147	0.693147	0.000000	0.000000	0.000000	0.000000	1.098612	0.693147	0.000000	1.609438	1.098612	1.098612	1.386294	0.693147	0.000000	1.098612	0.000000	0.000000	1.098612	1.098612	0.000000	0.693147	0.693147	1.386294	0.000000	0.693147	0.000000	0.693147	0.000000	0.000000	1.386294	0.693147	0.693147	0.000000	0.693147	0.693147	0.693147	0.000000	0.693147	0.693147	0.000000	1.098612	1.386294	0.000000	0.693147	0.000000	1.098612	1.098612	...	0.693147	0.693147	1.098612	1.098612	0.000000	0.693147	0.693147	0.693147	0.000000	0.693147	1.386294	0.000000	0.000000	1.098612	0.000000	0.693147	1.609438	1.098612	0.000000	0.693147	0.000000	0.000000	1.098612	0.693147	0.000000	0.000000	1.098612	0.000000	0.000000	0.693147	0.693147	0.000000	0.693147	0.000000	0.000000	0.693147	0.693147	0.000000	0.000000	1.098612	0.693147	0.693147	1.098612	0.693147	0.000000	0.693147	0.000000	0.000000	0.000000	0.000000	0.693147	0.000000	0.000000	0.693147	0.000000	0.000000	0.693147	0.693147	0.693147	1.098612	0.000000	1.386294	0.693147	0.000000	0.000000	0.000000	0.000000	0.693147	0.693147	0.000000	1.098612	0.000000	0.000000	0.000000	0.693147	0.693147	0.000000	1.098612	0.000000	0.000000	0.000000	0.693147	0.693147	0.000000	0.000000	0.000000	1.098612	0.693147	0.693147	0.000000	1.098612	0.000000	1.098612	0.000000	0.000000	0.000000	0.000000	0.693147	0.000000	1.098612
Cell_2	0.693147	1.098612	1.098612	1.386294	0.000000	0.693147	1.098612	0.693147	0.000000	0.693147	0.693147	0.000000	0.000000	1.386294	0.000000	0.693147	1.098612	0.000000	1.386294	1.386294	1.098612	0.000000	0.000000	0.693147	0.693147	0.693147	0.693147	0.693147	0.000000	0.693147	0.000000	1.098612	0.000000	0.000000	1.098612	0.693147	0.693147	1.098612	1.386294	1.098612	0.693147	0.693147	0.693147	0.000000	1.386294	0.693147	0.693147	0.693147	0.000000	0.693147	0.693147	1.098612	0.693147	0.000000	0.693147	1.386294	0.000000	0.000000	0.000000	0.693147	0.000000	0.693147	1.098612	0.693147	0.693147	0.693147	1.386294	0.693147	0.693147	0.000000	1.609438	0.693147	0.000000	0.000000	1.098612	0.693147	0.693147	0.693147	0.693147	1.098612	0.000000	0.693147	1.098612	0.000000	0.000000	0.000000	0.693147	0.693147	1.098612	1.609438	0.000000	0.693147	0.000000	0.693147	0.693147	0.000000	0.693147	0.000000	0.000000	0.000000	...	1.098612	1.098612	0.693147	0.000000	0.693147	0.693147	0.000000	0.693147	1.098612	1.386294	0.693147	0.000000	0.693147	0.693147	0.000000	0.000000	1.098612	0.000000	0.693147	1.098612	0.693147	1.098612	0.693147	0.000000	0.693147	1.098612	0.000000	0.000000	1.098612	0.693147	0.000000	0.000000	1.098612	0.693147	0.693147	1.098612	0.000000	0.693147	0.693147	0.000000	0.000000	0.000000	1.098612	0.693147	1.098612	0.000000	0.693147	0.693147	0.000000	0.000000	0.000000	0.693147	1.098612	0.693147	0.000000	0.693147	0.000000	0.693147	0.000000	0.693147	1.386294	0.000000	0.000000	0.000000	0.693147	0.000000	0.693147	0.000000	0.693147	0.000000	1.386294	1.098612	1.098612	0.693147	0.693147	0.000000	0.000000	0.000000	0.000000	0.000000	0.693147	0.693147	0.693147	1.098612	1.098612	0.693147	0.693147	0.000000	0.000000	0.693147	0.000000	0.000000	0.693147	0.000000	0.000000	0.000000	0.693147	0.693147	0.000000	0.000000
Cell_3	0.000000	1.098612	0.693147	0.693147	0.000000	0.693147	0.000000	1.386294	1.098612	0.693147	1.386294	0.693147	0.693147	1.098612	0.693147	1.098612	0.000000	1.098612	0.693147	1.098612	1.386294	0.693147	0.693147	0.693147	1.098612	0.693147	0.693147	0.693147	0.000000	1.098612	1.386294	0.000000	1.098612	0.693147	1.609438	0.000000	0.000000	0.000000	0.693147	0.000000	1.098612	0.000000	0.000000	0.000000	1.098612	1.098612	1.098612	0.693147	0.000000	0.693147	0.000000	0.693147	0.000000	0.000000	1.098612	1.098612	1.098612	1.098612	0.000000	0.693147	1.386294	0.693147	0.693147	0.000000	0.000000	1.386294	0.000000	0.693147	0.000000	1.098612	0.000000	0.693147	0.000000	1.098612	0.693147	0.693147	0.000000	0.000000	0.000000	1.098612	1.386294	0.693147	0.000000	0.693147	1.386294	1.386294	0.693147	0.000000	0.000000	0.000000	0.693147	0.000000	1.098612	0.693147	0.000000	0.693147	0.000000	1.098612	0.693147	0.000000	...	1.098612	0.000000	0.000000	0.000000	0.000000	0.693147	0.000000	0.693147	1.098612	1.386294	1.098612	1.386294	1.098612	0.000000	0.693147	0.000000	0.000000	1.098612	0.693147	0.693147	0.693147	1.386294	0.693147	0.000000	0.693147	1.386294	1.098612	0.000000	0.693147	0.000000	0.693147	0.693147	1.098612	1.386294	1.098612	1.098612	0.000000	0.000000	0.693147	0.693147	0.693147	0.693147	0.693147	1.386294	0.693147	1.098612	0.000000	0.000000	1.098612	1.098612	1.098612	0.000000	0.000000	0.000000	0.000000	0.693147	1.386294	0.693147	0.000000	1.386294	1.098612	1.098612	1.098612	1.098612	1.386294	0.000000	1.386294	1.098612	0.693147	1.098612	0.000000	0.693147	1.098612	0.000000	1.386294	1.098612	1.098612	0.000000	0.000000	0.000000	0.693147	0.693147	1.098612	0.693147	0.693147	0.693147	1.386294	0.693147	0.000000	0.693147	1.609438	0.693147	0.000000	0.000000	0.000000	0.693147	0.693147	1.098612	0.000000	1.098612
Cell_4	0.000000	0.693147	1.386294	0.693147	0.693147	1.609438	0.693147	0.000000	0.693147	0.693147	0.000000	0.693147	0.000000	0.000000	0.693147	0.000000	0.693147	0.000000	0.000000	1.098612	0.693147	1.386294	0.000000	0.693147	1.098612	0.000000	0.693147	0.693147	0.693147	0.693147	0.000000	0.000000	1.098612	0.693147	1.098612	0.000000	0.000000	0.000000	1.386294	0.000000	1.386294	0.693147	0.693147	0.000000	0.693147	0.693147	0.693147	1.098612	1.098612	0.000000	0.693147	0.693147	0.693147	0.693147	0.693147	0.693147	0.693147	0.000000	0.693147	0.693147	1.098612	0.000000	0.000000	0.693147	1.098612	0.693147	0.693147	0.693147	1.098612	0.693147	1.098612	1.098612	1.098612	0.000000	1.098612	0.000000	1.098612	0.000000	0.693147	0.693147	1.098612	0.693147	0.000000	0.000000	0.693147	0.000000	1.098612	0.000000	0.693147	1.098612	0.000000	0.000000	0.000000	0.000000	0.000000	0.693147	1.098612	0.693147	0.000000	0.000000	...	0.000000	0.693147	0.693147	1.098612	0.693147	0.693147	0.000000	0.693147	0.693147	0.000000	1.098612	0.000000	1.098612	0.693147	1.386294	0.000000	0.693147	0.000000	0.000000	0.000000	1.609438	0.693147	0.693147	0.693147	0.000000	1.386294	0.000000	1.098612	0.693147	0.000000	1.098612	0.693147	0.693147	0.693147	0.000000	0.000000	0.693147	0.000000	0.693147	1.098612	0.693147	0.000000	0.000000	0.000000	1.098612	0.693147	0.693147	0.693147	1.386294	1.098612	0.000000	0.000000	0.693147	1.386294	0.693147	0.693147	0.000000	0.000000	0.693147	1.098612	0.000000	0.693147	0.000000	1.386294	0.000000	1.098612	1.386294	0.693147	0.000000	0.693147	1.098612	1.098612	1.098612	0.693147	1.386294	1.098612	1.098612	0.693147	0.693147	1.609438	0.693147	0.693147	1.098612	0.000000	0.693147	1.098612	1.098612	1.098612	0.000000	0.000000	0.693147	0.000000	0.693147	0.693147	0.693147	1.386294	0.693147	1.098612	0.693147	0.693147
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Cell_95	0.000000	0.693147	1.098612	0.693147	1.098612	0.000000	0.693147	0.693147	0.693147	0.000000	0.693147	1.098612	0.000000	0.000000	0.000000	0.000000	0.693147	1.609438	0.000000	0.000000	0.693147	0.693147	1.386294	1.098612	0.693147	0.000000	0.693147	0.000000	0.000000	0.693147	0.693147	0.000000	1.098612	0.693147	0.000000	1.098612	1.098612	1.098612	0.000000	0.693147	0.693147	0.693147	0.000000	0.000000	0.693147	0.000000	0.693147	1.098612	1.098612	0.000000	0.000000	0.693147	0.000000	1.098612	0.693147	0.693147	0.000000	0.000000	0.693147	1.098612	0.693147	0.000000	0.000000	0.000000	1.386294	0.000000	0.000000	0.693147	0.000000	0.000000	0.693147	0.693147	0.693147	0.693147	0.000000	0.693147	0.693147	0.693147	1.098612	1.386294	1.609438	0.693147	1.098612	1.098612	0.693147	0.693147	0.693147	0.693147	0.693147	0.000000	0.693147	0.000000	0.693147	0.000000	1.386294	0.000000	0.693147	1.098612	0.693147	0.693147	...	0.000000	1.386294	1.098612	1.609438	0.693147	1.098612	1.386294	1.386294	0.693147	0.000000	0.000000	1.386294	0.000000	0.693147	0.693147	1.098612	0.693147	1.098612	0.693147	1.098612	0.000000	0.000000	0.693147	0.000000	0.693147	0.693147	0.000000	0.693147	1.386294	0.000000	0.000000	0.693147	0.000000	0.000000	0.000000	0.000000	0.693147	0.693147	0.000000	1.098612	1.098612	1.098612	1.098612	0.000000	0.000000	0.693147	1.386294	0.000000	1.386294	0.000000	0.000000	1.098612	1.098612	0.000000	0.000000	1.098612	0.000000	0.000000	0.693147	0.693147	1.098612	1.098612	1.098612	0.000000	0.000000	0.693147	0.000000	0.000000	0.000000	0.693147	0.000000	0.693147	0.000000	1.098612	0.000000	1.098612	0.000000	0.000000	0.693147	0.693147	0.693147	0.693147	0.693147	0.000000	0.693147	0.000000	0.000000	0.693147	0.000000	1.098612	0.693147	0.000000	0.000000	1.098612	0.693147	0.000000	0.693147	0.693147	1.098612	1.098612
Cell_96	1.098612	0.000000	0.693147	0.000000	1.098612	1.098612	1.098612	0.000000	0.000000	0.000000	0.000000	0.693147	0.000000	0.693147	0.000000	1.098612	0.000000	1.098612	1.386294	0.693147	0.693147	0.693147	1.098612	0.693147	1.098612	0.693147	0.693147	1.098612	0.000000	1.609438	0.000000	1.098612	0.693147	0.693147	1.386294	1.098612	1.098612	0.693147	0.000000	0.693147	0.693147	0.693147	0.693147	0.693147	0.693147	1.098612	0.693147	1.098612	1.098612	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.386294	1.098612	0.693147	1.098612	0.000000	0.693147	0.693147	0.693147	1.386294	0.693147	0.000000	0.000000	0.693147	0.000000	0.693147	0.000000	0.000000	1.098612	1.098612	0.000000	0.693147	0.000000	0.000000	0.000000	0.000000	1.098612	0.000000	0.693147	0.000000	0.000000	0.693147	0.693147	0.000000	1.098612	0.000000	0.693147	1.098612	0.693147	0.000000	0.693147	0.000000	0.000000	...	0.000000	0.693147	0.693147	1.098612	0.000000	1.098612	0.000000	1.386294	0.693147	1.098612	1.098612	0.693147	0.000000	1.098612	0.693147	1.098612	0.693147	1.609438	0.693147	0.000000	0.693147	0.693147	0.000000	0.000000	0.693147	0.693147	0.000000	0.693147	0.000000	0.000000	1.098612	0.693147	0.000000	0.693147	1.386294	0.000000	1.098612	0.693147	0.000000	0.000000	0.000000	0.693147	0.693147	0.000000	0.000000	0.693147	0.693147	0.693147	0.693147	0.000000	0.000000	0.693147	1.386294	1.098612	0.693147	0.000000	0.000000	1.098612	0.693147	0.693147	0.000000	0.693147	1.098612	0.000000	1.098612	0.000000	0.693147	1.609438	0.693147	0.693147	0.000000	1.098612	0.693147	1.098612	0.000000	0.000000	0.000000	0.693147	1.098612	0.693147	0.000000	1.945910	1.098612	0.693147	0.000000	0.693147	1.098612	0.693147	1.098612	0.693147	0.000000	1.098612	1.098612	1.098612	1.098612	0.693147	0.000000	1.098612	1.386294	0.693147
Cell_97	0.693147	0.693147	0.000000	1.386294	0.693147	0.693147	1.098612	1.098612	0.693147	0.000000	0.693147	0.693147	0.000000	0.693147	0.000000	0.000000	0.000000	0.693147	0.000000	0.693147	0.693147	1.386294	0.000000	1.098612	0.693147	0.693147	0.693147	0.000000	0.693147	1.098612	0.000000	1.098612	1.098612	0.000000	0.000000	0.000000	1.098612	0.693147	0.000000	1.098612	0.693147	1.386294	1.098612	0.000000	0.693147	0.693147	0.693147	0.693147	0.693147	0.000000	0.000000	1.098612	0.693147	1.386294	0.000000	1.098612	1.098612	0.693147	0.000000	0.693147	0.693147	0.693147	1.098612	0.693147	1.098612	1.386294	0.000000	0.693147	0.693147	1.098612	0.693147	1.098612	0.000000	0.693147	0.000000	0.693147	1.098612	0.000000	0.000000	0.000000	1.098612	0.000000	0.693147	0.000000	0.000000	0.693147	0.000000	0.000000	1.098612	1.098612	0.000000	1.098612	0.693147	0.000000	0.693147	1.098612	0.000000	0.693147	0.000000	0.000000	...	0.000000	0.000000	0.693147	0.693147	0.693147	0.000000	1.098612	0.693147	0.693147	0.000000	0.693147	1.098612	0.693147	0.693147	0.693147	0.000000	0.000000	0.693147	0.000000	0.000000	0.693147	0.000000	0.693147	0.000000	0.000000	1.098612	1.098612	0.693147	0.000000	1.386294	0.000000	0.693147	1.098612	0.000000	0.693147	1.386294	0.693147	1.098612	1.098612	0.000000	0.000000	0.000000	0.000000	0.693147	0.000000	1.098612	1.386294	0.693147	0.693147	0.000000	1.386294	0.693147	0.693147	1.098612	0.693147	0.000000	0.693147	0.000000	1.386294	0.693147	1.098612	0.000000	0.693147	0.000000	1.098612	0.000000	1.098612	0.693147	1.386294	0.693147	1.098612	0.000000	0.693147	0.000000	0.000000	0.000000	1.791759	1.098612	1.098612	0.693147	0.693147	0.693147	0.693147	0.693147	1.609438	0.000000	0.693147	0.693147	1.098612	0.000000	0.000000	0.000000	0.693147	0.693147	0.693147	0.693147	1.098612	0.693147	1.386294	1.098612
Cell_98	0.000000	0.693147	0.000000	0.693147	0.693147	0.000000	1.386294	0.000000	0.693147	0.693147	0.000000	1.098612	0.000000	1.098612	0.000000	0.693147	1.098612	0.693147	0.000000	1.098612	0.000000	0.693147	1.609438	0.693147	0.693147	0.000000	1.098612	0.693147	0.693147	1.386294	0.000000	0.693147	0.693147	0.693147	0.000000	0.000000	1.609438	0.000000	1.098612	0.693147	0.693147	1.098612	0.693147	1.098612	0.693147	1.098612	1.098612	0.693147	0.693147	0.693147	0.693147	0.000000	0.000000	0.000000	0.693147	0.000000	0.693147	0.000000	0.000000	1.098612	0.000000	1.098612	0.693147	0.000000	0.000000	1.098612	0.000000	0.000000	0.693147	0.000000	0.000000	1.098612	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.609438	0.693147	0.693147	1.098612	0.693147	0.693147	0.000000	0.000000	0.000000	1.386294	0.693147	1.098612	0.693147	1.098612	0.693147	0.693147	0.000000	0.693147	0.000000	0.693147	0.693147	0.693147	...	1.098612	0.693147	1.098612	0.693147	1.386294	1.098612	0.000000	0.000000	1.098612	0.693147	0.000000	0.000000	0.000000	1.609438	0.693147	0.693147	0.000000	0.693147	0.000000	1.098612	1.098612	0.000000	0.693147	1.098612	0.000000	0.000000	0.000000	0.000000	0.693147	0.000000	0.693147	0.693147	0.000000	0.000000	0.000000	0.693147	1.386294	0.693147	0.693147	0.000000	0.693147	0.693147	0.693147	0.000000	0.693147	0.000000	0.693147	0.000000	0.000000	0.693147	0.000000	0.000000	0.000000	0.000000	0.000000	0.693147	0.000000	0.693147	1.098612	1.098612	0.000000	1.386294	0.000000	0.693147	0.693147	0.693147	0.693147	0.693147	0.693147	0.693147	0.693147	0.693147	0.000000	0.000000	1.098612	0.000000	0.000000	0.000000	1.098612	0.000000	0.000000	0.693147	0.000000	0.000000	0.693147	0.693147	0.000000	0.693147	0.000000	1.386294	0.693147	0.693147	0.693147	1.609438	1.386294	0.693147	0.000000	0.693147	0.693147	1.609438
Cell_99	1.098612	1.386294	0.693147	1.098612	0.000000	0.000000	1.386294	1.098612	1.098612	0.000000	0.000000	0.000000	1.609438	1.098612	0.693147	0.000000	0.693147	0.693147	0.693147	0.000000	0.693147	0.000000	0.693147	0.693147	1.098612	0.693147	0.000000	0.693147	0.693147	1.098612	0.693147	0.693147	0.000000	1.098612	0.000000	0.693147	1.791759	0.000000	0.000000	0.000000	0.693147	0.693147	1.098612	0.000000	0.693147	0.693147	0.000000	0.000000	0.693147	0.000000	0.693147	0.000000	1.386294	1.098612	0.000000	0.000000	1.098612	0.693147	1.098612	0.693147	1.386294	0.000000	1.098612	1.386294	0.000000	1.098612	1.386294	0.693147	0.000000	0.000000	1.098612	0.693147	1.098612	0.000000	0.000000	0.000000	0.693147	1.098612	0.693147	0.000000	0.000000	0.693147	0.000000	0.693147	1.098612	1.098612	0.000000	0.693147	0.000000	0.000000	0.000000	0.000000	0.693147	0.693147	1.098612	0.000000	0.000000	0.000000	0.000000	1.098612	...	1.098612	1.098612	1.386294	0.000000	0.000000	0.000000	0.693147	1.386294	0.693147	0.000000	0.000000	1.098612	0.000000	1.098612	1.098612	0.693147	0.000000	0.000000	0.693147	1.791759	1.098612	0.693147	0.000000	0.693147	1.386294	0.693147	0.000000	1.098612	0.693147	0.000000	0.000000	0.693147	0.693147	1.098612	0.000000	1.098612	0.693147	0.000000	0.693147	1.098612	0.000000	0.000000	1.098612	0.000000	0.000000	0.693147	0.000000	0.000000	0.000000	0.693147	0.693147	0.693147	0.000000	0.000000	1.791759	0.000000	1.098612	0.693147	0.000000	0.693147	0.693147	0.693147	1.098612	0.000000	0.000000	0.000000	0.000000	0.693147	0.693147	0.693147	1.386294	0.000000	1.098612	1.098612	1.098612	1.386294	0.693147	1.609438	0.693147	0.693147	0.693147	0.693147	0.693147	0.693147	0.000000	0.693147	0.000000	0.000000	0.693147	0.000000	0.693147	1.386294	1.098612	0.693147	0.693147	0.693147	0.000000	0.000000	0.693147	0.000000

100 rows × 2000 columns

4.2.8. Reading and writing of AnnData objects#

AnnData objects can be saved on disk to hierarchical array stores like HDF5 or Zarr to enable similar structures in disk and on memory. AnnData comes with its own persistent HDF5-based file format: h5ad. If string columns with a few categories are not yet categorical, AnnData will auto-transform them to categorical. We will now save our AnnData object in h5ad format.

adata.write("my_results.h5ad", compression="gzip")

… and read it back in.

adata_new = ad.read_h5ad("my_results.h5ad")
adata_new

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'
    uns: 'random'
    obsm: 'X_umap'
    varm: 'gene_stuff'
    layers: 'log_transformed'

4.2.9. Efficient data access#

4.2.9.1. View and copies#

For the fun of it, let us look at another metadata use case. Imagine that the observations come from instruments characterizing 10 readouts in a multi-year study with samples taken from different subjects at different sites. We would typically get that information in some format and then store it in a DataFrame:

obs_meta = pd.DataFrame(
    {
        "time_yr": np.random.default_rng().choice([0, 2, 4, 8], adata.n_obs),
        "subject_id": np.random.default_rng().choice(
            ["subject 1", "subject 2", "subject 4", "subject 8"], adata.n_obs
        ),
        "instrument_type": np.random.default_rng().choice(
            ["type a", "type b"], adata.n_obs
        ),
        "site": np.random.default_rng().choice(["site x", "site y"], adata.n_obs),
    },
    index=adata.obs.index,  # these are the same IDs of observations as above!
)

This is how we join the readout data with the metadata. Of course, the first argument of the following call for X could also just be a DataFrame. This will result in a single data container that tracks everything.

adata = ad.AnnData(adata.X, obs=obs_meta, var=adata.var)
adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

Subsetting the joint data matrix can be important to focus on subsets of variables or observations, or to define train-test splits for a machine learning model.

Similar to numpy arrays, AnnData objects can either hold actual data or reference another AnnData object. In the latter case, they are referred to as “view”. Subsetting AnnData objects always returns views, which has two advantages:

No new memory is allocated.
It is possible to modify the underlying AnnData object.

You can get an actual AnnData object from a view by calling .copy() on the view. Usually, this is not necessary, as any modification of elements of a view (calling .[] on an attribute of the view) internally calls .copy() and makes the view an AnnData object that holds actual data. See the example below.

adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

Indexing into AnnData will assume that integer arguments to [] behave like .iloc in pandas, whereas string arguments behave like .loc. AnnData always assumes string indices.

adata_view = adata[:5, ["Gene_1", "Gene_3"]]
adata_view

View of AnnData object with n_obs × n_vars = 5 × 2
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

This is a view! This can be verified by examining the AnnData object again.

adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

The dimensions of the AnnData object have not changed. It still contains the same data. If we want an AnnData that holds the data in memory, we must call it .copy().

adata_subset = adata[:5, ["Gene_1", "Gene_3"]].copy()
adata_subset

AnnData object with n_obs × n_vars = 5 × 2
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

For a view, we can also set the first three elements of a column.

print(adata[:3, "Gene_1"].X.toarray().tolist())
adata[:3, "Gene_1"].X = [0, 0, 0]
print(adata[:3, "Gene_1"].X.toarray().tolist())

[[1.0], [1.0], [2.0]]
[[0.0], [0.0], [0.0]]

If you try to access parts of a view of an AnnData, the content will be auto-copied and a data-storing object will be generated.

adata_subset = adata[:3, ["Gene_1", "Gene_2"]]
adata_subset

View of AnnData object with n_obs × n_vars = 3 × 2
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

adata_subset.obs["foo"] = range(3)

Now adata_subset stores the actual data and is no longer just a reference to adata.

adata_subset

AnnData object with n_obs × n_vars = 3 × 2
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site', 'foo'

Evidently, you can use all of pandas to slice with sequences or boolean indices.

adata[adata.obs.time_yr.isin([2, 4])].obs.head()

	time_yr	subject_id	instrument_type	site
Cell_2	4	subject 1	type a	site y
Cell_3	2	subject 2	type a	site y
Cell_5	4	subject 4	type b	site x
Cell_6	2	subject 1	type b	site y
Cell_7	4	subject 4	type a	site x

4.2.9.2. Partial reading of large data#

If a single h5ad file is very large, you can partially read it into memory by using backed mode.

adata = ad.read_h5ad("my_results.h5ad", backed="r")

adata.isbacked

True

If you do this, you will need to remember that the AnnData object has an open connection to the file used for reading.

adata.filename

PosixPath('my_results.h5ad')

As we are using it in read-only mode, we cannot damage anything. To proceed with this tutorial, we still need to explicitly close it.

adata.file.close()

4.3. Unimodal data analysis with scanpy#

Now that we understand the fundamental data structure of unimodal single-cell analysis, the question remains: How can we actually analyze the stored data? In the scverse ecosystem, several tools exist for analyzing specific omics data. For example, scanpy [Wolf et al., 2018] provides tooling for general RNA-Seq-focused analysis, squidpy [Palla et al., 2022] focuses on spatial transcriptomics, and scirpy [Sturm et al., 2020] provides tooling for the analysis of T-cell receptor (TCR) and B-cell receptor (BCR) data. Even though many scverse extensions for various data modalities exist, they usually use some of scanpy’s preprocessing and visualization capabilities to some extent.

More specifically, scanpy is a Python package that builds on top of AnnData to facilitate the analysis of single-cell gene expression data. Several methods for preprocessing, embedding, visualization, clustering, differential gene expression testing, pseudotime and trajectory inference, and simulation of gene regulatory networks are accessible through scanpy. The efficient implementation based on the Python data science and machine learning libraries allows scanpy to scale to millions of cells. Generally, best-practice single-cell data analysis is an interactive process. Many of the decisions and analysis steps depend on the results of previous steps and the potential input of experimental partners. Pipelines such as scflow [Khozoie et al., 2021] entirely automate some downstream analysis steps. These pipelines have to make assumptions and simplifications, which may not result in the most robust analysis. Scanpy is therefore designed for interactive analyses with, for example, Jupyter Notebooks [Jupyter, 2022].

scanpy Overview — Fig. 4.3 Scanpy overview. Image obtained from [Wolf *et al.*, 2018].#

4.3.1. Installation#

Scanpy is available on PyPI and Conda. It can be installed using either of the following commands.

pip install scanpy
conda install -c conda-forge scanpy

4.3.2. Scanpy API design#

The scanpy framework is designed in a way that functions belonging to the same step are grouped into corresponding modules. For example, all preprocessing functions are available in the scanpy.preprocessing module, all transformations of a data matrix that are not preprocessing are available in scanpy.tools, and all visualizations are available in scanpy.plot. These modules are commonly accessed after having imported scanpy like import scanpy as sc with the corresponding abbreviations scanpy.pp for preprocessing, scanpy.tl for tools, and scanpy.pl for plots. All modules which read or write data are directly accessed. Further, a module for various datasets is available as scanpy.datasets. All functions with corresponding parameters and potential example plots are documented in the scanpy API documentation [scverse scanpy, 2022].

Note that this tutorial only covers a tiny subset of scanpy’s features and options. Readers are strongly encouraged to examine scanpy’s documentation for more details.

scanpy API — Fig. 4.4 Scanpy API overview. The API is divided into datasets, preprocessing (pp), tools (tl) and corresponding plotting (pl) functions.#

4.3.3. Scanpy example#

In the following cells we will shortly demonstrate the workflow of an analysis with scanpy. We explicitly do not conduct a full analysis because the specific analysis steps are covered in the corresponding chapters.

As a first step we import scanpy and define defaults for our following quick scanpy demo. We use scanpy’s setting object to set the Matplotlib plotting defaults for all of scanpy’s plots and finally print scanpy’s header. This header contains the versions of all relevant Python packages in the current environment including scanpy and AnnData. This output is especially useful when reporting bugs to the scverse team and for reproducibility reasons.

import scanpy as sc

sc.settings.set_figure_params(dpi=80, facecolor="white")
sc.logging.print_header()

The dataset of choice is a dataset of 2700 peripheral blood mononuclear cells of a healthy donor which were sequenced on the Illumina NextSeq 500. We can load the dataset from lamindb, although it is also available via scanpy.datasets.pbmc3k().

adata = ln.Artifact.get(
    key="introduction/fundamental_data_structures_and_frameworks.h5ad", is_latest=True
).load()
adata

AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'

The returned AnnData object has 2700 cells with 32738 genes. The var slot further contains the gene IDs.

adata.var

	gene_ids
index
MIR1302-10	ENSG00000243485
FAM138A	ENSG00000237613
OR4F5	ENSG00000186092
RP11-34P13.7	ENSG00000238009
RP11-34P13.8	ENSG00000239945
...	...
AC145205.1	ENSG00000215635
BAGE5	ENSG00000268590
CU459201.1	ENSG00000251180
AC002321.2	ENSG00000215616
AC002321.1	ENSG00000215611

32738 rows × 1 columns

As mentioned above, all of scanpy’s analysis functions are accessible via sc.[pp, tl, pl]. As a first step to get an overview over our data, we use scanpy to show those genes that yield the highest fraction of counts in each single cell, across all cells. We simply call the scanpy.pl.highest_expr_genes() function, pass the AnnData object which is in pretty much all cases the first parameter of any scanpy function, and specify that we want the top 20 expressed genes to be shown.

sc.pl.highest_expr_genes(adata, n_top=20)

../_images/95ed7c98ae909fe5c5b6b35270cedc859b2cddb660ec52b103a9759e3432bde3.png

Apparently, MALAT1 is the most expressed gene which is frequently detected in poly-A captured scRNA-Seq data, independent of protocol. This gene has been shown to have an inverse correlation with cell health. Especially dead/dying cells have a higher expression of MALAT1.

We now filter cells with less than 200 detected genes and genes which were found in less than 3 cells for a rough quality threshold with scanpy’s preprocessing module.

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

A common step in single-cell RNA-Seq analysis is dimensionality reduction with for example PCA to unveil the main axes of variation. This also denoises the data. Scanpy offers PCA as a preprocessing or tools function. These are equivalent. Here, we use the version in tools for no particular reason.

sc.tl.pca(adata, svd_solver="arpack")

The corresponding plotting function allows us to pass genes to the color argument. The corresponding values are automatically extracted from the AnnData object.

sc.pl.pca(adata, color="CST3")

../_images/f31f91a546e01ca347d9e9bd31416db0921299f5eb422a6ee078442d7c8a63b2.png

A fundamental step for any advanced embedding and downstream calculations is the calculating of the neighborhood graph using the PCA representation of the data matrix. It is automatically used for other tools that require it such as the calculation of a UMAP.

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

We now use the calculating neighborhood graph to embed the cells with a UMAP, one of many advanced dimension reduction algorithms implemented in scanpy.

sc.tl.umap(adata)

sc.pl.umap(adata, color=["CST3", "NKG7", "PPBP"])

../_images/af46bcc3ccb39e8a1d8fd887be9d3bba7900f837f2c6d8243520e8acc0e2e45e.png

Scanpy’s documentation also provides tutorials which we recommend to all readers who need a refresher of scanpy or are new to scanpy. Video tutorials are available on the scverse youtube channel.

4.4. Questions#

4.4.1. Flipcards#

What is the fundamental data structure for single-cell analysis in the scverse?

AnnData

What is the fundamental framework for single-cell analysis in the scverse?

Scanpy

In single-cell RNA-seq data, which dimensions correspond to genes and cells?

Genes are stored in `.var` (columns), and cells are stored in `.obs` (rows) of the matrix.

Q: In single-cell RNA-seq data, which dimensions correspond to genes and cells?Answer: Genes are stored in `.var` (columns), and cells are stored in `.obs` (rows) of the matrix.

4.4.2. Multiple-choice questions#

4.5. References#

[atBKS22]

Danila Bredikhin, Ilia Kats, and Oliver Stegle. Muon: multimodal omics analysis framework. Genome Biology, 23(1):42, Feb 2022. URL: https://doi.org/10.1186/s13059-021-02577-8, doi:10.1186/s13059-021-02577-8.

[atHHAN+21]

Yuhan Hao, Stephanie Hao, Erica Andersen-Nissen, William M. Mauck III, Shiwei Zheng, Andrew Butler, Maddie J. Lee, Aaron J. Wilk, Charlotte Darby, Michael Zagar, Paul Hoffman, Marlon Stoeckius, Efthymia Papalexi, Eleni P. Mimitou, Jaison Jain, Avi Srivastava, Tim Stuart, Lamar B. Fleming, Bertrand Yeung, Angela J. Rogers, Juliana M. McElrath, Catherine A. Blish, Raphael Gottardo, Peter Smibert, and Rahul Satija. Integrated analysis of multimodal single-cell data. Cell, 2021. URL: https://doi.org/10.1016/j.cell.2021.04.048, doi:10.1016/j.cell.2021.04.048.

[atHCG+15]

Wolfgang Huber, Vincent J. Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S. Carvalho, Hector Corrada Bravo, Sean Davis, Laurent Gatto, Thomas Girke, Raphael Gottardo, Florian Hahne, Kasper D. Hansen, Rafael A. Irizarry, Michael Lawrence, Michael I. Love, James MacDonald, Valerie Obenchain, Andrzej K. Oleś, Hervé Pagès, Alejandro Reyes, Paul Shannon, Gordon K. Smyth, Dan Tenenbaum, Levi Waldron, and Martin Morgan. Orchestrating high-throughput genomic analysis with bioconductor. Nature Methods, 12(2):115–121, Feb 2015. URL: https://doi.org/10.1038/nmeth.3252, doi:10.1038/nmeth.3252.

[atJup22]

Project Jupyter. Jupyter. https://jupyter.org/, 2022. Accessed: 2022-04-21.

[atKFM+21]

Combiz Khozoie, Nurun Fancy, Mahdi M. Marjaneh, Alan E. Murphy, Paul M. Matthews, and Nathan Skene. Scflow: a scalable and reproducible analysis pipeline for single-cell term`rna` sequencing data. bioRxiv, 2021. URL: https://www.biorxiv.org/content/early/2021/08/19/2021.08.16.456499.1, arXiv:https://www.biorxiv.org/content/early/2021/08/19/2021.08.16.456499.1.full.pdf, doi:10.1101/2021.08.16.456499.

[atMPY+25]

Luca Marconato, Giovanni Palla, Kevin A. Yamauchi, Isaac Virshup, Elyas Heidari, Tim Treis, Wouter-Michiel Vierdag, Marcella Toth, Sonja Stockhaus, Rahul B. Shrestha, Benjamin Rombaut, Lotte Pollaris, Laurens Lehner, Harald Vöhringer, Ilia Kats, Yvan Saeys, Sinem K. Saka, Wolfgang Huber, Moritz Gerstung, Josh Moore, Fabian J. Theis, and Oliver Stegle. Spatialdata: an open and universal data framework for spatial omics. Nature Methods, 22(1):58–62, 2025. URL: https://doi.org/10.1038/s41592-024-02212-x, doi:10.1038/s41592-024-02212-x.

[atPSK+22] (1,2)

Giovanni Palla, Hannah Spitzer, Michal Klein, David Fischer, Anna Christina Schaar, Louis Benedikt Kuemmerle, Sergei Rybakov, Ignacio L. Ibarra, Olle Holmberg, Isaac Virshup, Mohammad Lotfollahi, Sabrina Richter, and Fabian J. Theis. Squidpy: a scalable framework for spatial omics analysis. Nature Methods, 19(2):171–178, Feb 2022. URL: https://doi.org/10.1038/s41592-021-01358-2, doi:10.1038/s41592-021-01358-2.

[atscv22]

scverse. Scverse. https://scverse.org, 2022. Accessed: 2022-04-21.

[atss22]

scverse scanpy. Scanpy api. https://scanpy.readthedocs.io/en/stable/api.html#, 2022. Accessed: 2022-04-21.

[atSSF+20]

Gregor Sturm, Tamas Szabo, Georgios Fotakis, Marlene Haider, Dietmar Rieder, Zlatko Trajanoski, and Francesca Finotello. Scirpy: a Scanpy extension for analyzing single-cell T-cell receptor-sequencing data. Bioinformatics, 36(18):4817–4818, 07 2020. URL: https://doi.org/10.1093/bioinformatics/btaa611, arXiv:https://academic.oup.com/bioinformatics/article-pdf/36/18/4817/34560298/btaa611.pdf, doi:10.1093/bioinformatics/btaa611.

[atVRT+21] (1,2,3)

Isaac Virshup, Sergei Rybakov, Fabian J. Theis, Philipp Angerer, and F. Alexander Wolf. Anndata: annotated data. bioRxiv, 2021. URL: https://www.biorxiv.org/content/early/2021/12/19/2021.12.16.473007, arXiv:https://www.biorxiv.org/content/early/2021/12/19/2021.12.16.473007.full.pdf, doi:10.1101/2021.12.16.473007.

[atWAT18] (1,2,3)

F. Alexander Wolf, Philipp Angerer, and Fabian J. Theis. Scanpy: large-scale single-cell gene expression data analysis. Genome Biology, 19(1):15, Feb 2018. URL: https://doi.org/10.1186/s13059-017-1382-0, doi:10.1186/s13059-017-1382-0.

4.6. Contributors#

We gratefully acknowledge the contributions of:

4.6.1. Authors#

Lukas Heumos
Luis Heinzlmeier

4.6.2. Reviewers#

Isaac Virshup