# Normalization

```{dropdown} <i class="fas fa-brain"></i>&nbsp;&nbsp;&nbsp;Key takeaways

:::{card}
:link: surface-protein-normalization-key-takeaway-1
:link-type: ref
ADT data in CITE-seq requires normalization methods like CLR or DSB to address noise and biases, with DSB removing ambient and technical noise using background controls.
:::

```

``````{dropdown} <i class="fa-solid fa-gear"></i>&nbsp;&nbsp;&nbsp;Environment setup
`````{tab-set}
   
````{tab-item} Steps
```{include} ../_static/default_text_env_setup.md
```
````

````{tab-item} yml
```{literalinclude} surface_protein.yml
:language: yaml
```
````

`````

(surface-protein-normalization-key-takeaway-1)=
## Motivation

Contrary to the negative binomial distribution of UMI counts, ADT data is less sparse with a negative peak for non-specific antibody binding and a positive peak resembling enrichment of specific cell surface proteins{cite}`Zheng2022`.
The capture efficiency varies from cell to cell due to difference in biophysical properties. Since CITE-seq experiments enrich for a priori selected features, compositional biases are more severe.
Analogously to scRNA-seq data, many approaches to normalization exist.
We cover the two most widely used ideas methods that require different input data and starting points.

ADT data can be normalized using Centered Log-Ratio (CLR) transformation {cite}`Stoeckius2017`. Nevertheless, a new low-level normalization method tailored to dealing with the challenges this modality poses now exists: DSB (denoised and scaled by background). DSB normalization removes two kinds of noise. First, it uses the empty droplets to estimate a background noise and remove the ambient noise. Secondly, it uses the background population mean and isotypes (antibodies that bind non-specifically to the cells) to define and remove cell-to-cell technical noise{cite}`Mulè2022`

## Environment setup

In [1]:
import warnings

import muon as mu
import pandas as pd
import pooch
import scanpy as sc

warnings.filterwarnings("ignore")

sc.settings.verbosity = 0
sc.settings.set_figure_params(
    dpi=80,
    facecolor="white",
    frameon=False,
)

## Loading the data

In [2]:
cite_quality_control = pooch.retrieve(
    url="https://figshare.com/ndownloader/files/41452449",
    fname="cite_quality_control.h5mu",
    path=".",
    known_hash=None,
    progressbar=True,
)

We are simply loading the saved MuData object from the quality control chapter back in.

In [3]:
mdata = mu.read("cite_quality_control.h5mu")
mdata

In [4]:
mdata_raw = mu.read("cite_raw.h5mu")
mdata_raw

## DSB normalization

We are ready to normalize the data. In this case, we can use the raw data's distribution as background. We also have isotype controls to define and remove cell-to-cell technical variations.

Isotype contols are antibodies that bind to the cells present in this study non-specifically, meaning you would not expect a significant abundance difference between the cells. Thus, we can use the
values of the isotype controls to normalize technical differences.

We are calling the normalization function `mu.prot.pp.dsb` with the filtered and raw mudata object as well as the names of the isotype controls.

In [5]:
isotype_controls = ["Mouse-IgG1", "Mouse-IgG2a", "Mouse-IgG2b", "Rat-IgG2b"]

In [6]:
mdata["prot"].layers["counts"] = mdata["prot"].X

In [7]:
mdata["prot"].X = mdata["prot"].layers["counts"]

In [8]:
mu.prot.pp.dsb(mdata, mdata_raw, isotype_controls=isotype_controls)

Let's have a look at counts before denoising and normalization.

In [9]:
pd.Series(mdata["prot"].layers["counts"][:100, :100].A.flatten()).value_counts()

1.0      1090
0.0      1045
2.0       918
3.0       691
4.0       581
         ... 
350.0       1
706.0       1
296.0       1
970.0       1
763.0       1
Name: count, Length: 524, dtype: int64

See after denoise and normalization the range changed.

In [10]:
pd.Series(mdata["prot"].X[:100, :100].flatten()).value_counts()

 0.311677    2
-1.002554    1
 2.573147    1
 1.804169    1
-0.403206    1
            ..
 0.142890    1
 0.268634    1
-0.078150    1
 0.258447    1
-0.271008    1
Name: count, Length: 9999, dtype: int64

## Centered Log-Ratio normalization

If you don't have the unfiltered data available, you can also normalize the ADT data with `mu.prot.pp.clr`, implementing **C**entered **L**og-**R**atio normalization. There is no denoising in this type of normalization. We instead assume that the geometric mean is a good reference to make all else relative to (divide by){cite}`Quinn_Erb_Richardson_Crowley_2018`. We are in fact taking the natural log ratio of each protein in each cell relative to either other proteins or other cells, depending on the implementation. At first, it was done across proteins, but then it was changed to across cells. This change made the normalization less dependent on the antibody panel{cite}`Mulè2022`.

In [11]:
mdata

In [12]:
mdata.write("cite_normalization.h5mu")

## References

```{bibliography}
:filter: docname in docnames
```

## Contributors

We gratefully acknowledge the contributions of:

### Authors

* Daniel Strobl
* Ciro Ramírez-Suástegui

### Reviewers

* Lukas Heumos
* Anna Schaar