3. Raw data processing#

In this section, we discuss some of the fundamental issues surrounding what is commonly called “preprocessing” of single-cell and single-nucleus RNA-sequencing (sc/snRNA-seq) data. Though this is common terminology, it seems a bit of a misnomer, as this process involves several steps that make important decisions about how to deal with and represent the data that can enable or preclude subsequent analyses. Here, we will primarily refer to this phase of processing as “raw data processing”, and our focus will be on the phase of data analysis that begins with lane-demultiplexed FASTQ files, and ends with a count matrix representing the estimated number of distinct molecules arising from each gene within each quantified cell, potentially stratified by the inferred splicing status of each molecule (Fig. 3.1).

Chapter Overview

Fig. 3.1 An overview of the topics discussed in this chapter. In the plot, “txome” stands for transcriptome.#

This count matrix then serves as the input for the myriad methods that have been developed for various analyses carried out with scRNA-seq data [Zappia and Theis, 2021], from methods for normalization, integration, and filtering through methods for inferring cell types, developmental trajectories, and expression dynamics. Given that it serves as the starting point for all of these analyses, a robust and accurate estimation of this matrix is a foundational and critical step to support and enable accurate and reliable subsequent analyses. Fundamental misestimation in raw data processing can contribute to invalid inferences in higher-level analyses and can preclude discoveries based on the signal present in the raw data, which has been missed, diminished, or distorted by raw data processing. As we will see in this section, despite the intuitive nature of the input and output we seek from this step in the processing pipeline, several important and difficult challenges arise during raw data processing, and improved computational methodology for dealing with these challenges is still being actively developed. In particular, we will cover the fundamental steps in raw data processing, including read alignment/mapping, cell barcode (CB) identification and correction, and the estimation of molecule counts through the resolution of unique molecular identifiers (UMIs). We will also mention the choices and challenges that surface when performing these processing steps.

A note on what precedes raw data processing

The decision of where to begin discussing raw data processing is somewhat arbitrary. We note that while, here, we consider starting with lane-demultiplexed FASTQ files as the raw input, even this already represents data that has been processed and transformed from raw measurements. Further, some of the processing that precedes the generation of the FASTQ files is relevant to challenges that may arise in subsequent processing. For example, the process of base calling and base quality estimation affects the errors that arise in the FASTQ representation of the sequenced fragments, as well as the instrument’s estimation of the confidence of each called nucleotide. Further, issues can arise upstream of FASTQ generation, such as index hopping [Farouni et al., 2020], though these issues can be largely mitigated both with in silico approaches [Farouni et al., 2020] and through enhanced protocols such as dual indexing. In this section, however, we will not consider upstream effects such as these, and instead will consider the FASTQ files, derived from, e.g., BCL files via appropriate tools, as the raw input under consideration.

3.1. Raw data quality control#

Once raw FASTQ files have been obtained, the quality of the reads themselves can be quickly diagnosed by running a quality control (QC) tool, such as FastQC, to assess read quality, sequence content, etc. If run successfully, FastQC generates a QC report for each input FASTQ file. Overall, this report summarizes the quality score, base content, and certain relevant statistics to spot potential problems originating from library preparation or sequencing.

Although nowadays, single-cell raw data processing tools internally evaluate some quality checks that are important for single-cell data, such as the N content of sequences and the fraction of mapped reads, it is still a good habit to run a quick quality check before processing the data, as it evaluates other useful QC metrics. For readers who are interested in knowing what the FastQC report for a FASTQ file from an RNA-seq sample looks like, in the following toggle content, we use the example FastQC report of good together with bad Illumina data provided by the FastQC manual webpage, along with the tutorials and descriptions from the RTSF at MSU, the HBC training program, and the QC Fail website to demonstrate the modules in the FastQC report. The detailed description of these modules can be found on the FastQC manual webpage. Although these tutorials are not explicitly made for single-cell data, many of the results are still relevant for single-cell data, with a few caveats described below. In the toggle section, all graphs, except specifically mentioned, are taken from the example reports on the FastQC manual webpage. Finally, most of the metrics reported in such quality control reports make sense only for the “biological” reads of a single-cell dataset (i.e. the read being drawn from gene transcripts). For example, for 10x Chromium v2 and v3 datasets, this would be read 2 (the files containing R2 in their filename). This is because the technical reads are comprised primarily of barcode and UMI sequences which are not drawn from the underlying transcriptome and which we do not expect to have typical biologically plausible sequence or GC content, though certain metrics like the fraction of N base calls are still relevant.

0. Summary

The summary panel on the left-hand side of the HTML report shows the module names included in the report, together with a sign used for quick evaluation of the module results. However, because FastQC uses uniform thresholds for files generated from all kinds of sequencing platforms and underlying biological material, we sometimes see warnings (orange exclamation) and failures(red crosses) for good data and passes (green ticks) for questionable data. So, all modules should be carefully evaluated before concluding about the data quality.


Fig. 3.2 The summary panel of a bad example.#

1. Basic Statistics

The basic statistics module contains the overview information and statistics of the reads in the input FASTQ file, such as the filename, total number of sequences, number of poor quality sequences, sequence length, and the overall %GC of all bases in all sequences. Good single-cell data usually have a very little number of poor quality sequences and most often have uniform sequence length. The GC content should match the overall GC content of the genome or transcriptome of the sequenced species.

Basic Statistics

Fig. 3.3 A good basic statistics report example.#

2. Per Base Sequence Quality

The per base sequence quality view contains a BoxWhisker type plot for each position in the read, which shows the range of quality scores across all bases at each position along all reads in the file. The x-axis represents the positions in the read, and the y-axis shows the quality scores. For single-cell data of good quality, all yellow boxes in the view, which represent the inter-quantile range of the quality scores of positions, should fall into the green (calls of good quality) area. So do all the whiskers, which represent the 10% and 90% of the distribution. It is not unexpected to see that the quality scores drop along the body of the reads, and some base calls of the last positions fall into the orange (calls of reasonable quality) area because of the decrease in the signal-to-noise ratio that is common in most sequencing-by-synthesis methods. Still, boxes should fall outside of the red (calls of poor quality) area. If poor quality calls are observed, one may consider performing quality trimming of their reads. A detailed explanation of the sequencing error profiles is provided by the HBC training program.

per read sequence quality

Fig. 3.4 A good (left) and a bad (right) per-read sequence quality graph.#

3. Per Tile Sequence Quality

Using an Illumina library, the per tile sequence quality plot shows the deviation from the average quality for the reads in each flowcell tile that was sequenced. The “hotter” the color, the larger the deviation. For high-quality data, we should see blue over the entire plot, meaning that the flowcell has consistent quality in all tiles. If only part of a flowcell has poor quality, some hot colors will appear in the plot. In that case, the sequencing step might have encountered transient problems, such as bubbles going through the flowcell or smudges and debris inside the flowcell lane. For further diagnoses, please refer to QC Fail and common reasons for warnings from FastQC.

per tile sequence quality

Fig. 3.5 A good (left) and a bad (right) per tile sequence quality view.#

4. Per Sequence Quality Scores

The per sequence quality score plot shows the distribution of the average quality score of each read in the file. The x-axis shows the average quality scores, and the y-axis represents the frequency of each quality score. For good data, this plot should have only one peak at the tail. If some other peaks show up, a subset of reads might have some quality issue.

per sequence quality scores

Fig. 3.6 A good (left) and a bad (right) per sequence quality score plot.#

5. Per Base Sequence Content

The per base sequence content plot reports the percent of each base position of all reads in the file for which each of the four nucleotides has been called. For single-cell data, it is not uncommon to see fluctuations at the start of the lines because the first bases of reads represent the sequence of the priming sites, which may not be perfectly random, as explained on the QC Fail website. This happens frequently in RNA-seq libraries, even though FastQC will call a warning or failure.

per base sequence content

Fig. 3.7 A good (left) and bad (right) per base sequence content plot.#

6. Per Sequence GC Content

The per sequence GC content plot shows the distribution of GC content over all of the reads in red and a theoretical (expected) distribution in blue. The central peak of the observed distribution should correspond to the overall GC content of the underlying transcriptome. Sometimes we see a wider or narrower distribution than the theoretical distribution because the GC content of the transcriptome might differ from the genome, which, in theory, should follow the distribution shown in blue. Such differences in the spread of the distributions are not uncommon, even though it may trigger a warning or failure. However, a complex distribution usually indicates a contaminated library. It is worth noting, however, that a GC content plot can be somewhat complicated to interpret in transcriptomics experiments, as the expected GC content distribution depends not only on the sequence content of the underlying transcriptome, but also on the expression of the genes in the sample, which are unknown.

Per Sequence GC Content

Fig. 3.8 A good (left) and a bad (right) per sequence GC content plot. The plot on the left is from the RTSF at MSU. The plot on the right is taken from the HBC training program.#

7. Per Base N Content

The per base N content plot shows the percent of bases at each position for which an N was called, which will occur when the sequencer has insufficient confidence to make a base call. In a high-quality library, we should not expect any noticeable non-zero N content throughout the line.

Per Base N Content

Fig. 3.9 A good (left) and a bad (right) per base N content plot.#

8. Sequence Length Distribution

The Sequence length distribution graph shows the distribution of the read lengths. In most single-cell chemistries, all reads will be of the same length. If trimming was performed before quality assessment, there may be some small variation in read lengths.

Sequence Length Distribution

Fig. 3.10 A good (left) and a bad (right) sequence length distribution plot.#

9. Sequence Duplication Levels

The sequence duplication level plot shows the distribution of the degree of duplication for read sequences (the blue line) and those after deduplication. As most single-cell platform requires multiple rounds of PCR, highly expressed genes usually express a large number of transcripts, and FastQC itself is not UMI aware, it is common that a small subset of sequences may have a large number of duplications. This may trigger a warning or failure for this module, but it does not necessarily represent a quality problem with the data. Still, the majority of sequences should have a low duplication level.

Sequence Duplication Levels

Fig. 3.11 A good (left) and a bad (right) per sequence duplication levels plot.#

10. Overrepresented Sequences

The overrepresented sequences module lists all read sequences that make up more than \(0.1\%\) of the total number of sequences. We might see some overrepresented sequences from the highly expressed genes after PCR amplification, but most sequences should not be overrepresented. Note that if the possible source of the overrepresented sequences is not “No Hit”, it might indicate that the library is contaminated by the corresponding type of source.

Overrepresented Sequences

Fig. 3.12 An overrepresented sequence table.#

11. Adapter Content

The adapter content module shows the cumulative percentage of the fraction of reads in which each of the adapter sequences has been observed at each base position. Ideally, we should not see any abundant adapter sequences in the data.

Adapter Content

Fig. 3.13 A good (left) and a bad (right) per sequence quality score plot. The plot on the right is from the QC Fail website.#

3.2. Alignment and mapping#

Mapping or alignment is a fundamental step in single-cell raw data processing. It refers to the process of determining the potential loci of origin of each sequenced fragment (e.g., the set of genomic or transcriptomic loci that are similar to the read). Depending on the sequencing protocol, the resulting raw sequence file contains the cell-level information, commonly known as cell barcodes (CB), the unique molecule identifier (UMI), and the raw cDNA sequence (read sequence) generated from the molecule. As the first step in the raw data processing of a single-cell sample (Fig. 3.1), executing mapping or alignment accurately is instrumental for all downstream analyses, since incorrect read-to-transcript/gene mapping can lead to wrong or inaccurate matrices.

While mapping read sequences to reference sequences far predates the advent of scRNA-seq, the current and quickly growing scale of scRNA-seq samples (typically hundreds of millions to billions of reads) makes this stage particularly computationally intensive. Additionally, using pre-existing RNA-seq aligners that are agnostic to any specific scRNA-seq protocol – like the length and position of cell barcodes, UMI, etc. – requires making use of separate tools for performing other steps such as demultiplexing and UMI resolutions [Smith et al., 2017]. This added overhead can be computationally cumbersome. Further, it typically carries a substantial disk space requirement for storing intermediate files, and the extra input and output operations required for processing such intermediate files further increase runtime requirements.

To this end, several dedicated tools have been built specifically for aligning or mapping scRNA-seq data, which handle these additional processing requirements automatically and/or internally. Tools such as Cell Ranger (commercial software from 10x Genomics) [Zheng et al., 2017], zUMIs [Parekh et al., 2018], alevin [Srivastava et al., 2019], RainDrop [Niebler et al., 2020], kallisto|bustools [Melsted et al., 2021], STARsolo [Kaminow et al., 2021] and alevin-fry [He et al., 2022] provide dedicated treatment for aligning scRNA-seq reads along with parsing of technical read content (CBs and UMIs), as well as methods for demultiplexing and UMI resolution. While all provide relatively simplified interfaces to the user, these tools use a variety of different approaches internally, with some generating traditional intermediate files (e.g., BAM files) and subsequently processing them, and others either working entirely in memory or by making use of reduced intermediate representations to reduce the input/output burden.

While the specific algorithms, data structures, and time and space trade-offs made by different alignment and mapping approaches can vary greatly, we can roughly categorize existing approaches along two axes:

  • The type of mapping they perform and

  • The type of reference sequence against which they map reads.

3.2.1. Types of mapping#

Here we consider three main types of mapping algorithms that are most commonly applied in the context of mapping sc/snRNA-seq data: spliced alignment, contiguous alignment, and variations of lightweight mapping.

First, let us draw a distinction here between alignment-based approaches and lightweight mapping-based approaches (Fig. 3.14). Alignment approaches apply a range of different heuristics to determine the potential loci from which reads may arise and then subsequently score, at each putative locus, the best nucleotide-level alignment between the read and reference, typically using dynamic programming-based approaches. Using dynamic programming algorithms to solve the alignment problem has a long and rich history. The exact type of dynamic programming algorithm used depends on the type of alignment being sought. Global alignment is applicable for the case where the entirety of the query and reference sequence are to be aligned. On the other hand, local alignment is applicable when, possibly, only a subsequence of the query is expected to match a subsequence of the reference. Typically, the models most applicable for short-read alignment are neither fully global nor fully local, but fall into a category of semi-global alignment where the majority of the query is expected to align to some substring of the reference (this is often called “fitting” alignment). Moreover, it is sometimes common to allow soft-clipping of the alignment so that the penalties for mismatches, insertions, or deletions appearing at the very start or end of the read are ignored or down-weighted. This is commonly done using “extension” alignment. Though these modifications change the specific rules used in the dynamic programming recurrence and traceback, they do not fundamentally change its overall complexity.

Apart from these general alignment techniques, a number of more sophisticated modifications and heuristics have been designed to make the alignment process more practical and efficient in the context of aligning genomic sequencing reads. For example, banded alignment [Chao et al., 1992] is a popular heuristic included in the alignment implementation of many popular tools, which avoids computing large parts of the dynamic programming table if the user is uninterested in alignment scores below a certain threshold. Likewise, other heuristics like X-drop [Zhang et al., 2000] and Z-drop [Li, 2018] are popular for pruning un-promising alignments early. Recent algorithmic progress, such as wavefront alignment [Marco-Sola et al., 2022, Marco-Sola et al., 2020], allows for determining optimal alignments in asymptotically (and practically) shorter time and smaller space if high-scoring (low divergence) alignments exist. Finally, considerable effort has also been devoted to optimizing data layout and computation in a way that takes advantage of instruction-level parallelism [Farrar, 2007, Rognes and Seeberg, 2000, Wozniak, 1997],and to expressing the alignment dynamic programming recurrences in a manner that is highly amenable to data parallelism and vectorization (e.g., as in the difference encoding of Suzuki and Kasahara [2018]). Most widely-used alignment tools make use of highly-optimized and vectorized alignment implementations.

In addition to the alignment score, a backtrace of the actual alignment that yields this score may be obtained. Such information is typically encoded as a CIGAR string (short for “Concise Idiosyncratic Gapped Alignment Report”), a compressed alphanumeric representation of the alignment, within the SAM or BAM file that is output. An example of a CIGAR string is 3M2D4M, which represents that the alignment has three matches or mismatches, followed by a deletion of length two from the read (bases present in the reference but not the read), followed by four more matches or mismatches. Other variants of the CIGAR string can also delineate between matches or mismatches. For example, 3=2D2=2X is a valid extended CIGAR string encoding the same alignment as just described, except that it makes clear that the three bases before the deletion constitute matches and that after the deletion, there are two matched bases followed by two mismatched bases. A detailed description of the CIGAR string format can be found in the SAMtools manual or the SAM wiki page of UMICH.

At the cost of computing such scores, alignment-based approaches know the quality of each potential mapping of a read, which they can then use to filter reads that align well to the reference from mappings that arise as the result of low complexity or “spurious” matches between the read and reference. Alignment-based approaches include both traditional “full-alignment” approaches, as implemented in tools such as STAR[Dobin et al., 2013] and STARsolo[Kaminow et al., 2021] as well as approaches like selective-alignment implemented in salmon[Srivastava et al., 2020] and alevin[Srivastava et al., 2019], which score mappings but omit the computation of the optimal alignment’s backtrace.

Alignment vs Mapping

Fig. 3.14 An abstract overview of the alignment-based method and lightweight mapping-based method.#

Alignment-based approaches can be further categorized into spliced-alignment and contiguous-alignment approaches (currently, there are no lightweight-mapping approaches that perform spliced mapping). Spliced-alignment approaches are capable of aligning a sequence read over several distinct segments of a reference, allowing potentially large gaps between the regions of the reference that align well with the corresponding parts of the read. These alignment approaches are typically applied when aligning RNA-seq reads to the genome, since the alignment procedure must be able to align reads that span across one or more splice junctions of the transcript, where a sequence that is contiguous in the read may be separated by intron and exon subsequences (potentially spanning kilobases of sequence) in the reference. Spliced alignment is a challenging problem, particularly in cases where only a small region of the read spans a splicing junction, since very little informative sequence may be available to place the segment of the read that overhangs the splice junction by only a small amount. On the other hand, contiguous alignment seeks a contiguous substring of the reference that aligns well against the read. In such alignment problems, though small insertions and deletions may be allowed, large gaps such as those observed in spliced alignments are typically not tolerated.

Finally, we can draw a distinction between alignment-based methods such as those described above and lightweight-mapping methods, which include approaches such as pseudoalignment [Bray et al., 2016], quasi-mapping [Srivastava et al., 2016], and pseudoalignment with structural constraints [He et al., 2022]. Such approaches generally achieve superior speed by avoiding nucleotide-level alignment between the read and reference sequences. Instead, they base the determination of the reported mapping loci of a read on a separate set of rules and heuristics that may look only at the set of matching k-mers or other types of exact matches (e.g., via a consensus rule) and, potentially, their orientations and relative positions with respect to each other on both the read and reference (e.g., chaining). This can lead to substantially increased speed and mapping throughput, but also precludes easily-interpretable score-based assessments of whether or not the matches between the read and reference constitute a good match capable of supporting a high-quality alignment.

3.2.2. Mapping against different reference sequences#

While different choices can be made among the mapping algorithms themselves, different choices can also be made about the reference against which the read is mapped. We consider three main categories of reference sequence against which a method might map:

  • The full (likely annotated) reference genome

  • The annotated transcriptome

  • An augmented transcriptome

It is also worth noting that, currently, not all combinations of mapping algorithms and reference sequences are possible. For example, lightweight mapping-based algorithms do not currently exist that can perform spliced mapping of reads against a reference genome. Mapping to the full genome#

The first type of reference, against which a method may map, consists of the entire genome of the target organism, usually with the annotated transcripts considered during mapping. Tools like zUMIs [Parekh et al., 2018], Cell Ranger [Zheng et al., 2017] and STARsolo [Kaminow et al., 2021] take this approach. Since many of the reads arise from spliced transcripts, such an approach also requires the use of a splice-aware alignment algorithm where the alignment for a read can be split across one or more splicing junctions. The main benefits of this approach are that reads arising from any location in the genome, not just from annotated transcripts, can be accounted for. Since these approaches require constructing a genome-wide index, there is little marginal cost for reporting not only the reads that map to known spliced transcripts but also reads that overlap or align within introns, making the alignment cost when using such approaches very similar for both single-cell and single-nucleus data. A final benefit is that even reads residing outside of the annotated transcripts, exons, and introns can be accounted for by such methods, which can enable post hoc augmentation of the set of quantified loci (e.g., as is done by Pool et al. [2022] by adding expressed UTR extensions to transcript annotations in a sample-specific and data-driven manner) and potentially increase gene detection and quantification sensitivity.

On the other hand, the versatility of the strategy of performing spliced alignment against the full genome comes with certain trade-offs. First, the most commonly-used alignment tools that adopt this strategy in the single-cell space have substantial memory requirements. Due to its speed and versatility, most of these tools are based upon the STAR [Dobin et al., 2013] aligner. Yet, for a human-scale genome, constructing and storing the index can require upwards of \(32\) GB of memory. The use of a sparse suffix array can reduce the final index size by close to a factor of \(2\), but this comes at a reduction in alignment speed, and it still requires larger memory for the initial construction. Second, the increased difficulty of spliced alignment compared to contiguous alignment and the fact that current spliced-alignment tools must explicitly compute a score for each read alignment, means that this approach comes at an increased computational cost compared to the alternatives. Finally, such an approach requires the genome of the organism under study is available. While this is not problematic for the most commonly-studied reference organisms and is rarely an issue, it can make using such tools difficult for non-model organisms where only a transcriptome assembly may be available. Mapping to the spliced transcriptome#

To reduce the computational overhead of spliced alignment to a genome, the widely-adopted alternative is to use just the sequences of the annotated transcripts themselves as a reference. Since the majority of single-cell experiments are generated from model organisms (such as mouse or human), which have well-annotated transcriptomes, such transcriptome-based quantification methods may provide similar read coverage to their genome-based counterparts. Compared to the genome, transcriptome sequences are much smaller in size, minimizing the computational resources required for running the mapping pipeline. Additionally, since the relevant splicing patterns are already represented in the transcript sequences themselves, this approach dispenses with the need to solve the difficult spliced-alignment problem. Instead, one can simply search for contiguous alignments or mappings for the read. Since only contiguous mappings need to be discovered, both alignment and lightweight mapping techniques are amenable to transcriptome references, and both approaches have been used in popular tools that adopt the spliced transcriptome as the target of reference mapping.

However, while such approaches can greatly reduce the memory and time required for alignment and mapping, they will fail to capture reads that arise from outside of the spliced transcriptome. Such an approach is therefore not adequate when processing single-nucleus data. Even in single-cell experiments, reads arising from outside of the spliced transcriptome can constitute a substantial fraction of all data, and there is growing evidence that such reads should be incorporated into subsequent analysis [Pool et al., 2022, 10x Genomics, 2021]. Further, when paired with lightweight-mapping approaches, short matches shared between the spliced transcriptome and the reference sequences that truly give rise to a read may lead to spurious read mappings, which, in turn, can lead to spurious and even biologically implausible estimates of the expression of certain genes [Brüning et al., 2022, He et al., 2022, Kaminow et al., 2021]. Mapping to an augmented transcriptome#

To deal with the fact that sequenced reads may arise from outside of spliced transcripts, it is possible to augment the spliced transcript sequences with other reference sequences that may be expected to give rise to reads (e.g., full-length unspliced transcripts or excised intronic sequences). Transcriptome references, when augmented with further sequences such as introns, allow reference indices typically smaller than those required for the full genome while simultaneously retaining the ability to search only for contiguous read alignments. This means they can potentially enable both faster and less memory-hungry alignment than when mapping against the full genome while still accounting for many of the reads that arise from outside of the spliced transcriptome. Finally, by mapping to an expanded collection of reference sequences, not only are the mapping locations of more reads recovered compared to mapping against just the spliced transcriptome, but when using lightweight mapping-based approaches, spurious mappings can be greatly reduced [He et al., 2022]. Such an expanded or augmented transcriptome is commonly used among approaches (those that do not map to the full genome) when they need to quantify single-nucleus data or prepare data for RNA-velocity analysis [Soneson et al., 2021]. Therefore, such augmented references can be constructed for all of the common methods that don’t make use of spliced alignment to the full genome [He et al., 2022, Melsted et al., 2021, Srivastava et al., 2019].

He et al. [2022] argue that such an approach is valuable even when processing standard single-cell RNA-seq data and recommend constructing an augmented splici (meaning spliced + intronic) reference for mapping and quantification. The splici reference is constructed using the spliced transcriptome sequence along with the sequences containing the merged intervals corresponding to the introns of the annotated genes. Each reference is then labeled with its annotated splicing status, and the mapping to this reference is subsequently paired with splicing status-aware methods for UMI resolution. The main benefits of this approach are that it admits the use of lightweight-mapping methods while substantially reducing the prevalence of spurious mappings. It enables the detection of reads of both spliced and unspliced origin, which can increase the sensitivity of subsequent analysis [Pool et al., 2022, 10x Genomics, 2021], and, since splicing status is tracked during quantification and reported separately in the output, it unifies the processing pipeline for single-cell, single-nucleus, and RNA-velocity preprocessing.

3.3. Cell barcode correction#

Droplet-based single-cell segregation systems, such as those provided by 10x Genomics, have become an important tool for studying the cause and consequences of cellular heterogeneity. In this segregation system, the RNA material of each captured cell is extracted within a water-based droplet encapsulation along with a barcoded bead. These beads tag the RNA content of individual cells with unique oligonucleotides, called cell barcodes (CBs), that are later sequenced along with the fragments of the cDNAs that are reversely transcribed from the RNA content. The beads contain high-diversity DNA barcodes enabling parallel barcoding of the cell’s molecular content and in silico demultiplexing of the sequencing reads into individual cellular bins.

A note on alignment orientation

Depending on the chemistry of the sample being analyzed and the processing options provided by the user, it is not necessarily the case that all sequenced fragments aligning to the reference will be considered for quantification and barcode correction. One commonly-applied criterion for filtering is alignment orientation. Specifically, certain chemistries specify protocols such that the aligned reads should only derive from (i.e. map back to) the underlying transcripts in a specific orientation. For example, in 10x Genomics 3’ Chromium chemistries, we expect the biological read to align to the underlying transcript’s forward strand, though anti-sense reads do exist [10x Genomics, 2021]. Thus, mappings of the reads in the reverse-complement orientation to the underlying sequences may be ignored or filtered out at the user’s discretion. If a chemistry follows such a so-called “stranded” protocol, this should be documented.

3.3.1. Type of errors in barcoding#

The tag, sequence, and demultiplex method for single-cell profiling generally works well. However, the number of observed cell barcodes (CBs) in a droplet-based library can significantly differ from the number of originally encapsulated cells by several fold. A few main sources of error can lead to such observation:

  • Doublet / Multiplet: A single barcode can be associated with two or more cells and lead to undercounting of cells.

  • Empty Droplet: A droplet can be captured with no encapsulated cell, where the ambient RNA is tagged with the barcode and can be sequenced; this leads to overcounting of cells.

  • Sequence error: Errors can arise during the PCR amplification or sequencing process and can contribute to both under and over-counting of the cells.

Computational tools for demultiplexing the RNA-seq reads into cell-specific bins use a wide range of diagnostic indicators to filter out artefactual or low-quality data. For example, numerous methods exist for the removal of ambient RNA contamination [Lun et al., 2019, Muskovic and Powell, 2021, Young and Behjati, 2020], doublet detection [Bais and Kostka, 2019, DePasquale et al., 2019, McGinnis et al., 2019, Wolock et al., 2019] and cell barcodes correction for sequence errors based on nucleotide sequence similarity.

Several common strategies are used for cell barcode identification and correction.

  • Correction against a known list of potential barcodes: Certain chemistries, such as 10x Chromium, draw CBs from a known pool of potential barcode sequences. Thus, the set of barcodes observed in any sample is expected to be a subset of this known list, often called a “whitelist”, “permit list”, or “pass list”. In this case, a common strategy is to assume each barcode that exactly matches some element from the known list is correct and for all other barcodes to be correct against the known list of barcodes (i.e., to find barcodes from the permit list that are some small Hamming distance or edit distance away from the observed barcodes). This approach leverages the known permit list to allow efficient correction of many barcodes that have been potentially corrupted. However, one difficulty with this approach is that a given corrupted barcode may have multiple possible corrections in the permit list (i.e., its correction may be ambiguous). In fact, if one considers a barcode that is taken from the 10x Chromium v3 permit list and mutated at a single position to a barcode not in the list, there is a \(\sim 81\%\) chance that it sits at Hamming distance \(1\) from two or more barcodes in the permit list. The probability of such collisions can be reduced by only considering correcting against barcodes from the known permit list, which, themselves, occur exactly in the given sample (or even only those that occur exactly in the given sample above some nominal frequency threshold). Also, information such as the base quality at the “corrected” position can be used to potentially break ties in the case of ambiguous corrections. Yet, as the number of assayed cells increases, insufficient sequence diversity in the set of potential cell barcodes increases the frequency of ambiguous corrections, and reads tagged with barcodes having ambiguous corrections are most commonly discarded.

  • Knee or elbow-based methods: If a set of potential barcodes is unknown — or even if it is known, but one wishes to correct directly from the observed data itself without consulting an external list — one can adopt a method based on the observation that the list of “true” or high-quality barcodes in a sample is likely those associated with the greatest number of reads. To do this, one can construct the cumulative frequency plot of the barcodes, in which barcodes are sorted in descending order of the number of distinct reads or UMIs with which they are associated. Often, this ranked cumulative frequency plot will contain a “knee” or “elbow” – an inflection point that can be used to characterize frequently occurring barcodes from infrequent (and therefore likely erroneous) barcodes. Many methods exist for attempting to identify such an inflection point [He et al., 2022, Lun et al., 2019, Smith et al., 2017] as a likely point of discrimination between properly captured cells and empty droplets. Subsequently, the set of barcodes that appear “above” the knee can be treated as a permit list against which the rest of the barcodes may be corrected, as in the first method list above. Such an approach is flexible as it can be applied in chemistries that have an external permit list and those that don’t. Further parameters of the knee-finding algorithms can be altered to yield more or less restrictive selected barcode sets. Yet, such an approach can have certain drawbacks, like a tendency to be overly conservative and sometimes failing to work robustly in samples where no clear knee is present.

  • Filtering and correction based on an expected cell count provided by the user: These approaches seek to estimate a robust list of high-quality or present barcodes in the cases where the CB frequency distribution may not have a clear knee or exhibit bimodality due to technical artifacts. In such an approach, the user provides an estimate of the expected number of assayed cells. Then, the barcodes are ordered by descending frequency, the frequency \(f\) at a robust quantile index near the expected cell count is obtained, and all cells having a frequency within a small constant fraction of \(f\) (e.g., \(\ge \frac{f}{10}\)) are considered as valid barcodes. Again, the remaining barcodes are corrected against this valid list by attempting to correct uniquely to one of these valid barcodes based on sequence similarity.

  • Filtering based on a forced number of valid cells: Perhaps the simplest approach, although potentially problematic, is for the user to directly provide the index in the sorted frequency plot above which barcodes will be considered valid. All barcodes with a frequency greater than or equal to the frequency at the selected index are considered valid and treated as constituting the permit list. The remaining set of barcodes is then corrected against this list using the same approach described in the other methods above. If there are at least as many distinct barcodes as the number of cells the user requests, then this many cells will always be selected. Of course, such an approach is only reasonable when the user has a good reason to believe that the threshold frequency should be set around the provided index.

3.3.2. Future challenges#

While cellular barcoding of high-throughput single-cell profiling has been a tremendously successful approach, some challenges still remain, especially as the scale of experiments continues to grow. For example, the design of a robust method for selecting high-quality cell barcodes from the set of all the observations is still an active area of research, with distinct challenges arising, e.g., between single-cell and single-nucleus experiments. Also, as single-cell technologies have advanced to profile increasing numbers of cells, insufficient sequence diversity in the CB sequence can result in sequence corrections leading to CB collision. Addressing this latter problem may require more intelligent barcode design methods and continuing increases in the lengths of oligonucleotides used for cell barcoding.

3.4. UMI resolution#

After cell barcode (CB) correction, reads have either been discarded or assigned to a corrected CB. Subsequently, we wish to quantify the abundance of each gene within each corrected CB.

Because of the amplification bias as discussed in Transcript quantification, reads must be deduplicated, based upon their UMI, to assess the true count of sampled molecules. Additionally, several other complicating factors present challenges when attempting to perform this estimation.

The UMI deduplication step aims to identify the set of reads and UMIs derived from each original, pre-PCR molecule in each cell captured and sequenced in the experiment. The result of this process is to allocate a molecule count to each gene in each cell, which is subsequently used in the downstream analysis as the raw expression estimate for this gene. We refer to this process of looking at the collection of observed UMIs and their associated mapped reads and attempting to infer the original number of observed molecules arising from each gene as the process of UMI resolution.

To simplify the explanation, in the following text, the reads that map to a reference, for example, a genomic locus of the gene, are called the reads of that reference, and their UMI tags are called the UMIs of that reference; the set of reads that are tagged by a UMI is called the reads of that UMI. A read can only be tagged by one UMI but can belong to multiple references if it maps to multiple references. Furthermore, as the molecule barcoding process for each cell in scRNA-seq is usually isolated and independent (apart from the issues related to accurately resolving cell barcodes raised earlier), without loss of generality, UMI resolution will be explained for a specific cell. The same procedure will usually be applied to all cells independently.

3.4.1. The need for UMI resolution#

In the ideal case, where the correct (unaltered) UMIs tag reads, the reads of each UMI uniquely map to a common reference gene, and there is a bijection between UMIs and pre-PCR molecules. Consequently, the UMI deduplication procedure is conceptually straightforward: the reads of a UMI are the PCR duplicates from a single pre-PCR molecule. The number of captured and sequenced molecules of each gene is the number of distinct UMIs observed for this gene.

However, the problems encountered in practice make the simple rules described above insufficient for identifying the gene origin of UMIs in general and necessitate the development of more sophisticated models. Here, we concern ourselves primarily with two challenges.

  • The first problem we discuss is errors in UMIs. These occur when the sequenced UMI tag of reads contains errors introduced during PCR or the sequencing process. Common UMI errors include nucleotide substitutions during PCR and read errors during sequencing. Failing to address such UMI errors can inflate the estimated number of molecules [Smith et al., 2017, Ziegenhain et al., 2022].

  • The second issue we discuss is multimapping, which arises in cases where a read or UMI belongs to multiple references, for example, multi-gene reads/UMIs. This happens if different reads of a UMI map to different genes, a read maps to multiple genes, or both. The consequence of this issue is that the gene origin of the multi-gene reads/UMIs is ambiguous, which results in uncertainty about the sampled pre-PCR molecule count of those genes. Simply discarding multi-gene reads/UMIs can lead to a loss of data or a biased estimate among genes that tend to produce multimapping reads, such as sequence-similar gene families[Srivastava et al., 2019].

A note on UMI errors

UMI errors, especially those due to nucleotide substitutions and miscallings, are prevalent in single-cell experiments. Smith et al. [2017] establish that the average number of bases different (edit distance) between the observed UMI sequences in the tested single-cell experiments is lower than randomly sampled UMI sequences, and the enrichment of low edit distances is well correlated with the degree of PCR amplification. Multimapping also exists in single-cell data and, depending upon the gene being considered, can occur at a non-trivial rate. Srivastava et al. [2019] show that discarding the multimapping reads can negatively bias the predicted molecule counts.

There exist other challenges that we do not focus upon here, such as “convergent” and “divergent” UMI collisions. We consider the case where the same UMI is used to tag two different pre-PCR molecules arising from the same gene, in the same cell, as a convergent collision. When two or more distinct UMIs arise from the same pre-PCR molecule, e.g., due to the sampling of multiple priming sites from this molecule, we consider this a divergent collision. We expect convergent UMI collisions to be rare and, therefore, their effect typically small. Further, transcript-level mapping information can sometimes be used to resolve such collisions[Srivastava et al., 2019]. Divergent UMI collisions occur primarily among introns of unspliced transcripts[10x Genomics, 2021], and approaches to addressing the issues they raise are an area of active research[Gorin and Pachter, 2021, 10x Genomics, 2021].

Given that the use of UMIs is near ubiquitous in high-throughput scRNA-seq protocols and the fact that addressing these errors improves the estimation of gene abundances, there has been much attention paid to the problem of UMI resolution in recent literature [Bose et al., 2015, He et al., 2022, Islam et al., 2013, Kaminow et al., 2021, Macosko et al., 2015, Melsted et al., 2021, Orabi et al., 2018, Parekh et al., 2018, Smith et al., 2017, Srivastava et al., 2019, Tsagiopoulou et al., 2021].

3.4.2. Graph-based UMI resolution#

As a result of the problems that arise when attempting to resolve UMIs, many methods have been developed to address the problem of UMI resolution. While there are a host of different approaches for UMI resolution, we will focus on a framework for representing problem instances, modified from a framework initially proposed by Smith et al. [2017], that relies upon the notion of a UMI graph. Each connected component of this graph represents a sub-problem wherein certain subsets of UMIs are collapsed (i.e., resolved as evidence of the same pre-PCR molecule). Many popular UMI resolution approaches can be interpreted in this framework by simply modifying precisely how the graph is refined and how the collapse or resolution procedure carried out over this graph works.

In the context of single-cell data, a UMI graph \(G(V,E)\) is a directed graph with a node set \(V\) and an edge set \(E\). Each node \(v_i \in V\) represents an equivalence class (EC) of reads, and the edge set \(E\) encodes the relationship between the ECs. The equivalence relation \(\sim_r\) defined on reads is based on their UMI and mapping information. We say reads \(r_x\) and \(r_y\) are equivalent, \(r_x \sim_r r_y\), if and only if they have identical UMI tags and map to the same set of references. UMI resolution approaches may define a “reference” as a genomic locus[Smith et al., 2017], transcript[He et al., 2022, Srivastava et al., 2019] or gene[Kaminow et al., 2021, Zheng et al., 2017]. Other UMI resolution approaches exist, for example, the reference-free model[Tsagiopoulou et al., 2021] and the method of moments[Melsted et al., 2021], but they may not be easily represented in this framework and are not discussed in further detail here.

In the UMI graph framework, a UMI resolution approach can be divided into three major steps, each of which has different options that can be modularly composed by different approaches. The three steps are defining nodes, defining adjacency relationships, and resolving components. Additionally, these steps may sometimes be preceded (and/or followed) by filtering steps designed to discard or heuristically assign (by modifying the set of reference mappings reported) reads and UMIs exhibiting certain types of mapping ambiguity. Defining nodes#

As described above, a node \(v_i \in V\) is an equivalence class of reads. Therefore, \(V\) can be defined based on the full or filtered set of mapped reads and their associated uncorrected UMIs. All reads that satisfy the equivalence relation \(\sim_r\) based on their reference set and UMI tag are associated with the same vertex \(v \in V\). An EC is a multi-gene EC if its UMI is a multi-gene UMI. Some approaches will avoid the creation of such ECs by filtering or heuristically assigning reads prior to node creation, while other approaches will retain and process these ambiguous vertices and attempt and resolve their gene origin via parsimony, probabilistic assignment, or based on a related rule or model [He et al., 2022, Kaminow et al., 2021, Srivastava et al., 2019]. Defining the adjacency relationship#

After creating the node set \(V\) of a UMI graph, the adjacency of nodes in \(V\) is defined based on the distance, typically the Hamming or edit distance, between their UMI sequences and, optionally, the content of their associated reference sets.

Here we define the following functions on the node \(v_i \in V\):

  • \(u(v_i)\) is the UMI tag of \(v_i\).

  • \(c(v_i) = |v_i|\) is the cardinality of \(v_i\), i.e., the number of reads associated with \(v_i\) that are equivalent under \(\sim_r\).

  • \(m(v_i)\) is the reference set encoded in the mapping information, for \(v_i\).

  • \(D(v_i, v_j)\) is the distance between \(u(v_i)\) and \(u(v_j)\), where \(v_j \in V\).

Given these function definitions, any two nodes \(v_i, v_j \in V\) will be incident with a bi-directed edge if and only if \(m(v_i) \cap m(v_j) \ne \emptyset\) and \(D(v_i,v_j) \le \theta\), where \(\theta\) is a distance threshold and is often set as \(\theta=1\)[Kaminow et al., 2021, Smith et al., 2017, Srivastava et al., 2019]. Additionally, the bi-directed edge might be replaced by a directed edge incident from \(v_i\) to \(v_j\) if \(c(v_i) \ge 2c(v_j) -1\) or vice versa[Smith et al., 2017, Srivastava et al., 2019]. Though these edge definitions are among the most common, others are possible, so long as they are completely defined by the \(u\), \(c\), \(m\), and \(D\) functions. With \(V\) and \(E\) in hand, the UMI graph \(G = (V,E)\) is now defined. Defining the graph resolution approach#

Given the defined UMI graph, many different resolution approaches may be applied. A resolution method may be as simple as finding the set of connected components, clustering the graph, greedily collapsing nodes or contracting edges[Smith et al., 2017], or searching for a cover of the graph by structures following certain rules (e.g., monochromatic arboresences[Srivastava et al., 2019]) to reduce the graph. As a result, each node in the reduced UMI graph, or each element in the cover in the case that the graph is not modified dynamically, represents a pre-PCR molecule. The collapsed nodes or covering sets are regarded as the PCR duplicates of that molecule.

Different rules for defining the adjacency relationship and different approaches for graph resolution itself can seek to preserve different properties and can define a wide variety of distinct overall UMI resolution approaches. Note that for the approaches that aim to probabilistically resolve the ambiguity caused by multimapping, the resolved UMI graph might contain multi-gene ECs, and their gene origin will be resolved in the following step. Quantification#

The last step in UMI resolution is quantifying the abundance of each gene using the resolved UMI graph. For approaches that discard multi-gene ECs, the molecule count vector for the genes in the current cell being processed (or count vector for short) is generated by counting the number of ECs labeled with each gene. On the other hand, approaches that process, rather than discard, multi-gene ECs usually resolve the ambiguity by applying some statistical inference procedure. For example, Srivastava et al. [2019] introduce an expectation-maximization (EM) approach for probabilistically assigning multi-gene UMIs, and related EM algorithms have also been introduced as optional steps in subsequent tools[He et al., 2022, Kaminow et al., 2021, Melsted et al., 2021]. In this model, the collapsed-EC-to-gene assignments are latent variables, and the deduplicated molecule count of genes are the main parameters. Intuitively, evidence from gene-unique ECs will be used to help probabilistically apportion the multi-gene ECs. The EM algorithm seeks the parameters that together have the (locally) highest likelihood of generating the observed ECs.

Usually, the UMI resolution and quantification process described above will be performed separately for each cell, represented by a corrected CB, to create a complete count matrix for all genes in all cells. However, the relative paucity of per-cell information in high-throughput single-cell samples limits the evidence available when performing UMI resolution, which in turn limits the potential efficacy of model-based solutions like the statistical inference procedure described above. Thus, further research here is certainly warranted. For example, Srivastava et al. [2020] introduced an approach that allows sharing information among transcriptionally similar cells to improve the quantification result further.

3.5. Count matrix quality control#

Once a count matrix has been generated, it is important to perform a quality control (QC) assessment. There are several distinct assessments that generally fall under the rubric of quality control. Basic global metrics are often recorded and reported to help assess the overall quality of the sequencing measurement itself. These metrics consist of quantities such as the total fraction of mapped reads, the distribution of distinct UMIs observed per cell, the distribution of UMI deduplication rates, the distribution of detected genes per cell, etc. These and similar metrics are often recorded by the quantification tools themselves[He et al., 2022, Kaminow et al., 2021, Melsted et al., 2021, Zheng et al., 2017] since they arise naturally and can be computed during the process of read mapping, cell barcode correction, and UMI resolution. Likewise, there exist several tools to help organize and visualize these basic metrics, such as the Loupe browser, alevinQC, or a kb_python report, depending upon the quantification pipeline being used. Beyond these basic global metrics, at this stage of analysis, QC metrics are designed primarily to help determine which cells (CBs) have been sequenced “successfully”, and which exhibit artifacts that warrant filtering or correction.

In the following toggle section, we discuss an example alevinQC report taken from the alevinQC manual webpage.

Once alevin or alevin-fry quantifies the single-cell data, the quality of the data can be assessed through the R package alevinQC. The alevinQC report can be generated in PDF format or as R/Shiny applications, which summarizes various components of the single-cell library, such as reads, CBs, and UMIs.

1. Metadata and summary tables

AlevinQC Summary

Fig. 3.15 An example of the summary section of an alevinQC report.#

The first section of an alevinQC report shows a summary of the input files and the processing result, among which, the top left table displays the metadata provided by alevin (or alevin-fry) for the quantification results. For example, this includes the time of the run, the version of the tool, and the path to the input FASTQ and index files. The top right summary table provides the summary statistics for various components of the single-cell library, for example, the number of sequencing reads, the number of selected cell barcodes at various levels of filtering, and the total number of deduplicated UMIs.

2. Knee plot, initial whitelist determination

AlevinQC Plots

Fig. 3.16 The figure shows the plots in the alevinQC report of an example single-cell dataset, of which the cells are filtered using the “knee” finding method. Each dot represents a corrected cell barcode with its corrected profile.#

The first(top left) view in Fig. 3.16 shows the distribution of cell barcode frequency in decreasing order. In all plots shown above, each point represents a corrected cell barcode, with its x-coordinate corresponding to its cell barcode frequency rank. In the top left plot, the y-coordinate corresponds to the observed frequency of the corrected barcode. Generally, this plot shows a “knee”-like pattern, which can be used to identify the initial list of high-quality barcodes. The red dots in the plot represent the cell barcodes selected as the high-quality cell barcodes in the case that “knee”-based filtering was applied. In other words, these cell barcodes contain a sufficient number of reads to be deemed high-quality and likely derived from truly present cells. Suppose an external permit list is passed in the CB correction step, which implies no internal algorithm was used to distinguish high-quality cell barcodes. In that case, all dots in the plot will be colored red, as all these corrected cell barcodes are processed throughout the raw data processing pipeline and reported in the gene count matrix. One should be skeptical of the data quality if the frequency is consistently low across all cell barcodes.

3. Barcode collapsing

After identification of the barcodes that will be processed, either through an internal threshold (e.g., from the “knee”-based method) or through external whitelisting, alevin (or alevin-fry) performs cell barcode sequence correction. The barcode collapsing plot, the upper middle plot in the Fig. 3.16, shows the number of reads assigned to a cell barcode after sequence correction of the cell barcodes versus prior to correction. Generally, we would see that all points fall close to the line representing \(x = y\), which means that the reassignments in CB correction usually do not drastically change the profile of the cell barcodes.

4. Knee Plot, number of genes per cell

The upper right plot in Fig. 3.16 shows the distribution of the number of observed genes of all processed cell barcodes. Generally, a mean of \(2,000\) genes per cell is considered modest but reasonable for the downstream analyses. One should double-check the quality of the data if all cells have a low number of observed genes.

5. Quantification summary

Finally, a series of quantification summary plots, the bottom plots in Fig. 3.16, compare the cell barcode frequency, the total number of UMIs after deduplication and the total number of non-zero genes using scatter plots. In general, in each plot, the plotted data should demonstrate a positive correlation, and, if high-quality filtering (e.g., knee filtering) has been performed, the high-quality cell barcodes should be well separated from the rest. Moreover, one should expect all three plots to convey similar trends. If using an external permit list, all the dots in the plots will be colored red, as all these cell barcodes are processed and reported in the gene count matrix. Still, we should see the correlation between the plots and the separation of the dots representing high-quality cells to others. If all of these metrics are consistently low across cells or if these plots convey substantially different trends, then one should be concerned about the data quality.

3.5.1. Empty droplet detection#

One of the first QC steps is determining which cell barcodes correspond to “high-confidence” sequenced cells. It is common in droplet-based protocols[Macosko et al., 2015] that certain barcodes are associated with ambient RNA instead of the RNA of a captured cell. This happens when droplets fail to capture a cell. These empty droplets still tend to produce sequenced reads, though the characteristics of these reads look markedly different from those associated with barcodes corresponding to properly captured cells. Many approaches exist to assess whether a barcode likely corresponds to an empty droplet or not. One simple method is to examine the cumulative frequency plot of the barcodes, in which barcodes are sorted in descending order of the number of distinct UMIs with which they are associated. This plot often contains a “knee” that can be identified as a likely point of discrimination between properly captured cells and empty droplets[He et al., 2022, Smith et al., 2017]. While this “knee” method is intuitive and can often estimate a reasonable threshold, it has several drawbacks. For example, not all cumulative histograms display an obvious knee, and it is notoriously difficult to design algorithms that can robustly and automatically detect such knees. Finally, the total UMI count associated with a barcode may not, alone, be the best signal to determine if the barcode was associated with an empty or damaged cell.

This led to the development of several tools specifically designed to detect empty or damaged droplets, or cells generally deemed to be of “low quality” [Alvarez et al., 2020, Heiser et al., 2021, Hippen et al., 2021, Lun et al., 2019, Muskovic and Powell, 2021, Young and Behjati, 2020]. These tools incorporate a variety of different measures of cell quality, including the frequencies of distinct UMIs, the number of detected genes, and the fraction of mitochondrial RNA, and typically work by applying a statistical model to these features to classify high-quality cells from putative empty droplets or damaged cells. This means that cells can typically be scored, and a final filtering can be selected based on an estimated posterior probability that cells are not empty or compromised. While these models generally work well for single-cell RNA-seq data, one may have to apply several additional filters or heuristics to obtain robust filtering in single-nucleus RNA-seq data[He et al., 2022, Kaminow et al., 2021], like those exposed in the emptyDropsCellRanger function of DropletUtils[Lun et al., 2019].

3.5.2. Doublet detection#

In addition to determining which cell barcodes correspond to empty droplets or damaged cells, one may also wish to identify those cell barcodes that correspond to doublets or multiplets. When a given droplet captures two (doublets) or more (multiplets) cells, this can result in a skewed distribution for these cell barcodes in terms of quantities like the number of reads and UMIs they represent, as well as gene expression profiles they display. Many tools have also been developed to predict the doublet status of cell barcodes[Bais and Kostka, 2019, Bernstein et al., 2020, DePasquale et al., 2019, McGinnis et al., 2019, Wolock et al., 2019]. Once detected, cells determined to likely be doublets and multiplets can be removed or otherwise adjusted for in the subsequent analysis.

3.6. Count data representation#

As one completes initial raw data processing and quality control and moves on to subsequent analyses, it is important to acknowledge and remember that the cell-by-gene count matrix is, at best, an approximation of the sequenced molecules in the original sample. At several stages of the raw data processing pipeline, heuristics are applied, and simplifications are made to enable the generation of this count matrix. For example, read mapping is imperfect, as is cell barcode correction. The accurate resolution of UMIs is particularly challenging, and issues related to UMIs attached to multimapping reads are often ignored, as is the fact that multiple priming sites, particularly among unspliced molecules, can violate the one molecule-to-one UMI relationship that is often assumed.

In light of these challenges and the simplifications adopted to address them, there remains active research as to how best to represent the preprocessed data to downstream tools. For example, several quantification tools[He et al., 2022, Kaminow et al., 2021, Melsted et al., 2021, Srivastava et al., 2019] implement an optional EM algorithm, initially introduced in this context by Srivastava et al. [2019], that probabilistically apportions UMIs associated with reads that map to more than one gene. This, however, can result in non-integer count matrices that may be unexpected by certain downstream tools. Alternatively, one may choose to resolve UMIs to gene groups instead of individual genes, retaining the multimapping information in the preprocessed output (it is worth noting that a similar representation, but at the transcript level, has been used for over a decade as a succinct internal representation in bulk RNA-seq transcript quantification tools[Bray et al., 2016, Ju et al., 2017, Nicolae et al., 2011, Patro et al., 2017, Patro et al., 2014, Turro et al., 2011], and such a transcript-level representation has even been proposed for use in the clustering and dimensionality reduction of full-length single-cell RNA-seq data[Ntranos et al., 2016] and differential expression analysis of single-cell RNA-seq data[Ntranos et al., 2019]). In this case, instead of the resulting count matrix having dimensions \(C \times G\), where \(G\) is the number of genes in the quantified annotation, it will have dimension \(C \times E\), where \(E\) is the number of distinct gene groups (commonly called equivalence class labels) across all cells in the given sample. By propagating this information to the output count matrix, one can avoid the application of heuristic or probabilistic UMI resolution methods that can result in loss of data, or bias, in the counts used in downstream analyses. Of course, to make use of this information, downstream analysis methods must expect the information in this format. Further, those downstream methods must typically have a way to resolve these counts, eventually, to the level of genes, as the abundance of gene groups lacks the intuitive biological interpretability of gene abundance. Nonetheless, the benefits provided by such representations, in terms of conveying more complete and accurate information to downstream analysis tools, can be substantial, and tools taking advantage of such representations are being developed (e.g. DifferentialRegulation); this is still an active area of research.

3.7. Brief discussion#

To close this chapter, we convey some observations and suggestions that have arisen from recent benchmarking and review studies surrounding some of the common preprocessing tools described above [Brüning et al., 2022, You et al., 2021]. It is, of course, important to note that the development of methods and tools for single-cell and single-nucleus RNA-seq raw data processing, as well as the continual evaluation of such methods, is an ongoing community effort. It is therefore often useful and reasonable, when performing your own analyses, to experiment with several different tools.

At the coarsest level, the most common tools can process data robustly and accurately. It has been suggested that with many common downstream analyses like clustering, and the methods used to perform them, the choice of preprocessing tool typically makes less difference than other steps in the analysis process [You et al., 2021]. Nonetheless, it has also been observed that applying lightweight mapping restricted to the spliced transcriptome can increase the probability of spurious mapping and gene expression [Brüning et al., 2022].

Ultimately, the choice of a specific tool largely depends on the task at hand, and the constraints on available computational resources. If performing a standard single-cell analysis, lightweight mapping-based methods are a good choice since they are faster (often considerably so) and more memory frugal than existing alignment-based tools. If performing single-nucleus RNA-seq analysis, alevin-fry is an attractive option in particular, as it remains memory frugal and its index remains relatively small even as the transcriptome reference is expanded to include unspliced reference sequence. On the other hand, alignment-based methods are recommended if it is important to recover reads that map outside of the (extended) transcriptome or if the genomic mapping sites of reads are necessary for use in the relevant downstream tools or analyses (e.g., such as for differential transcript usage analysis with a tool like sierra [Patrick et al., 2020]). Among the alignment-based pipelines, according to Brüning et al. [2022], STARsolo should be favored over Cell Ranger because the former is much faster than the latter, and requires less memory, while it is also capable of producing almost identical results.

3.8. A real-world example#

Given that we have covered the concepts underlying various approaches for raw data processing, we now turn our attention to demonstrating how a specific tool (in this case, alevin-fry) can be used to process a small example dataset. To start, we need the sequenced reads from a single-cell experiment in FASTQ format and the reference (e.g., transcriptome) against which the reads will be mapped. Usually, a reference includes the genome sequences and the corresponding gene annotations of the sequenced species in the FASTA and GTF format, respectively.

In this example, we will use chromosome 5 of the human genome and its related gene annotations as the reference, which is a subset of the Human reference, GRCh38 (GENCODE v32/Ensembl 98) reference from the 10x Genomics reference build. Correspondingly, we extract the subset of reads that map to the generated reference from a human brain tumor dataset from 10x Genomics.

Alevin-fry[He et al., 2022] is a fast, accurate, and memory-frugal single-cell and single-nucleus data processing tool. Simpleaf is a program, written in rust, that exposes a unified and simplified interface for processing some of the most common protocols and data types using the alevin-fry pipeline. A nextflow-based workflow tool also exists to process extensive collections of single-cell data. Here we will first show how to process single-cell raw data using two simpleaf commands. Then, we describe the complete set of salmon alevin and alevin-fry commands to which these simpleaf commands correspond, to outline where the steps described in this section occur and to convey the possible different processing options. These commands will be run from the command line, and conda will be used for installing all of the software required for running this example.

3.8.1. Preparation#

Before we start, we create a conda environment in the terminal and install the required package. Simpleaf depends on alevin-fry, salmon and pyroe. They are all available on bioconda and will be automatically installed when installing simpleaf.

conda create -n af -y -c bioconda simpleaf
conda activate af

Note on using an Apple silicon-based device

Conda does not currently build most packages natively for Apple silicon. Therefore, if you are using a non-Intel-based Apple computer (e.g., with an M1(Pro/Max/Ultra) or M2 chip), you should make sure to specify that your environment uses the Rosetta2 translation layer. To do this, you can replace the above commands with the following (instructions adopted from here):

CONDA_SUBDIR=osx-64 conda create -n af -y -c bioconda simpleaf   # create a new environment
conda activate af
conda env config vars set CONDA_SUBDIR=osx-64  # subsequent commands use intel packages

Next, we create a working directory, af_xmpl_run, and download and uncompress the example dataset from a remote host.

# Create a working dir and go to the working directory
## The && operator helps execute two commands using a single line of code.
mkdir af_xmpl_run && cd af_xmpl_run

# Fetch the example dataset and CB permit list and decompress them
## The pipe operator (|) passes the output of the wget command to the tar command.
## The dash operator (-) after `tar xzf` captures the output of the first command.
## - example dataset
wget -qO- https://umd.box.com/shared/static/lx2xownlrhz3us8496tyu9c4dgade814.gz | tar xzf - --strip-components=1 -C .
## The fetched folder containing the fastq files is called toy_read_fastq.
## The fetched folder containing the human ref files is called toy_human_ref.

# Fetch CB permit list
## the right chevron (>) redirects the STDOUT to a file.
wget -qO- https://raw.githubusercontent.com/10XGenomics/cellranger/master/lib/python/cellranger/barcodes/3M-february-2018.txt.gz | gunzip - > 3M-february-2018.txt

With the reference files (the genome FASTA file and the gene annotation GTF file) and read records (the FASTQ files) ready, we can now apply the raw data processing pipeline discussed above to generate the gene count matrix.

3.8.2. Simplified raw data processing pipeline#

Simpleaf is designed to simplify the alevin-fry interface for single-cell and nucleus raw data processing. It encapsulates the whole processing pipeline into two steps:

  1. simpleaf index indexes the provided reference or makes a splici reference (spliced transcripts + introns) and index it.

  2. simpleaf quant maps the sequencing reads against the indexed reference and quantifies the mapping records to generate a gene count matrix.

More advanced usages and options for mapping with simpleaf can be found here.

When running simpleaf index, if a genome FASTA file (-f) and a gene annotation GTF file(-g) are provided, it will generate a splici reference and index it; if only a transcriptome FASTA file is provided (--refseq), it will directly index it. Currently, we recommend the splici index.

# simpleaf needs the environment variable ALEVIN_FRY_HOME to store configuration and data.
# For example, the paths to the underlying programs it uses and the CB permit list
mkdir alevin_fry_home && export ALEVIN_FRY_HOME='alevin_fry_home'

# the simpleaf set-paths command finds the path to the required tools and writes a configuration JSON file in the ALEVIN_FRY_HOME folder.
simpleaf set-paths

# simpleaf index
# Usage: simpleaf index -o out_dir [-f genome_fasta -g gene_annotation_GTF|--refseq transcriptome_fasta] -r read_length -t number_of_threads
## The -r read_lengh is the number of sequencing cycles performed by the sequencer to generate biological reads (read2 in Illumina).
## Publicly available datasets usually have the read length in the description. Sometimes they are called the number of cycles.
simpleaf index \
-o simpleaf_index \
-f toy_human_ref/fasta/genome.fa \
-g toy_human_ref/genes/genes.gtf \
-r 90 \
-t 8

In the output directory simpleaf_index, the ref folder contains the splici reference; The index folder contains the salmon index built upon the splici reference.

The next step, simpleaf quant, consumes an index directory and the mapping record FASTQ files to generate a gene count matrix. This command encapsulates all the major steps discussed in this section, including mapping, cell barcode correction, and UMI resolution.

# Collecting sequencing read files
## The reads1 and reads2 variables are defined by finding the filenames with the pattern "_R1_" and "_R2_" from the toy_read_fastq directory.

## The read files must be sorted and separated by a comma.
### The find command finds the files in the fastq_dir with the name pattern
### The sort command sorts the file names
### The awk command and the paste command together convert the file names into a comma-separated string.
reads1="$(find -L ${fastq_dir} -name "*$reads1_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd,)"
reads2="$(find -L ${fastq_dir} -name "*$reads2_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd,)"

# simpleaf quant
## Usage: simpleaf quant -c chemistry -t threads -1 reads1 -2 reads2 -i index -u [unspliced permit list] -r resolution -m t2g_3col -o output_dir
simpleaf quant \
-c 10xv3 -t 8 \
-1 $reads1 -2 $reads2 \
-i simpleaf_index/index \
-u -r cr-like \
-m simpleaf_index/index/t2g_3col.tsv \
-o simpleaf_quant

After running these commands, the resulting quantification information can be found in the simpleaf_quant/af_quant/alevin folder. Within this directory, there are three files: quants_mat.mtx, quants_mat_cols.txt, and quants_mat_rows.txt, which correspond, respectively, to the count matrix, the gene names for each column of this matrix, and the corrected, filtered cell barcodes for each row of this matrix. The tail lines of these files are shown below. Of note here is the fact that alevin-fry was run in the USA-mode (unspliced, spliced, and ambiguous mode), and so quantification was performed for both the spliced and unspliced status of each gene — the resulting quants_mat_cols.txt file will then have a number of rows equal to 3 times the number of annotated genes which correspond, to the names used for the spliced (S), unspliced (U), and splicing-ambiguous variants (A) of each gene.

# Each line in `quants_mat.mtx` represents
# a non-zero entry in the format row column entry
$ tail -3 simpleaf_quant/af_quant/alevin/quants_mat.mtx
138 58 1
139 9 1
139 37 1

# Each line in `quants_mat_cols.txt` is a splice status
# of a gene in the format (gene name)-(splice status)
$ tail -3 simpleaf_quant/af_quant/alevin/quants_mat_cols.txt

# Each line in `quants_mat_rows.txt` is a corrected
# (and, potentially, filtered) cell barcode
$ tail -3 simpleaf_quant/af_quant/alevin/quants_mat_rows.txt

We can load the count matrix into Python as an AnnData object using the load_fry function from pyroe. A similar function, loadFry, has been implemented in the fishpond R package.

import pyroe

quant_dir = 'simpleaf_quant/af_quant'
adata_sa = pyroe.load_fry(quant_dir)

The default behavior loads the X layer of the Anndata object as the sum of the spliced and ambiguous counts for each gene. However, recent work[Pool et al., 2022] and updated practices suggest that the inclusion of intronic counts, even in single-cell RNA-seq data, may increase sensitivity and benefit downstream analyses. While the best way to make use of this information is the subject of ongoing research, since alevin-fry automatically quantifies spliced, unspliced, and ambiguous reads in each sample, the count matrix containing the total counts for each gene can be simply obtained as follows:

import pyroe

quant_dir = 'simpleaf_quant/af_quant'
adata_usa = pyroe.load_fry(quant_dir, output_format={'X' : ['U','S','A']})

3.8.3. The complete alevin-fry pipeline#

Simpleaf makes it possible to process single-cell raw data in the “standard” way with a few commands. Next, we will show how to generate the identical quantification result by explicitly calling the pyroe, salmon, and alevin-fry commands. On top of the pedagogical value, knowing the exact command of each step will be helpful if only a part of the pipeline needs to be rerun or if some parameters not currently exposed by simpleaf need to be specified.

Please note that the commands in the Preparation section should be executed in advance. All the tools called in the following commands, pyroe, salmon, and alevin-fry, have already been installed when installing simpleaf. Building the index#

First, we process the genome FASTA file and gene annotation GTF file to obtain the splici index. The commands in the following code chunk are analogous to the simpleaf index command discussed above. This includes two steps:

  1. Building the splici reference (spliced transcripts + introns) by calling pyroe make-splici, using the genome and gene annotation file

  2. Indexing the splici reference by calling salmon index

# make splici reference
## Usage: pyroe make-splici genome_file gtf_file read_length out_dir
## The read_lengh is the number of sequencing cycles performed by the sequencer. Ask your technician if you are not sure about it.
## Publicly available datasets usually have the read length in the description.
pyroe make-splici \
${ref_dir}/fasta/genome.fa \
${ref_dir}/genes/genes.gtf \
90 \

# Index the reference
## Usage: salmon index -t extend_txome.fa -i idx_out_dir -p num_threads
## The $() expression runs the command inside and puts the output in place.
## Please ensure that only one file ends with ".fa" in the `splici_ref` folder.
salmon index \
-t $(ls splici_rl90_ref/*\.fa) \
-i salmon_index \
-p 8

The splici index can be found in the salmon_index directory. Mapping and quantification#

Next, we will map the sequencing reads recorded against the splici index by calling salmon alevin. This will produce an output folder called salmon_alevin that contains all the information we need to process the mapped reads using alevin-fry.

# Collect FASTQ files
## The filenames are sorted and separated by space.
reads1="$(find -L $fastq_dir -name "*$reads1_pat*" -type f | sort | awk '{$1=$1;print}' | paste -sd' ')"
reads2="$(find -L $fastq_dir -name "*$reads2_pat*" -type f | sort | awk '{$1=$1;print}' | paste -sd' ')"

# Mapping
## Usage: salmon alevin -i index_dir -l library_type -1 reads1_files -2 reads2_files -p num_threads -o output_dir
## The variable reads1 and reads2 defined above are passed in using ${}.
salmon alevin \
-i salmon_index \
-l ISR \
-1 ${reads1} \
-2 ${reads2} \
-p 8 \
-o salmon_alevin \
--chromiumV3 \

Then, we execute the cell barcode correction and UMI resolution step using alevin-fry. This procedure involves three alevin-fry commands:

  1. The generate-permit-list command is used for cell barcode correction.

  2. The collate command filters out invalid mapping records, corrects cell barcodes and collates mapping records originating from the same corrected cell barcode.

  3. The quant command performs UMI resolution and quantification.

# Cell barcode correction
## Usage: alevin-fry generate-permit-list -u CB_permit_list -d expected_orientation -o gpl_out_dir
## Here, the reads that map to the reverse complement strand of transcripts are filtered out by specifying `-d fw`.
alevin-fry generate-permit-list \
-u 3M-february-2018.txt \
-d fw \
-i salmon_alevin \
-o alevin_fry_gpl

# Filter mapping information
## Usage: alevin-fry collate -i gpl_out_dir -r alevin_map_dir -t num_threads
alevin-fry collate \
-i alevin_fry_gpl \
-r salmon_alevin \
-t 8

# UMI resolution + quantification
## Usage: alevin-fry quant -r resolution -m txp_to_gene_mapping -i gpl_out_dir -o quant_out_dir -t num_threads
## The file ends with `3col.tsv` in the splici_ref folder will be passed to the -m argument.
## Please ensure that there is only one such file in the `splici_ref` folder.
alevin-fry quant -r cr-like \
-m $(ls splici_rl90_ref/*3col.tsv) \
-i alevin_fry_gpl \
-o alevin_fry_quant \
-t 8

After running these commands, the resulting quantification information can be found in alevin_fry_quant/alevin. Other relevant information concerning the mapping, CB correction, and UMI resolution steps can be found in the salmon_alevin, alevin_fry_gpl, and alevin_fry_quant folders, respectively.

In the example given here, we demonstrate using simpleaf and alevin-fry to process a 10x Chromium 3’ v3 dataset. Alevin-fry and simpleaf provide many other options for processing different single-cell protocols, including but not limited to Dropseq[Macosko et al., 2015], sci-RNA-seq3[Cao et al., 2019] and other 10x Chromium platforms. A more comprehensive list and description of available options for different stages of processing can be found in the alevin-fry and simpleaf documentation. alevin-fry also provides a nextflow-based workflow, called quantaf, for conveniently processing many samples from a simply-defined sample sheet.

Of course, similar resources exist for many of the other raw data processing tools referenced and described throughout this section, including zUMIs[Parekh et al., 2018], alevin[Srivastava et al., 2019], kallisto|bustools[Melsted et al., 2021], STARsolo[Kaminow et al., 2021] and CellRanger. The scrnaseq pipeline from nf-core also provides a nextflow-based pipeline for processing single-cell RNA-seq data generated using a range of different chemistries and integrates several of the tools described in this section.

3.10. References#


Marcus Alvarez, Elior Rahmani, Brandon Jew, Kristina M. Garske, Zong Miao, Jihane N. Benhammou, Chun Jimmie Ye, Joseph R. Pisegna, Kirsi H. Pietiläinen, Eran Halperin, and Päivi Pajukanta. Enhancing droplet-based single-nucleus term`RNA`-seq resolution using the semi-supervised machine learning classifier DIEM. Scientific Reports, July 2020. URL: https://doi.org/10.1038/s41598-020-67513-5, doi:10.1038/s41598-020-67513-5.

[rawBK19] (1,2)

Abha S Bais and Dennis Kostka. Scds: computational annotation of doublets in single-cell term`RNA` sequencing data. Bioinformatics, 36(4):1150–1158, September 2019. URL: https://doi.org/10.1093/bioinformatics/btz698, doi:10.1093/bioinformatics/btz698.


Nicholas J. Bernstein, Nicole L. Fong, Irene Lam, Margaret A. Roy, David G. Hendrickson, and David R. Kelley. Solo: Doublet Identification in Single-Cell term`RNA`-Seq via Semi-Supervised Deep Learning. Cell Systems, 11(1):95–101.e5, July 2020. URL: https://doi.org/10.1016/j.cels.2020.05.010, doi:10.1016/j.cels.2020.05.010.


Sayantan Bose, Zhenmao Wan, Ambrose Carr, Abbas H. Rizvi, Gregory Vieira, Dana Pe'er, and Peter A. Sims. Scalable microfluidics for single-cell term`RNA` printing and sequencing. Genome Biology, June 2015. URL: https://doi.org/10.1186/s13059-015-0684-3, doi:10.1186/s13059-015-0684-3.

[rawBPMP16] (1,2)

Nicolas L Bray, Harold Pimentel, Páll Melsted, and Lior Pachter. Near-optimal probabilistic term`RNA`-seq quantification. Nature biotechnology, 34(5):525–527, 2016.


Ralf Schulze Brüning, Lukas Tombor, Marcel H Schulz, Stefanie Dimmeler, and David John. Comparative analysis of common alignment tools for single-cell term`RNA` sequencing. GigaScience, 2022.

[rawBTS+22] (1,2,3)

Ralf Schulze Brüning, Lukas Tombor, Marcel H Schulz, Stefanie Dimmeler, and David John. Comparative analysis of common alignment tools for single-cell term`RNA` sequencing. GigaScience, 2022. URL: https://doi.org/10.1093%2Fgigascience%2Fgiac001, doi:10.1093/gigascience/giac001.


Junyue Cao, Malte Spielmann, Xiaojie Qiu, Xingfan Huang, Daniel M. Ibrahim, Andrew J. Hill, Fan Zhang, Stefan Mundlos, Lena Christiansen, Frank J. Steemers, Cole Trapnell, and Jay Shendure. The single-cell transcriptional landscape of mammalian organogenesis. Nature, 566(7745):496–502, February 2019. URL: https://doi.org/10.1038/s41586-019-0969-x, doi:10.1038/s41586-019-0969-x.


Kun-Mao Chao, William R. Pearson, and Webb Miller. Aligning two sequences within a specified diagonal band. Bioinformatics, 8(5):481–487, 1992. URL: https://doi.org/10.1093/bioinformatics/8.5.481, doi:10.1093/bioinformatics/8.5.481.

[rawDSC+19] (1,2)

Erica A.K. DePasquale, Daniel J. Schnell, Pieter-Jan Van Camp, Íñigo Valiente-Aland\'ı, Burns C. Blaxall, H. Leighton Grimes, Harinder Singh, and Nathan Salomonis. DoubletDecon: deconvoluting doublets from single-cell term`RNA`-sequencing data. Cell Reports, 29(6):1718–1727.e8, November 2019. URL: https://doi.org/10.1016/j.celrep.2019.09.082, doi:10.1016/j.celrep.2019.09.082.

[rawDDS+13] (1,2)

Alexander Dobin, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. Star: ultrafast universal rna-seq aligner. Bioinformatics, 29(1):15–21, 2013.

[rawFDF+20] (1,2)

Rick Farouni, Haig Djambazian, Lorenzo E Ferri, Jiannis Ragoussis, and Hamed S Najafabadi. Model-based analysis of sample index hopping reveals its widespread artifacts in multiplexed single-cell term`RNA`-sequencing. Nature communications, 11(1):1–8, 2020.


Michael Farrar. Striped Smith–Waterman speeds database searches six times over other SIMD implementations. Bioinformatics, 23(2):156–161, 2007.


Gennady Gorin and Lior Pachter. Length Biases in Single-Cell term`RNA` Sequencing of pre-mRNA. bioRxiv, 2021. URL: https://www.biorxiv.org/content/early/2021/07/31/2021.07.30.454514, arXiv:https://www.biorxiv.org/content/early/2021/07/31/2021.07.30.454514.full.pdf, doi:10.1101/2021.07.30.454514.

[rawHZS+22] (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)

Dongze He, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Avi Srivastava, and Rob Patro. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nature Methods, 19(3):316–322, 2022.


Cody N. Heiser, Victoria M. Wang, Bob Chen, Jacob J. Hughey, and Ken S. Lau. Automated quality control and cell identification of droplet-based single-cell data using dropkick. Genome Research, 31(10):1742–1752, April 2021. URL: https://doi.org/10.1101/gr.271908.120, doi:10.1101/gr.271908.120.


Ariel A Hippen, Matias M Falco, Lukas M Weber, Erdogan Pekcan Erkan, Kaiyang Zhang, Jennifer Anne Doherty, Anna Vähärautio, Casey S Greene, and Stephanie C Hicks. miQC: An adaptive probabilistic framework for quality control of single-cell term`RNA`-sequencing data. PLoS computational biology, 17(8):e1009290, 2021.


Saiful Islam, Amit Zeisel, Simon Joost, Gioele La Manno, Pawel Zajac, Maria Kasper, Peter Lönnerberg, and Sten Linnarsson. Quantitative single-cell term`RNA`-seq with unique molecular identifiers. Nature Methods, 11(2):163–166, December 2013. URL: https://doi.org/10.1038/nmeth.2772, doi:10.1038/nmeth.2772.


Chelsea J-T Ju, Ruirui Li, Zhengliang Wu, Jyun-Yu Jiang, Zhao Yang, and Wei Wang. Fleximer: accurate quantification of term`RNA`-Seq via variable-length k-mers. In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, 263–272. 2017.

[rawKYD21] (1,2,3,4,5,6,7,8,9,10,11,12,13)

Benjamin Kaminow, Dinar Yunusov, and Alexander Dobin. STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus term`RNA`-seq data. bioRxiv, 2021.


Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–3100, May 2018. URL: https://doi.org/10.1093/bioinformatics/bty191, doi:10.1093/bioinformatics/bty191.

[rawLRA+19] (1,2,3,4)

Aaron TL Lun, Samantha Riesenfeld, Tallulah Andrews, Tomas Gomes, John C Marioni, and others. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell term`RNA` sequencing data. Genome biology, 20(1):1–9, 2019.

[rawMBS+15] (1,2,3)

Evan Z. Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, Allison R. Bialas, Nolan Kamitaki, Emily M. Martersteck, John J. Trombetta, David A. Weitz, Joshua R. Sanes, Alex K. Shalek, Aviv Regev, and Steven A. McCarroll. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, May 2015. URL: https://doi.org/10.1016/j.cell.2015.05.002, doi:10.1016/j.cell.2015.05.002.


Santiago Marco-Sola, Jordan M Eizenga, Andrea Guarracino, Benedict Paten, Erik Garrison, and Miquel Moreto. Optimal gap-affine alignment in o(s) space. bioRxiv, April 2022. URL: https://doi.org/10.1101/2022.04.14.488380, doi:10.1101/2022.04.14.488380.


Santiago Marco-Sola, Juan Carlos Moure, Miquel Moreto, and Antonio Espinosa. Fast gap-affine pairwise alignment using the wavefront algorithm. Bioinformatics, September 2020. URL: https://doi.org/10.1093/bioinformatics/btaa777, doi:10.1093/bioinformatics/btaa777.

[rawMMG19] (1,2)

Christopher S. McGinnis, Lyndsay M. Murrow, and Zev J. Gartner. DoubletFinder: doublet detection in single-cell term`RNA` sequencing data using artificial nearest neighbors. Cell Systems, 8(4):329–337.e4, April 2019. URL: https://doi.org/10.1016/j.cels.2019.03.003, doi:10.1016/j.cels.2019.03.003.

[rawMBL+21] (1,2,3,4,5,6,7,8)

Páll Melsted, A. Sina Booeshaghi, Lauren Liu, Fan Gao, Lambda Lu, Kyung Hoi Min, Eduardo da Veiga Beltrame, Kristján Eldjárn Hjörleifsson, Jase Gehring, and Lior Pachter. Modular, efficient and constant-memory single-cell rna-seq preprocessing. Nature Biotechnology, 39(7):813–818, April 2021. URL: https://doi.org/10.1038/s41587-021-00870-2, doi:10.1038/s41587-021-00870-2.

[rawMP21] (1,2)

Walter Muskovic and Joseph E Powell. DropletQC: improved identification of empty droplets and damaged cells in single-cell term`RNA`-seq data. Genome Biology, 22(1):1–9, 2021.


Marius Nicolae, Serghei Mangul, Ion I Măndoiu, and Alex Zelikovsky. Estimation of alternative splicing isoform frequencies from term`RNA`-Seq data. Algorithms for molecular biology, 6(1):1–13, 2011.


Stefan Niebler, André Müller, Thomas Hankeln, and Bertil Schmidt. RainDrop: Rapid activation matrix computation for droplet-based single-cell term`RNA`-seq reads. BMC bioinformatics, 21(1):1–14, 2020.


Vasilis Ntranos, Govinda M Kamath, Jesse M Zhang, Lior Pachter, and David N Tse. Fast and accurate single-cell term`RNA`-seq analysis by clustering of transcript-compatibility counts. Genome biology, 17(1):1–14, 2016.


Vasilis Ntranos, Lynn Yi, Páll Melsted, and Lior Pachter. A discriminative learning approach to differential expression analysis for single-cell term`rna`-seq. Nature methods, 16(2):163–166, 2019.


Baraa Orabi, Emre Erhan, Brian McConeghy, Stanislav V Volik, Stephane Le Bihan, Robert Bell, Colin C Collins, Cedric Chauve, and Faraz Hach. Alignment-free clustering of UMI tagged term`DNA` molecules. Bioinformatics, 35(11):1829–1836, October 2018. URL: https://doi.org/10.1093/bioinformatics/bty888, doi:10.1093/bioinformatics/bty888.

[rawPZV+18] (1,2,3,4)

Swati Parekh, Christoph Ziegenhain, Beate Vieth, Wolfgang Enard, and Ines Hellmann. zUMIs - a fast and flexible pipeline to process term`RNA` sequencing data with UMIs. GigaScience, May 2018. URL: https://doi.org/10.1093/gigascience/giy059, doi:10.1093/gigascience/giy059.


Ralph Patrick, David T. Humphreys, Vaibhao Janbandhu, Alicia Oshlack, Joshua W.K. Ho, Richard P. Harvey, and Kitty K. Lo. Sierra: discovery of differential transcript usage from polyA-captured single-cell term`RNA`-seq data. Genome Biology, jul 2020. URL: https://doi.org/10.1186%2Fs13059-020-02071-7, doi:10.1186/s13059-020-02071-7.


Rob Patro, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. Salmon provides fast and bias-aware quantification of transcript expression. Nature methods, 14(4):417–419, 2017.


Rob Patro, Stephen M Mount, and Carl Kingsford. Sailfish enables alignment-free isoform quantification from term`RNA`-seq reads using lightweight algorithms. Nature biotechnology, 32(5):462–464, 2014.

[rawPPC+22] (1,2,3,4)

Allan-Hermann Pool, Helen Poldsam, Sisi Chen, Matt Thomson, and Yuki Oka. Enhanced recovery of single-cell term`RNA`-sequencing reads for missing gene expression data. bioRxiv, 2022. URL: https://www.biorxiv.org/content/early/2022/04/27/2022.04.26.489449, arXiv:https://www.biorxiv.org/content/early/2022/04/27/2022.04.26.489449.full.pdf, doi:10.1101/2022.04.26.489449.


Torbjørn Rognes and Erling Seeberg. Six-fold speed-up of Smith–Waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics, 16(8):699–706, 2000.

[rawSHS17] (1,2,3,4,5,6,7,8,9,10,11)

Tom Smith, Andreas Heger, and Ian Sudbery. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome research, 27(3):491–499, 2017.


Charlotte Soneson, Avi Srivastava, Rob Patro, and Michael B Stadler. Preprocessing choices affect RNA velocity results for droplet scRNA-seq data. PLoS computational biology, 17(1):e1008585, 2021.


Avi Srivastava, Laraib Malik, Hirak Sarkar, and Rob Patro. A Bayesian framework for inter-cellular information sharing improves dscRNA-seq quantification. Bioinformatics, 36(Supplement_1):i292–i299, July 2020. URL: https://doi.org/10.1093/bioinformatics/btaa450, doi:10.1093/bioinformatics/btaa450.


Avi Srivastava, Laraib Malik, Hirak Sarkar, Mohsen Zakeri, Fatemeh Almodaresi, Charlotte Soneson, Michael I Love, Carl Kingsford, and Rob Patro. Alignment and mapping methodology influence transcript abundance estimation. Genome Biology, 21(1):1–29, 2020.

[rawSMS+19] (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)

Avi Srivastava, Laraib Malik, Tom Smith, Ian Sudbery, and Rob Patro. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome biology, 20(1):1–16, 2019.


Avi Srivastava, Hirak Sarkar, Nitish Gupta, and Rob Patro. Rapmap: a rapid, sensitive and accurate tool for mapping term`rna`-seq reads to transcriptomes. Bioinformatics, 32(12):i192–i200, 2016.


Hajime Suzuki and Masahiro Kasahara. Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics, February 2018. URL: https://doi.org/10.1186/s12859-018-2014-8, doi:10.1186/s12859-018-2014-8.

[rawTMP+21] (1,2)

Maria Tsagiopoulou, Maria Christina Maniou, Nikolaos Pechlivanis, Anastasis Togkousidis, Michaela Kotrová, Tobias Hutzenlaub, Ilias Kappas, Anastasia Chatzidimitriou, and Fotis Psomopoulos. UMIc: a preprocessing method for UMI deduplication and reads correction. Frontiers in Genetics, May 2021. URL: https://doi.org/10.3389/fgene.2021.660366, doi:10.3389/fgene.2021.660366.


Ernest Turro, Shu-Yi Su, Ângela Gonçalves, Lachlan JM Coin, Sylvia Richardson, and Alex Lewin. Haplotype and isoform specific expression estimation using multi-mapping term`RNA`-seq reads. Genome biology, 12(2):1–15, 2011.

[rawWLK19] (1,2)

Samuel L. Wolock, Romain Lopez, and Allon M. Klein. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Systems, 8(4):281–291.e9, April 2019. URL: https://doi.org/10.1016/j.cels.2018.11.005, doi:10.1016/j.cels.2018.11.005.


Andrzej Wozniak. Using video-oriented instructions to speed up sequence comparison. Bioinformatics, 13(2):145–150, 1997.

[rawYTS+21] (1,2)

Yue You, Luyi Tian, Shian Su, Xueyi Dong, Jafar S. Jabbari, Peter F. Hickey, and Matthew E. Ritchie. Benchmarking UMI-based single-cell term`RNA`-seq preprocessing workflows. Genome Biology, dec 2021. URL: https://doi.org/10.1186%2Fs13059-021-02552-3, doi:10.1186/s13059-021-02552-3.

[rawYB20] (1,2)

Matthew D Young and Sam Behjati. SoupX removes ambient term`RNA` contamination from droplet-based single-cell term`RNA` sequencing data. GigaScience, December 2020. URL: https://doi.org/10.1093/gigascience/giaa151, doi:10.1093/gigascience/giaa151.


Luke Zappia and Fabian J. Theis. Over 1000 tools reveal trends in the single-cell rna-seq analysis landscape. Genome Biology, 22(1):301, Oct 2021. URL: https://doi.org/10.1186/s13059-021-02519-4, doi:10.1186/s13059-021-02519-4.


Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller. A greedy algorithm for aligning term`DNA` sequences. Journal of Computational Biology, 7(1-2):203–214, February 2000. URL: https://doi.org/10.1089/10665270050081478, doi:10.1089/10665270050081478.

[rawZTB+17] (1,2,3,4)

Grace X. Y. Zheng, Jessica M. Terry, Phillip Belgrader, Paul Ryvkin, Zachary W. Bent, Ryan Wilson, Solongo B. Ziraldo, Tobias D. Wheeler, Geoff P. McDermott, Junjie Zhu, Mark T. Gregory, Joe Shuga, Luz Montesclaros, Jason G. Underwood, Donald A. Masquelier, Stefanie Y. Nishimura, Michael Schnall-Levin, Paul W. Wyatt, Christopher M. Hindson, Rajiv Bharadwaj, Alexander Wong, Kevin D. Ness, Lan W. Beppu, H. Joachim Deeg, Christopher McFarland, Keith R. Loeb, William J. Valente, Nolan G. Ericson, Emily A. Stevens, Jerald P. Radich, Tarjei S. Mikkelsen, Benjamin J. Hindson, and Jason H. Bielas. Massively parallel digital transcriptional profiling of single cells. Nature Communications, 8(1):14049, Jan 2017. URL: https://doi.org/10.1038/ncomms14049, doi:10.1038/ncomms14049.


Christoph Ziegenhain, Gert-Jan Hendriks, Michael Hagemann-Jensen, and Rickard Sandberg. Molecular spikes: a gold standard for single-cell term`RNA` counting. Nature Methods, 19(5):560–566, 2022.

[raw10xGenomics21] (1,2,3,4,5)

10x Genomics. Technical note - interpreting intronic and antisense reads in 10x genomics single cell gene expression data. https://www.10xgenomics.com/support/single-cell-gene-expression/documentation/steps/sequencing/interpreting-intronic-and-antisense-reads-in-10-x-genomics-single-cell-gene-expression-data, August 2021.

3.11. Contributors#

We gratefully acknowledge the contributions of:

3.11.1. Authors#

  • Dongze He

  • Avi Srivastava

  • Hirak Sarkar

  • Rob Patro

3.11.2. Reviewers#

  • Lukas Heumos