This tutorial is not in its final state. The content may change a lot in the next months.
Because of this status, it is also not listed in the topic pages.
Single-cell RNA-seq (scRNA-seq) is emerging as a promising technology for analysing variability in cell populations. However, the combination of technical noise and intrinsic biological variability makes detecting technical artefacts particularly challenging. Removal of low-quality cells and detection of technical artefacts is critical for accurate downstream analysis.
A number of factors should be examined before downstream analyses, many of which we’ll address here:
Low library size: When cells are very degraded or absent from the library preparation, the number of reads sequenced from that library will be very low. It’s important to remove these cells from downstream analyses.
Low number of expressed genes: A low number of expressed genes may be a result of poor-quality cells (e.g. dying, degraded, damaged, etc.), followed by high PCR amplification of the remaining RNA. Again, these cells should be removed from downstream analyses.
High mitochondrial gene content: High concentrations of mitochondrial genes is often a result of damaged cells where the endogenous RNA escapes or degrades. As mitochondria has its own cell membranes, it is often the last DNA/RNA in damaged cells to degrade and hence occurs in high quantities during sequencing.
Batch effect: Large scRNA-seq projects usually need to generate data across multiple batches due to logistical constraints. However, the processing of different batches is often subject to variation, e.g., changes in operator, differences in reagent quality and concentration, the sequencing machine used, etc. This results in systematic differences in the observed expression in cells from different batches, which we refer to as “batch effects”. Batch effects are problematic as they can be major drivers of variation in the data, masking the relevant biological differences and complicating interpretation of the results.
We will use scater (McCarthy et al. 2017) to visualise scRNA-seq data, obtaining information about the factors mentioned above, filter out low-quality cells and confirm that filtering has worked. We’ll then look at confounding factors such as batch effect to see if the data is biased to any technical artifacts.
We will use a pre-calculated expression matrix, along with some additional metadata such as lists of mitochondrial genes and annotation of technical information for each sequencing library. We will plot the data and carry out quality control filtering based on the visualise-filter-visualise paradigm of this approach (see workflow below).
Click galaxy-uploadUpload Data at the top of the tool panel
Select galaxy-wf-editPaste/Fetch Data
Paste the link(s) into the text field
Press Start
Close the window
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:
Go into Libraries (left panel)
Navigate to the correct folder as indicated by your instructor.
On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
Select the desired files
Click on Add to Historygalaxy-dropdown near the top and select as Datasets from the dropdown menu
In the pop-up window, choose
“Select history”: the history you want to import the data to (or create a new one)
Click on Import
Visualise the data
Take a look at the uploaded data by clicking on the galaxy-eye symbol for each dataset.
The counts.txt file is a 40-sample expression matrix. Each sample (Cell_001 - Cell_040) is listed as the column headers and the start of each row is a gene name. The rest of the data refers to the number of reads mapped to each gene/sample.
annotation.txt is a file listing experimental information about each cell. Parameters here include Mutation_Status, Cell_Cycle, and Treatment. These will be useful for looking at batch effects later.
The mt_controls.txt file is a list of mitochondrial genes. This list will be used later to calculate the % of mitochondrial reads in each sequencing library.
Hands On: Calculate QC metrics
Scater: Calculate QC metrics ( Galaxy version 1.12.2) with the following parameters:
param-file“Expression matrix in tabular format”: counts.txt (Input dataset)
param-file“Format dataset describing the features in tabular format”: annotation.txt (Input dataset)
param-file“Dataset containing the list of the mitochondrial control genes”: mt_controls.txt (Input dataset)
Comment
The output of this tool is a SingleCellExperiment object in Loom format, which contains all the information from the input files, along with a host of other quality control metrics, calculated from the input data.
Next, lets take a look at the data by plotting various properties to see what our data looks like.
Hands On: Plot library QC
Scater: plot library QC ( Galaxy version 1.12.2) with the following parameters:
param-file“Input SingleCellLoomExperiment dataset”: output_loom (output of Scater: Calculate QC metricstool)
param-check“Plot on log scale”: No
If we have a large number of cells (500+), set the ‘Plot on log scale’ option to ‘Yes’. This will make it easier to pick cut-offs when dealing with large numbers. When the tool has finished running, click on the galaxy-eye to view the plots. If it doesn’t appear in the browser, you may have to download it and view it externally. You should be presented with plots similar to those below.
Open image in new tab
Figure 2: Raw data QC plots
Comment
There are four plots, two distribution bar plots and two scatter plots.
The first distribution plot is the number of reads in each library (from a single cell).
The second plot is the distribution of feature counts per cell. Feature counts in this case refers to the number of genes expressed in each cell.
The third plot Scatterplot of reads vs genes is a combination of the two barplots, in that it plots both the read count and the expressed gene count for each cell.
The final scatterplot is the % MT genes, which plots the number of genes expressed verses the percentage of those genes that are mitochondrial genes.
Let’s look at each plot in turn and see what it tells us.
Read counts. You can see that there are a few cells that have less than ~200,000 reads, with other cells having up to one million reads. Although 200,000 reads is still quite a lot and we wouldn’t want to get rid of so much data, we might want to think about removing cells that only contain a smaller number of reads (say, 100,000).
Feature counts. Similar to the read counts plot, we see a few cells that have a very low number of expressed genes (<600), then followed by a more even distribution.
Scatterplot of reads vs genes. This takes the information provided in the two distribution plots above and creates a scatterplot from them. The really poor-quality cells are represented by the points near the intersection of the x and y axis, being data with low read count and low gene count. These are the cells we want to remove during filtering.
% MT genes. You can see from the plot that there are a few cells outside the main “cloud” of datapoints. Some of these could be removed by filtering out cells with low feature counts, but others might need to be removed by mitochondrial content, such as the cell around 37.5%
Filtering
Manual filtering
In the Scater: filter SCE Galaxy tool there are two filtering methods available. First, there’s an “automatic” filtering method that uses PCA to identify outliers cells and remove them from the data. This is particularly useful for very large datasets (hundreds of samples). Second, there’s a manual filtering method where users can put a range of filtering parameters, informed by the previous plotting tool.
Here, we’ll use the manual filtering method.
Hands On: Filtering with scater
Scater: filter SCE ( Galaxy version 1.12.2) with the following parameters:
param-file“Input SingleCellLoomExperiment dataset”: output_loom (output of Scater: Calculate QC metricstool)
“Type of filter”: manual
“Number of reads mapped to a gene for it to be counted as expressed”: 4.0
“Minimum library size (mapped reads) to filter cells on”: 100000
“Minimum number of expressed genes to filter cells on”: 500
“Maximum % of mitochondrial genes expressed per cell”: 35.0
Comment
Let’s have a look at the parameters and their values:
“Number of reads mapped to a gene for it to be counted as expressed”: by default, only one read needs to be mapped to a gene for it to be counted as “expressed”. We can be a little bit more stringent here and increase the number of reads that need to be mapped to a gene for it to be categorised as “expressed”.
“Minimum library size (mapped reads) to filter cells on”: This value asks how many mapped reads from each cell do you require to be mapped to your genome to be included in downstream analysis. We can see from our plots that we have a few cells that have less than 200,000 reads. 200,000 can still be quite a lot of reads (depending on the experiment), but we can use a smaller number to see what the initial effect of filtering is. Initially, use 100,000 as the value here.
“Minimum number of expressed genes”: You can see that some cells only express a few hundred genes, so we’ll remove these cells also.
“Maximum % of mitochondrial genes expressed per cell”. You can see that as well as having one obvious outlier (~37%).
Hands On: Plot library QC after filtering
Scater: plot library QC ( Galaxy version 1.12.2) with the following parameters:
param-file“Input SingleCellLoomExperiment dataset”: output_loom (output of Scater: filter SCEtool)
How did the filtering go? Do you think it’s done a good job? Have you removed too many cells? Too few cells? About right?
Often, it’s a matter of trial and error, where you would start off by being quite lenient (low parameters) and then increasing the stringency until you’re happy with the results. Using the initial Calculate QC metrics file, play around with the filtering parameters and visualise the output to see the effect different paramters have.
Automatic (PCA) filtering
Another filtering approach is to identify outliers in the data and remove them. PCA can be run once a SingleCellExperiment object has been normalised, and outliers cells identified based on the pre-computed quality control metrics within the SingleCellExperiment object.
As we are using a rather small test dataset, it’s unlikely that PCA filtering will make any difference; for a larger, noisier dataset this is what we would perform instead:
Hands On: Task description
Scater: filter SCE ( Galaxy version 1.12.2) with the following parameters:
param-file“Input SingleCellLoomExperiment dataset”: output_loom (output of Scater: Calculate QC metricstool)
“Type of filter”: automatic
Comment
The data will be normalised and then PCA ran on it using the following information from the data:
pct_counts_top_100_features
total_features_by_counts
pct_counts_feature_control
total_features_feature_control
log10_total_counts_endogenous
log10_total_counts_feature_control
When using these filtering approaches, it is sometimes good to try running them in reverse - try PCA filtering first and then if that doesn’t remove enough low-quality cells then use the manual filtering. We could actually pipeline them together - use PCA filtering first and then use the output of that to do further manual filtering.
As discussed previously, technical artefacts can bias scRNA-seq analyses. Strong batch effects can mask real biological differences in the data, so must be identified and removed from the data. Logging meta-data details such as date of library construction, sequencing batch, sample name, technical replicate, plate number, etc., is essential to identify batch effects in the data. We can use this information to visualise the data to examine it for clustering according to batch, rather than any real biological feature.
Hands On: PCA plot
Scater: PCA plot ( Galaxy version 1.12.2) with the following parameters:
param-file“Input SingleCellLoomExperiment dataset”: output_loom (output of Scater: filter SCEtool)
Do any of the categories suggesting some sort of technical artefact?
We can see that the S and G2M categories in Cell_Cycle cluster away from the rest of the data. Spend some time thinking about whether this might be a batch effect or biologically significant.
Conclusion
We have gone through the process of filtering low-quality data from an scRNA-seq expression matrix, using the visualise-filter-visualise paradigm, which proves to be a very effective way of quality-controlling scRNA-seq data. Cells that have low read-coverage, low expression values, or high mitochondrial gene expression have been filtered out. We have then examined ways of looking at confounding factors to examine batch effects in our data.
The workflow available from the “Supporting Materials” of this tutorial can be directly imported and used or adapted to a specific analysis.
Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.
References
McCarthy, D. J., K. R. Campbell, A. T. L. Lun, and Q. F. Wills, 2017 Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33: 1179–1186. 10.1093/bioinformatics/btw777
Etherington, G. J., N. Soranzo, S. Mohammed, W. Haerty, R. P. Davey et al., 2019 A Galaxy-based training resource for single-cell RNA-sequencing quality control and analyses. GigaScience 8: giz144.
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
@misc{single-cell-scrna-scater-qc,
author = "Graham Etherington and Nicola Soranzo",
title = "Single-cell quality control with scater (Galaxy Training Materials)",
year = "",
month = "",
day = "",
url = "\url{https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-scater-qc/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
doi = {10.1371/journal.pcbi.1010752},
url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
year = 2023,
month = {jan},
publisher = {Public Library of Science ({PLoS})},
volume = {19},
number = {1},
pages = {e1010752},
author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
editor = {Francis Ouellette},
title = {Galaxy Training: A powerful framework for teaching!},
journal = {PLoS Comput Biol}
}
Congratulations on successfully completing this tutorial!
Do you want to extend your knowledge?
Follow one of our recommended follow-up trainings: