ERGA post-assembly QC

Author(s)	Fabian Recktenwald Cristóbal Gallardo Tom Brown Delphine Lariviere
Reviewers

Overview
Questions:

What combination of tools can assess the quality of a genome assembly?

What metrics can help to analyse the quality of an assembly?

How do we evaluate the outputs?

Objectives:

Apply the post-assembly-QC-workflow using the necessary tools

Analyse and evaluate the results of the workflow

Requirements:

Introduction to Galaxy Analyses

slides Slides: Quality Control

tutorial Hands-on: Quality Control

Time estimation: 3 hours

Supporting Materials:

Workflows

FAQs

video Recordings

video Tutorial (October 2024) - 44m

video View All

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

Possibly Working

UseGalaxy.org (Main)

UseGalaxy.org.au

Published: Jun 23, 2023

Last modification: Oct 8, 2024

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00355

rating Rating: 1.0 (1 recent ratings, 2 all time)

version Revision: 6

The European Reference Genome Atlas (ERGA, Mazzoni et al. 2023) is a large-scale network of researchers aiming to generate high-quality reference genomes for all eukaryotic life in Europe and build capacity to allow researchers anywhere to generate reference genomes and use them to answer questions regarding species conservation and biodiversity. ERGA uses state-of-the-art sequencing technologies and advanced bioinformatics tools to produce high-quality genome assemblies.

Reference genomes provide a baseline for understanding genetic diversity within and among populations, and can be used to identify populations at risk of genetic erosion. This information is crucial for developing effective conservation strategies and management plans for threatened and endangered species (Shafer et al. 2015). Additionally, by better understanding the genetic basis of important traits, such as disease resistance and adaptation to changing environments, researchers can develop targeted interventions to mitigate the effects of environmental change and prevent the loss of genetic diversity (FRANKHAM et al. 2011). The ERGA project has the potential to greatly benefit biodiversity conservation efforts and advance our understanding of the genetic basis of biodiversity.

Genome post-assembly quality control is a crucial step for evaluating the accuracy and completeness of newly assembled genomes. This involves assessing the contiguity, completeness, accuracy, and consistency of the genome assembly using various bioinformatic tools and methods (Koren et al. 2017, Hunt et al. 2015, Mikheenko et al. 2015, Vaser et al. 2017, Zimin et al. 2017). The QC process aims to ensure that genomic data is reliable and useful for downstream analyses such as annotation, comparative genomics, and functional studies.

In this tutorial you will learn how to implement the ERGA Genome Assembly QC pipeline, and how to interpretate the potential outcomes.

Agenda

In this tutorial, we will cover:

Study cases

Get data

What data have we imported?

Genome assembly overview with BlobToolKit

Generate read coverage data with Minimap2

Generate sequence similarity data with DIAMOND

Generate BUSCO report

Generate interactive plots with BlobToolKit

K-mer based genome profiling and evaluation

Generating k-mer profile with Meryl

K-mer profile analysis with GenomeScope

K-mer based assembly evaluation with Merqury

Providing assembly statistics with gfastats

Hi-C scaffolding

Pre-processing Hi-C data

Generate Hi-C contact map

Assembly graph

Conclusion

Study cases

In this tutorial we will evaluate three genome assemblies, belonging to three different taxonomic groups, in order to illustrate the different scenarios that can be identified. The characteristics of each of them are described briefly below.

Case 1: Chondrosia reniformis

Chondrosia reniformis is a slow-growing marine sponge cosmopolitan species which can be found in the Mediterranean Sea and the eastern Atlantic Ocean in shallow waters; it is considered to playing an important ecological role in the marine ecosystem by filtering large volumes of water and providing habitat for other species (Voultsiadou 2005). The members of this species are gonochoristic and oviparous, whose physiology and behavious seems to highly influence the presence of endosymbiosis heterotrophic bacteria (Sara Michele and Carlo 1998).

Figure 1: Chrondosia reniformis. — **Figure 1**: Example *Chrondosia reniformis* sponge.

The assembly is based on 70x PacBio HiFi and Arima2 Hi-C data generated by the Aquatic Symbiosis Genomics Project. The assembly process included the following sequence of steps: initial PacBio assembly generation with Hifiasm, retained haplotig separation with purge_dups, and Hi-C based scaffolding with YaHS. The mitochondrial genome was assembled using MitoHiFi. Finally, the primary assembly was analysed and manually improved using gEVAL.

Case 2: Erythrolamprus reginae

Erythrolamprus reginae is a species of colubrid snake found in South America. This species has been reported to include triploid individuals with parthenogenic reproduction; this type of seems to be associated with higher mutation rates and more tandem duplications in the genome (Bogart 1980).

Figure 2: Erythrolamprus reginae. — **Figure 2**: Example *Erythrolamprus reginae* snake.

The assembly used in this tutorial correspond to the curated primary assembly generated by the VGP project, based on 34x Pacbio HiFi and Arima2 Hi-C data, by using the VGP assembly pipeline.

Case 3: Eschrichtius robustus

Eschrichtius robustus, commonly known as the gray whale, is a species of whale found primarily in the North Pacific Ocean. Adult gray whales can reach lengths of up to 14.9 meters and weights of up to 36,000 kilograms. It is a diploid species, genetically characterized by high homozygosity (Brüniche-Olsen et al. 2018).

Figure 3: Eschrichtius robustus. — **Figure 3**: Example *Eschrichtius robustus* whale.

The assembly used in this tutorial correspond to the curated primary assembly generated by the VGP project, based on 29x Pacbio HiFi data, by using the VGP assembly pipeline.

Get data

As a first step we will get the data from Zenodo.

Hands On: Upload data

Create a new history for allocating temporally all the datasets. This history can be named as Datasets history

Import the Pacbio HiFi files from Zenodo:

Open the file galaxy-upload upload menu
Click on Rule-based tab
“Upload type”: Collections

Copy the tabular data, paste it into the textbox and press Build

dataset_01   https://zenodo.org/record/7786773/files/hifi.fastq.gz   fastqsanger.gz    HiFi    CReniformis_Pacbio
dataset_01   https://zenodo.org/record/7784764/files/m64055_210602_211608.hifi_reads.fastq.gz   fastqsanger.gz    HiFi    EReginae_PacBio
dataset_02   https://zenodo.org/record/7784764/files/m54306Ue_211020_191957.hifi_reads.fastq.gz   fastqsanger.gz    HiFi    EReginae_PacBio
dataset_03   https://zenodo.org/record/7784764/files/m54306Ue_211016_070813.hifi_reads.fastq.gz   fastqsanger.gz    HiFi    EReginae_PacBio
dataset_04   https://zenodo.org/record/7786773/files/m54306U_210529_225553.hifi_reads.fastq.gz   fastqsanger.gz    HiFi    EReginae_PacBio
dataset_01   https://zenodo.org/record/7781236/files/m54306Ue_220411_220734.demultiplex.bc1010--bc1010.hifi_reads.fastq.gz   fastqsanger.gz    HiFi    ERobustus_PacBio
dataset_02   https://zenodo.org/record/7781236/files/m54306Ue_220520_051743.demultiplex.bc1010--bc1010.hifi_reads.fastq.gz   fastqsanger.gz    HiFi    ERobustus_PacBio
dataset_03   https://zenodo.org/record/7781236/files/m64055e_220615_033108.demultiplex.bc1010--bc1010.hifi_reads.fastq.gz   fastqsanger.gz    HiFi    ERobustus_PacBio
dataset_04   https://zenodo.org/record/7786773/files/m64055e_220603_182128.demultiplex.bc1010--bc1010.hifi_reads.fastq.gz   fastqsanger.gz    HiFi    ERobustus_PacBio

Click on Build
From Rules menu select Add / Modify Column Definitions
- Click Add Definition button and select List Identifier(s): column A
- Click Add Definition button and select URL: column B
- Click Add Definition button and select Type: column C
- Click Add Definition button and select Group Tag: column D
- Click Add Definition button and select Collection Name: column E
Click Apply and press Upload

Import the remainig datasets from Zenodo:

Open the file galaxy-upload upload menu
Click on Rule-based tab
“Upload type”: Datasets

Copy the tabular data, paste it into the textbox and press Build

CReformitis_assembly    https://zenodo.org/record/7831298/files/assembly_sponge.fasta.gz    fasta.gz    assembly
CReformitis_metadata    https://zenodo.org/record/7781236/files/metadata_chon.yaml  yaml    metadata
CReniformis_Hi-C_F   https://zenodo.org/record/7786773/files/hiC_1.fastq.gz   fastqsanger.gz    Hi-C
CReniformis_Hi-C_R   https://zenodo.org/record/7786773/files/hiC_2.fastq.gz   fastqsanger.gz    Hi-C
EReginae_Hi-C_F_01   https://zenodo.org/record/7831762/files/rEryReg1_Royal_Ground_Snake_R1.fastq.gz.gz.001   fastqsanger.gz    Hi-C
EReginae_Hi-C_F_02   https://zenodo.org/record/7831762/files/rEryReg1_Royal_Ground_Snake_R1.fastq.gz.gz.002   fastqsanger.gz    Hi-C
EReginae_Hi-C_F_03   https://zenodo.org/record/7831763/files/rEryReg1_Royal_Ground_Snake_R1.fastq.gz.gz.003   fastqsanger.gz    Hi-C
EReginae_Hi-C_R_01   https://zenodo.org/record/7833514/files/rEryReg1_Royal_Ground_Snake_R2.fastq.gz.gz.001   fastqsanger.gz    Hi-C
EReginae_Hi-C_R_02   https://zenodo.org/record/7833514/files/rEryReg1_Royal_Ground_Snake_R2.fastq.gz.gz.002   fastqsanger.gz    Hi-C
EReginae_Hi-C_R_03   https://zenodo.org/record/7831763/files/rEryReg1_Royal_Ground_Snake_R2.fastq.gz.gz.003   fastqsanger.gz    Hi-C
EReginae_assembly    https://zenodo.org/record/7788734/files/rEryReg1.pri.cur.20230105.fasta.gz    fasta.gz    assembly
EReginae_metadata    https://zenodo.org/record/7781236/files/metadata_eryth.yaml  yaml    metadata
ERobustus_assembly  https://zenodo.org/record/7781236/files/mEscRob2.pri.cur.20221201.fasta.gz  fasta.gz    assembly
ERobustus_metadata  https://zenodo.org/record/7781236/files/metadata_esch.yaml  yaml    metadata
Taxonomy_data   https://zenodo.org/record/7781236/files/new_taxdump.tar.gz  gz  taxonomy
Diamond_db   https://zenodo.org/record/7788734/files/Diamond_makedb.dmnd  dmnd  database

From Rules menu select Add / Modify Column Definitions
- Click Add Definition button and select Name: column A
- Click Add Definition button and select URL: column B
- Click Add Definition button and select Type: column C
- Click Add Definition button and select Name Tag: column D
Click Apply and press Upload

What data have we imported?

For each species, we have uploaded the fasta file corresponding to the assembly, the raw sequencing data used to generate each assembly - in these cases PacBio HiFi longreads and Illumina Hi-C reads - and databases necessary to run the QC tools included in this tutorial. We will use the taxonomy database new_taxdump.tar.gz and the diamond database of protein sequences from the ncbi nt database to identify any sequence from contaminants or symbiont which are in our assembly using BlobToolKit.

Once we have imported all the datasets, we will move each one to its correspondent history.

Hands On: Upload data

Create three new empty histories, one for each specie.

Rename the histories as Case 1: *Chondrosia reniformis*, Case 2: *Erythrolamprus reginae* and Case 3: *Eschrichtius robustus*.

Click in History options and select Show Histories Side-by-Side

Click in Select histories, and include the histories corresponding to the three species.

Drag and drop to move the datasets to their correspondent histories.

Comment: Non-unique datasets

Both the Taxonomic_data and the Diamond_database should be included in all of them.

Click on galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)

Type the new name

Click on Save

To cancel renaming, click the galaxy-undo “Cancel” button

If you do not have the galaxy-pencil (Edit) next to the history name (which can be the case if you are using an older version of Galaxy) do the following:

Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel

Type the new name

Press Enter

Once all the datasets have been copied to their correspondent history, we should obtain something similar to this:

Figure 4: Distribution of data. — **Figure 4**: Histories corresponding to the three cases of study.

Genome assembly overview with BlobToolKit

BlobToolKit is a tool designed to assist researchers in analyzing and visualizing genome assembly data. The tool uses information from multiple data sources such as read coverage, sequence length, and taxonomic annotations to generate a comprehensive overview of genome assembly data (Challis et al. 2020). One of the key characteristics of BlobToolKit is its ability to provide with a user-friendly interactive interface for analyzing complex genome assembly data.

In this tutorial, we will use BlobToolKit in order to integrate the following data:

Read coverage data: BlobToolKit can use read coverage information to identify potential errors or gaps in the genome assembly. By comparing the depth of coverage across the genome, it can highlight regions that may be overrepresented or underrepresented, which can indicate potential issues with the assembly.
Taxonomic annotations: BlobToolKit can use taxonomic information to identify potential contaminants or foreign DNA in the genome assembly. It does this by comparing the taxonomic profile of the genome assembly to a reference database of known organisms.
Sequence similarity data: Sequence similarity data can be used to identify potential misassemblies or contaminants in the genome assembly. BlobToolKit can use BLAST/DIAMOND searches to compare the genome assembly to reference databases and identify regions that may be problematic.
BUSCO reports: BlobToolKit can use BUSCO data to provide additional information about the quality of a genome assembly. It can generate plots of the number of complete and partial BUSCO genes in the genome assembly, as well as the number of missing and fragmented genes.

Comment: Why should we evaluate contaminants?

A significant proportion of the genome sequences in both GenBank and RefSeq (0.54% and 0.34% of entries, respectively) include sequences from contaminants; the contamination primarily exists in the form of short contigs, flanking regions on longer contigs, or areas of larger scaffolds that are flanked by Ns, although a few longer sequences with contamination were also detected (Steinegger and Salzberg 2020).

In the next steps, we will generate the data required for generating the visualization plots with BlobToolKit.

Generate read coverage data with Minimap2

Read coverage is an essential metric for evaluating the quality of genome assemblies, and it provides valuable information for identifying regions of high and low quality, detecting misassemblies, and identifying potential contaminants. Thus, for example, unexpected regions of low coverage suggests potential errors, such as misassemblies, gaps, or low complexity regions (Koren et al. 2017).

In this tutorial we will use Minimap2 for generation the coverage data. Minimap2 is a multi-purpose aligner which is particularly efficient at aligning long-reads produced by sequencing machines from PacBio or Oxford Nanopore (Li 2018).

Comment: How is coverage information encoded in the BAM file?

Coverage information is encoded in the BAM file format through the number of reads that align to each position in the reference genome. Specifically, the BAM file stores each read’s start and end position in the genome, as well as the length and orientation of the read, among other information. To calculate coverage, the number of reads that overlap each position in the reference genome is counted.

Hands On: Generate BAM file with Minimap2

Collapse Collection ( Galaxy version 5.1.0) with the following parameters:

param-collection “Collection of files to collapse into single dataset”: CReniformis_Pacbio

Map with minimap2 ( Galaxy version 2.28+galaxy0) with the following parameters:

“Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index

param-file “Select the reference genome”: CReformitis_assembly

“Single or Paired-end reads”: Single

param-file “FASTA/Q file”: output of Collapse Collection tool

“Select a profile of preset options”: PacBio HiFi reads vs reference mapping (-k19 -w19 -U50,500 -g10k -A1 -B4 -O6,26 -E2,1 -s200 ) (map-hifi)

Repeat the prevous steps with the datasets from the two remanining species.

Generate sequence similarity data with DIAMOND

DIAMOND is a sequence alignment tool that utilizes a more efficient algorithm compared to BLAST, allowing for much faster searches of large sequence databases. Specifically, DIAMOND uses a sensitive seed-extension approach that compares a set of small segments (seeds) from the query sequence to a database, and then extends the alignments based on the highest-scoring hits. This approach allows DIAMOND to perform up to 20,000 times faster than BLAST, with comparable or improved sensitivity and accuracy (Buchfink et al. 2014).

Hands On: Task description

Diamond ( Galaxy version 2.0.15+galaxy0) with the following parameters:

“Alignment mode”: DNA query sequences (blastx)

“Allow for frameshifts?”: yes

“restrict hit culling to overlapping query ranges”: Yes

“frame shift penalty”: 15

param-file “Input query file in FASTA or FASTQ format”: CReformitis_assembly

“Will you select a reference database from your history or use a built-in index?”: Use one from the history

param-file “Select the reference database”: Diamond_db

“Restrict search taxonomically?”: No

“Sensitivity Mode”: Fast (--fast)

“Block size in billions of sequence letters to be processed at a time”: 10.0

“Method to filter?”: Maximum e-value to report alignments

“Method to restrict the number of hits?”: Maximum number of target sequences

In “Output options”:

“Format of output file”: BLAST tabular

“Tabular fields”: Query Seq-id,Subject Seq -id, Start of alignment in query, End of alignment in query, Expected value, Bit score and Unique Subject Taxonomy ID(s), separated by a ',' (in numerical order)

Repeat the prevous steps with the datasets from the two remanining species.

Generate BUSCO report

BUSCO (Benchmarking Universal Single-Copy Orthologs) is a tool used for quantitative assessment of genome assembly, gene set, and transcriptome completeness. It is based on evolutionarily informed expectations of gene content derived from near-universal single-copy orthologs (Simão et al. 2015)

Comment: Orthologs

Orthologs are genes in different species which have usually the same function and have evolved from a common ancestral gene. They are important for new genome assemblies in order to predict gene functions and help with gene annotation (Gennarelli and Cattaneo 2010).

Hands On: Estimate single copy gene representation completeness

Busco ( Galaxy version 5.7.1+galaxy0) with the following parameters:

param-file “Sequences to analyse”: CReformitis_assembly

“Lineage data source”: Use cached lineage Data

“Cached database with lineage”: Busco v5 Lineage Datasets

“Mode”: Genome assemblies (DNA)

“Select a gene predictor”: Metaeuk

“Auto-detect or select lineage?”: Auto-detect (For Erythrolamprus reginae, select Select lineage then Vertebrata in “Lineage”)

“auto-lineage group”: All taxonomic groups (--auto-lineage)

“Which outputs should be generated”: Short summary text

BUSCO sets represent 3,023 genes for vertebrates, 2,675 for arthropods, 843 for metazoans, 1,438 for fungi and 429 for eukaryotes. An intuitive metric is provided in BUSCO notation - C:complete[S:single, D:duplicated], F:fragmented, M:missing, n:number of genes used.

Repeat the prevous steps with the datasets from the two remanining species.

Question

What are the values of complete BUSCO genes for each species?

Eschrichtius robustus: 96.0%

Erythrolamprus reginae: 91.3%

Chondrosia reniformis: 82.8%

Generate interactive plots with BlobToolKit

BlobToolKit is a tool designed to assist researchers in analyzing and visualizing genome assembly data. The tool uses information from multiple data sources such as read coverage, gene expression, and taxonomic annotations to generate a comprehensive overview of genome assembly data (Challis et al. 2020). One of the key characteristics of BlobToolKit is its ability to provide with a user-friendly interactive interface for analyzing complex genome assembly data.

To work with BlobToolKit we need to create a new dataset structure called BlobDir. Therefore the minimum requirement is a fasta file which contains the sequence of our assembly. A list of sequence identifiers and some statistics like length, GC proportion and undefined bases will then be generated.

To get a more meaningful analysis and therefore more useful information about our assembly, it is better to provide as much data as we can get. In our case we will also provide a metadata file if possible, NCBI taxonomy ID and the NCBI taxdump directory (Challis et al. 2020).

Hands On: Creating the BlobDir dataset

BlobToolKit ( Galaxy version 3.4.0+galaxy0) with the following parameters:

“Select mode”: Create a BlobToolKit dataset

param-file “Genome assembly file”: CReformitis_assembly

param-file “Metadata file”: CReformitis_metadata

“NCBI taxonomy ID”: 121349

param-file “NCBI taxdump directory”: Taxonomy_Data

BlobToolKit ( Galaxy version 3.4.0+galaxy0) with the following parameters:

“Select mode”: Add data to a BlobToolKit dataset

param-file “Blobdir.tgz file”: output of BlobToolKit tool

param-file “BUSCO full table file”: Full table (output of Busco tool)

“BLAST/Diamond hits”: Enabled

“BLAST/Diamond hits dataset”: output of Diamond tool

“BLAST/Diamond file column order”: 1=qseqid,7=staxids,6=bitscore,2=sseqid,3=qstart,4=qend,5=evalue

param-file “BAM/SAM/CRAM read alignment file”: output of Minimap2 tool

Interactive BlobToolKit with the following parameters:

param-file “Blobdir file”: output of BlobToolKit tool

Click on the galaxy-eye (eye) icon and inspect the output of Interactive BlobToolKit.

Repeat the prevous steps with the datasets from the two remanining species.

Now, we will evaluate three of plots that we can find in the Blobtoolkit interactve interface. First, we will start with the snailplot, which provides a holistic view of the assembly.

Figure 5: BlobToolKit snail plot. — **Figure 5**: Chondrosia reniformis snail plot summary. The grey line represents the assembly scaffolds, where the distance to the center of the circle indicates the length of the scaffolds. A scale line in the center of the circle helps to gauge the length. Longest contig: The red line in the plot indicates the longest contig in the assembly. N50 and N90: The dark orange and light orange lines represent the scaffolds contained in the N50 and N90 metrics, respectively. GC/AT content: The outer light and dark blue lines represent the GC and AT content, respectively. In an ideal assembly, the line between the two colors should be consistent, without much fluctuation. Gaps: Gaps in the assembly, if present, are denoted by the percent N in the bottom right of the image.

The main plot is divided into 1,000 size-ordered bins around the circumference with each bin representing 0.1% of the 117,390,217 bp assembly. The distribution of sequence lengths is shown in dark grey with the plot radius scaled to the longest sequence present in the assembly (10,413,042 bp, shown in red).

Orange and pale-orange arcs show the N50 and N90 sequence lengths (8,459,200 and 6,903,244 bp), respectively. The pale grey spiral shows the cumulative sequence count on a log scale with white scale lines showing successive orders of magnitude. The blue and pale-blue area around the outside of the plot shows the distribution of GC, AT and N percentages in the same bins as the inner plot. A summary of complete, fragmented, duplicated and missing BUSCO genes in the eukaryota_odb10 set is shown in the top right.

Question: Snailplot question

Which differences do you appreciate between the snailplots corresponding to Erythrolamprus reginae and Eschrichtius robustus?

Open image in new tab

Figure 6: Comparison between snailplots from E. reginae (A) and E. robustus (B).

(A) The snake (Erythrolamprus reginae) does have a relatively large proportion of the genome assembled into some few and long scaffolds (reflected in the high “longest scaffold” value and the high “Log10 scaffold count” value).

(B) The whale (Eschrichtius robustus) does have a relatively large proportion of the genome assembled into fragmented scaffolds (reflected in the low N90 value compared to the N50 value and the high “Log10 scaffold count” value)

Now, we are going to analyze the blob plot corresponding to Erythrolamprus reginae, which is especially useful for gaining insights into the composition of your genomic data and identify potential contaminants or endosymbionts.

Figure 7: BlobToolKit circle plot. — **Figure 7**: Blob plot of base coverage in input against GC proportion for sequences in *E. reginate assembly*. Sequences are coloured by phylum. Circles are sized in proportion to sequence length . Histograms show the distribution of sequence length sum along each axis

The blob plot is a two-dimensional scatter plot that helps in visualizing and analyzing genomic data for quality control, contaminant detection, and filtering. In the circle plot, each sequence is represented by a circle, with its diameter proportional to the sequence length. Circles are colored based on their taxonomic affiliation, and their positions on the X and Y axes are determined by their GC content and coverage, respectively. GC content is the proportion of G and C bases in the sequence, which can differ substantially between genomes. Coverage, on the other hand, is a measure of the number of times a particular sequence has been read during the sequencing process. The plot also includes coverage and GC histograms for each taxonomic group, weighted by the total span (cumulative length) of sequences occupying each bin.

Comment: Pre-curated assembly evaluation

Usually, the evaluation of the pre-curated assembly yield quite different results when compared with the post-curation evaluation. A frequent difference is the abudance of contigs that belong to different organisms, as a result of contamination in the reads. It can be clearly appreciated in the next figure, which correspond to the E. reginae pre-curated assembly. A striking element is the presence of 38 contigs corresponding to the phylum Artverviricota (dark green), a group of viruses that encode a reverse transcriptase.

Open image in new tab

Figure 8: Blob plot of base coverage in input against GC proportion for sequences in pre-curated E. reginate assembly. Sequences are coloured by phylum. Circles are sized in proportion to sequence length . Histograms show the distribution of sequence length sum along each axis

In addition, as we can appreciate, the samples included contamination from fungi, arthropoda and bacteria.

Question: Blob plot question

Is there significant contamination in the Eschrichtius robustus and Chondrosia reniformis assemblies?

Open image in new tab

Figure 9: Comparison between blob plots from E. robustus (A) and C. reniformis (B).

(A) The whale (Eschrichtius robustus) does have significant contamination. Approximately half of the assembled sequences of the whale does contain contaminants or cosymbionts (total count: 703, Chordata count: 316), however as the foreign sequences are smaller, they make up a small percentage of the total assembly.

(B) The sponge (Condrosia reniformis) doesn’t contain any foreign sequences or contaminatio (total count: 15, Chordata count: 15).

Finally, let’s have a look at the cumulative sequence length plot, which shows the curves for subsets of scaffolds assigned to each phylum relative to the overall assembly. It is useful for evaluating the contribution of the contigs from contaminated reads to the final assembly. In that case, first we will evaluate all sequences together, and then we will remove those correspoding to chordata or not classified (not-hit) in order to be able to evaluate in detail each of the contaminants. So, let’s start with the accumulative plot corresponding to all sequences (fig. 10).

In this kind of plot, the x-axis represent the number of contigs (sorted by species and length), and the y-axis correspond to the cumulative length in nucleotides. The gray line shows the cumulative length of all sequences. As we can appreciate, most sequences corespond to the taxa chordata (1.9Gb, distributed along 151 contigs). In addition, we can see that the final assembly includes 174 contigs corresponding to different taxa, or not classified at all. In order to be able to analyze the contribution of those sequences, we will filter the contigs corresponding to the chordata.

Figure 10: BlobToolKit cumulative plot. — **Figure 10**: Cumulative sequence length for assembly Blobdir. The grey line shows cumulative length for all sequences. Coloured lines show cumulative lengths of sequences assigned to each phylum using the bestsum taxrule. Plot axes are scaled to the filtered assembly.

Comment: How do I apply filters to a dataset?

Datasets can be filtered based on any category or variable using the “Filters” menu. To filter based on a category, click the tab above any bar of the preview histogram to hide/show records assigned to this category. To filter variables, use the sliders at either end of the preview histogram to change the maximum and minimum values or enter the numbers directly into the text boxes above the histogram.

By hidding the contings corresponding to chordata, we can have a detailed view of the contributions of each contaminant (fig. 11).

Figure 11: BlobToolKit cumulative plot. — **Figure 11**: The grey line shows cumulative length for all sequences. Coloured lines show cumulative lengths of sequences assigned to each phylum using the bestsum taxrule and are stacked by cumulative value on the x-axis to show the proportion of each phylum in the overall assembly. The assembly has been filtered to exclude sequences with phylum matches Chordata. Plot axes are scaled to the filtered assembly.

In figure 11 we can appreciate clearly the relative contribution of each contaminant with respect to the total contamination.

K-mer based genome profiling and evaluation

k-mers can be utilized to infer genome characteristics such as length and heterozygosity by analyzing their frequency distribution in raw sequencing reads (Manekar and Sathe 2018). In this tutorial we will make use of three different tools in order to perform the k-mer based genome profiling and assembly evaluation.

Comment: Genome profiling details

You can find more details about genome profiling in the VGP training.

Generating k-mer profile with Meryl

DNA is double stranded and normally only one strand is sequenced. For our assembly we want to consider the other strand as well. Therefore canonical k-mers are used in most counting tools, exactly like in Meryl. A full k-mer pair is a sequence and the reverse complement of the sequence (e.g. ATG/CAT). The canonical sequence of a k-mer pair is the lexicographically smaller of the two reverse complementary sequences. So if CAT appears it will be counted as ATG. (Clavijo)

Meryl is a powerful tool for counting k-mers in large-scale genomic datasets. It uses a sorting-based approach that sorts the k-mers in lexicographical order.

Hands On: Generating k-mer profile

Meryl ( Galaxy version 1.3+galaxy6) with the following parameters:

“Operation type selector”: Count operations

param-collection “Input sequences”: CReniformis_Pacbio

“K-mer size selector”: Set a k-mer size

“K-mer size”: 21

Comment: compute k

In general k can be computed as k=log4 (G(1-p)/p), with G as genome size and p as tolerable collision rate. In our case we set the k-mer size to 21.

Run Meryl ( Galaxy version 1.3+galaxy6) again with the following parameters:

“Operation type selector”: Operations on sets of k-mers

“Operations on sets of k-mers”: Union-sum: return k-mers that occur in any input, set the count to the sum of the counts

param-file “Input meryldb”: previous output of Meryl tool

Run Meryl ( Galaxy version 1.3+galaxy6) a third time with the following parameters:

“Operation type selector”: Generate histogram dataset

param-file “Input meryldb”: previous output of Meryl tool

Rename the output as Histogram dataset

K-mer profile analysis with GenomeScope

Genomescope is used for analysing genomes with the help of k-mer profile analysis. It estimates the overall genome characteristics and the overall read characteristics. The tool will use a given k-mer profile which is calculated only from raw reads sequencing data. It then generates a plot with the calculated data giving us information about the completeness and quality of the to be assembled data. (Vurture et al. 2017)

Hands On: Generate plots for analysis

GenomeScope ( Galaxy version 2.0+galaxy2) with the following parameters:

param-file “Input histogram file”: Histogram dataset

“Ploidy for model to use”: 2 for C. reniformis and E. robustus, 3 for E. reginae.

Comment: Plots

The generated plots will have the coverage on the x-axis and the frequency on the y-axis. It shows the fitted model in the observed k-mer data as well as the k-mer peaks and other information like some estimated parameters of the assembly.

Figure 12: Genomescope plot. — **Figure 12**: Genomescope 21-mer profile (k:21) of Chondrosia reniformis, diploid (p:2). The plot includes estimations about the genome length (len:124,167,659bp), unique sequences (uniq:61.5%), homozygous portion (aa: 98.7%), heterozygous portions (ab:1.32%), mean k-mer coverage for heterozygous bases (kcov:35.4), read error rate (err:0.752%) and average rate of read duplications (dup:0.378).

Figure 12 corresponds to the k-mer profile of the sponge (Chondrosia reniformis). Presumably, the large gene number in the sponge genome is due to regional gene duplication; so far evidence for a transposition in sponges has been presented. Data indicate that only 0.25 % of the total sponge genome comprises CA/TG microsatellites, and until now also no SINEs/transposable elements have been identified. The estimated genome size around 124,Mbp is relatively close to Chondrosia reniformis genome size (117,39Mbp).

Question: Genome profile question

Which differences can you appreciate between the genome profile of Erythrolamprus reginae and Eschrichtius robustus?

Open image in new tab

Figure 13: Genomescope genome profile. Erythrolamprus reginae (A). Eschrichtius robustus (B).

(A) The genome profile of the snake (Erythrolamprus reginae) suggests a high level of heterozygosity (reflected by the first peak at 11.1x coverage). 51.6% unique sequences have been estimated. In this case the low value is attributed to the parthenogenesis and the polyploidy of this snake which causes tandem duplications and higher DNA replication as a result of the triploid-diploid cycles.

(B) The genome profile of the whale (Eschrichtius robustus) suggests a high level of homozygosity (reflected by the second peak at 29x coverage). The homozygous k-mer distribution can be expected from the high grade of inbreeding.

K-mer based assembly evaluation with Merqury

Merqury is designed for evaluating the completeness and accuracy of long-read genome assemblies using short-read sequencing data. Thus the quality of assemblies which are generated by using third-generation sequencing technologies can be reviewed and assessed by the tool. Merqury works by comparing k-mers of an assembly to those from unassembled high-accuracy reads of the raw sequencing data. K-mer-based methods are also used to identify errors and missing sequences. (Rhie et al. 2020)

Hands On: Generating stats and plots

Merqury ( Galaxy version 1.3+galaxy2) with the following parameters:

“Evaluation mode”: Default mode

param-file “K-mer counts database”: read_db (output of Meryl tool)

“Number of assemblies”: One assembly (pseudo-haplotype or mixed-haplotype)

param-file “Genome assembly”: CReformitis_assembly

Comment: Output

Merqury will now generate following outputs:

stats with completeness statistics

QV stats with quality value statistics

plots

Let’s have a closer look at the copy number plots for each of the three species. Merqury will generate k-mers from the raw sequencing data (in the following called the ‘read set’) and will compare them to the assembly, in contrast to genomescope. Despite being diploid or triploid the species only got plotted with the primary assembly. This is why only one-copy k-mers get evaluated.

Figure 14: Merqury plot. — **Figure 14**: Merqury copy numbers (CN) plot of the whale (*Eschrichtius robustus*). The red area displays the k-mers of the assembly. The black area displays the k-mers only found in the read set. The black area can be indicative for sequencing error in the read set or missing sequences in the assembly.

The small black area indicates that most of the k-mers found in the read set are also found in the assembly (but not all since only the primary assembly got plotted). It reflects a highly homozygous assembly. This plot indicates a sequencing coverage at ~30x as seen before with genomescope.

Question: Genome profile question

Which differences can you appreciate between the genome profile of Chondrosia reniformis and Erythrolamprus reginae?

Open image in new tab

Figure 15: Merqury copy numbers (CN) plot of Chondrosia reniformis (A) and Erythrolamprus reginae (B). The red area displays the k-mers of the assembly. The black area displays the k-mers only found in the read set. The black area can be indicative for sequencing error in the read set or missing sequences in the assembly.

(A) The sponge (Chondrosia reniformis) does have two peaks as seen before with genomescope which is typical for a diploid species. The large black area indicates that there is a high amount of k-mers in the read set which is not used in the assembly since only the primary assembly got plotted. The secondary assembly is missing.

(B) The snake (Erythrolamprus reginae) does have three peaks as seen before with genomescope which is typical for a triploid species. The black area is even larger compared to the sponge since only the primary assembly got plotted. The secondary and third assembly are missing.

Providing assembly statistics with gfastats

gfastats is a tool for providing summary statistics and genome file manipulation. In our case it will generate genome assembly statistics in a tabular-format output. Metrics like N50/L50, GC-content and lengths of contigs, scaffolds and gaps as well as other statistical information are provided for assessing the contiguity of the assembly.

Hands On: Generate summary statistics

gfastats ( Galaxy version 1.2.0+galaxy0) with the following parameters:

param-file “Input file”: CReformitis_assembly

“Specify target sequences”: Disabled

“Tool mode”: Summary statistics generation

“Report mode”: Genome assembly statistics (--nstar-report)

Comment: N50/NG50

Consider taking all contigs and sorting them by size. Starting with the largest and ending with the smallest. Now add up the length of each contig beginning with the largest, then the second largest and so on. When reaching 50% of the total length of all contigs it’s done. The length of the contig you stopped is the N50 value. (Videvall)

NG50 or more general NG(X) is based on the same idea as N50. The difference is that in this case the whole genome size or estimated genome size is taken into account. Through this comparisons can be made over different assemblies and genome sizes.

Comment: L50

Remember adding up the length of each contig until reaching the 50%. The L50 value is the number of the contig you have stopped.

Example: The sum of all contigs together is 2000 kbp. The contig at 50% has length 300 kbp and is the third one and thus the third largest. Then N50 = 300 kbp and L50 = 3.

In the best case a high quality assembly should consist of just a few and large contigs to represent the genome as a whole. Therefore a good assembly should lead to a high N50 value and in contrast a low quality assembly with tiny, fragmented contigs would lead to a low N50 value. (Videvall)

The metric doesn’t only rely on measuring the 50% mark. The general case is N(X) where X ranges from 0 - 100 mostly in ten steps. However the NX metric is not suitable for comparing different species with different genome lengths.

The GC-content or guanine-cytosine ratio tells one about the occurrence of guanine and cytosine in a genome. It is stated in percent. The two nucleobases are held together by three hydrogen bonds. A high GC-content makes DNA more stable than a low GC-content. Because the ratio of most species and organisms has been found out by now, it is also a good metric to gauge completeness.(Wikipedia)

Another good metric for gauging contiguity is the CC ratio. The value is calculated by dividing contig counts by the chromosome pair number. A perfect score would be 1. Therefore the lower the value the better the contiguity of the assembly.

Hi-C scaffolding

To understand how chromosomes are arranged in their three-dimensional structure in the nucleus, high resolution and high throughput imaging techniques have been developed. Hi-C is a high throughput method to measure pairwise contacts between pairs of genomic loci. It is based on cross-linking the DNA in the nucleus and its histones (proteins around which the DNA is wrapped). However the DNA gets cut and marked and afterwards the cut parts get ligated together. Then the DNA gets fragmented and sequenced forwards and backwards generating paired reads. The resulting information can be used to assemble the reads to the corresponding chromosome. (Pal et al. 2018)

Pre-processing Hi-C data

The following analysis use Hi-C reads in single datasets: one for the forward reads and one for the reverse reads. If you have more than one fastq datasets, as we do for Erythrolamprus reginae you will need to merge them into single datasets. If you have only one pair of datasets, skip this step.

Hands On: Merge Erythrolamprus reginae Hi-C data into single datasets

Create a collection with the Hi-C forward reads name EReginae Hi-C forward reads.

Do the same with the reverse reads (R) and name it EReginae Hi-C reverse reads.

For species with more than one set of Hi-C reads, verify that the datasets are ordered the same way in both collections. If not, sort both collections with the tool Sort collection

Collapse Collection ( Galaxy version 5.1.0) with the following parameters:

param-collection “Collection of files to collapse into single dataset”: EReginae Hi-C forward Reads

Rename the dataset EReginae Hi-C forward Reads

Collapse Collection ( Galaxy version 5.1.0) with the following parameters:

param-collection “Collection of files to collapse into single dataset”: EReginae Hi-C reverse reads

Rename the dataset EReginae Hi-C reverse Reads

Click on galaxy-selector Select Items at the top of the history panel

Check the fastq.gz containing the Hi-C forward reads (F)

Click 1 of N selected and choose Build Dataset List

Enter a name for your collection to EReginae Hi-C forward Reads

Click Create collection to build your collection

Click on the checkmark icon at the top of your history again

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, change the Name field to Hi-C forward Reads

Click the Save button

Hands On: Mapping Hi-C reads against a reference genome

BWA-MEM2 ( Galaxy version 2.2.1+galaxy0) with the following parameters:

“Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index

param-file “Use the following dataset as the reference sequence”: CReformitis_assembly

“Single or Paired-end reads”: Single

param-file “Select fastq dataset”: Hi-C forward Reads (output of Collapse Collection tool)

“Set read groups information?”: Do not set

“Select analysis mode”: 1.Simple Illumina mode

“BAM sorting mode”: Sort by read names (i.e., the QNAME field)

Rename the bam dataset Mapped Hi-C forward Reads

BWA-MEM2 ( Galaxy version 2.2.1+galaxy0) with the following parameters:

“Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index

param-file “Use the following dataset as the reference sequence”: CReformitis_assembly

“Single or Paired-end reads”: Single

param-file “Select fastq dataset”: Hi-C reverse Reads (output of Collapse Collection tool)

“Set read groups information?”: Do not set

“Select analysis mode”: 1.Simple Illumina mode

“BAM sorting mode”: Sort by read names (i.e., the QNAME field)

Rename the bam dataset Mapped Hi-C reverse Reads

Filter and merge ( Galaxy version 1.0+galaxy0) with the following parameters:

param-file “First set of reads”: Mapped Hi-C forward Reads (output of BWA-MEM2 tool)

param-file “Second set of reads”: Mapped Hi-C reverse Reads (output of BWA-MEM2 tool)

Generate Hi-C contact map

PretextMap converts BAM/SAM files into genome contact maps. With those contact maps it is possible to analyse the 3D organisation of chromosomes. The resulting map will be used for visualisation and editing contigs/scaffolds. (Kumar et al. 2017)

Hands On: Generate a contact map with PretextMap and PretextSnapshot

PretextMap ( Galaxy version 0.1.9+galaxy1) with the following parameters:

param-file “Input dataset in SAM or BAM format”: outfile (output of Filter and merge tool)

“Sort by”: Ascending

Pretext Snapshot ( Galaxy version 0.0.3+galaxy1) with the following parameters:

param-file “Input Pretext map file”: pretext_map_out (output of PretextMap tool)

“Output image format”: PNG

“Show grid?”: Yes

Figure 16: Contact Map Snake. — **Figure 16**: Hi-C contact map of Erythrolamprus reginae generated by Pretext.

The contact map of the snake (Erythrolamprus reginae) does have a clear diagonal pattern. These interactions are commonly called “cis” interactions and occur between genomic loci that are spatially close to each other. Interactions beyond the diagonal pattern are called “trans” and indicate long-range interactions.

Question: Hi-C contact map question

What can you say about the Hi-C map corresponding to Chondrosia reniformis?

Open image in new tab

Figure 17: Hi-C contact map of Chondrosia reniformis generated by Pretext.

The contact map of the sponge (Chondrosia reniformis) does contain widespread interactions all over the genome. Additionally a diagonal pattern can be observed. The widespread interactions all over the genome can be the result of experimental and technical issues or it can have a biological and functional significance - such as interactions between chromosomes in the nucleus.

Assembly graph

Bandage is a tool to visualise de novo assembly graphs with connections. (Wick et al. 2015)

Hands On: Generate assembly graph

gfastats ( Galaxy version 1.3.6+galaxy0) with the following parameters:

param-file “Input file”: CReformitis_assembly (Input dataset)

“Specify target sequences”: Disabled

“Tool mode”: Genome assembly manipulation

“Output format”: GFA

“Terminal overlaps selection”: No (default)

Bandage Image ( Galaxy version 2022.09+galaxy4) with the following parameters:

param-file “Graphical Fragment Assembly”: output (output of gfastats tool)

Figure 18: Assembly Graph Sponge. — **Figure 18**: Assembly graph of the sponge (*Chrondrosia reniformis*). Contigs (nodes) are displayed in different colors and their connections (edges) are displayed as red dotted lines.

The Assembly graph of Chondrosia reniformis is relatively simple and doesn’t have a complex structure. The connectivity of the nodes is clear but incomplete and lacking some indicative regions of overlapping contigs. Repetitive regions can be investigated by searching for multiple paths or connections in the graph.

Question: Assembly graph question

Which differences can you appreciate between the assembly graph corresponding to Erythrolamprus reginae and Eschrichtius robustus?

Open image in new tab

Figure 19: Assembly graph of Erythrolamprus reginae (A) and Eschrichtius robustus (B). Contigs (nodes) are displayed in different colors and their connections (edges) are displayed as red dotted lines.

(A) The assembly graph of the snake (Erythrolamprus reginae) does have a confusing and complex structure. This can be an indicator for errors or misassemblies. The nodes are well-connected but relatively unclear in their path which is reflected by many small contigs and only a few large contigs. This assembly graph does contain a lot of repetitive sequences that result in multiple paths and connections between the nodes.

(B) The assembly graph of the whale (Eschrichtius robustus) is overall relatively simple and doesn’t have a too complex structure. The nodes are well-connected and most of the paths are clear which is reflected by many large contigs and only some small contigs. This assembly does also contain repetitive regions but clearly less than in the assembly of the snake.

Conclusion

In conclusion, it’s worth to run the a post-assembly QC workflow to assess the quality of genome assemblies. The following table contains metrics to sum up the quality control.

Figure 20: Final stats. — **Figure 20**: This table contains 9 indicators for quality evaluation of three different organisms: Chondrosia reniformis, Eschrichtius robustus and Erythrolamprus reginae.

You've Finished the Tutorial

Key points

The ERGA post-assembly pipeline allows users to assess and improve the quality of genome assemblies

The ERGA post-assembly pipeline consists of three main steps: Genome assembly decontamination and overview with BlobToolKit, providing analysis information and statistics, and Hi-C scaffolding.

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

References

Bogart, J. P., 1980 Evolutionary Implications of Polyploidy in Amphibians and Reptiles, pp. 341–378 in Polyploidy, Springer US. 10.1007/978-1-4613-3069-1_18
Sara Michele, C.-V. R., Bavestrello Giorgio, and C. Carlo, 1998 Endosymbiosis in Sponges: Relevance for Epigenesis and Evolution . Symbiosis 25: 57–68. https://dalspace.library.dal.ca/bitstream/handle/10222/77605/VOLUME%2025-NUMBERS%201-3-1998-PAGE%2057.pdf
Voultsiadou, E., 2005 Sponge diversity in the Aegean Sea: Check list and new information. Italian Journal of Zoology 72: 53–64. 10.1080/11250000509356653
Gennarelli, M., and A. Cattaneo, 2010 Genetic Variations and Association, pp. 129–151 in International Review of Neurobiology, Elsevier. 10.1016/b978-0-12-384976-2.00006-x
FRANKHAM, R. I. C. H. A. R. D., J. O. N. A. T. H. A. N. D. BALLOU, M. A. R. K. D. B. ELDRIDGE, R. O. B. E. R. T. C. LACY, K. A. T. H. E. R. I. N. E. RALLS et al., 2011 Predicting the Probability of Outbreeding Depression. Conservation Biology 25: 465–475. 10.1111/j.1523-1739.2011.01662.x
Buchfink, B., C. Xie, and D. H. Huson, 2014 Fast and sensitive protein alignment using DIAMOND. Nature Methods 12: 59–60. 10.1038/nmeth.3176
Hunt, M., N. D. Silva, T. D. Otto, J. Parkhill, J. A. Keane et al., 2015 Circlator: automated circularization of genome assemblies using long sequencing reads. Genome Biology 16: 10.1186/s13059-015-0849-0
Mikheenko, A., V. Saveliev, and A. Gurevich, 2015 MetaQUAST: evaluation of metagenome assemblies. Bioinformatics 32: 1088–1090. 10.1093/bioinformatics/btv697
Shafer, A. B. A., J. B. W. Wolf, P. C. Alves, L. Bergström, M. W. Bruford et al., 2015 Genomics and the challenging translation into conservation practice. Trends in Ecology &\mathsemicolon Evolution 30: 78–87. 10.1016/j.tree.2014.11.009
Simão, F. A., R. M. Waterhouse, P. Ioannidis, E. V. Kriventseva, and E. M. Zdobnov, 2015 BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31: 3210–3212. 10.1093/bioinformatics/btv351
Wick, R. R., M. B. Schultz, J. Zobel, and K. E. Holt, 2015 Bandage: interactive visualization of \lessi\greaterde novo\less/i\greater genome assemblies. Bioinformatics 31: 3350–3352. 10.1093/bioinformatics/btv383
Koren, S., B. P. Walenz, K. Berlin, J. R. Miller, N. H. Bergman et al., 2017 Canu: scalable and accurate long-read assembly via adaptive \lessi\greaterk\less/i\greater-mer weighting and repeat separation. Genome Research 27: 722–736. 10.1101/gr.215087.116
Kumar, R., H. Sobhy, P. Stenberg, and L. Lizana, 2017 Genome contact map explorer: a platform for the comparison, interactive visualization and analysis of genome contact maps. Nucleic Acids Research 45: e152–e152. 10.1093/nar/gkx644
Vaser, R., I. Sović, N. Nagarajan, and M. Šikić, 2017 Fast and accurate de novo genome assembly from long uncorrected reads. Genome Research 27: 737–746. 10.1101/gr.214270.116
Vurture, G. W., F. J. Sedlazeck, M. Nattestad, C. J. Underwood, H. Fang et al., 2017 GenomeScope: fast reference-free genome profiling from short reads (B. Berger, Ed.). Bioinformatics 33: 2202–2204. 10.1093/bioinformatics/btx153
Zimin, A. V., D. Puiu, R. Hall, S. Kingan, B. J. Clavijo et al., 2017 The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum. GigaScience 6: 10.1093/gigascience/gix097
Brüniche-Olsen, A., R. Westerman, Z. Kazmierczyk, V. V. Vertyankin, C. Godard-Codding et al., 2018 The inference of gray whale (Eschrichtius robustus) historical population attributes from whole-genome sequences. BMC Evolutionary Biology 18: 10.1186/s12862-018-1204-3
Li, H., 2018 Minimap2: pairwise alignment for nucleotide sequences (I. Birol, Ed.). Bioinformatics 34: 3094–3100. 10.1093/bioinformatics/bty191
Manekar, S. C., and S. R. Sathe, 2018 A benchmark study of k-mer counting methods for high-throughput sequencing. GigaScience. 10.1093/gigascience/giy125
Pal, K., M. Forcato, and F. Ferrari, 2018 Hi-C analysis: from data generation to integration. Biophysical Reviews 11: 67–78. 10.1007/s12551-018-0489-1
Challis, R., E. Richards, J. Rajan, G. Cochrane, and M. Blaxter, 2020 BlobToolKit – Interactive Quality Assessment of Genome Assemblies. G3 Genes\vertGenomes\vertGenetics 10: 1361–1374. 10.1534/g3.119.400908
Rhie, A., B. P. Walenz, S. Koren, and A. M. Phillippy, 2020 Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology 21: 10.1186/s13059-020-02134-9
Steinegger, M., and S. L. Salzberg, 2020 Terminating contamination: large-scale search identifies more than 2, 000, 000 contaminated entries in GenBank. Genome Biology 21: 10.1186/s13059-020-02023-1
Mazzoni, C. J., C. Ciofi, and R. M. Waterhouse, 2023 Biodiversity: an atlas of European reference genomes. Nature 619: 252–252. 10.1038/d41586-023-02229-w
Clavijo, B. J. k-mer counting, part I: Introduction. https://bioinfologics.github.io/post/2018/09/17/k-mer-counting-part-i-introduction/
Videvall, E. What’s N50? https://www.molecularecologist.com/2017/03/29/whats-n50/
Videvall, E. The N50 filtering problem. https://www.molecularecologist.com/2017/04/05/the-first-problem-with-n50/#: :text=In%20contrast%2C%20a%20poor%20assembly,indicative%20measures%20of%20better%20assemblies
Wikipedia GC-content. https://en.wikipedia.org/wiki/GC-content

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Fabian Recktenwald, Cristóbal Gallardo, Tom Brown, Delphine Lariviere, ERGA post-assembly QC (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/assembly/tutorials/ERGA-post-assembly-QC/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{assembly-ERGA-post-assembly-QC,
author = "Fabian Recktenwald and Cristóbal Gallardo and Tom Brown and Delphine Lariviere",
	title = "ERGA post-assembly QC (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/assembly/tutorials/ERGA-post-assembly-QC/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/assembly/tutorials/ERGA-post-assembly-QC/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: blobtoolkit
  owner: bgruening
  revisions: 7d9ee8cd14b8
  tool_panel_section_label: Assembly
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: blobtoolkit
  owner: bgruening
  revisions: '392811378515'
  tool_panel_section_label: Assembly
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: diamond
  owner: bgruening
  revisions: e8ac2b53f262
  tool_panel_section_label: NCBI Blast
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: gfastats
  owner: bgruening
  revisions: 5799092ffdff
  tool_panel_section_label: FASTA/FASTQ
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: gfastats
  owner: bgruening
  revisions: 3ef480892a9f
  tool_panel_section_label: FASTA/FASTQ
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: text_processing
  owner: bgruening
  revisions: d698c222f354
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: samtools_stats
  owner: devteam
  revisions: 3a0efe14891f
  tool_panel_section_label: SAM/BAM
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: smudgeplot
  owner: galaxy-australia
  revisions: 5e0825476fb7
  tool_panel_section_label: Assembly
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: bandage
  owner: iuc
  revisions: ddddce450736
  tool_panel_section_label: Graph/Display Data
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: bellerophon
  owner: iuc
  revisions: 25ca5d73aedf
  tool_panel_section_label: Assembly
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: busco
  owner: iuc
  revisions: bb935709506a
  tool_panel_section_label: Annotation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: busco
  owner: iuc
  revisions: 2babe6d5c561
  tool_panel_section_label: Annotation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: bwa_mem2
  owner: iuc
  revisions: b4a22d90cce9
  tool_panel_section_label: Mapping
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: genomescope
  owner: iuc
  revisions: 01210c4e9144
  tool_panel_section_label: Assembly
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: merqury
  owner: iuc
  revisions: f8113c25bc6b
  tool_panel_section_label: Assembly
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: meryl
  owner: iuc
  revisions: 29dabd8db6f2
  tool_panel_section_label: Assembly
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: minimap2
  owner: iuc
  revisions: 5cc34c3f440d
  tool_panel_section_label: Mapping
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: pretext_map
  owner: iuc
  revisions: dfb8a4497339
  tool_panel_section_label: Assembly
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: pretext_map
  owner: iuc
  revisions: c6cec5ab35c1
  tool_panel_section_label: Assembly
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: pretext_snapshot
  owner: iuc
  revisions: 44c66e8d21e6
  tool_panel_section_label: Graph/Display Data
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: collapse_collections
  owner: nml
  revisions: 90981f86000f
  tool_panel_section_label: Collection Operations
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

t{ hist[0] | to_stars }} 1

t{ hist[0] | to_stars }} 1

April 2025

1 stars: Liked: not much, I reach this page from blobtools Disliked: I wanted to use blobtools, I got redirected to this page because I looked for help on how to run Blobtools. Specifically but I cannot understand what is “NCBI taxdump directory”: Taxonomy_Data, I do not understand what I have to upload exactly and how I got this Taxonomy_Data file, where do I upload it. Gallaxy is so frustrating. Lots of good things but it rarely works because of lack of info or bugs.

June 2023

5 stars: Liked: Everything! Disliked: It's awesome!