DNA Methylation data analysis
Author(s) |
|
Editor(s) |
![]() ![]() |
Reviewers |
|
OverviewQuestions:
Objectives:
What is methylation and why it cannot be recognised by a normal NGS procedure?
Can a different methylation influence the expression of a gene? How?
Which tools you can use to analyse methylation data?
Requirements:
Learn how to analyse methylation data
Get a first intuition what are common pitfalls.
- Introduction to Galaxy Analyses
- slides Slides: Quality Control
- tutorial Hands-on: Quality Control
- slides Slides: Mapping
- tutorial Hands-on: Mapping
Time estimation: 3 hoursSupporting Materials:Published: Feb 16, 2017Last modification: Apr 3, 2025License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00142rating Rating: 3.7 (3 recent ratings, 13 all time)version Revision: 24
We will use a small subset of the original data. If we would do the computation on the orginal data the computation time for a tutorial is too long. To show you all necessary steps for Methyl-Seq we decided to use a subset of the data set. In a second step we use precomputed data from the study to show you different levels of methylation. We will consider samples from normal breast cells (NB), fibroadenoma (noncancerous breast tumor, BT089), two invasive ductal carcinomas (BT126, BT198) and a breast adenocarcinoma cell line (MCF7).
This tutorial is based off of Lin et al. 2015. The data we use in this tutorial is available at Zenodo.
AgendaIn this tutorial, we will deal with:
Data upload
We will start by loading the example dataset which will be used for the tutorial into Galaxy
Hands On: Get the data into Galaxy
Create a new history
To create a new history simply click the new-history icon at the top of the history panel:
Import the two example datasets from Zenodo or the shared data library:
https://zenodo.org/record/557099/files/subset_1.fastq https://zenodo.org/record/557099/files/subset_2.fastq
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:
- Go into Libraries (left panel)
- Navigate to the correct folder as indicated by your instructor.
- On most Galaxies tutorial data will be provided in a folder named GTN - Material â> Topic Name -> Tutorial Name.
- Select the desired files
- Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
In the pop-up window, choose
- âSelect historyâ: the history you want to import the data to (or create a new one)
- Click on Import
Quality Control
The first step in any analysis should always be quality control. We will use the Falco tool to asses the quality of our reads and determine if we need to perform any data cleaning before proceeding with our analysis. Falco is an efficiency optimized rewrite of FastQC
Hands On: Quality Control
- Falco ( Galaxy version 1.2.4+galaxy0) with the following parameters:
- param-files âRaw read data from your current historyâ:
subset_1.fastq.gz
andsubset_2.fastq.gz
- Click on param-files Multiple datasets
- Select several files by keeping the Ctrl (or COMMAND) key pressed and clicking on the files of interest
Go to the web page result page and have a closer look at âPer base sequence contentâ
Question
- Note the GC distribution and percentage of âTâ and âCâ. Why is this so weird?
- Is everything as expected?
- The attentive audience of the theory part knows: Every C-meth stays a C and every normal C becomes a T during the bisulfite conversion.
- Yes it is. Always be careful and have the specific characteristics of your data in mind during the interpretation of Falco results.
Alignment
Hands On: Mapping with bwamethWe will now map the imported dataset against a reference genome.
- bwameth ( Galaxy version 0.2.7+galaxy0) with the following parameters:
- âSelect a genome reference from your history or a built-in index?â:
Use a built-in index
- âSelect a reference genomeâ:
Human (hg38full)
- âIs this library mate-pairedâ:
Paired-end
- âFirst read in pairâ:
subset_1.fastq
- âSecond read in pairâ:
subset_2.fastq
Comment: Long compute timesPlease notice that mapping can take some time. If you want to skip this, we provide for you a precomputed alignment. Import
https://zenodo.org/records/557099/files/aligned_subset.bam
to your history.QuestionWhy we need other alignment tools for bisulfite sequencing data?
You may have noticed that all the Câs are C-methâs and a T can be a T or a C. A mapper for methylation data needs to find out what is what.
Methylation bias and metric extraction
Hands On: Methylation biasIn this step we will have a look at the distribution of the methylation and will look at a possible bias.
- MethylDackel ( Galaxy version 0.5.2+galaxy0) with the following parameters:
- âLoad reference genome fromâ:
Local cache
- âUsing reference genomeâ:
Human (hg38)
- âSorted BAM fileâ: output of bwameth tool
- âWhat do you want to do?â:
Determine the position-dependent methylation bias in the dataset, producing diagnostic SVG images (mbias)
- In âAdvanced optionsâ
- âKeep singletonsâ: param-toggle
Yes
- âKeep discordant alignmetnsâ: param-toggle
Yes
Question
- Consider the
original top strand
output. Is there a methylation bias?- If we would trim, what would be the start and the end positions?
- The distribution of the methylation is more or less equal. Only at the start and the end we could trim a bit but a +- 5% variation is acceptable.
- To trim the reads we would include for the first strand only the positions 0 to 145, for the second 6 to 149.
Hands On: Methylation extraction with MethylDackelWe will extract the methylation on the resulting BAM file of the alignment step. We need this to create a methylation level plot in the next step.
- MethylDackel ( Galaxy version 0.5.2+galaxy0) with the following parameters:
- âLoad reference genome fromâ:
Local cache
- âUsing reference genomeâ:
Human (hg38)
- âSorted BAM fileâ: output of bwameth tool
- âWhat do you want to do?â:
Extract methylation metrics from an alignment file in BAM/CRAM format (extract)
- âMerge per-Cytosine metricsâ: param-toggle
Yes
- âOutput optionsâ:
CpG methylation fractions (--fraction)
Visualization
Hands OnIn this step we want to visualize the methylation level around all TSS of our data. When located at gene promoters, DNA methylation is usually a repressive mark.
- Wig/BedGraph-to-bigWig with the following parameters:
âConvertâ:
fraction CpG
(result of MethylDackel tool)It can happen that you can not select the correct input file. In this case you have to add meta information about the used genome to the file.
- Click on the pencil of the correct history item.
- Change
Database/Build:
to the genome you used.- In our case the correct genome is
Human Dec. 2013 (GRCh38/hg38) (hg38)
.Import the BED file with CpG islands from Zenodo into the history
https://zenodo.org/records/557099/files/CpGIslands.bed
- computeMatrix ( Galaxy version 3.5.4+galaxy0) with the following parameters:
- âRegions to plotâ:
CpGIslands.bed
- âSample order mattersâ:
No
- âScore fileâ: Output of Wig/BedGraph-to-bigWig tool
- âcomputeMatrix has two main output optionsâ:
reference-point
- plotProfile ( Galaxy version 3.5.4+galaxy0) with the following parameters:
- âMatrix file from the computeMatrix toolâ:
Matrix
(output of computeMatrix tool)The output should look like this:
Lets see how the methylation looks for a few provided files:
Import the BED file with CpG islands from Zenodo into the history
https://zenodo.org/records/557099/files/NB1_CpG.meth.bedGraph
- Wig/BedGraph-to-bigWig with the following parameters:
- âConvertâ:
NB1_CpG.meth.bedGraph
QuestionThe execution fails. Do you have an idea why?
A conversion to bigWig would fail right now. If it turned green, the file size should be 0 bytes. Probably dataset info box shows some error message like
hashMustFindVal: '1' not found
. The reason is the source of the reference genome which was used. There is ensembl and UCSC as sources which differ in naming the chromosomes. Ensembl is using just numbers e.g. 1 for chromosome one. UCSC is using chr1 for the same. Be careful with this especially if you have data from different sources. We need to convert this.Comment: UCSC - Ensembl convertDownload the file containing mapping between Ensembl and UCS chromosome convention of hg38
https://raw.githubusercontent.com/dpryan79/ChromosomeMappings/master/GRCh38_ensembl2UCSC.txt
Replace column ( Galaxy version 0.2) with the follwing parameters:
- âFile in which you want to replace some valuesâ:
NB1_CpG.meth.bedGraph
- âReplace information fileâ:
GRCh38_ensembl2UCSC.txt
- âWhich column should be replaced?â:
Column: 1
- âSkip this many starting linesâ:
1
- âDelimited byâ:
Tab
To save compute time we prepared the converted files for you. Import the following files. Create a collection list and label it
all_coverage_files
. Strip the file extension from the name. For example, rename fromNB1_CpG.meth_ucsc.bedGraph
toNB1_CpG
.https://zenodo.org/records/557099/files/NB1_CpG.meth_ucsc.bedGraph https://zenodo.org/records/557099/files/NB2_CpG.meth_ucsc.bedGraph https://zenodo.org/records/557099/files/BT089_CpG.meth_ucsc.bedGraph https://zenodo.org/records/557099/files/BT126_CpG.meth_ucsc.bedGraph https://zenodo.org/records/557099/files/BT198_CpG.meth_ucsc.bedGraph https://zenodo.org/records/557099/files/MCF7_CpG.meth_ucsc.bedgraph
- Click on galaxy-selector Select Items at the top of the history panel
- Check all the datasets in your history you would like to include
Click n of N selected and choose Build Dataset List
- Enter a name for your collection
- Click Create collection to build your collection
- Click on the checkmark icon at the top of your history again
Change the datatype to
bedgraph
and set the database tohg38
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
bedgraph
from âNew typeâ dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
- Click the desired datasetâs name to expand it.
Click on the â?â next to database indicator:
- In the central panel, change the Database/Build field
- Select your desired database key from the dropdown list:
hg38
- Click the Save button
- Wig/BedGraph-to-bigWig with the following parameters:
- param-collection âConvertâ:
all_coverage_files
- computeMatrix ( Galaxy version 3.5.4+galaxy0) with the following parameters:
- âRegions to plotâ:
CpGIslands.bed
- âSample order mattersâ:
No
- âScore fileâ: Output of previous Wig/BedGraph-to-bigWig tool
- âcomputeMatrix has two main output optionsâ:
reference-point
- plotProfile ( Galaxy version 3.5.4+galaxy0) with the following parameters:
- âMatrix file from the computeMatrix toolâ:
Matrix
(output of previous computeMatrix tool)- âShow advanced optionsâ:
Yes
- âMake one plot per group of regionsâ: param-toggle
Yes
The output should look like this:
Metilene
Hands On: MetileneWith metilene it is possible to detect differentially methylated regions (DMRs) which is a necessary prerequisite for characterizing different epigenetic states.
Import the following files from Zenodo into yout history
https://zenodo.org/records/557099/files/NB1_CpG.meth.bedGraph https://zenodo.org/records/557099/files/NB2_CpG.meth.bedGraph https://zenodo.org/records/557099/files/BT198_CpG.meth.bedGraph
Metilene ( Galaxy version 0.2.6.1) with the following parameters:
- âInput group 1â:
NB1_CpG.meth.bedGraph
andNB2_CpG.meth.bedGraph
- âInput group 2â:
BT198_CpG.meth.bedGraph
- âBED file containing regions of interestâ:
CpGIslands.bed
QuestionHave a look at the produced pdf document. What is the data showing?
It shows the distribution of DMR differences, DMR length in nucleotides and number CpGs, DMR differences vs. q-values, mean methylation group 1 vs. mean methylation group 2 and DMR length in nucleotides vs. length in CpGs