Markov Models (nothing to hide so far)

Introduction

During this practical, we will explore different bioinformatics resources to

compute Markov models (transition matrices) from different sequence types
compute the probability of a given sequence with a given Markov models

Resources

Resource	Description	URL
UCSC genome Browser		genome.ucsc.edu/
UCSC table browser	Web tool to extract features from the UCUS genome browser	genome.ucsc.edu/cgi-bin/hgTables
RSAT	Regulatory Sequence Analysis Tools	rsat.eu
RSAT Metazoa	Metazoa server of the Regulatory Sequence Analysis Tools	metazoa.rsat.eu

Getting CpG islands from the Human genome

Open a connection to the UCSC table browser
Select the hg38 assembly of the Human genome
Set the following parameters
- feature group to Regulation and the feature type to CpG islands
- region to genome
- output format to bed
- Output file to hg38_CpG_islands.bed
- Optional: check the option gzip compressed. This will reduce the transfer time, but you will need to uncompress the downloaded result file.
Click on get output. This displays a second page with additional parameters. Leave all parameters to their default values and click get bed. The browser downloads a file named hg38_CpG_islands.bed.gz. Uncompress the result and check its content. It should look like this.

chr1    84934572        84935054        CpG:_47
chr1    63176547        63177427        CpG:_78
chr1    125435174       125435976       CpG:_67
chr1    183368926       183369826       CpG:_93
chr1    3531624 3531843 CpG:_27
chr1    3670619 3671074 CpG:_34

Read the specifications of the bed format on the formats pagr of the UCSC genome browser.

Exercise

Make a copy of the bed file named hg38_CpG_islands.tsv (the extension tsv stands for “tab-separated values”) and open it with a spreadsheet (e.g. LibreOffice calc or Microsoft Excel) or with the R statistical package.
Add a column with the sequence length (beware of the zero-based notation of coordinates in bed).
Compute the number of CpG islands, their min, mean and max lengths.
Compute the cumulative size of CpG islands (in Megabases) and the genome coverage (in percent).
Plot the distribution of lengths for the CpG islands.

Exercise

Use the same approach to retrieve the sequences of all CpG islands in the Human genome, in fasta format, and save them in a file named hg38_CpG_islands.fasta (you will need to uncompress the .gz file downloaded from UCSC genome browser).
Do the same with the following organisms (always take the latest assembly)
- Mus musculus (mouse)
- Rattus norvegicus (rat)

Computing a transition matrix from sequences

Computing k-mer frequencies

Open a connection to the Regulatory Sequence Analysis Tools (rsat.eu)
Choose the Metazoa server
In the left menu, click on the tool search box (magnifier icon) and start typing background. In In the list of matching tools that appears in the tool menu, select Create background.
In the Background sequences, click browse and find the file with your CpG island sequences (fasta format).
Set the Markov order to 1.
Click GO.
Save the result file to your computer as (it is a tab-separated value format, so I name it hg38_CpG_2nt-freq.tsv).
Open it with a spreadsheet.

What does it contain? Do you understand the relationship between this result and a first-order Markov Model?

Converting k-mer frequencies to a transition matrix

In the RSAT tool menu, open the tool Convert Background.
Check the option Custom background model and make sure the input format is oligo-analysis.
Check File upload and browse your computer to find the background model obtained in the previous step.
Check that the Output format is set to transitions.
Set the decimals to 5.
Click Go
Save the result file on your computer and open it with a spreadsheet.

Exercise

Do you understand the relationship between this file and the previous one?
Starting from the 2nt frequencies, recompute the transition probability from C to G and check that you obtain the same result as the convert background tool. First write the formula with the symbols for the relevant dinucleotide frequencies, then replace these symbols by their actual values, and compute the result.

Computing a Markov model for non-CpG island sequences

In order to distinguish between CpG island and other human genomic sequences, we would like to build a Markov model from all the genomic sequences that are not annotated as CpG islands. However, the since the Human genome is quite big (3e+9 nucleotides) we will use a proxy, by computing a Markov model from a random sampling of genomic sequences. These might include some pieces of CpG islands, but this effect should be negligible.

Importantly, we will select random genome fragments having the same sizes as the CpG islands, because in the next steps we will compare the properties of these two sequence files.

Open a connection to metazoa.rsat.eu.
Find and open the tool named random genome fragments.
Select the option Use template file, and select the bed file with the coordinates of the CpG islands (hg38_CpG_islands.bed) that you downloaded from the UCSC genome browser. Set the Template format to bed.
Set the organism to Homo sapiens GRCh38.
In the Output option, select Sequences in fasta format.
Optionally, you can choose the email output, which will send you a message with a link to the result file.
Click GO.

It will take a few minutes to generate the result.

In the result page, copy the link of the fasta file and save it in a file on your computer, we will use it several times for this tutorial.
Save a copy of this file on your computer, with an informative name (for example random-genome-fragments_hg38.fasta).
Open the tool create background.
In the option URL of the sequence, paste the link of your random genomic fragments.
Set the Markov order to 1.
Click GO.
Save the background file on your computer with the name hg38_genomic-bg_2nt-freq.tsv.

We now obtained dinucleotide frequencies of 10,000 random genomic fragments of 1000 pb. We can convert these dinucleotide frequencies into a transition matrix.

Open the tool convert background model
Select Custom background model, File Upload and choose the dinucleotide frequency file computed above (hg38_genomic-bg_2nt-freq.tsv).
Click GO and save the transition file with the name hg38_genomic-bg_transitions_m1.tsv.

Do the same exercise as above: starting from the dinucleotide frequencies, compute the transition probability from C to G and compare it to the value found in the CpG islands.

Computing the probability of the CpG islands given the genomic background model

We will now compute the probability of each CpG island, using the Markov model of genomic background estimated from the random genome fragments.

Open a connection to metazoa.rsat.eu.
Find and open the tool named sequence probability.
Upload the sequences of the CpG islands (fasta format).
Choose a custom background model and upload the dinucleotide frequencies estimated for the genomic background (those measured in the random selection of genomic fragments).
Click GO.
Download the result table on your computer, and open it with either a spreadsheet or R.
Plot and histogram with the logarithms of p-values.
Plot the log(p-value) as a function of the sequence length.
Write a few sentences to interpret the results.

Computing the probabilities of CpG islands given the CpG island background model

Run the same analyses as above but use the Markov model built from the CpG islands.

Computing the probability of the random genome fragments

Compute the probabilities of the random genome fragments as in the previous steps: first given the genomic background model and then given the CpG island background model).
Open the result file in a spreadsheet or R.
Do the same plots as for CpG islands and interpret the results.

Discriminating sequences

We will not compute the log-ratio of the sequence probabilities computed using Markov models respectively estimated from the genomic background and from CpG islands.

\[score(S) = log \left( \frac{P(S|CpG)}{P(S|Bg)} \right) \]

Markov Models (nothing to hide so far)

Computational biology

Jacques van Helden

2019-12-09

Introduction

Resources

Getting CpG islands from the Human genome

Exercise

Exercise

Computing a transition matrix from sequences

Computing k-mer frequencies

Converting k-mer frequencies to a transition matrix

Exercise

Computing a Markov model for non-CpG island sequences

Computing the probability of the CpG islands given the genomic background model

Computing the probabilities of CpG islands given the CpG island background model

Computing the probability of the random genome fragments

Discriminating sequences