Introduction
During this practical, we will explore different bioinformatics resources to
- compute Markov models (transition matrices) from different sequence types
- compute the probability of a given sequence with a given Markov models
Getting CpG islands from the Human genome
Open a connection to the UCSC table browser
Select the hg38 assembly of the Human genome
Set the following parameters
- feature group to Regulation and the feature type to CpG islands
- region to genome
- output format to bed
- Output file to
hg38_CpG_islands.bed
- Optional: check the option gzip compressed. This will reduce the transfer time, but you will need to uncompress the downloaded result file.
Click on get output. This displays a second page with additional parameters. Leave all parameters to their default values and click get bed. The browser downloads a file named hg38_CpG_islands.bed.gz. Uncompress the result and check its content. It should look like this.
chr1 84934572 84935054 CpG:_47
chr1 63176547 63177427 CpG:_78
chr1 125435174 125435976 CpG:_67
chr1 183368926 183369826 CpG:_93
chr1 3531624 3531843 CpG:_27
chr1 3670619 3671074 CpG:_34
- Read the specifications of the
bed format on the formats pagr of the UCSC genome browser.
Exercise
- Make a copy of the bed file named
hg38_CpG_islands.tsv (the extension tsv stands for “tab-separated values”) and open it with a spreadsheet (e.g. LibreOffice calc or Microsoft Excel) or with the R statistical package.
- Add a column with the sequence length (beware of the zero-based notation of coordinates in
bed).
- Compute the number of CpG islands, their min, mean and max lengths.
- Compute the cumulative size of CpG islands (in Megabases) and the genome coverage (in percent).
- Plot the distribution of lengths for the CpG islands.
Exercise
Use the same approach to retrieve the sequences of all CpG islands in the Human genome, in fasta format, and save them in a file named hg38_CpG_islands.fasta (you will need to uncompress the .gz file downloaded from UCSC genome browser).
Do the same with the following organisms (always take the latest assembly)
- Mus musculus (mouse)
- Rattus norvegicus (rat)
Computing a transition matrix from sequences
Computing k-mer frequencies
- Open a connection to the Regulatory Sequence Analysis Tools (rsat.eu)
- Choose the Metazoa server
- In the left menu, click on the tool search box (magnifier icon) and start typing background. In In the list of matching tools that appears in the tool menu, select Create background.
- In the Background sequences, click browse and find the file with your CpG island sequences (fasta format).
- Set the Markov order to 1.
- Click GO.
- Save the result file to your computer as (it is a tab-separated value format, so I name it
hg38_CpG_2nt-freq.tsv).
- Open it with a spreadsheet.
What does it contain? Do you understand the relationship between this result and a first-order Markov Model?
Converting k-mer frequencies to a transition matrix
- In the RSAT tool menu, open the tool Convert Background.
- Check the option Custom background model and make sure the input format is oligo-analysis.
- Check File upload and browse your computer to find the background model obtained in the previous step.
- Check that the Output format is set to transitions.
- Set the decimals to 5.
- Click Go
- Save the result file on your computer and open it with a spreadsheet.
Exercise
- Do you understand the relationship between this file and the previous one?
- Starting from the 2nt frequencies, recompute the transition probability from C to G and check that you obtain the same result as the convert background tool. First write the formula with the symbols for the relevant dinucleotide frequencies, then replace these symbols by their actual values, and compute the result.
Computing a Markov model for non-CpG island sequences
In order to distinguish between CpG island and other human genomic sequences, we would like to build a Markov model from all the genomic sequences that are not annotated as CpG islands. However, the since the Human genome is quite big (3e+9 nucleotides) we will use a proxy, by computing a Markov model from a random sampling of genomic sequences. These might include some pieces of CpG islands, but this effect should be negligible.
Importantly, we will select random genome fragments having the same sizes as the CpG islands, because in the next steps we will compare the properties of these two sequence files.
- Open a connection to metazoa.rsat.eu.
- Find and open the tool named random genome fragments.
- Select the option Use template file, and select the
bed file with the coordinates of the CpG islands (hg38_CpG_islands.bed) that you downloaded from the UCSC genome browser. Set the Template format to bed.
- Set the organism to Homo sapiens GRCh38.
- In the Output option, select Sequences in fasta format.
- Optionally, you can choose the email output, which will send you a message with a link to the result file.
- Click GO.
It will take a few minutes to generate the result.
- In the result page, copy the link of the fasta file and save it in a file on your computer, we will use it several times for this tutorial.
- Save a copy of this file on your computer, with an informative name (for example
random-genome-fragments_hg38.fasta).
- Open the tool create background.
- In the option URL of the sequence, paste the link of your random genomic fragments.
- Set the Markov order to 1.
- Click GO.
- Save the background file on your computer with the name
hg38_genomic-bg_2nt-freq.tsv.
We now obtained dinucleotide frequencies of 10,000 random genomic fragments of 1000 pb. We can convert these dinucleotide frequencies into a transition matrix.
- Open the tool convert background model
- Select Custom background model, File Upload and choose the dinucleotide frequency file computed above (
hg38_genomic-bg_2nt-freq.tsv).
- Click GO and save the transition file with the name
hg38_genomic-bg_transitions_m1.tsv.
Do the same exercise as above: starting from the dinucleotide frequencies, compute the transition probability from C to G and compare it to the value found in the CpG islands.
Computing the probability of the CpG islands given the genomic background model
We will now compute the probability of each CpG island, using the Markov model of genomic background estimated from the random genome fragments.
- Open a connection to metazoa.rsat.eu.
- Find and open the tool named sequence probability.
- Upload the sequences of the CpG islands (fasta format).
- Choose a custom background model and upload the dinucleotide frequencies estimated for the genomic background (those measured in the random selection of genomic fragments).
- Click GO.
- Download the result table on your computer, and open it with either a spreadsheet or R.
- Plot and histogram with the logarithms of p-values.
- Plot the log(p-value) as a function of the sequence length.
- Write a few sentences to interpret the results.
Computing the probabilities of CpG islands given the CpG island background model
Run the same analyses as above but use the Markov model built from the CpG islands.
Computing the probability of the random genome fragments
- Compute the probabilities of the random genome fragments as in the previous steps: first given the genomic background model and then given the CpG island background model).
- Open the result file in a spreadsheet or R.
- Do the same plots as for CpG islands and interpret the results.
Discriminating sequences
We will not compute the log-ratio of the sequence probabilities computed using Markov models respectively estimated from the genomic background and from CpG islands.
\[score(S) = log \left( \frac{P(S|CpG)}{P(S|Bg)} \right) \]