During this practical, we will develop in R a simplified version of the MACS peak calling algorithm, and test it on a relatively small dataset (ChIP-seq data for the FNR transcription factor in the genome of the Bacteria Escherichia coli).
Beware: this exercise only performs some basic part of the peak-calling processing. The result will certainly not be as good as the peaks detected by conventional peak calling algorithms. The essential goal of this tutorial is understand the basic statistics used for peak calling, by implementing ourselves a simplified way to test the enrichment of reads in a partition of non-overlapping windows covering the genome.
For each sample (FNR ChIP-seq and genomic input) we pre-computed bed files with the counts reads mapped in non-overlapping windows of 200bp, or 50bp, respectively (two files per sample).
FNR ChIP-seq sample (test)
Genomic input (control)
FNR ChIP-seq sample (test)
Genomic input (control)
These files are in bedGraph format.
Remember: the bed convention uses zero-based coordinates, with semi-open intervals. Thus, the coordinates 0 50
correspond to
[0:50[
in zero-based coordinates;[0:49]
in zero-based coordinates;[1:50]
in one-based (human understandable) coordinates.This section is optional. It explains the tricks we used to generate a data set for this exercise.
We obtained the original datasets (mapped reads for the ChIP-seq and input samples) by following Morgane Thomas-Chollier’s tutorial Hands-on introduction to ChIP-seq analysis.
Then, we used bedtools to count the number of reads per bins, i.e. non-overlapping windows of fixed size (200bp per window) covering the full genome of Escherichia coli.
The protocol to count reads per bin is the following:
bedtools makewindows \
-g Escherichia_coli_K_12_MG1655_genome.txt -w 200 \
> Escherichia_coli_K_12_MG1655_windows_200bp.bed
The file genome.txt is expected to be a 2-columns file indicating the ID and length of each chromosome. In the case of Escherichia coli, the file contains a single line:
gi|49175990|ref|NC_000913.2| 4639675
To compute the intersection between two sets of genomic regions (bedtools intersect), bedtools requires two bed files sorted by chromosome and by chromosomal position. We first need ton convert sam to bed format. For this we use an intermediate BAM format.
## Convert mapped reads of FNR ChIP-seq library (test)
samtools view -bS SRR576933.sam | bedtools bamtobed | \
sort -n -k 2 \> SRR576933_sorted.bed
## Convert mapped reads of control library (input)
samtools view -bS SRR576938.sam | bedtools bamtobed \
| sort -n -k 2 > SRR576938_sorted.bed
# Count the number of reads per window in the FNR ChIP-seq library
bedtools intersect -a Escherichia_coli_K_12_MG1655_windows_200bp.bed \
-b SRR576933_sorted.bed \
-c -sorted \
> FNR_ChIP-seq_Anaerobic_A_GSM1010219_reads_per_200bp.bedg
# Count the number of reads per window in the control library
bedtools intersect -a Escherichia_coli_K_12_MG1655_windows_200bp.bed \
-b SRR576938_sorted.bed \
-c -sorted \
> Escherichia_coli_K_12_MG1655_input_reads_per_200bp.bedg
Compare the distributions of counts per reads (normalized for the input) in the ChIP-seq and input samples.
For each bin, compute the following statistics, and collect them in a table (one row per bin, one column per statistics):
Note: each of these statistics can be computed with a single operation – in R, you should avoid loops whenever possible.
Discuss the results.
MACS method descripion:
Zhang, Y., Liu, T., Meyer, C.A., Eeckhoute, J., Johnson, D.S., Bernstein, B.E., Nussbaum, C., Myers, R.M., Brown, M., Li, W., et al. (2008) Model-based analysis of ChIP-Seq (MACS). Genome Biol, 9, R137.
To generate the data, we followed Morgane Thomas-Chollier’s tutorial Hands-on introduction to ChIP-seq analysis, and then applied some tricks to count reads in fixed-width windows over the whole genome of Escherichia coli.
We can directly load the data from the web site.
#url.course <- "~/stats_avec_RStudio_EBA/" # For JvH local test only
url.course <- "http://jvanheld.github.io/stats_avec_RStudio_EBA/"
url.data <- file.path(url.course, "practicals", "02_peak-calling", "data")
## Choose a window size (200 or 50)
window.size <- 200
## Load counts per window in chip sample
chip.bedg.file <- paste(sep="",
"FNR_",
window.size,"bp.bedg")
chip.bedg <- read.table(file.path(url.data, chip.bedg.file))
names(chip.bedg) <- c("chrom", "start", "end","counts")
## Load counts per window in the input sample
input.bedg.file <- paste(sep="", "input_",window.size,"bp.bedg")
input.bedg <- read.table(file.path(url.data, input.bedg.file))
names(input.bedg) <- c("chrom", "start", "end","counts")
Let us check the content of the bed files.
dim(chip.bedg) ## Check number of rows and columns
## [1] 23199 4
head(chip.bedg) ## check the content of the first rows
## chrom start end counts
## 1 gi|49175990|ref|NC_000913.2| 0 200 1594
## 2 gi|49175990|ref|NC_000913.2| 200 400 834
## 3 gi|49175990|ref|NC_000913.2| 400 600 222
## 4 gi|49175990|ref|NC_000913.2| 600 800 172
## 5 gi|49175990|ref|NC_000913.2| 800 1000 123
## 6 gi|49175990|ref|NC_000913.2| 1000 1200 116
dim(input.bedg) ## Check number of rows and columns (should be the same as for test file)
## [1] 23199 4
head(input.bedg) ## check the content of the first rows
## chrom start end counts
## 1 gi|49175990|ref|NC_000913.2| 0 200 514
## 2 gi|49175990|ref|NC_000913.2| 200 400 490
## 3 gi|49175990|ref|NC_000913.2| 400 600 392
## 4 gi|49175990|ref|NC_000913.2| 600 800 374
## 5 gi|49175990|ref|NC_000913.2| 800 1000 316
## 6 gi|49175990|ref|NC_000913.2| 1000 1200 352
For each library (FNR and input, respectively), the total number of reads is obtained by computing the sum of all values from the fourth column, which contains the number of reads for each window.
## The fourth column of the bedgraph file contains the counts
chip.counts <- chip.bedg[,4]
input.counts <- input.bedg[,4]
## Count total counts
chip.total <- sum(chip.counts)
input.total <- sum(input.counts)
## Compare total counts between test and input samples
print(chip.total)
## [1] 2622163
print(input.total)
## [1] 7480400
print (input.total/chip.total)
## [1] 2.852759
Let us already note that the input library contains 2.8 times more reads than the test library. The input library was generated by sequencing the raw genome of Escherichia coli.
We can thus compute the genome coverage.
G <- 4641652 ## Genome size
print(genome.coverage <- sum(input.total) * 35 / G)
## [1] 56.40535
The input library represents a 56-fold genome coverage. This is a rather good indication, since it will be used to estimate our random expectation for each positional window. It is thus important to have a sufficient depth in order to get robust estimations.
However, we need to keep in mind that the test and input libraries are of different sizes to define our statistics for the enrichment of each window.
Note: the sum of reads differs from the actual total counts. Indeed, the protocol used to generate the data counts the number of reads overlapping each window. So, if a read overlaps two adjacent windows, it will be counted twice. We will temporarily ignore this problem, which will be fixed in the future by counting reads only for the window that overlaps their central residue.
We will now normalize the input library in order to obtain the same sum as the test (FNR) library.
## Option 1: mormalize the data based on the library sum
norm.input.libsum <- input.counts * chip.total / input.total
head(norm.input.libsum)
## [1] 180.1764 171.7635 137.4108 131.1011 110.7699 123.3893
## Option 1: mormalize the data based on median counts per library.
## This is more robust to outliers, esp. in the ChIP sample,
## where the actual peaks bias the mean towards higher values
norm.input.median <- input.counts * median(chip.counts) / median(input.counts)
head(norm.input.median)
## [1] 162.2312 154.6562 123.7250 118.0438 99.7375 111.1000
# Check the normalizing effect:
# Libsum-scaled input counts should have the same sum as test counts.
sum(chip.counts)
## [1] 2622163
sum(norm.input.libsum)
## [1] 2622163
sum(norm.input.libsum)/sum(chip.counts)
## [1] 1
# Median-scaled input counts do not have the same sum as test counts.
sum(norm.input.median)
## [1] 2361001
sum(norm.input.median)/sum(chip.counts)
## [1] 0.9004022
## Make a choice between options 1 and 2
norm.input <- norm.input.median
## Display the first elements of the normalized input counts
head(norm.input)
## [1] 162.2312 154.6562 123.7250 118.0438 99.7375 111.1000
Whereas the input and test datasets consisted in natural numbers (read counts), the normalized input consists of decimal numbers. It can be interpreted as the number of reads that would be expected in each window from a library of the same size as the FNR (test) library, in absence of any specific binding. This is thus a genomic position-specific background model, estimated from the input library.
The principle of the following tests will be to define suitable statistics to compare the number of reads observed in each window of the FNR sample with this random expectation estimated from the normalized input.
Compare the distributions of counts per reads (normalized for the input) in the ChIP-seq and input samples.
A simple way to compare the counts is to draw a dot plot of the test versus input reads per window.
par(mfrow=c(1,1))
plot(norm.input, chip.counts,
xlab="Normalized input", ylab="Test",
col="#666666",
main="Counts per window")
grid()