Objectives

During this practical, we will develop in R a simplified version of the MACS peak calling algorithm, and test it on a relatively small dataset (ChIP-seq data for the FNR transcription factor in the genome of the Bacteria Escherichia coli).

Beware: this exercise only performs some basic part of the peak-calling processing. The result will certainly not be as good as the peaks detected by conventional peak calling algorithms. The essential goal of this tutorial is understand the basic statistics used for peak calling, by implementing ourselves a simplified way to test the enrichment of reads in a partition of non-overlapping windows covering the genome.


Data sets

For each sample (FNR ChIP-seq and genomic input) we pre-computed bed files with the counts reads mapped in non-overlapping windows of 200bp, or 50bp, respectively (two files per sample).

Read counts per 200bp windows

  1. FNR ChIP-seq sample (test)

    FNR_ChIP-seq_Anaerobic_A_GSM1010219_reads_per_200bp.bedg

  2. Genomic input (control)

    Escherichia_coli_K_12_MG1655_input_reads_per_200bp.bedg

Read counts per 50bp windows

  1. FNR ChIP-seq sample (test)

    FNR_ChIP-seq_Anaerobic_A_GSM1010219_reads_per_50bp.bedg

  2. Genomic input (control)

    Escherichia_coli_K_12_MG1655_input_reads_per_200bp.bedg

These files are in bedGraph format.

Remember: the bed convention uses zero-based coordinates, with semi-open intervals. Thus, the coordinates 0 50 correspond to

  • the semi-open interval [0:50[ in zero-based coordinates;
  • i.e. the closed interval [0:49] in zero-based coordinates;
  • i.e. the closed interval [1:50] in one-based (human understandable) coordinates.

How was this dataset generated?

This section is optional. It explains the tricks we used to generate a data set for this exercise.

We obtained the original datasets (mapped reads for the ChIP-seq and input samples) by following Morgane Thomas-Chollier’s tutorial Hands-on introduction to ChIP-seq analysis.

Then, we used bedtools to count the number of reads per bins, i.e. non-overlapping windows of fixed size (200bp per window) covering the full genome of Escherichia coli.

The protocol to count reads per bin is the following:

  1. Generate a bed file defining the bin limits (one row per bin).
  2. Convert the mapped reads from sam to bed file, sorted by chromosomal position.
  3. Use bedtools to compute the intersection between each bin and the read file (i.e. count the reads falling onto each bin).

Generating windows of equal size along the reference genome

bedtools makewindows \
  -g Escherichia_coli_K_12_MG1655_genome.txt -w 200 \
  > Escherichia_coli_K_12_MG1655_windows_200bp.bed

The file genome.txt is expected to be a 2-columns file indicating the ID and length of each chromosome. In the case of Escherichia coli, the file contains a single line:

gi|49175990|ref|NC_000913.2|    4639675

Converting bam to sorted bed

To compute the intersection between two sets of genomic regions (bedtools intersect), bedtools requires two bed files sorted by chromosome and by chromosomal position. We first need ton convert sam to bed format. For this we use an intermediate BAM format.

## Convert mapped reads of FNR ChIP-seq library (test)
samtools view -bS SRR576933.sam | bedtools bamtobed | \
  sort -n -k 2 \> SRR576933_sorted.bed

## Convert mapped reads of control library (input)
samtools view -bS SRR576938.sam | bedtools bamtobed \
  | sort -n -k 2 > SRR576938_sorted.bed

Counting the reads per bin

# Count the number of reads per window in the FNR ChIP-seq library
bedtools intersect -a Escherichia_coli_K_12_MG1655_windows_200bp.bed \
-b SRR576933_sorted.bed \
-c -sorted \
>  FNR_ChIP-seq_Anaerobic_A_GSM1010219_reads_per_200bp.bedg

# Count the number of reads per window in the control library
bedtools intersect -a Escherichia_coli_K_12_MG1655_windows_200bp.bed \
-b SRR576938_sorted.bed \
-c -sorted \
>  Escherichia_coli_K_12_MG1655_input_reads_per_200bp.bedg    

Questions

  1. Download the two bed files describing read maps (“test” and “input”, respectively), for a given window size.
  2. Load these two tables in R.
  3. Count the total reads for the FNR and input libraries, respectively.
  4. Normalize the input library in order to obtain the same sum as the test (FNR) library.
  5. Compare the distributions of counts per reads (normalized for the input) in the ChIP-seq and input samples.

  6. For each bin, compute the following statistics, and collect them in a table (one row per bin, one column per statistics):

    Note: each of these statistics can be computed with a single operation – in R, you should avoid loops whenever possible.

    1. Number of reads in the test (FNR ChIP-seq)
    2. Number of reads in the input
    3. Normalized number of reads in the input (norm.input)/
    4. Reads per base in the input
    5. Fold enrichment (ratio between test and normalized input)
    6. Log-fold enrichment (log10 of the fold enrichment)
    7. P-value of the test reads, using a Poisson distribution with local lambda estimate (explain your model).
    8. P-value of the test reads, using a binomial distribution (explain your model).
    9. P-value of the test reads, using a hypergeometric distribution (explain your model).
  7. Draw some dot plots to compare the different statistics (fold enrichment, log-fold, p-values with the different models).
  8. Discuss the results.


References

  1. MACS method descripion:
    Zhang, Y., Liu, T., Meyer, C.A., Eeckhoute, J., Johnson, D.S., Bernstein, B.E., Nussbaum, C., Myers, R.M., Brown, M., Li, W., et al. (2008) Model-based analysis of ChIP-Seq (MACS). Genome Biol, 9, R137.

  2. To generate the data, we followed Morgane Thomas-Chollier’s tutorial Hands-on introduction to ChIP-seq analysis, and then applied some tricks to count reads in fixed-width windows over the whole genome of Escherichia coli.


Solutions

Loading the two bed files describing read counts in R

We can directly load the data from the web site.

#url.course <- "~/stats_avec_RStudio_EBA/" # For JvH local test only
url.course <- "http://jvanheld.github.io/stats_avec_RStudio_EBA/"
url.data <- file.path(url.course, "practicals", "02_peak-calling", "data")

## Choose a window size (200 or 50)
window.size <- 200

## Load counts per window in chip sample
chip.bedg.file <- paste(sep="", 
                        "FNR_",
                        window.size,"bp.bedg")
chip.bedg <- read.table(file.path(url.data, chip.bedg.file))
names(chip.bedg) <- c("chrom", "start", "end","counts")

## Load counts per window in the input sample
input.bedg.file <- paste(sep="", "input_",window.size,"bp.bedg")
input.bedg <- read.table(file.path(url.data, input.bedg.file))
names(input.bedg) <- c("chrom", "start", "end","counts")

Let us check the content of the bed files.

dim(chip.bedg) ## Check number of rows and columns
## [1] 23199     4
head(chip.bedg) ## check the content of the first rows
##                          chrom start  end counts
## 1 gi|49175990|ref|NC_000913.2|     0  200   1594
## 2 gi|49175990|ref|NC_000913.2|   200  400    834
## 3 gi|49175990|ref|NC_000913.2|   400  600    222
## 4 gi|49175990|ref|NC_000913.2|   600  800    172
## 5 gi|49175990|ref|NC_000913.2|   800 1000    123
## 6 gi|49175990|ref|NC_000913.2|  1000 1200    116
dim(input.bedg) ## Check number of rows and columns (should be the same as for test file)
## [1] 23199     4
head(input.bedg) ## check the content of the first rows
##                          chrom start  end counts
## 1 gi|49175990|ref|NC_000913.2|     0  200    514
## 2 gi|49175990|ref|NC_000913.2|   200  400    490
## 3 gi|49175990|ref|NC_000913.2|   400  600    392
## 4 gi|49175990|ref|NC_000913.2|   600  800    374
## 5 gi|49175990|ref|NC_000913.2|   800 1000    316
## 6 gi|49175990|ref|NC_000913.2|  1000 1200    352

Counting the total reads per library

For each library (FNR and input, respectively), the total number of reads is obtained by computing the sum of all values from the fourth column, which contains the number of reads for each window.

## The fourth column of the bedgraph file contains the counts
chip.counts <- chip.bedg[,4] 
input.counts <- input.bedg[,4] 

## Count total counts
chip.total <- sum(chip.counts)
input.total <- sum(input.counts)

## Compare total counts between test and input samples
print(chip.total)
## [1] 2622163
print(input.total)
## [1] 7480400
print (input.total/chip.total)
## [1] 2.852759

Let us already note that the input library contains 2.8 times more reads than the test library. The input library was generated by sequencing the raw genome of Escherichia coli.

We can thus compute the genome coverage.

G <- 4641652 ## Genome size
print(genome.coverage <- sum(input.total) * 35 / G)
## [1] 56.40535

The input library represents a 56-fold genome coverage. This is a rather good indication, since it will be used to estimate our random expectation for each positional window. It is thus important to have a sufficient depth in order to get robust estimations.

However, we need to keep in mind that the test and input libraries are of different sizes to define our statistics for the enrichment of each window.

Note: the sum of reads differs from the actual total counts. Indeed, the protocol used to generate the data counts the number of reads overlapping each window. So, if a read overlaps two adjacent windows, it will be counted twice. We will temporarily ignore this problem, which will be fixed in the future by counting reads only for the window that overlaps their central residue.

Normalization of the input library

We will now normalize the input library in order to obtain the same sum as the test (FNR) library.

## Option 1: mormalize the data based on the library sum
norm.input.libsum <- input.counts * chip.total / input.total
head(norm.input.libsum)
## [1] 180.1764 171.7635 137.4108 131.1011 110.7699 123.3893
## Option 1: mormalize the data based on median counts per library.
## This is more robust to outliers, esp. in the ChIP sample, 
## where the actual peaks bias the mean towards higher values
norm.input.median <- input.counts *  median(chip.counts) /  median(input.counts)
head(norm.input.median)
## [1] 162.2312 154.6562 123.7250 118.0438  99.7375 111.1000
# Check the normalizing effect: 
# Libsum-scaled input counts should have the same sum as test counts.
sum(chip.counts)
## [1] 2622163
sum(norm.input.libsum)
## [1] 2622163
sum(norm.input.libsum)/sum(chip.counts)
## [1] 1
# Median-scaled input counts do not have the same sum as test counts.
sum(norm.input.median)
## [1] 2361001
sum(norm.input.median)/sum(chip.counts)
## [1] 0.9004022
## Make a choice between options 1 and 2
norm.input <- norm.input.median

## Display the first elements of the normalized input counts
head(norm.input)
## [1] 162.2312 154.6562 123.7250 118.0438  99.7375 111.1000

Whereas the input and test datasets consisted in natural numbers (read counts), the normalized input consists of decimal numbers. It can be interpreted as the number of reads that would be expected in each window from a library of the same size as the FNR (test) library, in absence of any specific binding. This is thus a genomic position-specific background model, estimated from the input library.

The principle of the following tests will be to define suitable statistics to compare the number of reads observed in each window of the FNR sample with this random expectation estimated from the normalized input.

Comparison between FNR and input count distributions

Compare the distributions of counts per reads (normalized for the input) in the ChIP-seq and input samples.

A simple way to compare the counts is to draw a dot plot of the test versus input reads per window.

par(mfrow=c(1,1))
plot(norm.input, chip.counts, 
     xlab="Normalized input", ylab="Test", 
     col="#666666",
     main="Counts per window")
grid()