Practical: analysis of ChIP-seq peaks

Introduction

In this practical, we will combine various bioinformatics resources to extract information from ChIP-seq experiments, and evaluate the quality of the peaks.

Resources

Resource Name	URL
ReMap	A database of ChIP-seq peaks (http://remap.cisreg.eu/)
Jaspar	A database of eukaryote TF binding motifs, mainly built from CHIP-seq peaks (http://jaspar.genereg.net/)
Hocomoco	A database of Human TF binding motifs, mainly built from CHIP-seq peaks (http://hocomoco11.autosome.ru/)
RSAT	Regulatory Sequence Analysis Tools http://rsat.eu/
RSAT Metazoa	http://metazoa.rsat.eu/
RSAT Teaching	http://teaching.rsat.eu/
RSAT server at IFB	http://rsat.france-bioinformatique.fr/rsat/
UCSC NGS file formats	https://genome.ucsc.edu/FAQ/FAQformat.html

Creating a workspace for this practical

Create a new folder to store the results of this practical (e.g. ~/LCG_BEII/chip-seq_practical/)
Set this folder as your working direct

Getting peaks from ReMap

ReMap (https://remap.univ-amu.fr/) is a database of TF binding regions based on an extensive re-analysis of published ChIP-seq data for human transcription factors. ory (setwd())

Open a connection to ReMap
Select Homo sapiens
Explore the interface
Choose a transcription factor and a tissue / cell type of interest.
- For a same factor, you wil generally find several datasets resulting from different studies, in different tissues (biotypes).
- For your exercise, I recommend to select a dataset with a reasonable number of peaks
- For this tutorial I will use Sox2 in glioma, where ReMap peaks were identified from the original GEO dataset GSE67282
Click on the Download icon tab, besides the number of peaks. This downloads a compressed (gzipped) bed file (tab-separated columns describing genomic features).
Check that the number of peaks in the downloaded file corresponds to the indication on the ReMap page for your factor in the tissue / cell type and in the study identifier (GSE or other) you selected.
Check the name column in the bed file to make sure that the peaks all belong to the same tissue / cell type and study identifier.
Compute
- the number of peaks
- the sum of peak lengths
- the mean peak length
- the median peak length

Beware: the bed format relies on a somewhat confusing convention to specify the coordinates of genomic features.

Coordinates are 0-based (the first position of a sequence is referred to as 0).
The feature start is (as expected) the first position within the feature, but the “end” is the first position after the feature.
Consequently, the feature length can be computed as $l = end - start$, whereas with other formats it should be $l = end - start = 1$.

Tips:

The peak statistics can be computed with a spreadsheet (Excel, LibreOffice Calc), R, or the Unix command line.
Click the “code” button to display a bash line to compute peak statistics (number, sum of lengths and mean length) from a bedfile.

RESULT_DIR=~/LCG_BEII/chip-seq_practical/SOX2_glioma_GSE67282
PEAKS=${RESULT_DIR}/GSE67282.SOX2.glioma_stem.bed.gz
gunzip -c ${PEAKS} \
   | awk '{len=$3 - $2; sum += len; n++; mean = sum / n ; print "n="n"\tsum="sum"\tmean="mean}' \
   | tail -n 1

Click the “code” button to display a chunk of R code to compute peak statistics.

resultDir <- "~/LCG_BEII/chip-seq_practical/SOX2_glioma_GSE67282"
peakBase <- "GSE67282.SOX2.glioma_stem.bed.gz"
peakFile <- file.path(resultDir, peakBase)
#peakFile <- "data/ReMap_GSE67282.SOX2.ESC_peaks.bed"
peaks <- read.delim(peakFile, header = FALSE)
peaks <- peaks[, 1:4] # Keep the fields used for this analysis
names(peaks) <- c("chr", "start", "end", "name")

peaks$len <- peaks$end - peaks$start
head(peaks) # print the first peaks

peak_stats <- data.frame(
  sum = sum(peaks$len),
  mean = mean(peaks$len),
  n = nrow(peaks))

library(knitr)
kable(peak_stats)

In our example (Sox2 peaks in embryonic stem cells from the GEO series GSE67282) we obtain:

Number of peaks: $n = 3,859$
Sum of peak lengths: $sum = 1,154,851$
Mean peak lengths: $mean = 299.3$

Getting motifs from reference databases

Open a connection to Jaspar
- Explore the database functionalities
- Find the transcription factor binding motifs for your TF of interest.
- Download the corresponding matrix in your working directory.
  - Note: JASPAR proposes several matrix formats. All these formats are supported yb RSAT, but for convenience you can choose the TRANSFAC format.
- Save this transfac-formatted motif in a text file named [YOUR_FACTOR_NAME]_reference-motifs.tf.
Optional: Do the same with the Hocomoco database.
- Note the classification of transcription factors on the home page. You can browse it to observe TF families, defined based on their binding domains.
- Find the transcripiton factor of interest and save its position-counts matrix (PCM) in your work directory.
- With the RSAT tool convert-matrix, convert Hocomoco motif of your factor to transfac format, and paste it below the Jaspar motif (make sure that you paste the whole record, including the // line at the end, which indicates the separation between two motifs in the same file).
- Tip: the format Hocomoco is not displayed yet in the list of input formats for convert-matrix, but it is actually identical to cluster-buster.

Discovering motifs with RSAT peak-motifs

Open a connection to RSAT
Use the tool fetch-sequences to get the sequences of your peaks.
- BEWARE: the peaks of all ChIP-seq datasets were computed by Ballester’s team based on the hg38 assembly of the Human genome. You should thus use this version of the genome.
Save a copy of the fasta file in your local folder (it might serve for further analyses).
At the bottom of the fetch-sequence result page, click the peak$motifs button to send the sequences as input to peak-motifs.
Set the following parameters for peak-mortifs
- Open the section Compare discovered motifs with databases or custom reference motifs.
- As motif databases, select both Jaspar core non-redundant vertebrates and Hocomoco Human TFs.
- Enter as reference motif the transfac-formatted file with the JASPAR and Hocomoco motifs for your TF of interest (the file [YOUR_FACTOR_NAME]_reference-motifs.tf you built in the previous section).
- Enter your email address
- Leave all other parameters unchanged an run the analysis.
Interpret the results:
- Did you obtain significant results? How significant ?
- If so, were they characterized by over-representation (oligo-analysis), positional bias (position-analysis) or both?
- Did some of the discovered motifs match the reference motifs for your TF in Jaspar and/or Hocomoco?
- Did you discover other motifs?

Negative control

Use *RSAT random genome fragments** to pick up random regions having the same sizes as your peaks
Run the same analyses as above with peak-motifs
Compare the results of this negative control with those obtained above with your peaks, and interpret them.

Motif encrihment analysis

Note : for the 2020 session we had no time to see the theory about matrix-quality and motif enrichment. This section can thus be skipped for the report.

On the RSAT server, open the tool matrix-quality
Submit your peak sequences
Enter the JASPAR and Hocomoco reference motifs
Run the analyses

Motif discovery with MEME

Analyse the same peak sequences with MEME-ChIP, and convert the discovered motifs in transfac format with the RSAT tool convert-matrix.

Also analyse with MEME the random selections of genmic regions (negative control).

Matrix clustering

The tool matrix-clustering can be used for different purposes. Here we will use it to - compare the motifs discovered respectively by MEME and RSAT peak-motifs in the peak-sequences - merge the similar motifs in order to obtain a non-redundant collection of motifs discoverd by the different tools in the peaks - compare this non-redundant collection of discovered motifs with a motif database in order to identify the associated TFs.

Run the RSAT tool matrix-clustering to merge and compare the two collections of motifs discovered by RSAT peak-motifs and MEME-ChIP, respectively.
From the matrix-clustering result, retrieve the root motifs, which will provide you with a non-redundant collection of the motifs resulting from RSAT peak-motifs and MEME-ChIP.
Use the RSAT tool compare-matrices to compare these non-redundant discovered motifs with JASPAR nonredundant vertebrate.
Use RSAT matrix-scan to measure the rate of coverage of the peaks (percentage of peaks with at least one match)
- for each one of the non-redundant discovered motifs
- for each reference motifs

Interpret the results

Do the motifs discovered in the peaks include the reference motifs (those correspinding to the immunoprecipitated factor)?
Are these motifs reported by all the motif discovery approaches or only by some of them?
Do you discover motifs associated with other transcription factors? If so, are these factors relevant to the ChIP-seq data set under study (co-factors of the immunoprecipitated factor, factors involved in this cell/tissue type, …)?
Which algorithms reported significant motifs in the random selections of genomic regions? How significant are they compared to those reported in the actual peaks? How do you interpret these motifs?

Instruction for the report

Volume: 4 - 5 pages of text (without counting the figures and the supp. material)
Structure: we ask you to return a report structured as a small research article (introduction, material and methods, results and discussion, conclusions an perspective).
Report template. Please use the template provided here. Read carefully the instructions.
- Rmd format: Rmd.
- Word format: docx.

Please submit both the original file (Rmd) and the report compiled in Word format (in RStudio, use the command Knit to Word), in order to enable me to annotate it in the margins.