In this practical, we will combine various bioinformatics resources to extract information from ChIP-seq experiments, and evaluate the quality of the peaks.
Resource Name | URL |
---|---|
ReMap | A database of ChIP-seq peaks (http://remap.cisreg.eu/) |
Jaspar | A database of eukaryote TF binding motifs, mainly built from CHIP-seq peaks (http://jaspar.genereg.net/) |
Hocomoco | A database of Human TF binding motifs, mainly built from CHIP-seq peaks (http://hocomoco11.autosome.ru/) |
RSAT | Regulatory Sequence Analysis Tools http://rsat.eu/ |
RSAT Metazoa | http://metazoa.rsat.eu/ |
RSAT Teaching | http://teaching.rsat.eu/ |
RSAT server at IFB | http://rsat.france-bioinformatique.fr/rsat/ |
UCSC NGS file formats | https://genome.ucsc.edu/FAQ/FAQformat.html |
~/LCG_BEII/chip-seq_practical/
)ReMap (https://remap.univ-amu.fr/) is a database of TF binding
regions based on an extensive re-analysis of published ChIP-seq data for
human transcription factors. ory (setwd()
)
Open a connection to ReMap
Select Homo sapiens
Explore the interface
Choose a transcription factor and a tissue / cell type of interest.
Click on the Download icon tab, besides the number of peaks. This downloads a compressed (gzipped) bed file (tab-separated columns describing genomic features).
Check that the number of peaks in the downloaded file corresponds to the indication on the ReMap page for your factor in the tissue / cell type and in the study identifier (GSE or other) you selected.
Check the name column in the bed file to make sure that the peaks all belong to the same tissue / cell type and study identifier.
Compute
Beware: the bed format relies on a somewhat confusing convention to specify the coordinates of genomic features.
Tips:
The peak statistics can be computed with a spreadsheet (Excel, LibreOffice Calc), R, or the Unix command line.
Click the “code” button to display a bash line to compute peak
statistics (number, sum of lengths and mean length) from a
bed
file.
RESULT_DIR=~/LCG_BEII/chip-seq_practical/SOX2_glioma_GSE67282
PEAKS=${RESULT_DIR}/GSE67282.SOX2.glioma_stem.bed.gz
gunzip -c ${PEAKS} \
| awk '{len=$3 - $2; sum += len; n++; mean = sum / n ; print "n="n"\tsum="sum"\tmean="mean}' \
| tail -n 1
<- "~/LCG_BEII/chip-seq_practical/SOX2_glioma_GSE67282"
resultDir <- "GSE67282.SOX2.glioma_stem.bed.gz"
peakBase <- file.path(resultDir, peakBase)
peakFile #peakFile <- "data/ReMap_GSE67282.SOX2.ESC_peaks.bed"
<- read.delim(peakFile, header = FALSE)
peaks <- peaks[, 1:4] # Keep the fields used for this analysis
peaks names(peaks) <- c("chr", "start", "end", "name")
$len <- peaks$end - peaks$start
peakshead(peaks) # print the first peaks
<- data.frame(
peak_stats sum = sum(peaks$len),
mean = mean(peaks$len),
n = nrow(peaks))
library(knitr)
kable(peak_stats)
In our example (Sox2 peaks in embryonic stem cells from the GEO series GSE67282) we obtain:
Open a connection to Jaspar
Explore the database functionalities
Find the transcription factor binding motifs for your TF of interest.
Download the corresponding matrix in your working directory.
Save this transfac-formatted motif in a text file named
[YOUR_FACTOR_NAME]_reference-motifs.tf
.
Optional: Do the same with the Hocomoco database.
Note the classification of transcription factors on the home page. You can browse it to observe TF families, defined based on their binding domains.
Find the transcripiton factor of interest and save its position-counts matrix (PCM) in your work directory.
With the RSAT tool convert-matrix, convert Hocomoco
motif of your factor to transfac format, and paste it below the Jaspar
motif (make sure that you paste the whole record, including the
//
line at the end, which indicates the separation between
two motifs in the same file).
Tip: the format Hocomoco is not displayed yet in the list of input formats for convert-matrix, but it is actually identical to cluster-buster.
Open a connection to RSAT
Use the tool fetch-sequences to get the sequences of your peaks.
Save a copy of the fasta file in your local folder (it might serve for further analyses).
At the bottom of the fetch-sequence result page, click the peak$motifs button to send the sequences as input to peak-motifs.
Set the following parameters for peak-mortifs
[YOUR_FACTOR_NAME]_reference-motifs.tf
you built in the
previous section).Interpret the results:
Use *RSAT random genome fragments** to pick up random regions having the same sizes as your peaks
Run the same analyses as above with peak-motifs
Compare the results of this negative control with those obtained above with your peaks, and interpret them.
Note : for the 2020 session we had no time to see the theory about matrix-quality and motif enrichment. This section can thus be skipped for the report.
On the RSAT server, open the tool matrix-quality
Submit your peak sequences
Enter the JASPAR and Hocomoco reference motifs
Run the analyses
Analyse the same peak sequences with MEME-ChIP, and convert the discovered motifs in transfac format with the RSAT tool convert-matrix.
Also analyse with MEME the random selections of genmic regions (negative control).
The tool matrix-clustering can be used for different purposes. Here we will use it to - compare the motifs discovered respectively by MEME and RSAT peak-motifs in the peak-sequences - merge the similar motifs in order to obtain a non-redundant collection of motifs discoverd by the different tools in the peaks - compare this non-redundant collection of discovered motifs with a motif database in order to identify the associated TFs.
Run the RSAT tool matrix-clustering to merge and compare the two collections of motifs discovered by RSAT peak-motifs and MEME-ChIP, respectively.
From the matrix-clustering result, retrieve the root motifs, which will provide you with a non-redundant collection of the motifs resulting from RSAT peak-motifs and MEME-ChIP.
Use the RSAT tool compare-matrices to compare these non-redundant discovered motifs with JASPAR nonredundant vertebrate.
Use RSAT matrix-scan to measure the rate of coverage of the peaks (percentage of peaks with at least one match)
Do the motifs discovered in the peaks include the reference motifs (those correspinding to the immunoprecipitated factor)?
Are these motifs reported by all the motif discovery approaches or only by some of them?
Do you discover motifs associated with other transcription factors? If so, are these factors relevant to the ChIP-seq data set under study (co-factors of the immunoprecipitated factor, factors involved in this cell/tissue type, …)?
Which algorithms reported significant motifs in the random selections of genomic regions? How significant are they compared to those reported in the actual peaks? How do you interpret these motifs?
Volume: 4 - 5 pages of text (without counting the figures and the supp. material)
Structure: we ask you to return a report structured as a small research article (introduction, material and methods, results and discussion, conclusions an perspective).
Report template. Please use the template provided here. Read carefully the instructions.
Please submit both the original file (Rmd) and the report compiled in Word format (in RStudio, use the command Knit to Word), in order to enable me to annotate it in the margins.