One of the expected result of a peak calling approach (see previous tutorial) is a set of peaks in bed format. The BED format also known as BED6 format is the most widely used. The BED6 contains the following information:
An example BED record is provided below.
chrom | start | end | name | score | strand |
chr1 | 123 | 789 | feature_1 | 0 | + |
An alternative is the the BED3 format that only contains the 3 first column.
chrom | start | end |
chr1 | 123 | 789 |
Finally a less conventional Bed format is the BED12 format that is most generally used to store transcript coordinates (including all exon coordinates).
Depending of the software producing the BED file, an additional first line (a header) containing a description can be found. See UCSC BED format description for more information.
Several conventions exist to describe genome coordinates. The BED file format is said to be “zero-based, half-open”.
Zero-based means that the coordinate of the first nucleotide of a chromosome is defined as 0 . half-open means that the “End” coordinate is not part of the feature. In the example above, corresponding to a “chromosome” (Z) containing 12 nucleotides, the coordinates of the ATG (in red) sequence would be [4,7[. The corresponding entry would be described by the following row in a bed file ChrZ 4 7.
This convention is somewhat confusing, since the ATG trinucleotide actually spans the region from the 5th to the 8th nucleotides of the sequence. However, this is the convention adopted by many genome centers, and we have to learn using it.
If you encounter issues with BED files in Galaxy you should try to:
Given a set of ChIP-seq peaks annotate them in order to find associated genes, associated genomic elements (promoter, UTR, introns,…) and associated functional terms (through GO term analysis).
For this practical session, the ChIP-seq data and peaks related to following publication will be used: “GATA3 acts upstream of FOXA1 in mediating ESR1 binding by shaping enhancer accessibility.”, Theodorou et al.
The following pre-processed datasets are available:
Each file contains 7 columns in “BED” format.
Sample | format |
---|---|
siGATA_ER_E2_r1 | bed |
siGATA_ER_E2_r2 | bed |
siGATA_ER_E2_r3 | bed |
siGATA_H3K4me1_E2_r1 | bed |
siNT_ER_E2_r1 | bed |
siNT_ER_E2_r2 | bed |
siNT_ER_E2_r3 | bed |
The BAM indexes are provided in the second table.
Sample | format |
---|---|
MCF_input_r3 | bam |
siGATA_ER_E2_r1 | bam |
siGATA_ER_E2_r2 | bam |
siGATA_ER_E2_r3 | bam |
siGATA_H3K4me1_E2_r1 | bam |
siNT_ER_E2_r1 | bam |
siNT_ER_E2_r2 | bam |
siNT_ER_E2_r3 | bam |
siNT_H3K4me1_E2_r1 | bam |
Sample | format |
---|---|
MCF_input_r3 | bam |
siGATA_ER_E2_r1 | bam |
siGATA_ER_E2_r2 | bam |
siGATA_ER_E2_r3 | bam |
siGATA_H3K4me1_E2_r1 | bam |
siNT_ER_E2_r1 | bam |
siNT_ER_E2_r2 | bam |
siNT_ER_E2_r3 | bam |
siNT_H3K4me1_E2_r1 | bam |
Sample | format |
---|---|
MCF_input_r3 | bigWig |
siGATA_ER_E2_r1 | bigWig |
siGATA_ER_E2_r2 | bigWig |
siGATA_ER_E2_r3 | bigWig |
siGATA_H3K4me1_E2_r1 | bigWig |
siNT_ER_E2_r1 | bigWig |
siNT_ER_E2_r2 | bigWig |
siNT_ER_E2_r3 | bigWig |
siNT_H3K4me1_E2_r1 | bigWig |
Reload IGV. Load the bigWig files and bed files in the genome browser. Use colored tracks to distinguish between GATA3 KO and WT condition (right-click on track name > Change track color). Reorder properly the files with peaks below corresponding bigwigs.
Reproducibility may be an issue in ChIP-Seq experiment. Select one condition (WT or GATA KO).
For that specific step we will use the GREAT annotation tools. Connect to GREAT web server and perform a GO annotation for the ESR1 peaks. Alternatively GREAT can be launch directly from UCSC web server (using Table browser Custom track and by selecting send to GREAT).
Examine the enriched functional categories ? Are they biologically meaningful given what you know about ESR1 ?
It may be interesting, for each peak, to get the name of the closest transcript. Two this aim we require both transcript coordinates and a set of peaks in BED format. We will search the closest feature using the closestBed tool from Bedtools suite.
In the Tool search box, look for UCSC main table browser
So far, we obtained the list of all the Human transcripts, in bed12 format. This format includes, in addition to the genomic coordinates (columns 1-3), the transcript ID, coordinates of the coding region, number of exons, list of exon starts and list of exon ends.
We would like to select the 6 first columns to clearly visualized in IGV the selected regions.
Bedtools requires the BED file to bed sorted.
In the TAGC galaxy instance. CEAS (Cis-regulatory Element Annotation System) is a tool for characterizing genome-wide protein-DNA interaction patterns from ChIP-chip and ChIP-Seq of both sharp and broad binding factors. It provides statistics on ChIP enrichment at important genome features such as specific chromosome, promoters, gene bodies, or exons, and infers genes most likely to be regulated by a binding factor.
What can we say about the enrichment.
Using the “deepTools heatmapper” we will try to visualize the local enrichment around the TSS for all known genes. Before drawing the heatmap we need to prepare the data by computing a summary matrix of the local ChIP enrichment using “deepTools computeMatrix”.
Do you see any differences between the two choosen dataset ?
In this part, we will use the ReMap (http://tagc.univ-mrs.fr/remap/index.php) software to compare the peaks obtained in the peaks calling tutorial to an extensive regulatory catalog of 8 million transcription factor binding regions defined by collecting all the peaks from ChIP-seq experiments from the ENCODE project + other public datasets from the GEO database. Note that on the ReMap Web site, the term “site” is used to denote a ChIP-seq peak, rather than the precise binding location of a transcription factor
What are the TFs associated to your peaks?