Stats with R and RStudio

Practical: basic stats for peak calling

Jacques van Helden, Hugo varet and Julie Aubert

2017-01-08

Introduction

Peak-calling: question

**The peak calling question**. Slide from Carl Herrmann.

The peak calling question. Slide from Carl Herrmann.

Data loading

Defining the data directory

We will first define the URL from which the data can be downloaded, by concatenating the URL fo the course with the path to our dataset.

To concatenate paths, it is recommended to use the R command file.path().

url.course <- "http://jvanheld.github.io/stats_avec_RStudio_EBA/"
url.data <- file.path(url.course, "practicals", "02_peak-calling", "data")

Loading a data table

R enables to download data directly from the Web.

Load counts per bin in chip sample.

## Define URL of the ChIP file
chip.bedg.file <- file.path(url.data, "FNR_200bp.bedg")

## Load the file content in an R data.frame
chip.bedg <- read.table(chip.bedg.file)

## Set column names
names(chip.bedg) <- c("chrom", "start", "end","counts")

Exploring a data frame: dim()

Before anything else, let us inspect the size of the data frame, in order to check that it was properly lodaded.

dim(chip.bedg)
[1] 23199     4

Checking the n first rows: head()

The function head() displays the first rows of a table.

head(chip.bedg, n = 5)
                         chrom start  end counts
1 gi|49175990|ref|NC_000913.2|     0  200   1594
2 gi|49175990|ref|NC_000913.2|   200  400    834
3 gi|49175990|ref|NC_000913.2|   400  600    222
4 gi|49175990|ref|NC_000913.2|   600  800    172
5 gi|49175990|ref|NC_000913.2|   800 1000    123

Checking the n last rows: tail()

The function tail() displays the last rows of a table.

tail(chip.bedg, n = 5)
                             chrom   start     end counts
23195 gi|49175990|ref|NC_000913.2| 4638800 4639000    155
23196 gi|49175990|ref|NC_000913.2| 4639000 4639200     93
23197 gi|49175990|ref|NC_000913.2| 4639200 4639400     90
23198 gi|49175990|ref|NC_000913.2| 4639400 4639600    226
23199 gi|49175990|ref|NC_000913.2| 4639600 4639675    186

Viewing a table

The function View() displays the full table in a user-friendly mode.

View(chip.bedg)

Selecting arbitrary rows

chip.bedg[100:105,] 
                           chrom start   end counts
100 gi|49175990|ref|NC_000913.2| 19800 20000     21
101 gi|49175990|ref|NC_000913.2| 20000 20200      0
102 gi|49175990|ref|NC_000913.2| 20200 20400      0
103 gi|49175990|ref|NC_000913.2| 20400 20600    108
104 gi|49175990|ref|NC_000913.2| 20600 20800    229
105 gi|49175990|ref|NC_000913.2| 20800 21000    245

Selecting arbitrary columns

chip.bedg[100:105, 2] 
[1] 19800 20000 20200 20400 20600 20800
chip.bedg[100:105, "start"] 
[1] 19800 20000 20200 20400 20600 20800
chip.bedg[100:105, c("start", "counts")] 
    start counts
100 19800     21
101 20000      0
102 20200      0
103 20400    108
104 20600    229
105 20800    245

Adding columns

We can add columns with the result of computations from other columns.

chip.bedg$midpos <- (chip.bedg$start + chip.bedg$end)/2
head(chip.bedg)
                         chrom start  end counts midpos
1 gi|49175990|ref|NC_000913.2|     0  200   1594    100
2 gi|49175990|ref|NC_000913.2|   200  400    834    300
3 gi|49175990|ref|NC_000913.2|   400  600    222    500
4 gi|49175990|ref|NC_000913.2|   600  800    172    700
5 gi|49175990|ref|NC_000913.2|   800 1000    123    900
6 gi|49175990|ref|NC_000913.2|  1000 1200    116   1100

Exploring the data with basic plots

Plotting a density profile

We can readily print a plot with the counts per bin.

plot(chip.bedg[, c("midpos", "counts")], type="h")

Plotting a density profile

Let us improve the plot

plot(chip.bedg[, c("midpos", "counts")], type="h", 
     col="darkgreen", xlab="Genomic position (200bp bins)", 
     ylab= "Reads per bin",
     main="FNR ChIP-seq")