Goals of this tutorial

In this tutorial, we will use the local sequence similarity search tool (BLAST) to searching sequence similarities for a protein of interest (an histone) in the proteome of a given organism (the yeast Saccharomyces cerevisiae).

This approach is very simple, fast and widely used (BLAST is the most highly bioinformatics tool worldwide), but suffers from some limitations. In particular, the resulting list of protein sequences is sensitive to the particular protein chosse as query.

In a next practical, we will rely on a more powerful method based on Hidden Markov Models (HMM) in order to build a profile for a protein family, and use this profile to scan the proteome and detect all the members of this family.

Technically, this tutorial will also lead us to submit jobs to a scheduler, monitor the output and error logs, and organise the results for further use.

We will also perform some exercise to evaluate the impact of some parameters on the results (in particular, the threshold on the e-value).

Operations

We will successively perform the following operations.

Select the full set of proteins (proteome) identified by translating the putatively coding regions of a given organism (e.g. the fruit fly Drosophila melanogaster), download its sequences to our computer, and upload these sequences to the High-Performance Computing server (the IFB core cluster).
Download a sequence of interest (e.g. Histone H3) on the HPC server.
Use the similarity search tool blast to find homologs of the protein of interest in the selected genome.

Prerequisite

This practical assumes that you are connected to a cluster of the IFB NNCR. This enables you to benefit from a software environment with pre-installed software tools for the different fields of bioinformatics.

You should have recieved a login before starting the course. If this is not the case, see the information on account request on the IFB Web site (https://www.france-bioinformatique.fr/en/cluster).

Before starting this tutorial, you should have followed the following introductory sessions

Warning about command-line use of the cluster

Please remember: the sequence analysis tasks below should by no way run on the login node of the computer. Each job has to be sent to the slurm job scheduler using srun (submit a single command) or sbacth (submit a bash script), as explained in the tutorials of the prerequisite section.

Bioinformatics resources used for this course

Resource	Description	URL
Ensembl	Genome database	www.ensembl.org
UniprotKB	Knowledge base of protein sequence and functional information	www.uniprot.org
PFAM	Database of protein families	pfam.xfam.org

Collecting the sequences

Loading the required modules on the IFB core cluster

The IFB core cluster comes with a modular software environment, which enables users to deploy the specific tools required for their analyses.

In this section, we will open a connection to the IFB cluster and load the modules required for the practicals below.

Task 1: use the command module availto explore the lists of available modules and identify the two modules used for this course.

blast: Basic Local Alignment Search Tool
emboss: a software suite implemented by the European bioinformatics network EMBNet

The solution can be seen by unfolding the code below.

## Open a connection to the IFB core cluster
ssh <username>@core.cluster.france-bioinformatique.fr

## List the modules of your environment
module list
## Hmm ... no module is currently loaded

## Get the list of available modules
## Use the option -t to get one module per row
module avail -t

## Identify the blast module
module avail -t | grep blast

## Identify the emboss module
module avail -t | grep emboss

Task 2: Load the required modules.

## Load required modules
module load blast
module load emboss

## List loaded modules
module list

Creating a workspace for this tutorial

In your home directory on the IFB core cluster, create a folder named cmb-comp-bio_ptracticals, and a sub-folder named proteome-blast_results.

## Define the result directory for this practical and export it in an environment variable.
export RESULT_DIR=~/cmb-comp-bio_ptracticals/proteome-blast_results

## Note: the -p command enables to create full path at once
mkdir -p ${RESULT_DIR}
cd ${RESULT_DIR}

Collecting the proteome of interest from Uniprot

We will use Uniprot Web interface to select a protein of interest, that will serve as query to search for homologous proteins in the selected proteome.

For this tutorial, we will use the fruit fly Drosophila melanogaster as organism of interest

Open a connection to uniprot.org
Use the advanced serarches to formulate structured queries
Select the Organism. Start typing Drosophila melanogaster and wait for a few seconds. Among the possible completions displayed, select the following one.

Drosophila melanogaster (Fruit fly) [7227]

Note: the number 7227 is the identifier of this species in the NCBI taxonomic database.
Click Search. This will select all the proteins of the selected organism. The result table only displays a summary information for 25 entries.
On the top of the result page, click Download and download all proteins in fasta format on your computer.

Note: by default the sequence file is compressed with the gzip algorithm, you should thus obtain a binary file with the extension .fasta.gz.
Locate the downloaded file on your computer, and rename if with a simpler name, for example Drosophila_melanogaster_proteome.fasta.gz¹.
Use the terminal to check the begining of this file

Transferring a sequence file to your account on IFB core cluster

Practical

Open a new terminal on your computer, and use the Unix command scp to copy the proteome sequence file in the result directory of the core cluster.

Note that an alternative is to use a user-friendly program (e.g. Filezilla) to transfer this file. For more information read the section on data transfer in the IFB cluster documentation.

## Go to your download folder. Note: this can vary depending on your operating system
cd ~/Downloads

## Check that the fasta file is there
ls -l Drosophila_melanogaster_proteome.fasta.gz

## Check the content of the file (use zless to directly see the content of the compressed file)
zless Drosophila_melanogaster_proteome.fasta.gz

## REPLACE <username> by your own username
scp Drosophila_melanogaster_proteome.fasta.gz \
  <username>@core.cluster.france-bioinformatique.fr:cmb-comp-bio_ptracticals/proteome-blast_results/

Come back to your ssh session on the IFB core cluster and check that the seque ce file is in the right directory, and contains the compressed proteome sequence file (extension .fasta.gz).

cd $RESULT_DIR ## Make sure you are in the result directory
ls -ltr   ## List the files sorted by date (most recent last)

Measure the size (in megabase) of the compressed sequence file (command: du), then uncompress it (command: gunzip) and measure the size of the uncompressed file.

## Measure the size of the compressed proteome file
du -sm Drosophila_melanogaster_proteome.fasta.gz

## Uncompress the proteome file
gunzip Drosophila_melanogaster_proteome.fasta.gz

## Measure the size of the uncompressed proteome file
du -sm Drosophila_melanogaster_proteome.fasta

Getting a query sequence

Click on the Advanced search again. Your previous query is still displayed. Click the big + symbol to add a criterion, select Protein name and type Histone H3.
Click on the link to the protine named Histone H3 (with ID P02299). You will see the full record of annotations for this protein.
Select Format \(\rightarrow\) Fasta. This displays the sequence of Histone 3 in fasta format. We will use this sequence as query for BLAST.
Copy the link of this sequence (https://www.uniprot.org/uniprot/P02299.fasta) in a file.
Come back to the IFB core cluster terminal, make sure you are in the results directory, and use the Unix command wget to download the sequence.

cd $RESULT_DIR

wget --no-clobber https://www.uniprot.org/uniprot/P02299.fasta

Note: we use the option –no-clobber. To know why, read the wget manual after typing the command man wget. In Unix man pages, you can find a given word (e.g. clobber) by typing / followed by the search string.

Searching sequence similarities with BLAST

Formatting the blast database for the reference genome

Before being able to run sequence similarity searches, BLAST needs to read the sequence database (in our case, the proteome) in order to build an index of the k-mers (oligopeptides). This is done with the command makeblastdb.

Practical

Display the makeblastdb options (-h) manual (-help). Well, the the manual is quite heavy, so we give you the equired options: -in and -dbtype. Read the usage of these options in the manual, built the command and run it.
Build a command (DO NOT RUN IT YET) to index the proteome of interest.
Send this command to the job scheduler via srun.
Check the result files.

## Make sure the blast module is loaded
module load blast

## Check the path of the program makeblastdb (it should be in your conda folder)
which makeblastdb

## Get the usage of makeblastdb (formal specification of the command syntax)
makeblastdb -h

## Get the help for makeblastdb (explanation of the options)
makeblastdb -help

## Build a BLAST database with all the protein sequences of D.melanogaster
srun makeblastdb -dbtype prot -in Drosophila_melanogaster_proteome.fasta

## Check the new files that were created with this commands. 
## For this we list the files in reverse (-r) temporal (-t) order. 
ls -1tr

The program makeblastdb created three files besides the input fasta file.

Drosophila_melanogaster_proteome.fasta.pin
Drosophila_melanogaster_proteome.fasta.phr
Drosophila_melanogaster_proteome.fasta.psq

These files contain an index of all the oligopeptites found in all the sequences of the fasta file. This “k-mer” index will be used by BLAST to rapidly find all the sequences in the database that match a query sequence, and use these short correspondences to start an alignment.

Read blastp manual

Since we want to search a protein database with protein query sequences, we will use the blastp command.

## Command  syntax
blastp -h

## Help
blastp -help

OK, the first contact is a bit frightening, because blastp has many options. This is because it allows you to run queries in different modes, with different parameters, and to export the results in different formats.

In this tutorial we will show you the most usual ways to use the tool, and you will then be able to refine your search by combing back to the manual and understanding the use of additional options.

Similarity search for one sequence against a database

Build a command (DO NOT RUN IT YET) that searches similarities for one or several query sequences provided in a fasta file (e.g. P02299.fasta) in the proteome of interest (Drosophila_melanogaster_proteome.fasta). By default, blastp prints the result on the screen, but we prefer to redirect it to a file (e.g. P02299_blastp_result.txt) in order to keep a trace of the result and inspect it later.
Send this command to the job scheduler with srun.

## Search similarities for histone 3 in D.melanogaster proteome
srun blastp \
  -db Drosophila_melanogaster_proteome.fasta  \
  -query P02299.fasta \
  > P02299_blastp_result.txt
  
## Check the result
less P02299_blastp_result.txt

Questions

How many hits were found in total?
How many of these have an E-value higher than 1?
What was the threshold on the E-value for this blastp search?
With this threshold, how many hits would we expeect if we use a random sequence as query?
How many alignments would you consider significant in this result?
How many of these would you qualify of homologs?
For each of these putative homologs, is it a paralog, ortholog or “other log”?
Do you see a relationship between the E-value and some properties of the alignments?

Setting a conservative cutoff on the e-value

In blast manual, find the option that enables you to set a threshold (cutoff) on the e-value. Redo the blast search with a threshold of 1e-5 on the e-value.

## Search similarities for histone 3 in D.melanogaster proteome
srun blastp \
  -db Drosophila_melanogaster_proteome.fasta  \
  -query P02299.fasta -evalue 1e-5 \
  > P02299_blastp_result_eval1e-5.txt
  
## Check the result
less P02299_blastp_result_eval1e-5.txt

Getting a tabular synthesis of blast results

By default, BLAST displays the detailed results of a similaty search, starting with a summary of the matches, followed by all the alignemnts. It can be convenient to generate a synthetic result indicating only the relevant statistics for each alignment. This can be done by tuning BLAST options.

Read blastp manual and find a way to tune the formatting options in order to get a tabular output with comments.
Refine the query by selecting the following fields in the tabilar option:
- ID of the “subject” sequence (the one found in the proteome of interest)
- length of the query sequence
- length of the subject sequence
- raw score of the alignment
- e-value

Solution

## Search similarities for histone 3 in D.melanogaster proteome.
## Return the result as a synthetic table with the matching statistics for each hit. 
srun blastp  \
  -db Drosophila_melanogaster_proteome.fasta  \
  -query P02299.fasta \
  -outfmt 7 \
  > P02299_blastp_summary.tsv

## Check the result
more P02299_blastp_summary.tsv

## Return table with selected fields
srun blastp \
  -db Drosophila_melanogaster_proteome.fasta  \
  -query P02299.fasta \
  -outfmt "6 sseqid qlen slen length score  evalue" \
  > P02299_blastp_summary.tsv

## Check the result
more P02299_blastp_summary.tsv

Negative control: blast against a shuffled sequence

In this section, we will perform a quick empirical test, by submitting a random sequence to blastp. For this, we will shuffle the aminoacids of the original query sequence.

Shuffling sequences with EMBOSS shuffleseq

Load the emboss module
Read the manual of the shuffleseq tool.
Send a command to the job scheduler to generate a shuffled version of the query sequence.

## Activate the module with the emboss software suite
module load emboss

## Check that shuffleseq is available
which shuffleseq

## Get the suffle manual 
shuffleseq -h

## Shuffle the apartokinase 3 sequence
srun shuffleseq -sequence P02299.fasta \
   -outseq shuffled_P02299.fasta

## Check the result
cat shuffled_P02299.fasta

## Compare the shuffled seq with the original one
cat P02299.fasta

Here is an example of result (I just simplified the header of the shuffle output fasta sequence for the sake of readability).

Of course, your result should differ since each shuffling produces a randomized sequence.

>shuffled_H3_DROME
LMKPEQMLIAKKKDTKRVLASVRGIHRIARKVARTERLQKVLFRKTIAFALAGDICGQPT
ESAPRTADLAQAVIFRHGTRPTRYTAPLSMQGKSYNLFGRQEDEQRASARASELTAIPKL
KVYERGAKAKQTRRLR

Exercise

Shuffle the histone 3 sequence with the tool shuffleseq (found in the conda environment emboss).
Run a blastp search with this shuffled sequence against the selected proteome.

Solution


## Run blast with the shuffled lysC sequence against all D.melanogaster proteins
srun blastp \
  -db Drosophila_melanogaster_proteome.fasta  \
  -query shuffled_P02299.fasta \
  > shuffled_P02299_blastp_result.txt

## Check the result
less shuffled_P02299_blastp_result.txt

## Re-shuffle the apartokinase 3 sequence
srun shuffleseq -sequence P02299.fasta \
   -outseq shuffled_P02299.fasta

## Run the same query with a tabular output
blastp -outfmt 7 \
  -db Drosophila_melanogaster_proteome.fasta  \
  -query shuffled_P02299.fasta \
  > shuffled_P02299_blastp_summary.tsv

## count the number of false positives
wc -l shuffled_P02299_blastp_summary.tsv

## Check the result
less shuffled_P02299_blastp_summary.tsv

Questions

How many alignments would you expect by chance with the default blastp parameters?
How many alignemnts did you get with the shuffled sequence?
How many of these had an E-value smaller than 1?
How many of these had an E-value smaller than 0.01?
How many of these would you qualify of homologs?

Homework

Supplementary information (not part of the practical)

Installing the conda environment in your own computer

Attention, this is a complement to the practical, which should not be run if you are working on the IFB core cluster. This section explains you how to install on your environment all the tools required for this tutorial.

## ATTENTION: DO NOT RUN THIS COMMAND IF YOU ARE ON THE NNCR CLUSTER,
## because this would occupy a big space to reinstall BLAST in your 
## own account, although it is already available for all users. 

## Check that conda is available on your computer
which conda

## Define an environment variable with the conda bin directory
export CONDA=`which conda`
export CONDA_BIN=`dirname $CONDA`

## Create a conda environment named blast/2.7.1 and install the corresponding blast version
conda create -n blast blast==2.7.1

## Create a conda environment named emboss/6.6.0 and install the corresponding emboss version
conda create -n emboss emboss==6.6.0

## List the conda environments available on the system
$CONDA_BIN/conda env list

## FInd the blast environment
$CONDA_BIN/conda env list | grep -i blast

## Activate the blast environment. 
conda activate blast

## Check the location of the blastp command
which blastp

## Deactivate the current environment. 
conda deactivate
## Note the change in the shell prompt

## Re-activate the blast environment and activate the emboss environment
conda activate blast
conda activate emboss
## Note the change in the shell prompt

## Find the path of the shuffleseq command (part of the EMBOSS package)
which shuffleseq

References

The file downloaded from Uniprot has a sophisticated name with parentheses and brackets, which will be hard to handle in unix commands: uniprot-organism__Drosophila+melanogaster+(Fruit+fly)+[7227]_.fasta.gz↩︎

Using BLAST on the command line

Jacques van Helden

Created: 8/10/2018; last update: 2019-11-29

Goals of this tutorial

Operations

Prerequisite

Warning about command-line use of the cluster

Bioinformatics resources used for this course

Collecting the sequences

Loading the required modules on the IFB core cluster

Creating a workspace for this tutorial

Collecting the proteome of interest from Uniprot

Transferring a sequence file to your account on IFB core cluster

Practical

Getting a query sequence

Searching sequence similarities with BLAST

Formatting the blast database for the reference genome

Practical

Read blastp manual

Similarity search for one sequence against a database

Questions

Setting a conservative cutoff on the e-value

Getting a tabular synthesis of blast results

Solution

Negative control: blast against a shuffled sequence

Shuffling sequences with EMBOSS shuffleseq

Exercise

Solution

Questions

Homework

Supplementary information (not part of the practical)

Installing the conda environment in your own computer

References