In this tutorial, we will use the local sequence similarity search tool (BLAST) to searching sequence similarities for a protein of interest (an histone) in the proteome of a given organism (the yeast Saccharomyces cerevisiae).
This approach is very simple, fast and widely used (BLAST is the most highly bioinformatics tool worldwide), but suffers from some limitations. In particular, the resulting list of protein sequences is sensitive to the particular protein chosse as query.
In a next practical, we will rely on a more powerful method based on Hidden Markov Models (HMM) in order to build a profile for a protein family, and use this profile to scan the proteome and detect all the members of this family.
Technically, this tutorial will also lead us to submit jobs to a scheduler, monitor the output and error logs, and organise the results for further use.
We will also perform some exercise to evaluate the impact of some parameters on the results (in particular, the threshold on the e-value).
We will successively perform the following operations.
Select the full set of proteins (proteome) identified by translating the putatively coding regions of a given organism (e.g. the fruit fly Drosophila melanogaster), download its sequences to our computer, and upload these sequences to the High-Performance Computing server (the IFB core cluster).
Download a sequence of interest (e.g. Histone H3) on the HPC server.
Use the similarity search tool blast
to find homologs of the protein of interest in the selected genome.
This practical assumes that you are connected to a cluster of the IFB NNCR. This enables you to benefit from a software environment with pre-installed software tools for the different fields of bioinformatics.
You should have recieved a login before starting the course. If this is not the case, see the information on account request on the IFB Web site (https://www.france-bioinformatique.fr/en/cluster).
Before starting this tutorial, you should have followed the following introductory sessions
Please remember: the sequence analysis tasks below should by no way run on the login node of the computer. Each job has to be sent to the slurm
job scheduler using srun
(submit a single command) or sbacth
(submit a bash script), as explained in the tutorials of the prerequisite section.
Resource | Description | URL |
---|---|---|
Ensembl | Genome database | www.ensembl.org |
UniprotKB | Knowledge base of protein sequence and functional information | www.uniprot.org |
PFAM | Database of protein families | pfam.xfam.org |
The IFB core cluster comes with a modular software environment, which enables users to deploy the specific tools required for their analyses.
In this section, we will open a connection to the IFB cluster and load the modules required for the practicals below.
Task 1: use the command module avail
to explore the lists of available modules and identify the two modules used for this course.
The solution can be seen by unfolding the code below.
## Open a connection to the IFB core cluster
ssh <username>@core.cluster.france-bioinformatique.fr
## List the modules of your environment
module list
## Hmm ... no module is currently loaded
## Get the list of available modules
## Use the option -t to get one module per row
module avail -t
## Identify the blast module
module avail -t | grep blast
## Identify the emboss module
module avail -t | grep emboss
Task 2: Load the required modules.
In your home directory on the IFB core cluster, create a folder named cmb-comp-bio_ptracticals
, and a sub-folder named proteome-blast_results
.
We will use Uniprot Web interface to select a protein of interest, that will serve as query to search for homologous proteins in the selected proteome.
For this tutorial, we will use the fruit fly Drosophila melanogaster as organism of interest
Open a connection to uniprot.org
Use the advanced serarches to formulate structured queries
Select the Organism. Start typing Drosophila melanogaster and wait for a few seconds. Among the possible completions displayed, select the following one.
Drosophila melanogaster (Fruit fly) [7227]
Note: the number 7227 is the identifier of this species in the NCBI taxonomic database.
Click Search. This will select all the proteins of the selected organism. The result table only displays a summary information for 25 entries.
On the top of the result page, click Download and download all proteins in fasta format on your computer.
Note: by default the sequence file is compressed with the gzip
algorithm, you should thus obtain a binary file with the extension .fasta.gz
.
Locate the downloaded file on your computer, and rename if with a simpler name, for example Drosophila_melanogaster_proteome.fasta.gz
1.
Use the terminal to check the begining of this file
scp
to copy the proteome sequence file in the result directory of the core cluster.Note that an alternative is to use a user-friendly program (e.g. Filezilla) to transfer this file. For more information read the section on data transfer in the IFB cluster documentation.
## Go to your download folder. Note: this can vary depending on your operating system
cd ~/Downloads
## Check that the fasta file is there
ls -l Drosophila_melanogaster_proteome.fasta.gz
## Check the content of the file (use zless to directly see the content of the compressed file)
zless Drosophila_melanogaster_proteome.fasta.gz
## REPLACE <username> by your own username
scp Drosophila_melanogaster_proteome.fasta.gz \
<username>@core.cluster.france-bioinformatique.fr:cmb-comp-bio_ptracticals/proteome-blast_results/
ssh
session on the IFB core cluster and check that the seque ce file is in the right directory, and contains the compressed proteome sequence file (extension .fasta.gz
).du
), then uncompress it (command: gunzip
) and measure the size of the uncompressed file.Click on the Advanced search again. Your previous query is still displayed. Click the big + symbol to add a criterion, select Protein name and type Histone H3.
Click on the link to the protine named Histone H3 (with ID P02299). You will see the full record of annotations for this protein.
Select Format → Fasta. This displays the sequence of Histone 3 in fasta format. We will use this sequence as query for BLAST.
Copy the link of this sequence (https://www.uniprot.org/uniprot/P02299.fasta) in a file.
Come back to the IFB core cluster terminal, make sure you are in the results directory, and use the Unix command wget
to download the sequence.
Note: we use the option –no-clobber. To know why, read the wget manual after typing the command man wget
. In Unix man pages, you can find a given word (e.g. clobber) by typing /
followed by the search string.
Before being able to run sequence similarity searches, BLAST needs to read the sequence database (in our case, the proteome) in order to build an index of the k-mers (oligopeptides). This is done with the command makeblastdb
.
Display the makeblastdb
options (-h
) manual (-help
). Well, the the manual is quite heavy, so we give you the equired options: -in
and -dbtype
. Read the usage of these options in the manual, built the command and run it.
Build a command (DO NOT RUN IT YET) to index the proteome of interest.
Send this command to the job scheduler via srun
.
Check the result files.
## Make sure the blast module is loaded
module load blast
## Check the path of the program makeblastdb (it should be in your conda folder)
which makeblastdb
## Get the usage of makeblastdb (formal specification of the command syntax)
makeblastdb -h
## Get the help for makeblastdb (explanation of the options)
makeblastdb -help
## Build a BLAST database with all the protein sequences of D.melanogaster
srun makeblastdb -dbtype prot -in Drosophila_melanogaster_proteome.fasta
## Check the new files that were created with this commands.
## For this we list the files in reverse (-r) temporal (-t) order.
ls -1tr
The program makeblastdb
created three files besides the input fasta file.
Drosophila_melanogaster_proteome.fasta.pin
Drosophila_melanogaster_proteome.fasta.phr
Drosophila_melanogaster_proteome.fasta.psq
These files contain an index of all the oligopeptites found in all the sequences of the fasta file. This “k-mer” index will be used by BLAST to rapidly find all the sequences in the database that match a query sequence, and use these short correspondences to start an alignment.
Since we want to search a protein database with protein query sequences, we will use the blastp
command.
OK, the first contact is a bit frightening, because blastp has many options. This is because it allows you to run queries in different modes, with different parameters, and to export the results in different formats.
In this tutorial we will show you the most usual ways to use the tool, and you will then be able to refine your search by combing back to the manual and understanding the use of additional options.
Build a command (DO NOT RUN IT YET) that searches similarities for one or several query sequences provided in a fasta file (e.g. P02299.fasta
) in the proteome of interest (Drosophila_melanogaster_proteome.fasta
). By default, blastp
prints the result on the screen, but we prefer to redirect it to a file (e.g. P02299_blastp_result.txt
) in order to keep a trace of the result and inspect it later.
Send this command to the job scheduler with srun
.
In blast
manual, find the option that enables you to set a threshold (cutoff) on the e-value. Redo the blast search with a threshold of 1e-5 on the e-value.
By default, BLAST displays the detailed results of a similaty search, starting with a summary of the matches, followed by all the alignemnts. It can be convenient to generate a synthetic result indicating only the relevant statistics for each alignment. This can be done by tuning BLAST options.
Read blastp manual and find a way to tune the formatting options in order to get a tabular output with comments.
Refine the query by selecting the following fields in the tabilar option:
## Search similarities for histone 3 in D.melanogaster proteome.
## Return the result as a synthetic table with the matching statistics for each hit.
srun blastp \
-db Drosophila_melanogaster_proteome.fasta \
-query P02299.fasta \
-outfmt 7 \
> P02299_blastp_summary.tsv
## Check the result
more P02299_blastp_summary.tsv
## Return table with selected fields
srun blastp \
-db Drosophila_melanogaster_proteome.fasta \
-query P02299.fasta \
-outfmt "6 sseqid qlen slen length score evalue" \
> P02299_blastp_summary.tsv
## Check the result
more P02299_blastp_summary.tsv
In this section, we will perform a quick empirical test, by submitting a random sequence to blastp. For this, we will shuffle the aminoacids of the original query sequence.
shuffleseq
tool.## Activate the module with the emboss software suite
module load emboss
## Check that shuffleseq is available
which shuffleseq
## Get the suffle manual
shuffleseq -h
## Shuffle the apartokinase 3 sequence
srun shuffleseq -sequence P02299.fasta \
-outseq shuffled_P02299.fasta
## Check the result
cat shuffled_P02299.fasta
## Compare the shuffled seq with the original one
cat P02299.fasta
Here is an example of result (I just simplified the header of the shuffle output fasta sequence for the sake of readability).
Of course, your result should differ since each shuffling produces a randomized sequence.
>shuffled_H3_DROME
LMKPEQMLIAKKKDTKRVLASVRGIHRIARKVARTERLQKVLFRKTIAFALAGDICGQPT
ESAPRTADLAQAVIFRHGTRPTRYTAPLSMQGKSYNLFGRQEDEQRASARASELTAIPKL
KVYERGAKAKQTRRLR
Shuffle the histone 3 sequence with the tool shuffleseq
(found in the conda environment emboss
).
Run a blastp
search with this shuffled sequence against the selected proteome.
## Run blast with the shuffled lysC sequence against all D.melanogaster proteins
srun blastp \
-db Drosophila_melanogaster_proteome.fasta \
-query shuffled_P02299.fasta \
> shuffled_P02299_blastp_result.txt
## Check the result
less shuffled_P02299_blastp_result.txt
## Re-shuffle the apartokinase 3 sequence
srun shuffleseq -sequence P02299.fasta \
-outseq shuffled_P02299.fasta
## Run the same query with a tabular output
blastp -outfmt 7 \
-db Drosophila_melanogaster_proteome.fasta \
-query shuffled_P02299.fasta \
> shuffled_P02299_blastp_summary.tsv
## count the number of false positives
wc -l shuffled_P02299_blastp_summary.tsv
## Check the result
less shuffled_P02299_blastp_summary.tsv
Attention, this is a complement to the practical, which should not be run if you are working on the IFB core cluster. This section explains you how to install on your environment all the tools required for this tutorial.
## ATTENTION: DO NOT RUN THIS COMMAND IF YOU ARE ON THE NNCR CLUSTER,
## because this would occupy a big space to reinstall BLAST in your
## own account, although it is already available for all users.
## Check that conda is available on your computer
which conda
## Define an environment variable with the conda bin directory
export CONDA=`which conda`
export CONDA_BIN=`dirname $CONDA`
## Create a conda environment named blast/2.7.1 and install the corresponding blast version
conda create -n blast blast==2.7.1
## Create a conda environment named emboss/6.6.0 and install the corresponding emboss version
conda create -n emboss emboss==6.6.0
## List the conda environments available on the system
$CONDA_BIN/conda env list
## FInd the blast environment
$CONDA_BIN/conda env list | grep -i blast
## Activate the blast environment.
conda activate blast
## Check the location of the blastp command
which blastp
## Deactivate the current environment.
conda deactivate
## Note the change in the shell prompt
## Re-activate the blast environment and activate the emboss environment
conda activate blast
conda activate emboss
## Note the change in the shell prompt
## Find the path of the shuffleseq command (part of the EMBOSS package)
which shuffleseq
The file downloaded from Uniprot has a sophisticated name with parentheses and brackets, which will be hard to handle in unix commands: uniprot-organism__Drosophila+melanogaster+(Fruit+fly)+[7227]_.fasta.gz
↩︎