In this tutorial, we will use R to detect over-represented motifs in the regulatory sequences of a set of genes involved in a biological process of interest.
Our study case: We examplify the approach with the methionine biosynthetic process in the yeast Saccharomyces cerevisiae.
Each student will be invited to run the same analyses with a different biological process, and we will compare the results gathered by all of us in order to draw some general insight into the relevance of the approach.
I provide here some examples of biological processes of the yeast Saccharomyces cerevisiae. Note that these processes may have different names depending on the databases, so you will have to adapt them in different steps of the tutorial.
Resource | Description | URL |
---|---|---|
Biocyc | Data base of metabolic pathways | |
Gene Ontology Consortium | Official site of the Gene Ontology, with tools enabling to explore and retrieve data | http://www.geneontology.org/ |
Ensembl BioMart | Genomic platform maintained by the European Bioinformatics Institute | https://www.ensembl.org/ |
RSAT | Regulatory Sequence Analysis Tools | http://rsat.eu/ |
RSAT Fungi | Fungi server of the Regulatory Sequence Analysis Tools | https://rsat.france-bioinformatique.fr/fungi/ |
Students will store their results in a shared spreadsheet, which will be used to compare their results and get a broader landscape from the comparison of the results obtained with different transcription factors.
Shared folder (in edit mode): https://drive.google.com/drive/folders/139B72oJOoI5fDWn3FfxhK7u-K3TXhPEV
Shared result table (one row per student) : https://docs.google.com/spreadsheets/d/1XuG4DcLtktlgY2GLzyS7SA09BzSTG7vh/edit#gid=711917257
Add a line with your name and email. You will progressively fill in the other columns during the pratical.
Open a connection to BioCyc (https://biocyc.org/)
Click on the button change organism database and select Saccharomyces cerevisiae S288c.
In the Tools menu, select the function Metabolism > Generate a metabolic map poster to get an overview of the whole metabolism.
Select the tool Browse pathway ontology and select a pathway of your choice (not too big, not too small, let us say between 5 and 50 genes).
As an example I select two pathways:
Click on the pathway instances to display the pathway maps. PLay with the options to increase/decrease the level of detail.
In your custom result folder, save a figure that provides a nice view of the pathway, and could be used for example in the introduction of a report.
Open a connection to the Gene Ontology Consortium (http://www.geneontology.org/)
In the search box, type the name of your biological process of interest (e.g. ‘methionine biosynthesis’). Check the option “Ontology” and start the search. This opens a page with the list of matching terms in the Gene Ontology.
. Select the one you need (e.g. “L-methionine biosynthetic process”).
This opens the record describing the biological process you chose.
Before anything else, take note of the GO dientifier (e.g.: GO:0071265 for the L-methionine biosynthetic process)
Read the Term definition check that the description corresponds to your choice.
In the Annotations tab, check the number of genes associated to this term (for GO:0071265, there 896 genes on Feb 10, 2023). Note that these genes belong to different organisms.
In the left menu, click Organisms and click on the “+” button besides “Saccharomyces cerevisiae S288C”. This will add a filter on the genes.
Click on the Download button. Drag the field Gene/product bioentity label from left to right in order to use it as first field in the “Selected fields” box. This will provide us with the common gene name, which is more convenient to select genes in RSAT.
Question: how many genes are associated to your process in the budding yeast ?
Open a connection to Ensembl (https://www.ensembl.org/) and select the BioMart tool.
Choose the database: Ensembl Genes [version] (note: the current version is 95 in Feb 2019)
Choose dataset: Saccharomyces cerevisiae genes (R-46-1-1)
In the left menu, click Filters and open the section Gene Ontology.
Type the GO term accession ID of your choice (e.g. GO:0071265).
In the top corner of the windfow, click Count and check that you have a reasonable number of genes (with GO:0071265, this returns 14 genes in Feb 2023).
We will now customize the attributes to be extracted from BioMart.
In the left menu, click Attributes, then Genes
Activate the following attributes:
Note: BioMart is a flexible way to quickly get a lot of information about genes. For other tasks you might be led to select additional attributes.
In the top left corner, click Results to get the selected genes.
Select Export all results to File and TSV (tab-separted values) and click Go. This will download a tab-delimited text file in tge Download folder of your Web browser. Save this file on your computer.
Rename the downloaded file (e.g. Scerevisiae_methionine-biosynthetic-process_GO-0071265_genes.tsv
).
Open the TSV file to check its content. The most convenient is to open it with a spreadsheet tool (e.g. LibreOffice Calc, Excel), but you can also view it with a simple text editor.
At this stage, we dispose of a table with the description of all the genes associated to the biological process of our choice.
In RSAT Fungi, open the tool Random gene selection, select Saccharomyces cerevisiae and set the number of genes to the size of the gene set analysed above (choose the one that gave the best result).
Run the same analyses as above with the random selection of genes, and take note of the scores of the most significant hexanucleotide
Each student should return a scientific report containing
A summary of the principal results and their interpretation. This section should be synthetic, and cannot exceed 2 pages.
A full list of the complete RSAT commands used for the successive steps of the analysis
As many figures and tables as useful
Figures and tables are not counted in the two pages, but numbered in order to enable cross-references from the text to these elements;
Each figure or table should have a title (in bold) and a legend sufficiently detailed to enable the reader understanding their contents.
Reports should be deposited before February 26, 2023 on the Moodle server of the LCG.
Completion of the result in the shared result table on GDrive.
Precision of the results in the report (e.g. number of non-redundant genes, e-value, p-value, significance, …)
Interpretation of results, e.g.
Interpretation of negative controls
Attention: The biological interpretation of the findings is crucial for your score.
Plagiarism will not be tolerated