Practical: motif discovery in yeast regulons

Introduction

In this tutorial, we will use R to detect over-represented motifs in the regulatory sequences of a set of genes involved in a biological process of interest.

Our study case: We examplify the approach with the methionine biosynthetic process in the yeast Saccharomyces cerevisiae.

Each student will be invited to run the same analyses with a different biological process, and we will compare the results gathered by all of us in order to draw some general insight into the relevance of the approach.

Suggested biological processes

I provide here some examples of biological processes of the yeast Saccharomyces cerevisiae. Note that these processes may have different names depending on the databases, so you will have to adapt them in different steps of the tutorial.

L-methionine biosynthesis
L-arginine biosyntesis
L-lysine biosynthesis
any other aminoacid metabolic pathway
Ergosterol biosynthesis
Galactose utilization
Phosphae utilization
Nitrogen utilization
… (don’t hesitate to be adventurous)

Resources

Resource	Description	URL
Biocyc	Data base of metabolic pathways
Gene Ontology Consortium	Official site of the Gene Ontology, with tools enabling to explore and retrieve data	http://www.geneontology.org/
Ensembl BioMart	Genomic platform maintained by the European Bioinformatics Institute	https://www.ensembl.org/
RSAT	Regulatory Sequence Analysis Tools	http://rsat.eu/
RSAT Fungi	Fungi server of the Regulatory Sequence Analysis Tools	https://rsat.france-bioinformatique.fr/fungi/

Collective table for the 2023 practical

Students will store their results in a shared spreadsheet, which will be used to compare their results and get a broader landscape from the comparison of the results obtained with different transcription factors.

Shared folder (in edit mode): https://drive.google.com/drive/folders/139B72oJOoI5fDWn3FfxhK7u-K3TXhPEV
Shared result table (one row per student) : https://docs.google.com/spreadsheets/d/1XuG4DcLtktlgY2GLzyS7SA09BzSTG7vh/edit#gid=711917257

Add a line with your name and email. You will progressively fill in the other columns during the pratical.

Viewing metabolic pathways on BioCyc

Open a connection to BioCyc (https://biocyc.org/)
Click on the button change organism database and select Saccharomyces cerevisiae S288c.
In the Tools menu, select the function Metabolism > Generate a metabolic map poster to get an overview of the whole metabolism.
Select the tool Browse pathway ontology and select a pathway of your choice (not too big, not too small, let us say between 5 and 50 genes).
As an example I select two pathways:
- “superpathway of sulfur amino acid biosynthesis”
- “L-methionine de novo biosynthesis”
Click on the pathway instances to display the pathway maps. PLay with the options to increase/decrease the level of detail.
In your custom result folder, save a figure that provides a nice view of the pathway, and could be used for example in the introduction of a report.

Finding a process in the Gene Ontology

Open a connection to the Gene Ontology Consortium (http://www.geneontology.org/)
In the search box, type the name of your biological process of interest (e.g. ‘methionine biosynthesis’). Check the option “Ontology” and start the search. This opens a page with the list of matching terms in the Gene Ontology.

. Select the one you need (e.g. “L-methionine biosynthetic process”).

This opens the record describing the biological process you chose.

Before anything else, take note of the GO dientifier (e.g.: GO:0071265 for the L-methionine biosynthetic process)
Read the Term definition check that the description corresponds to your choice.
In the Annotations tab, check the number of genes associated to this term (for GO:0071265, there 896 genes on Feb 10, 2023). Note that these genes belong to different organisms.
In the left menu, click Organisms and click on the “+” button besides “Saccharomyces cerevisiae S288C”. This will add a filter on the genes.
Click on the Download button. Drag the field Gene/product bioentity label from left to right in order to use it as first field in the “Selected fields” box. This will provide us with the common gene name, which is more convenient to select genes in RSAT.

Question: how many genes are associated to your process in the budding yeast ?

Getting genes from BioMart

Open a connection to Ensembl (https://www.ensembl.org/) and select the BioMart tool.
Choose the database: Ensembl Genes [version] (note: the current version is 95 in Feb 2019)
Choose dataset: Saccharomyces cerevisiae genes (R-46-1-1)
In the left menu, click Filters and open the section Gene Ontology.
Type the GO term accession ID of your choice (e.g. GO:0071265).
In the top corner of the windfow, click Count and check that you have a reasonable number of genes (with GO:0071265, this returns 14 genes in Feb 2023).

We will now customize the attributes to be extracted from BioMart.

In the left menu, click Attributes, then Genes
Activate the following attributes:
- Gene stable ID
- Protein stable ID
- Gene description
- Gene name

Note: BioMart is a flexible way to quickly get a lot of information about genes. For other tasks you might be led to select additional attributes.

In the top left corner, click Results to get the selected genes.
Select Export all results to File and TSV (tab-separted values) and click Go. This will download a tab-delimited text file in tge Download folder of your Web browser. Save this file on your computer.
Rename the downloaded file (e.g. Scerevisiae_methionine-biosynthetic-process_GO-0071265_genes.tsv).
Open the TSV file to check its content. The most convenient is to open it with a spreadsheet tool (e.g. LibreOffice Calc, Excel), but you can also view it with a simple text editor.

At this stage, we dispose of a table with the description of all the genes associated to the biological process of our choice.

Discovering motifs in promoters of GO-associated related genes

We will try to discover over-represented motifs in the promoters of the genes involved in our pathway of interest.

Open the TSV file with the genes involved in your biological process.
Select the colum with stable gene identifiers. In my case I have the 13 following IDs.

YAL012W
YPL273W
YJR148W
YHR137W
YJR024C
YJR024C
YEL038W
YEL038W
YMR009W
YMR009W
YHR112C
YGL202W
YGL184C
YGL184C
YHR208W
YPR118W
YLR017W

Open a connection to the fungal RSATv server (http://fungi.rsat.eu/)
With the tool retrieve sequence, get the non-coding sequences located upstream of your genes.
In the Next Step box of the rsult page, click the button oligo-analysis.
For oligomer lengths select 6 (uncheck the values 7 and 8 that are selected by default).
Leave all other options unchanged and click GO
Do the same analysis with dyad-analysis instead of oligo-analysis.

Questions

Did you obtain significant motifs ?
If so, report the P-value, E-value and signiicance index of the most significant hexanucleotide in the colllective result table.

Negative control

In RSAT Fungi, open the tool Random gene selection, select Saccharomyces cerevisiae and set the number of genes to the size of the gene set analysed above (choose the one that gave the best result).
Run the same analyses as above with the random selection of genes, and take note of the scores of the most significant hexanucleotide

Scientific report

Structure of the report

Each student should return a scientific report containing

A summary of the principal results and their interpretation. This section should be synthetic, and cannot exceed 2 pages.
A full list of the complete RSAT commands used for the successive steps of the analysis
As many figures and tables as useful
- Figures and tables are not counted in the two pages, but numbered in order to enable cross-references from the text to these elements;
- Each figure or table should have a title (in bold) and a legend sufficiently detailed to enable the reader understanding their contents.

Report submission date

Reports should be deposited before February 26, 2023 on the Moodle server of the LCG.

Evaluation criteria

Completion of the result in the shared result table on GDrive.
Precision of the results in the report (e.g. number of non-redundant genes, e-value, p-value, significance, …)
Interpretation of results, e.g.
- relevance of the genes found by the different gene selection approaches;
- statistical significance of the discovered motifs (including comparison between dyads and oligos);
- relevance of the discovered motifs relative to the motifs known to be involved in the regulation of the considered biological process;
- …
Interpretation of negative controls

Attention: The biological interpretation of the findings is crucial for your score.
Plagiarism will not be tolerated