3.2 Functional annotation and gene ontology

4 min readjuly 30, 2024

and are crucial for understanding the roles of in organisms. They help researchers make sense of genomic data, identify drug targets, and unravel disease mechanisms by assigning biological functions to genes and gene products.

Gene Ontology provides a standardized framework for describing gene functions across species. It uses three main ontologies - , , and - organized in a hierarchical structure to facilitate annotation and analysis of gene sets.

Functional annotation in genomics

Definition and importance

  • Functional annotation is the process of assigning biological functions, processes, and pathways to genes or gene products based on experimental evidence or computational predictions
  • Crucial for understanding the roles and interactions of genes within an organism
    • Enables researchers to make sense of the vast amount of genomic data generated by high-throughput sequencing technologies (RNA-seq, ChIP-seq)
  • Helps in identifying potential drug targets, understanding disease mechanisms (cancer, neurodegenerative disorders), and guiding further experimental studies
  • Relies on various sources of information
    • Sequence homology (BLAST)
    • Protein domains (Pfam, InterPro)
    • Expression patterns (tissue-specific expression)
    • Literature mining (PubMed)

Methods and approaches

  • Manual annotation by expert curators involves reviewing the literature and experimental data to assign functions
    • Ensures high-quality and reliable annotations but is time-consuming and labor-intensive
  • Automated annotation methods can be used to assign functions to large datasets
    • Sequence similarity-based approaches (orthology, paralogy)
    • Domain-based approaches (presence of conserved protein domains)
    • These annotations may require additional validation
  • Integration of multiple lines of evidence (sequence, structure, expression, interactions) improves the confidence and accuracy of functional annotations

Gene ontology structure

Standardized vocabulary and framework

  • Gene Ontology (GO) is a standardized vocabulary and framework for describing the functions of genes and gene products across different species
  • Consists of three main ontologies
    • Biological Process (BP): describes the larger biological programs or objectives in which a gene or gene product is involved (cell cycle, signal transduction)
    • Molecular Function (MF): describes the specific molecular activities or tasks performed by a gene or gene product (DNA binding, catalytic activity)
    • Cellular Component (CC): describes the subcellular locations or macromolecular complexes where a gene or gene product is found (nucleus, ribosome)

Hierarchical organization and properties

  • GO terms are organized in a hierarchical structure
    • More specific terms are child terms of more general parent terms
    • Forms a directed acyclic graph (DAG)
  • Each GO term has a unique identifier, a name, and a definition, along with references to the evidence supporting the annotation
  • Relationships between GO terms include
    • is_a: indicates that a child term is a subtype or instance of a parent term
    • part_of: indicates that a child term is a component of a parent term
    • regulates: indicates that a child term modulates the occurrence or rate of a parent term

GO term application

Annotation process

  • GO annotation involves assigning the most appropriate and specific GO terms to a gene or gene product based on the available evidence
  • Evidence codes are used to indicate the type and strength of evidence supporting the annotation
    • Experimental evidence (IDA: Inferred from Direct Assay, IPI: Inferred from Physical Interaction)
    • Computational predictions (IEA: Inferred from Electronic Annotation, ISS: Inferred from Sequence or Structural Similarity)
  • Annotations can be made at different levels of granularity, depending on the specificity of the available evidence

Enrichment analysis

  • (GSEA) can be performed using GO annotations to identify overrepresented or underrepresented functional categories within a set of genes of interest
  • Enrichment analysis compares the frequency of GO terms in the gene set to their frequency in a background set (entire genome)
  • Helps identify biological processes, molecular functions, or cellular components that are significantly associated with the gene set
  • Can provide insights into the functional themes or pathways involved in a particular biological condition or experimental treatment

Functional enrichment analysis interpretation

Statistical significance and biological relevance

  • using GO annotations helps identify statistically overrepresented or underrepresented GO terms within a gene set compared to a background set
  • Enrichment analysis tools (, PANTHER, TopGO) calculate p-values or false discovery rates (FDR) to assess the significance of the enrichment
  • Overrepresented GO terms suggest that the gene set is enriched for specific functions or processes
    • Potentially indicates the biological mechanisms or pathways involved in the studied condition (disease, treatment response)
  • Underrepresented GO terms suggest that the gene set is depleted for specific functions or processes
    • Potentially indicates the biological mechanisms or pathways that are suppressed or not involved in the studied condition
  • Interpreting the results requires considering the biological context, the statistical significance of the enriched terms, and the potential biases or limitations of the annotation and analysis methods

Visualization and exploration

  • Visualization tools can help in exploring and interpreting the relationships between the enriched GO terms and their associated genes
  • GO term networks display the hierarchical relationships and connections between the enriched terms
    • Allows identification of broader functional themes and specific subprocesses
  • Treemaps or bar charts can be used to visualize the relative significance and overlap of the enriched terms
  • Interactive tools (, QuickGO) enable users to navigate the GO hierarchy, view term definitions, and explore the evidence supporting the annotations
  • Integration with other biological databases (, Reactome) can provide additional context and insights into the biological pathways and processes associated with the enriched terms

Key Terms to Review (23)

Amigo: In the context of genomics, 'amigo' refers to a comprehensive tool or resource that assists researchers in the functional annotation of genes. This tool is often utilized for understanding gene functions and biological processes by integrating various data sources, facilitating the interpretation of genomic information in a user-friendly way.
Biological process: A biological process refers to a series of events or actions that occur within living organisms to maintain life, facilitate growth, reproduction, and respond to environmental changes. These processes encompass everything from metabolic pathways and gene expression to cellular signaling and development. Understanding these processes is essential for functional annotation and gene ontology, as they help categorize the roles of genes and proteins in various biological contexts.
Biomarker discovery: Biomarker discovery refers to the process of identifying biological markers that indicate a specific biological condition, disease, or physiological state. This process is crucial for advancing personalized medicine, enhancing diagnostic accuracy, and developing targeted therapies. Biomarker discovery often integrates various data types, such as genetic, transcriptomic, proteomic, and metabolomic data, to provide a comprehensive understanding of disease mechanisms and treatment responses.
Cellular component: A cellular component refers to the various structures and organelles within a cell that perform distinct functions necessary for the cell's survival and activity. These components include everything from the nucleus and mitochondria to smaller structures like ribosomes and lysosomes, each playing a specific role in maintaining cellular homeostasis and facilitating processes like metabolism, growth, and reproduction.
Data normalization: Data normalization is the process of adjusting and scaling data to bring it into a common format or range, allowing for more accurate comparisons and analyses. This technique is crucial in functional annotation and gene ontology, as it ensures that differences in gene expression levels are not due to technical biases or variations in measurement methods, but rather reflect true biological variations. By normalizing data, researchers can more effectively interpret gene functions and biological pathways across different studies and conditions.
DAVID: DAVID stands for Database for Annotation, Visualization, and Integrated Discovery. It's a comprehensive bioinformatics resource that provides tools and data for functional annotation of genes, particularly in the context of gene ontology and biological pathways. By integrating various genomic databases, DAVID enables researchers to identify biological meanings from large lists of genes and proteins, facilitating the interpretation of complex genomic data.
Ensembl: Ensembl is a comprehensive genomic database that provides access to genome sequences, gene annotations, and comparative genomics data for a wide range of species. It plays a crucial role in various genomic analyses, including whole genome alignments and synteny analysis, by offering tools and resources that facilitate functional annotation, gene prediction, and understanding genome structure and organization.
Functional Annotation: Functional annotation refers to the process of identifying and assigning biological functions to genes and their products based on sequence data. This includes determining the roles of proteins, RNA molecules, and other gene products in cellular processes, which is essential for understanding gene function and its implications in various biological contexts.
Functional characterization: Functional characterization is the process of determining the biological function of genes or proteins through various experimental methods and bioinformatics tools. This includes understanding how genes contribute to cellular processes, their interactions with other molecules, and their roles in various biological pathways. It connects to functional annotation and gene ontology by helping to assign specific roles to genes based on their sequences and observed characteristics.
Functional enrichment analysis: Functional enrichment analysis is a method used to identify biological functions, pathways, or processes that are statistically overrepresented in a given set of genes or proteins. This analysis helps researchers understand the biological significance of gene lists derived from experiments by linking them to known functions, often using databases like Gene Ontology. By revealing functional categories that are enriched, this approach aids in hypothesis generation and the interpretation of genomic data.
GenBank: GenBank is a comprehensive public database that houses nucleotide sequences and their associated annotations, facilitating access to genetic information for researchers worldwide. It plays a crucial role in bioinformatics by allowing scientists to perform sequence alignment, homology searches, and functional annotation, thus aiding in the understanding of genome structure and organization. GenBank's extensive data resources are invaluable for microbial genome assembly and annotation efforts as well.
Gene function prediction: Gene function prediction is the process of inferring the biological roles of genes based on various types of data, such as sequence homology, expression profiles, and experimental results. This process is essential for understanding how genes contribute to cellular processes and organismal functions, as well as for exploring the relationships between genes in pathways and networks.
Gene Ontology: Gene ontology (GO) is a framework for the representation of gene and gene product attributes across all species, providing a controlled vocabulary to describe the roles of genes in biological processes, molecular functions, and cellular components. This structured approach allows for standardized functional annotation of genes, facilitating the comparison of genetic information across different organisms. By utilizing gene ontology, researchers can gain insights into gene functions, interactions, and their involvement in various biological processes.
Gene Set Enrichment Analysis: Gene Set Enrichment Analysis (GSEA) is a statistical method used to determine whether a predefined set of genes shows statistically significant differences in expression levels between two biological conditions. This technique helps researchers understand the biological processes or pathways that are active or altered in different states, such as disease versus normal conditions, by analyzing groups of genes with similar functions or characteristics.
Genes: Genes are segments of DNA that contain the instructions for building proteins, which play crucial roles in the structure and function of living organisms. They are the fundamental units of heredity and influence various traits and biological processes in an organism. Understanding genes helps in exploring functional annotation and gene ontology, as these concepts categorize genes based on their functions and relationships in biological systems.
KEGG: KEGG, which stands for Kyoto Encyclopedia of Genes and Genomes, is a comprehensive database resource that integrates genomic, chemical, and systemic functional information. It serves as a vital tool for researchers in understanding biological systems through the analysis of pathways and networks, linking genes to functions and cellular processes.
Metabolites: Metabolites are small molecules produced during metabolic processes, serving as intermediates and products of biochemical reactions within living organisms. They play crucial roles in various biological functions, including energy production, signaling, and cellular structure maintenance. Understanding metabolites is essential for functional annotation and gene ontology, as they provide insights into the physiological state of an organism and the pathways that are active under specific conditions.
Molecular Function: Molecular function refers to the specific biochemical activity of a gene product, such as a protein or RNA, at the molecular level. It encompasses the tasks or roles that these molecules perform in cellular processes, helping to understand how genes contribute to biological systems. This term connects deeply with the processes of functional annotation and gene ontology, where molecular functions are categorized and linked to specific biological roles and pathways.
Multi-omics integration: Multi-omics integration refers to the combined analysis of different omics data types, such as genomics, transcriptomics, proteomics, and metabolomics, to provide a more comprehensive understanding of biological systems. By integrating these various layers of biological information, researchers can gain insights into the complex interactions within cells and organisms that drive health and disease.
Over-representation analysis: Over-representation analysis is a statistical method used to determine whether certain gene sets or biological categories are enriched in a given list of genes, such as those differentially expressed in a specific condition. This approach helps researchers understand the biological significance of gene expression changes by assessing if certain functional annotations, often derived from gene ontology, are over-represented compared to what would be expected by chance.
Pathway Analysis: Pathway analysis is a computational method used to identify and interpret biological pathways that are associated with a set of genes or gene products. This process helps researchers understand the underlying mechanisms of biological functions and diseases by linking gene expression data to known biological pathways, thereby providing insights into the cellular processes that may be altered in different conditions.
Proteins: Proteins are large, complex molecules made up of long chains of amino acids, essential for the structure, function, and regulation of the body's cells, tissues, and organs. They play a critical role in numerous biological processes including enzymatic reactions, immune responses, and signal transduction. Understanding proteins is crucial for analyzing gene functions and the interactions within biological systems.
UCSC Genome Browser: The UCSC Genome Browser is a powerful web-based tool that provides a comprehensive and interactive platform for visualizing and analyzing genomic data across various species. It integrates vast amounts of genomic information, allowing researchers to explore whole genome alignments, conduct synteny analysis, visualize gene annotations, and access a wide array of genomic databases and resources, making it essential for modern genomic research.
© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.