Light

3.4 Gene Ontology (GO) and KEGG databases

10 min read•august 20, 2024

(GO) and databases are essential tools for understanding gene functions and biological systems. GO provides a standardized vocabulary for describing gene roles, while KEGG offers comprehensive pathway information and functional annotations.

These databases enable researchers to analyze large-scale genomic data, perform functional annotations, and explore biological pathways. They're crucial for interpreting experimental results, discovering new insights, and advancing our understanding of complex biological processes across species.

Gene Ontology (GO) database

The Gene (GO) database is a comprehensive resource that provides a standardized vocabulary for describing the functions of genes and gene products across all species
GO is crucial for understanding the roles and relationships of genes, enabling researchers to analyze and interpret large-scale genomic and proteomic data in the context of biological processes, molecular functions, and cellular components

Purpose of GO

Top images from around the web for Purpose of GO

Frontiers | A Literature Review of Gene Function Prediction by Modeling Gene Ontology View original
Is this image relevant?
Frontiers | Summary Visualizations of Gene Ontology Terms With GO-Figure! View original
Is this image relevant?
Class 2 Slides: Functional Annotation — BICH464 View original
Is this image relevant?
Frontiers | A Literature Review of Gene Function Prediction by Modeling Gene Ontology View original
Is this image relevant?
Frontiers | Summary Visualizations of Gene Ontology Terms With GO-Figure! View original
Is this image relevant?

1 of 3

Top images from around the web for Purpose of GO

Frontiers | A Literature Review of Gene Function Prediction by Modeling Gene Ontology View original
Is this image relevant?
Frontiers | Summary Visualizations of Gene Ontology Terms With GO-Figure! View original
Is this image relevant?
Class 2 Slides: Functional Annotation — BICH464 View original
Is this image relevant?
Frontiers | A Literature Review of Gene Function Prediction by Modeling Gene Ontology View original
Is this image relevant?
Frontiers | Summary Visualizations of Gene Ontology Terms With GO-Figure! View original
Is this image relevant?

1 of 3

Provides a structured, consistent, and machine-readable language for describing gene functions
Facilitates the integration and comparison of biological information from different sources
Enables the annotation of genes and gene products with functional terms, allowing for the interpretation of high-throughput experimental results
Supports computational analysis and data mining of genomic and proteomic data

GO terminology

GO uses a controlled vocabulary consisting of GO terms, which are precise, unambiguous descriptions of biological functions
Each GO term has a unique identifier, a name, and a definition that provides a clear and concise description of the term
GO terms are organized into three main categories: , , and
The relationships between GO terms are represented using a directed acyclic graph (DAG) structure, which allows for the capture of hierarchical and semantic relationships

Biological process ontology

The biological process ontology describes the larger processes or 'programs' accomplished by multiple molecular activities
Examples of biological process terms include "cell cycle", "signal transduction", and "metabolic process"
Biological process terms are arranged in a hierarchical manner, with more specific terms being child terms of more general parent terms
Biological process annotations provide insight into the biological pathways and processes in which genes and gene products participate

Molecular function ontology

The molecular function ontology describes the elemental activities of a gene product at the molecular level, such as catalytic activity or binding activity
Examples of molecular function terms include "kinase activity", "DNA binding", and "transporter activity"
Molecular function terms are independent of the biological context in which they occur, allowing for the annotation of gene products across different species and biological systems
Molecular function annotations provide information about the biochemical activities of gene products

Cellular component ontology

The cellular component ontology describes the subcellular structures, locations, and macromolecular complexes where gene products are active
Examples of cellular component terms include "nucleus", "mitochondrion", and "ribosome"
Cellular component terms are organized in a hierarchical manner, reflecting the containment relationships between cellular structures
Cellular component annotations provide information about the spatial distribution and localization of gene products within the cell

GO annotations

GO annotations are statements that associate a specific GO term with a gene or gene product
Annotations are made based on various types of evidence, including experimental results, computational analysis, and literature curation
Each annotation includes information about the gene or gene product being annotated, the GO term assigned, the evidence supporting the annotation, and the source of the annotation
GO annotations are continuously updated and expanded based on new experimental findings and computational predictions

Evidence codes in GO

Evidence codes are used to indicate the type of evidence supporting a GO annotation
Examples of evidence codes include "Inferred from Direct Assay (IDA)", "Inferred from Sequence Orthology (ISO)", and "Inferred from Electronic Annotation (IEA)"
Evidence codes provide transparency and allow users to assess the reliability and origin of the annotations
Different evidence codes carry different weights, with experimental evidence generally considered more reliable than computational predictions

Advantages of GO

Provides a standardized and consistent vocabulary for describing gene functions across different species and biological domains
Enables the integration and comparison of biological data from various sources, facilitating data mining and meta-analysis
Supports the functional interpretation of large-scale genomic and proteomic data, aiding in the discovery of biological insights
Allows for the development of computational tools and algorithms that leverage the structured nature of the GO database

Limitations of GO

The completeness and accuracy of GO annotations depend on the availability and quality of the underlying experimental evidence
GO annotations may be biased towards well-studied genes and biological processes, leading to an uneven representation of functional information
The of the GO ontology may not always capture the complex and dynamic nature of biological systems
GO annotations do not provide information about the specific conditions, cell types, or developmental stages in which gene functions occur

KEGG database

The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a comprehensive database resource for understanding high-level functions and utilities of biological systems
KEGG integrates information from various sources, including genomic, chemical, and systemic functional information, to provide a knowledge base for understanding the complex biological processes and their interactions

Purpose of KEGG

Provides a reference knowledge base for linking genomes to biological systems and functions
Facilitates the understanding of high-level functions and utilities of organisms and ecosystems
Enables the integration and interpretation of large-scale molecular datasets in the context of biological pathways and processes
Supports the discovery of novel insights into the relationships between genes, proteins, and metabolites in various biological contexts

KEGG pathway maps

KEGG pathway maps are graphical representations of molecular interaction and reaction networks for various biological processes
Pathway maps cover a wide range of biological processes, including metabolism, genetic information processing, environmental information processing, cellular processes, organismal systems, and human diseases
Each pathway map is manually curated and represents current knowledge about the molecular interactions and reactions involved in a specific biological process
Pathway maps are interactive and provide links to additional information about the genes, proteins, and compounds involved in the pathway

KEGG orthology

KEGG orthology (KO) is a database of orthologous genes and proteins across different species
KO groups are defined based on the conservation of molecular functions and biological roles, allowing for the of genes and proteins in newly sequenced genomes
KO assignments facilitate the comparison of biological processes and pathways across different species and enable the identification of conserved and divergent functional modules

KEGG modules

KEGG modules are higher-level functional units that represent conserved and recurrent molecular subnetworks within pathway maps
Modules are defined based on the analysis of pathway maps and the identification of functionally related gene sets
Each module is associated with a specific biological function or process, such as a specific or a signaling cascade
KEGG modules provide a more condensed and focused view of the biological processes and can be used for comparative analysis and functional characterization of gene sets

KEGG disease database

The KEGG disease database is a collection of molecular networks and pathways associated with human diseases
Disease entries are classified based on the International Classification of Diseases (ICD) and are linked to the corresponding KEGG pathway maps and drug information
The disease database integrates information from various sources, including genetic and environmental factors, molecular mechanisms, and therapeutic interventions
The disease database facilitates the understanding of the molecular basis of human diseases and supports the identification of potential drug targets and biomarkers

KEGG drug database

The KEGG drug database is a comprehensive resource for information about approved drugs, drug targets, and drug-target interactions
Drug entries are linked to their corresponding target proteins, metabolic pathways, and disease information
The drug database integrates information from various sources, including chemical structures, pharmacological actions, and therapeutic indications
The drug database supports the discovery of new drug targets and the repurposing of existing drugs for novel therapeutic applications

Advantages of KEGG

Provides a comprehensive and integrated view of biological systems, linking genomic information to higher-level functions and utilities
Offers manually curated and high-quality pathway maps that represent current knowledge about molecular interactions and reactions
Enables the functional annotation and comparison of genes and proteins across different species through the KEGG orthology system
Supports the analysis and interpretation of large-scale molecular datasets in the context of biological pathways and processes

Limitations of KEGG

The coverage and completeness of KEGG pathway maps and modules may vary depending on the biological domain and the availability of experimental evidence
KEGG primarily focuses on conserved and core biological processes, and may not capture all the species-specific or condition-specific variations in molecular networks
The manual curation process of KEGG data may result in a slower update frequency compared to automatically generated databases
KEGG is a commercial database, and some of its features and tools may require a subscription or license for access

Comparing GO vs KEGG

Both GO and KEGG are valuable resources for understanding the functions and roles of genes and proteins in biological systems, but they have different focuses and approaches

Scope of databases

GO provides a standardized vocabulary for describing gene functions and covers three main categories: biological processes, molecular functions, and cellular components
KEGG focuses on integrating genomic, chemical, and systemic functional information to provide a knowledge base for understanding high-level functions and utilities of biological systems

Ontology vs pathway focus

GO is an ontology-based database that organizes gene functions into a hierarchical structure, capturing the relationships between different functional terms
KEGG is primarily focused on pathway-based representation of molecular interactions and reactions, providing a more dynamic and interconnected view of biological processes

Manual vs automated curation

GO annotations are primarily based on manual curation of experimental evidence and literature, ensuring high-quality and reliable functional assignments
KEGG data is generated through a combination of manual curation and automated methods, such as computational prediction and integration of information from various sources

Species coverage

GO aims to provide functional annotations for genes and gene products across all species, enabling cross-species comparisons and functional inference
KEGG covers a wide range of species, including prokaryotes, eukaryotes, and viruses, but the depth and completeness of coverage may vary depending on the organism and the availability of data

Update frequency

GO is continuously updated and expanded based on new experimental findings and community contributions, with regular releases of the database
KEGG updates its data periodically, but the update frequency may be slower compared to automatically generated databases due to the manual curation process

Applications of GO and KEGG

GO and KEGG are widely used resources in various areas of biological research and have numerous applications in genomics, proteomics, and systems biology

Functional annotation

GO and KEGG are extensively used for the functional annotation of genes and proteins in newly sequenced genomes
By assigning GO terms or KEGG pathway associations to genes and proteins, researchers can gain insights into their potential functions and roles in biological processes

Enrichment analysis

GO and KEGG are commonly used in enrichment analysis, where the goal is to identify overrepresented or underrepresented functional categories or pathways in a set of genes or proteins of interest
Enrichment analysis helps in understanding the biological themes and processes associated with a specific experimental condition or disease state

Network analysis

KEGG pathway maps and modules provide a framework for network analysis, allowing researchers to explore the interactions and relationships between genes, proteins, and metabolites in the context of biological pathways
Network analysis using KEGG data can reveal key regulatory nodes, functional modules, and potential drug targets in biological systems

Comparative genomics

GO and KEGG facilitate comparative genomics studies by enabling the comparison of functional annotations and pathway associations across different species
Comparative analysis using GO and KEGG can help identify conserved and divergent functional modules, evolutionary relationships, and species-specific adaptations

Biomarker discovery

GO and KEGG can be used in the discovery of potential biomarkers for disease diagnosis, prognosis, and treatment response
By integrating functional annotations and pathway information with experimental data, researchers can identify genes or proteins that are differentially expressed or associated with specific disease states

Drug target identification

KEGG's drug database and pathway maps provide valuable information for the identification of potential drug targets and the development of new therapeutic strategies
By mapping drug-target interactions and analyzing the effects of drugs on biological pathways, researchers can identify novel drug targets and predict potential side effects or drug repurposing opportunities

Accessing GO and KEGG data

GO and KEGG provide various ways to access and retrieve their data, catering to the needs of different users and applications

Web interfaces

Both GO and KEGG offer user-friendly web interfaces that allow users to browse, search, and visualize their data
The web interfaces provide interactive tools for exploring gene annotations, pathway maps, and associated information
Users can access the web interfaces through their respective websites: GO (http://geneontology.org/) and KEGG (https://www.kegg.jp/)

Programmatic access

GO and KEGG provide programmatic access to their data through various APIs and web services
Programmatic access allows users to retrieve data in a structured format and integrate it into their own analysis pipelines or software tools
Examples of programmatic access methods include REST APIs, SOAP APIs, and libraries in different programming languages (e.g., Python, R, Java)

File formats

GO and KEGG data can be downloaded in various file formats, depending on the specific database and data type
Common file formats include:
- GO: OBO (Open Biomedical Ontologies) format, OWL (Web Ontology Language) format, and tab-delimited files
- KEGG: KGML (KEGG Markup Language) format, BioPAX (Biological Pathway Exchange) format, and tab-delimited files
The choice of file format depends on the intended use and the compatibility with downstream analysis tools

Integration with other tools

GO and KEGG data can be integrated with various bioinformatics tools and platforms to enhance their functionality and analysis capabilities
Many popular bioinformatics tools, such as Cytoscape, PathVisio, and , provide built-in support for GO and KEGG data integration
Integration with these tools allows users to perform advanced analyses, such as network visualization, , and functional enrichment analysis, using GO and KEGG data as a foundation

Key Terms to Review (19)

Apoptosis pathway: The apoptosis pathway is a series of biochemical events leading to programmed cell death, essential for maintaining cellular homeostasis and development. This pathway plays a crucial role in eliminating damaged or unnecessary cells, preventing diseases like cancer, and is tightly regulated by various proteins and signaling molecules. Understanding this pathway is fundamental for studying cellular processes and disease mechanisms.

Biological process: A biological process refers to a series of events or actions that occur within living organisms, leading to specific outcomes essential for life. These processes encompass a wide range of functions including metabolism, cell signaling, and gene expression, and are crucial for maintaining the health and functionality of cells and organisms. Understanding these processes allows researchers to identify how genes and proteins contribute to the overall functioning of biological systems.

Blast2go: blast2go is a bioinformatics tool that combines sequence alignment and functional annotation to help researchers understand the biological roles of genes and proteins. By integrating BLAST searches with Gene Ontology (GO) terms, blast2go enables users to annotate gene sequences with relevant functional information, making it easier to interpret genomic data in the context of biological pathways.

Cell Differentiation: Cell differentiation is the process by which a less specialized cell becomes a more specialized cell type, acquiring distinct structures and functions. This process is crucial in the development of multicellular organisms, allowing for the formation of various tissues and organs, each with specific roles. It is influenced by genetic regulation and signaling pathways, which can be studied through various biological databases that categorize gene functions and metabolic pathways.

Cellular component: A cellular component refers to the specific parts or structures within a cell that carry out distinct functions necessary for the cell's survival and operation. These components include organelles, membranes, and other structures that contribute to the cell's organization and functionality. Understanding cellular components is crucial for interpreting how genes and pathways interact in biological processes, particularly when using databases like Gene Ontology (GO) and KEGG.

DAVID: DAVID (Database for Annotation, Visualization, and Integrated Discovery) is a comprehensive online resource used primarily in bioinformatics for gene functional annotation, pathway analysis, and data visualization. It provides tools and databases that facilitate the interpretation of genomic and proteomic data, helping researchers to understand biological functions and relationships. The platform integrates various datasets, including Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, which are essential for analyzing complex biological information.

Enzyme activity: Enzyme activity refers to the rate at which an enzyme catalyzes a biochemical reaction, indicating how efficiently it converts substrates into products. This concept is crucial for understanding metabolic pathways and biological functions, as variations in enzyme activity can significantly impact cellular processes and overall organism health.

Functional annotation: Functional annotation refers to the process of identifying the biological function of genes, proteins, and other genomic elements. This process is crucial for understanding how different components of an organism's genome contribute to its phenotype and biological processes, linking sequence data with functional insights across various research areas.

Gene enrichment analysis: Gene enrichment analysis is a computational method used to identify biological pathways or gene sets that are significantly overrepresented in a given list of genes, often derived from high-throughput experiments. This analysis helps researchers understand the biological context of their data by linking gene expression or genetic variation to specific functions, processes, or pathways, thereby highlighting areas of interest for further investigation.

Gene interaction network: A gene interaction network is a complex representation of the relationships and interactions between genes and their products within a biological system. These networks illustrate how genes communicate with each other, influencing various biological processes such as development, metabolism, and responses to environmental stimuli. By mapping these interactions, researchers can better understand the underlying mechanisms of cellular functions and disease pathways.

Gene Ontology: Gene Ontology (GO) is a framework for the standardized representation of gene and gene product attributes across all species. It provides a controlled vocabulary to describe the roles of genes and their products in biological processes, cellular components, and molecular functions. This system enables researchers to annotate genes and proteins consistently, facilitating data sharing and comparison across different studies, which is crucial for functional annotation, pathway analysis, and understanding gene expression through various techniques like RNA-seq and gene co-expression networks.

Glycolysis pathway: The glycolysis pathway is a series of biochemical reactions that convert glucose into pyruvate, producing energy in the form of ATP and NADH. This pathway is crucial for cellular metabolism, as it serves as the primary means of energy production in both aerobic and anaerobic conditions, linking carbohydrate metabolism to cellular respiration and fermentation processes.

Hierarchical Structure: A hierarchical structure is a way of organizing information where elements are ranked according to levels of importance or inclusiveness. In the context of biological databases, this structure helps categorize and represent complex relationships among genes, proteins, and biological processes in a clear and systematic manner, allowing users to easily navigate and retrieve relevant information.

KEGG: KEGG, which stands for Kyoto Encyclopedia of Genes and Genomes, is a comprehensive database that provides information on biological systems, including genomic, chemical, and systemic functional information. It serves as a critical resource for understanding the functions of genes and proteins in various organisms, and connects genetic information to biological pathways and diseases, making it vital for research in genomics and bioinformatics.

Metabolic pathway: A metabolic pathway is a series of chemical reactions in a cell that leads to the conversion of a substrate into a product, often involving multiple enzymes and intermediate compounds. These pathways are crucial for cellular processes such as energy production, biosynthesis, and degradation of molecules, and they can be mapped out using databases to understand how genes and proteins interact within various biological systems.

Molecular Function: Molecular function refers to the specific biochemical activity of a gene product, typically a protein, at the molecular level. It describes what the gene product does, such as binding to other molecules, catalyzing biochemical reactions, or transporting substances within cells. This concept is essential in understanding the role of proteins in various biological processes and is fundamental for annotations in databases that classify genes and proteins based on their functions.

Ontology: Ontology is a formal representation of knowledge as a set of concepts within a domain, and the relationships between those concepts. In computational genomics, ontologies are essential for standardizing biological data, allowing researchers to communicate findings clearly and efficiently by defining terms related to genes, proteins, and metabolic pathways.

Pathway mapping: Pathway mapping is a process used to visualize and analyze the biological pathways that involve genes, proteins, and other molecules to understand their interactions and functions within a biological context. This technique helps researchers identify the roles of specific genes and their products in cellular processes, aiding in the interpretation of complex biological data and the functional annotation of genomes.

Signal Transduction Pathway: A signal transduction pathway is a series of molecular events and interactions that occur within a cell in response to an external signal, leading to a cellular response. These pathways play a critical role in how cells communicate with their environment and can regulate various processes such as gene expression, metabolism, and cell growth. Understanding these pathways is crucial for deciphering biological functions and developing therapeutic strategies.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

Back

Practice QuizGlossary

Practice Quiz Glossary

3.4 Gene Ontology (GO) and KEGG databases

Gene Ontology (GO) database

Purpose of GO

Top images from around the web for Purpose of GO

Top images from around the web for Purpose of GO

GO terminology

Biological process ontology

Molecular function ontology

Cellular component ontology

GO annotations

Evidence codes in GO

Advantages of GO

Limitations of GO

KEGG database

Purpose of KEGG

KEGG pathway maps

KEGG orthology

KEGG modules

KEGG disease database

KEGG drug database

Advantages of KEGG

Limitations of KEGG

Comparing GO vs KEGG

Scope of databases

Ontology vs pathway focus

Manual vs automated curation

Species coverage

Update frequency

Applications of GO and KEGG

Functional annotation

Enrichment analysis

Network analysis

Comparative genomics

Biomarker discovery

Drug target identification

Accessing GO and KEGG data

Web interfaces

Programmatic access

File formats

Integration with other tools

Key Terms to Review (19)

© 2024 Fiveable Inc. All rights reserved.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

Back

Next guide