Intro to Database Systems

💾Intro to Database Systems Unit 15 – Distributed Databases in Intro to DB Systems

Distributed databases spread data across multiple connected sites, offering improved performance, fault tolerance, and scalability. They enable parallel query execution, data replication, and localized access, addressing limitations of centralized systems while supporting global data views and collaboration. This unit explores distributed database types, architectures, and strategies for data distribution and query processing. It covers consistency mechanisms, replication protocols, and challenges like network latency and the CAP theorem, providing insights into managing large-scale, geographically dispersed data systems.

What's a Distributed Database?

  • Consists of a single logical database that is spread physically across computers in multiple locations that are connected by a data communications network
  • Enables data to be stored and accessed from multiple sites, providing a global view of the data to users and applications
  • Allows for the parallel execution of queries to improve performance and reduce response time
  • Provides fault tolerance and high availability by replicating data across multiple sites
  • Supports scalability by allowing the addition of new sites to the network as the database grows
  • Offers better reliability and availability compared to centralized databases
  • Enables data to be located close to the users who need it, reducing network traffic and latency

Why Go Distributed?

  • Enables organizations to scale their databases beyond the limitations of a single machine or data center
  • Provides fault tolerance and high availability by replicating data across multiple sites
  • Allows for load balancing and parallel processing of queries to improve performance
  • Enables data to be located close to the users who need it, reducing network traffic and latency
  • Supports the need for data sharing and collaboration across multiple sites or organizations
  • Offers better scalability and flexibility compared to centralized databases
  • Enables organizations to handle large volumes of data and high transaction rates

Types of Distributed Databases

  • Homogeneous distributed databases use the same DBMS software at all sites and have identical data structures and constraints
  • Heterogeneous distributed databases use different DBMS software at different sites and may have varying data structures and constraints
  • Federated databases integrate multiple autonomous databases, allowing them to share and exchange information while maintaining their autonomy
  • Multi-database systems provide a unified interface for accessing multiple heterogeneous databases without requiring them to be integrated
  • Peer-to-peer distributed databases distribute data and processing across a network of equal, autonomous nodes
  • Cloud-based distributed databases leverage the scalability and flexibility of cloud computing platforms to store and process data across multiple virtual machines or containers

Architecture and Components

  • Consists of a network of interconnected sites, each with its own local database management system (DBMS) and storage
  • Includes a global schema that defines the logical structure of the entire distributed database
  • Uses a distributed data dictionary to store metadata about the distribution of data across sites
  • Employs a distributed query processor to optimize and execute queries across multiple sites
  • Utilizes a distributed transaction manager to ensure the consistency and integrity of transactions that span multiple sites
  • Incorporates a distributed concurrency control mechanism to manage concurrent access to shared data
  • Relies on a distributed recovery system to handle failures and ensure the consistency of the database in the event of site or network failures

Data Distribution Strategies

  • Fragmentation involves dividing a relation or table into smaller fragments that are stored at different sites
    • Horizontal fragmentation partitions a relation by rows, with each fragment containing a subset of the rows
    • Vertical fragmentation partitions a relation by columns, with each fragment containing a subset of the columns
  • Replication involves storing copies of the same data at multiple sites to improve availability and performance
    • Full replication stores a complete copy of the database at each site
    • Partial replication stores only a subset of the data at each site
  • Hybrid distribution strategies combine fragmentation and replication to balance the benefits and tradeoffs of each approach
  • Data allocation determines the optimal placement of fragments and replicas across sites based on factors such as network topology, data access patterns, and performance requirements

Query Processing in Distributed Systems

  • Involves decomposing a global query into a set of local subqueries that can be executed at individual sites
  • Requires a query optimization phase to determine the most efficient execution plan for the query, considering factors such as data location, network bandwidth, and processing costs
  • Uses a query execution engine to coordinate the execution of subqueries across sites and combine the results into a final answer
  • Employs techniques such as semi-joins and bloom filters to reduce the amount of data transferred between sites during query processing
  • Utilizes parallel processing and load balancing to improve query performance and response time
  • Incorporates distributed join algorithms (such as hash joins and nested loop joins) to efficiently process joins across multiple sites
  • Handles distributed aggregation and grouping operations to compute aggregate functions (such as SUM, AVG, and COUNT) across multiple sites

Consistency and Replication

  • Ensures that all copies of replicated data are consistent and up-to-date across all sites
  • Uses a replication protocol to propagate updates from one site to all other sites that store a copy of the data
    • Synchronous replication ensures strong consistency by requiring all replicas to be updated before a transaction is committed
    • Asynchronous replication provides eventual consistency by allowing updates to be propagated to replicas after a transaction is committed
  • Employs distributed concurrency control mechanisms (such as two-phase locking and timestamp ordering) to manage concurrent access to replicated data
  • Utilizes distributed commit protocols (such as two-phase commit) to ensure that all sites agree on the outcome of a transaction
  • Incorporates distributed recovery techniques to handle site and network failures and ensure the consistency of replicated data
  • Deals with the tradeoff between consistency and availability in the presence of network partitions (known as the CAP theorem)

Challenges and Tradeoffs

  • Network latency and bandwidth limitations can impact the performance of distributed queries and transactions
  • Ensuring consistency and integrity of data across multiple sites can be challenging, especially in the presence of failures and network partitions
  • Maintaining the security and privacy of data in a distributed environment requires additional measures, such as distributed access control and encryption
  • Designing an efficient data distribution strategy that balances the benefits of fragmentation and replication can be complex
  • Handling heterogeneous data sources and integrating legacy systems can be difficult in a distributed database environment
  • Dealing with the complexity of distributed query optimization and transaction management requires sophisticated algorithms and techniques
  • Balancing the tradeoffs between consistency, availability, and partition tolerance (as described by the CAP theorem) is a key challenge in distributed databases
  • Ensuring the scalability and elasticity of the distributed database system as the data volume and workload grow can be challenging


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.