💾Intro to Database Systems Unit 15 – Distributed Databases in Intro to DB Systems

Distributed databases spread data across multiple connected sites, offering improved performance, fault tolerance, and scalability. They enable parallel query execution, data replication, and localized access, addressing limitations of centralized systems while supporting global data views and collaboration. This unit explores distributed database types, architectures, and strategies for data distribution and query processing. It covers consistency mechanisms, replication protocols, and challenges like network latency and the CAP theorem, providing insights into managing large-scale, geographically dispersed data systems.

Study Guides for Unit 15

15.1

Distributed database architectures

4 min read

15.2

Data fragmentation and replication

4 min read

15.3

Distributed query processing and optimization

3 min read

What's a Distributed Database?

Consists of a single logical database that is spread physically across computers in multiple locations that are connected by a data communications network
Enables data to be stored and accessed from multiple sites, providing a global view of the data to users and applications
Allows for the parallel execution of queries to improve performance and reduce response time
Provides fault tolerance and high availability by replicating data across multiple sites
Supports scalability by allowing the addition of new sites to the network as the database grows
Offers better reliability and availability compared to centralized databases
Enables data to be located close to the users who need it, reducing network traffic and latency

Why Go Distributed?

Enables organizations to scale their databases beyond the limitations of a single machine or data center
Provides fault tolerance and high availability by replicating data across multiple sites
Allows for load balancing and parallel processing of queries to improve performance
Enables data to be located close to the users who need it, reducing network traffic and latency
Supports the need for data sharing and collaboration across multiple sites or organizations
Offers better scalability and flexibility compared to centralized databases
Enables organizations to handle large volumes of data and high transaction rates

Types of Distributed Databases

Homogeneous distributed databases use the same DBMS software at all sites and have identical data structures and constraints
Heterogeneous distributed databases use different DBMS software at different sites and may have varying data structures and constraints
Federated databases integrate multiple autonomous databases, allowing them to share and exchange information while maintaining their autonomy
Multi-database systems provide a unified interface for accessing multiple heterogeneous databases without requiring them to be integrated
Peer-to-peer distributed databases distribute data and processing across a network of equal, autonomous nodes
Cloud-based distributed databases leverage the scalability and flexibility of cloud computing platforms to store and process data across multiple virtual machines or containers

Architecture and Components

Consists of a network of interconnected sites, each with its own local database management system (DBMS) and storage
Includes a global schema that defines the logical structure of the entire distributed database
Uses a distributed data dictionary to store metadata about the distribution of data across sites
Employs a distributed query processor to optimize and execute queries across multiple sites
Utilizes a distributed transaction manager to ensure the consistency and integrity of transactions that span multiple sites
Incorporates a distributed concurrency control mechanism to manage concurrent access to shared data
Relies on a distributed recovery system to handle failures and ensure the consistency of the database in the event of site or network failures

Data Distribution Strategies

Fragmentation involves dividing a relation or table into smaller fragments that are stored at different sites
- Horizontal fragmentation partitions a relation by rows, with each fragment containing a subset of the rows
- Vertical fragmentation partitions a relation by columns, with each fragment containing a subset of the columns
Replication involves storing copies of the same data at multiple sites to improve availability and performance
- Full replication stores a complete copy of the database at each site
- Partial replication stores only a subset of the data at each site
Hybrid distribution strategies combine fragmentation and replication to balance the benefits and tradeoffs of each approach
Data allocation determines the optimal placement of fragments and replicas across sites based on factors such as network topology, data access patterns, and performance requirements

Query Processing in Distributed Systems

Involves decomposing a global query into a set of local subqueries that can be executed at individual sites
Requires a query optimization phase to determine the most efficient execution plan for the query, considering factors such as data location, network bandwidth, and processing costs
Uses a query execution engine to coordinate the execution of subqueries across sites and combine the results into a final answer
Employs techniques such as semi-joins and bloom filters to reduce the amount of data transferred between sites during query processing
Utilizes parallel processing and load balancing to improve query performance and response time
Incorporates distributed join algorithms (such as hash joins and nested loop joins) to efficiently process joins across multiple sites
Handles distributed aggregation and grouping operations to compute aggregate functions (such as SUM, AVG, and COUNT) across multiple sites

Consistency and Replication

Ensures that all copies of replicated data are consistent and up-to-date across all sites
Uses a replication protocol to propagate updates from one site to all other sites that store a copy of the data
- Synchronous replication ensures strong consistency by requiring all replicas to be updated before a transaction is committed
- Asynchronous replication provides eventual consistency by allowing updates to be propagated to replicas after a transaction is committed
Employs distributed concurrency control mechanisms (such as two-phase locking and timestamp ordering) to manage concurrent access to replicated data
Utilizes distributed commit protocols (such as two-phase commit) to ensure that all sites agree on the outcome of a transaction
Incorporates distributed recovery techniques to handle site and network failures and ensure the consistency of replicated data
Deals with the tradeoff between consistency and availability in the presence of network partitions (known as the CAP theorem)

Challenges and Tradeoffs

Network latency and bandwidth limitations can impact the performance of distributed queries and transactions
Ensuring consistency and integrity of data across multiple sites can be challenging, especially in the presence of failures and network partitions
Maintaining the security and privacy of data in a distributed environment requires additional measures, such as distributed access control and encryption
Designing an efficient data distribution strategy that balances the benefits of fragmentation and replication can be complex
Handling heterogeneous data sources and integrating legacy systems can be difficult in a distributed database environment
Dealing with the complexity of distributed query optimization and transaction management requires sophisticated algorithms and techniques
Balancing the tradeoffs between consistency, availability, and partition tolerance (as described by the CAP theorem) is a key challenge in distributed databases
Ensuring the scalability and elasticity of the distributed database system as the data volume and workload grow can be challenging