📊Business Intelligence Unit 2 – Data Warehousing: Concepts & Architecture

Data warehousing is a critical component of modern business intelligence. It centralizes an organization's historical data from various sources, optimizing it for complex queries and analysis. This enables informed decision-making and long-term trend identification. The data warehouse architecture typically follows a three-tier model, separating data sources, the central repository, and user access tools. Key components include source systems, staging areas, ETL processes, and query tools. Understanding these elements is crucial for effective data management and analysis.

What's Data Warehousing?

  • Centralized repository for all an organization's historical data used for reporting and analysis
  • Integrates data from multiple heterogeneous sources (transactional databases, operational data stores, external sources) into a single unified schema
  • Optimized for querying and analysis rather than transaction processing enables complex queries and data mining
  • Stores large volumes of historical data (terabytes to petabytes) over extended periods (years to decades) for long-term trend analysis
  • Typically updated in batches (daily, weekly) rather than real-time to provide a stable, consistent view of data over time
  • Follows a subject-oriented design organizing data around key business subjects (customers, products, sales) rather than operational processes
  • Employs a non-volatile storage approach data is read-only once loaded and not changed, only added to over time

Why Do We Need It?

  • Enables informed strategic decision-making by providing a comprehensive, integrated view of enterprise-wide data
  • Overcomes limitations of operational databases not optimized for complex queries and analytics
  • Improves data quality and consistency through data integration and cleansing during ETL process
  • Enhances performance of analytical queries by separating them from transactional workloads
  • Provides a historical record of business data over time for identifying long-term trends and patterns
  • Facilitates data consolidation from multiple systems and business units into an enterprise-wide view
  • Supports ad hoc queries, data mining, and advanced analytics not feasible on operational databases
  • Reduces IT costs and complexity by offloading analytical workloads from transactional systems

Key Components of a Data Warehouse

  • Source systems originating operational databases and external data sources feeding the warehouse
  • Staging area temporary storage for data extraction, transformation, and cleansing before loading into warehouse
  • ETL (Extract, Transform, Load) tools and processes for extracting data from sources, converting it to warehouse format, and loading it
  • Central repository the core data warehouse database storing integrated, historical data in a denormalized, subject-oriented schema
    • Fact tables store quantitative metrics (sales amount, quantity) about business events and processes
    • Dimension tables store descriptive attributes (customer name, product category) providing context for facts
  • Data marts subsets of warehouse data focused on specific business functions (marketing, finance) or departments
  • Metadata repository stores information about the structure, meaning, and lineage of data in the warehouse
  • Query and reporting tools allow users to access, analyze, and visualize warehouse data (OLAP, data mining, dashboards)

Data Warehouse Architecture

  • Three-tier architecture separates data sources, warehouse repository, and end-user access and analysis
    • Bottom tier source systems and staging area for ETL
    • Middle tier central data warehouse repository and data marts
    • Top tier end-user query, reporting, and analysis tools
  • Hub-and-spoke architecture uses a central data warehouse (hub) to feed multiple dependent data marts (spokes)
  • Bus architecture uses conformed dimensions to create a standardized, reusable data model across data marts
  • Federated architecture leaves data in source systems and uses a virtual data warehouse to provide a unified view
  • Hybrid architecture combines elements of different architectures (hub-and-spoke fed by bus) to meet specific needs

ETL Process Explained

  • Extraction retrieves data from source systems into staging area can be full or incremental based on data volumes and update frequency
  • Transformation converts extracted data into warehouse format resolves data quality issues, applies business rules, and derives new calculated fields
    • Data cleansing corrects errors, removes duplicates, and standardizes formats
    • Data integration combines data from multiple sources into a unified schema
    • Data aggregation summarizes detailed data into higher-level totals for analysis
  • Loading moves transformed data into the warehouse tables and indexes are created to optimize query performance
  • ETL workflows are scheduled, logged, and monitored to ensure data freshness, quality, and availability
  • Changed Data Capture (CDC) identifies and extracts only new or modified data to streamline incremental loads
  • Data lineage tracks the sources, transformations, and dependencies of warehouse data for auditing and troubleshooting

Types of Data Warehouses

  • Enterprise Data Warehouse (EDW) provides a centralized, comprehensive view of data across the entire organization
  • Operational Data Store (ODS) contains near real-time data from operational systems for tactical reporting and analysis
  • Data Mart focuses on a specific business function (marketing) or department (sales) and contains a subset of EDW data
    • Dependent data marts source data from the EDW for a unified, consistent view
    • Independent data marts source data directly from operational systems for faster, more flexible deployment
  • Cloud Data Warehouse uses cloud computing and storage to provide scalability, elasticity, and cost-efficiency (Amazon Redshift, Google BigQuery)
  • Real-time Data Warehouse supports near real-time data ingestion and analysis for operational reporting and decision-making

Data Modeling for Warehouses

  • Dimensional modeling organizes data into facts (quantitative) and dimensions (qualitative) for intuitive querying and analysis
    • Star schema has a central fact table connected to surrounding dimension tables resembles a star
    • Snowflake schema normalizes dimension tables into multiple related tables resembles a snowflake
  • Fact tables store measurements, metrics, or events (sales amount, quantity sold) related to a business process
    • Composite key consists of foreign keys from connected dimension tables
    • Fact less tables have no measures only foreign keys used to analyze many-to-many relationships
  • Dimension tables store descriptive attributes (product name, category) that provide context for fact data
    • Denormalized and redundant to simplify queries and improve performance
    • Slowly Changing Dimensions (SCDs) handle changes to dimension attributes over time
  • Surrogate keys replace natural keys to simplify joins, improve performance, and handle changes

Real-World Applications

  • Retail analyzes sales, inventory, and customer data to optimize product assortment, pricing, and promotions
  • Healthcare integrates patient, clinical, and financial data to improve care quality, outcomes, and cost-effectiveness
  • Financial Services analyzes customer, account, and transaction data for fraud detection, risk management, and cross-selling
  • Telecommunications mines call detail records, network data, and customer data for churn analysis, network optimization, and marketing
  • Manufacturing integrates supply chain, production, and quality data to optimize inventory, throughput, and product quality
  • Government agencies integrate data from multiple departments to improve services, reduce fraud, and enhance security
  • Higher Education analyzes student, course, and alumni data to improve enrollment, retention, and fundraising

Challenges and Limitations

  • Data quality issues from source systems (inconsistencies, errors, duplicates) can propagate into the warehouse if not resolved during ETL
  • Data integration challenges arise when combining data from disparate sources with different schemas, formats, and semantics
  • Scalability and performance issues can occur as data volumes grow, requiring careful design and optimization of warehouse architecture and queries
  • High implementation and maintenance costs for hardware, software, and personnel can be a barrier for smaller organizations
  • Lack of real-time data can limit the warehouse's usefulness for operational decision-making and reporting
  • Resistance to change from business users accustomed to existing reporting and analysis tools can slow adoption
  • Data governance and security concerns arise from centralizing sensitive data from multiple sources into one repository
  • Keeping up with evolving data sources and business requirements requires ongoing warehouse maintenance and updates


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.