Data Architecture Styles Reference Models
Summary
In this post, I explore several architectural patterns and their applications in data processing and analytics. Lambda Architecture is designed to combine batch and real-time processing, offering a comprehensive approach to data analytics by integrating historical data with real-time insights. This architecture is particularly beneficial for scenarios that require the simultaneous analysis of past and present data, such as optimising inventory or personalising marketing based on customer purchase behaviour. Delta Lake Architecture enhances data quality and processing efficiency through a layered approach, consisting of the Bronze, Silver, and Gold layers, each serving a distinct purpose from raw data ingestion to producing analytics-ready datasets. This architecture ensures that data is clean, enriched, and ready for business intelligence applications. Additionally, Data Lakehouse Architecture combines the strengths of data lakes and data warehouses, offering unified data storage and processing with enhanced performance features, making it suitable for both structured and unstructured data analytics. Meanwhile, ETL Pipeline Architecture and Kappa Architecture provide streamlined processes for extracting, transforming, and loading data, with Kappa Architecture particularly focusing on real-time data processing to simplify the overall data management approach.
Data Architecture Styles
1. Lambda Architecture
- Batch Layer: Stores immutable, constantly growing master dataset. Data is processed in batches and stored in a data warehouse or data lake.
- Speed Layer: Handles real-time data processing and provides near real-time views. Uses streaming services like Apache Spark Streaming.
- Serving Layer: Merges the batch and speed layer results to provide a comprehensive view for queries. Often uses databases like Cassandra or HBase.
Problem Solved: Combines historical (batch) and real-time data for comprehensive analytics and timely insights.
Example:
- Source Systems: Customer data from CRM and real-time sales transactions from point-of-sale systems.
- Scenario: Analysing customer purchase history and current sales trends to optimize inventory and personalize marketing.
Components Used:
- Batch Processing: Databricks Spark processes historical customer data.
- Batch Data Store: Delta Lake stores processed batch data.
- Real-Time Processing: Databricks Spark Streaming processes live sales transactions.
- Real-Time Data Store: Delta Lake stores real-time data.
- Serving Layer: Databricks merges batch and real-time data.
- Queries: Databricks SQL for analytics.
2. Delta Lake Architecture
- Bronze Layer: Raw data ingestion. This layer stores the unprocessed data in a raw format.
- Silver Layer: Cleansed and enriched data. This layer processes and refines the data, adding schema validation and transformations.
- Gold Layer: Business-level aggregates and analytics-ready data. This layer contains data that is ready for use in business intelligence and analytics.
Problem Solved: Streamlines data processing pipeline ensuring data quality and readiness for analysis.
Example:
- Source Systems: Invoice data from ERP and logs from application servers.
- Scenario: Cleaning and transforming invoice data for accurate financial reporting and analyzing log data to monitor application performance.
Components Used:
- Bronze Layer: Raw data storage.
- Silver Layer: Cleaned and refined data storage.
- Gold Layer: Aggregated, analytics-ready data.
- BI Tools: Databricks for business intelligence.
3. Data Lakehouse Architecture
- Unified Data Storage: Combines the capabilities of a data lake and a data warehouse. Stores both structured and unstructured data in a single platform.
- Metadata Layer: Provides schema enforcement and governance. Utilises tools like Databricks’ Unity Catalogue.
- Compute Layer: Executes both batch and real-time analytics using Apache Spark.
- Optimisation Layer: Uses Delta Engine to improve performance with features like indexing, caching, and data skipping.
Explanation:
- Data Sources: Data is ingested from various sources into Databricks.
- Unified Data Storage – Delta Lake: Data is stored in Delta Lake, which resides on cloud storage and provides reliable and scalable data storage.
- Metadata Layer – Databricks Unity Catalog: Databricks Unity Catalog provides schema enforcement, governance, and metadata management.
- Compute Layer – Databricks Spark: Databricks uses Apache Spark for distributed data processing and computation.
- Optimisation Layer – Delta Engine: Databricks’ Delta Engine optimises query performance with features like indexing, caching, and data skipping.
- Queries – Databricks SQL: Users perform interactive queries and analytics using Databricks SQL.
- BI Tools – Databricks: Databricks provides integrated BI tools for advanced analytics and reporting.
4. ETL Pipeline Architecture
- Extract: Pulls data from various sources (databases, APIs, etc.).
- Transform: Cleanses, enriches, and transforms the data using Apache Spark.
- Load: Loads the transformed data into a target data warehouse or data lake for further analysis.
6. Kappa Architecture
- Data Ingestion: Captures real-time data streams from various sources using tools like Apache Kafka or Azure Event Hubs.
- Stream Processing: Processes the data streams in real-time using Spark Structured Streaming. The same stream processing logic is applied to both historical and real-time data, simplifying the architecture.
- Storage: Stores the processed data in a unified storage layer such as Delta Lake, which supports ACID transactions and scalable storage.
- Serving Layer: Provides processed data for real-time analytics and business intelligence using query engines like Databricks SQL.