Designing a Modern Data Warehouse – Available Options
Summary
I recently spent some time with Microsoft to look at the various options for building our big data pipelines. Here are the final options that exist for anyone considering doing the same.
Note that all ingesting and loading is carried out with Azure Data Factory.
Source Systems to Azure Data Warehouse to Power BI
In this architecture, data is pushed directly into Azure DW where transformation (cleansing, quality control, schema modelling) is all carried out in the DW. There is no staging area, as all load & transformation workloads are committed directly on the DW. Business Users then directly connect to the DW and build their reports in the DW.
Pros & Cons listed based on the assumption that the number of DWU is absolutely required (http://dwucalculator.azurewebsites.net/)
Pros
- The number of interconnected components is reduced to three at the minimum (source system, DW and a reporting tool).
- Reducing the number of copies of the data. You only have one set of production data, therefore managing security and access is simplified.
- Compute resources can be scaled up or down when required. The DW can also be paused if required. This means that spending on the DW can be managed quite accurately.
- Shorter development pipelines mean faster deployment to production for business user consumption.
Cons
- Because all workloads are committed on the DW, performance may be an issue (i.e. running a data modelling workload when a business user is running a report). This will most certainly mean that computing power will need to be scaled up.
- Working directly in the DW will also mean that the number of queries being executed by Developers and Business Users will also be higher. The maximum number of concurrent queries as of 2019 in Azure DW is 128 at the highest service tier (DW30000c), with the lowest tier (DW100c) serving 4 concurrent queries.
- Implementation will not be economically scalable as more developers work directly in the DW when higher numbers of business users querying DW for reports.
- It is not considered best practice to allow business users to directly access the DW for reporting purposes. A SQL data mart or Azure Analysis Service should be used.
*Note: The debate is often around ‘performance’ in an Azure SQL DW is not a concern due to it being MPP (Massive Parallel Processing). This is incorrect. If you provision 200 DWU, the assumption is that 200 DWU is based on evidence that you are maximising the required MPP (applying transformations, calculating data in petabytes). If you have provisioned 210 DWU, the additional 10 DWU comes at a cost. If the answer is, the extra 10 DWU is to be used for business user consumption, i.e. Power BI – then there are better strategies for this. You may need to consider your architecture again and understand the architectural separation of concerns and the single responsibility principle.
Source Systems, to Staging (Azure Data Lake) to Azure Data Warehouse with Reporting
Pros
- The extraction & transformation is carried out in Azure Data Lake. This offloading helps to reduce DWU consumption and helps to reduce costs.
- Developers do not need to directly work on the DW, leaving DWU consumption to Business Users. This provides a better metric for reporting on business-use consumption.
- Developers are able to maximise on the benefits of Azure Data Lake by creating ‘zones’ in the Data Lake. This allows developers to completely control their ALM process, by creating containers (DEV, UAT, SIT, PROD) only loading data into the DW when data is at a production-level quality.
- In some scenarios, Business Users can access the Azure Data Lake for direct reporting from the Azure Data Lake. There is some data partitioning of 2GB required for Data lake Gen1, Gen2 has no limits.
- Reporting on historic data is also possible depending on how the Data Lake has been constructed. Reporting on trends or forecasting using historic data is also possible using SQL Polybase.
- Security can be controlled using Role-Based Access Control in Data lake.
Cons
- Data is duplicated in both the Data Lake and the Data Warehouse. This may present a legal or administrative challenge when maintaining security.
- The complexity has been increased by adding Data Lake.
- Development pipelines have now increased.
- It is not considered best practice to allow business users to directly access the DW for reporting purposes. A SQL data mart or Azure Analysis Service should be used.
Source Systems, to Staging (Azure Data Lake) to Azure Data Warehouse to Data Marts in SQL with Reporting
Pros
- In addition to the Pros mentioned in point 2, the cost is further reduced by loading data into data marts. In this architecture, data is loaded from the DW to smaller SQL DBs scaled correctly for business user consumption. This greatly reduces cost and complexity.
- Data can be partitioned into smaller units and offloaded into Single-Instance SQL DB’s. This prevents the DW from being constantly re-modelled to fulfil business requirements.
- This also reduces the chances of throttling the concurrent limit.
- DW can be paused after the data has been loaded from the DW to the data mart, reducing cost.
- Encourages strong governance, where the Data lake and the Data Warehouse is controlled rigorously but allows flexibility for Developers to build-out the SQL server data marts.
- The data mart can be either an Azure SQL elastic pool, Azure Analysis Services or Data Lake itself if the import is not an issue in Power BI.
Cons
- While costs are reduced, the introduction of a SQL DB as a data mart will inherently increase development times.
- Can often lead to a big ball of mud architecture with multiple technologies involved without clear governance.
- Managing deltas needs a strong strategy as changes in the source systems need to be propagated through the pipeline.
Source Systems, to Staging (Azure Data Lake) to Azure SQL DW in a Virtual Machine to Data Marts
In this architecture, data is extracted from a variety of sources using Azure Data Factory. The data is ingested into Azure Data Lake and placed in zones. SQL Server DW in a VM is used to transform the data using Polybase. Once transformed the data is loaded into data marts (Azure Single-instance DB, elastic pool, or Azure Analysis Services).
Pros
- Introducing the VM with Azure DW and polybase will provide developers with a familiar environment to work form.
- Can often shorten data pipeline development times and costs.
- Polybase allows external sources to be read. https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-guide?view=sql-server-ver15
- Using Azure Analysis Services may provide additional cost savings where data marts are offloaded to AAS, an in-memory OLAP cube.
- VM can be turned off after workloads are completed.
Cons
- Introduction of a VM with SQL warehouse will mean license costs and a VM that will require management (updates, patches etc).
- Has the potential to act as a bottle-neck if not provisioned carefully.
- Creates an additional layer in the data pipeline that has to be managed by a competent developer/sysadmin.
Source Systems to Azure Data Ware using Azure Databricks
In this architecture, all transformations are applied using Azure Databricks. There is no need for a staging area to prepare the data into a Star Scheme as this is carried out in Databricks. The data is then loaded into Azure Data Warehouse directly for business user consumption.
Note that this architecture is flexible, Azure Data Bricks can be used to load data into Azure Data Warehouse, and then into Data Marts (SQL, Analysis Services or into Power BI data flows).
Pros
- Reduces the number of technologies now being used.
- Provides developers with familiar sets of tools and languages (notebooks, SQL, Python etc.)
- ALM and data pipelines are easier to manage, as are incremental loads.
- Reduced costs if provisioned correctly.
- New features are released to Databricks continuously.
Cons
- The learning curve can be a challenge for traditional SQL developers.
- Costs can get high very quickly if provisioned incorrectly.
- Relatively new to the market so support may be a challenge.
Power BI Dataflows
Power BI Dataflows is relatively new and perfect for smaller workloads. It uses Azure Data Lake Gen 2, has data wrangling and transformation properties so you can build a star scheme directly, using the Power BI Service. Here, source systems are connected through the Common Data Service, data is dumped into the Data lake. The Data lake here is abstract so this does not need to be provisioned. All modelling occurs using Power Query. Data is finally served to Power Bi Service.
Pros
- Reduces the number of technologies and costs.
- Data flow is readily being extended by Microsoft.
- Easier to use than tools built into Azure.
- Provides a familiar environment for non-technical users to rapidly start developing reports.
- While most technologies are abstracted from the consumer, Microsoft does provide access to the Data Lake if required.
Cons
- Does not provide the power and capabilities required for large datasets.
- Power Query has some shortcomings which may leave some SQL developers frustrated with the experience.
- Relatively new so support is limited.
For more information on Power BI Dataflow go to the official documentation site: https://docs.microsoft.com/en-us/power-bi/transform-model/service-dataflows-overview