Integration testing with Data Factory, Data Lake, Dynamics 365 Finance & Operation

Summary

Just for fun, I decided to play around with Azure Data Factory (ADF), Azure Data Lake Storage (ADLS) and Microsoft Dynamics 365 Finance & Operation VM.

I wanted to see how I can connect the pieces together to deliver some reporting functionality. My experimental plan is to integrate the technologies together to see if reports can be generated in Power BI. I’m not sure what these reports are going to look like, but as always I will document my steps and any challenges & errors I come across.

Experimental Business Scenario

The experiment is to find out the top 5 Users in Dynamics 365 Finance & Operations on-premise VM. The rough architecture is as follows:

*Azure Data Factory natively has some support for Dynamics 365 Finance & Operations. Although this is in preview and primarily for the SaaS offering, our aim is to use an on-premise D365 FO instance. This is the Onebox implementation that can be downloaded from LifeCycle Services.

*Microsoft Dynamics does support Power BI integration out-the-box. The purpose of this is just to document how a hybrid architecture might look like.

Dynamics 365 Finance & Operation

The first challenge is getting an instance of Microsoft Dynamics Finance and Operations. This isn’t possible as a consumer, but luckily as a Consultant working on an D365-FO project, I was able to access LifeCycle Services and download the VM.

Once I downloaded Microsoft Dynamics Finance and Operations (10.0) Virtual Machine from LifeCycle Services. The VM is built with Hyper-V, however the second challenge for me was to convert the VM to a Vmware ESXi supported VM. This was easily accomplished using the StarWind V2V Converter.

Once the VM was loaded and running, the first step was to determine where the User table in D365 FO exists. This is actually quite a simple task. For example, if you wanted to find the Customer table name then you would Right-Click on a View, go to ‘Form Information‘.

Clicking on the CustTable menu item will reveal a panel, where clicking on the Administration tab will reveal the Query that is run to display the view. In this case the SQL table is called CustTable.

In my case I need the User table: USERINFO. Logging in via SQL Server Management Studio and running the following query:

SELECT * FROM dbo.USERINFO

Reveals the following list in SQL Server Management Studio.

This is the list I want to track and report in Power BI. Now it’s time to move into orchestrating the workload in Azure Data Factory.

Azure Data Factory Setup

Creating the Data Factory resource is relatively a straight forward task. I wont go into this. At the time of writing I opted for the V2 Data Factory resource, mainly because of the Self-hosted Integration Runtime. The other point to remember is Access Control.

  • To create and manage child resources in the Azure portal, you must belong to the Data Factory Contributor role at the resource group level or above.
  • To create and manage child resources with PowerShell or the SDK, the contributor role at the resource level or above is sufficient.

ADF offers three types of Integration Runtimes, and you should choose the type that best serve the data integration capabilities and network environment needs you are looking for. These three types are:

  • Azure
  • Self-hosted
  • Azure-SSIS

In ADF, an activity defines the action to be performed. A linked service defines a target data store or a compute service. An integration runtime provides the bridge between the activity and linked Services. It is referenced by the linked service or activity, and provides the compute environment where the activity either runs on or gets dispatched from. This way, the activity can be performed in the region closest possible to the target data store or compute service in the most performant way while meeting security and compliance needs [REF].

The following table describes the capabilities and network support for each of the integration runtime types:

IR typePublic networkPrivate network
AzureData Flow
Data movement
Activity dispatch
 
Self-hostedData movement
Activity dispatch
Data movement
Activity dispatch
Azure-SSISSSIS package executionSSIS package execution

Pragmatic Works have a very interesting video that describes the differences between ADF version 1 & 2. It’s worh a quick view: https://www.youtube.com/watch?v=hAkcXiXr0DE

The key differences outlined by Pragmatic Works, between versions 1 and 2 are:

ADF V1ADF V2
DatasetsDatasets
Linked Services Linked Services
Pipelines Pipelines
Schedule on dataset availability & pipeline start/end timesSchedule triggers (time or tumbling window)
Self-hosted intergration runtime
Host and execute SSIS packages
Parameters
New Control flow activities

In my experiment, I will be using V2 with the Self-hosted integration runtime. A list of supported Data Stores and formats can be found here: https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-overview#supported-data-stores-and-formats

Azure Data Factory Copy Activity

The fastest way to deploy ADF integration, is to use the Copy Activity. This is a wizard that takes you from start to finish. The end result should look like this:

Using the Copy Activity I was able to quickly create a simple ADF Copy Pipeline. Some key points to remember are:

  1. The Self-hosted Integration Runtime is an executable that requires installation on the on-premise host machine.
  2. Firewall rules may need to be adjusted to allow the Self-hosted integration runtime to contact ADF.
  3. Azure Data Lake Storage needs to be setup prior to running the Copy Activity. The ADLS will be used as a Sink.
  4. Azure Data Lake Storage with Managed Identities requires setting up in Azure Data Lake Storage.

If you are using Managed Identities to access the Azure Data Lake, then you need to ensure that you select the Data Lake Container itself and not the subfolders.

Select and search for the generated managed identity.

59345da9-2a4d-412b-afb6-298fadab91a1

Add the correct permissions

Another important point to remember is the runtime trigger. You can run ADF once, or shedule the ADF pipeline to run at intervals. This needs to be setup at the start of the Cope Activity wizard.

Azure Data Factory Source Settings

When selecting my Source Connection, I used the following details to logon to the SQL VM Server:

Username: AOUSER
Password: AOSWebSite@123

When creating the Linked Service using the Self-Hosted Integration Runtime, my settings were as follows:

For the integration runtime I selected the Express Setup. This was the easiest option, allowing me to download a file and install on my local machine.

Interestingly, you are able to set the limit for number of concurrent jobs. This essentially means that several ADF jobs can share a single runtime.

Azure Data Lake Destination Settings

If you are using the Azure Portal UI, you should find that most of the settings are automatically solved for you.

Finally add the Azure Data Lake folder path

I’ve set my folder path by year. So the directory output should look like:

2019/UserInfo.tsv

The file format settings are as follows:

Azure Data Factory Pipeline Settings

Some important settings that need to be made for managing errors. In my pipeline I want to skip any errors but log these errors. Right now, I am wondering if I can track these log files through instant notification, either using Azure Event Grid or Logic Apps. It’s something that I’ll have to investigate.

My chosen settings are as follows:

With everything setup, I can now trigger the pipline. The summary of my ADF Copy Activity is as follows:

Experiment Results

The copy activity, as expected worked as planned. I saw my data in Azure Data Lake quite quickly.

Details of the running job:

Final Thoughts

The pieces work very well, especially for PoC work. The cost of this implementation was very minimal. It cost me £4.30 a month. With Azure you are entitled to 5 free activity runs,