Connect Azure Databricks to Azure Storage
Summary
The Microsoft documentation contains instructions on connecting Databricks to several storage resources, when you directly create a Databricks resource in Azure.
In this Post I’ll demonstrate how to connect Azure Databricks to an Azure Storage Account, in a VNET.
Infrastructure Setup
I want my Azure Storage, and Databrick to sit in the same network. This is purely so that I can let incoming/outgoing traffic to and from Azure Storage. The three resources I need are:
- Azure Storage to host my CSV.
- Azure Virtual Network; I want my Databricks Cluster and my Storage to sit in the same network.
- Databricks to transform my data.
- All resources will be situated in UK South just to keep things simple.
Azure Storage
- I’ll create an Azure Storage account, General Purpose V2 Block storage.
- I’ll uses Shared Access Signature (SAS) for Authentication/Authorisation at the container level.
- Under Firewalls and Virtual Networks, I’ll allow traffic to come through from only ‘Selected Networks‘. This is to prevent the error:
AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: This request is not authorized to perform this operation. Caused by: StorageException: This request is not authorized to perform this operation.
Azure Virtual Network
- Create an Azure Virtual Network called primarynetwork01, accepting all the defaults. I’ll make a note of my subnet IPs, as I want my Databricks Clusters to sit in a similar Subnet.
Azure Databricks
- I’ll create a new Azure Databricks under the UK South region.
- Under the Network settings, I’ll ensure that I select the primarynetwork01. This is the VNET where I want my Clusters to run from.
- I’ll add the necessary Subnets in the Databricks wizards (db-private-subnet & db-public-subnet).
Create Azure Virtual Network
Create a new Virtual Network. Accept all defaults unless there is something specific you need to update. In my implementation, I have kept all the default settings.
Create Azure Block Storage
The details below have been tested in an Azure Databricks deployment. For unknown reasons, these configurations do not work in the Databricks Community Edition. These details have not been tested in an AWS Databricks.
Before creating an Azure Storage, you must consider the following:
- Azure Blob storage supports three blob types: block, append, and page. You can only mount block blobs to DBFS.
- Network Security must be configured to allow Databricks to access the storage otherwise you will receive the following error:
AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: This request is not authorized to perform this operation. Caused by: StorageException: This request is not authorized to perform this operation.
- The implementation uses SAS configuration; if security and governance is a concern you should implement Managed Identities or use Azure Keyvault.
Create a new Azure Storage account
- Under Networking, I have selected Public endpoint (selected networks).
- I selected the Virtual Network I created earlier.
Once this was created, I left all default options.
Azure Storage SAS Key
Create a new Container
Create a new SAS key
Right-click on the Container and generate a new SAS key.
The SAS key should look something similar to this:
sp=r&st=2022-02-23T07:13:32Z&se=2022-02-23T15:13:32Z&spr=https&sv=2020-08-04&sr=c&sig=cbEhoGLvxp2GuMNNfDodFhwpFXz9cg%2BmRdCG0p8JylM%3D
Upload a CSV of any kind. This is the CSV that will be queried.
Databricks Instructions
- Create a new Azure Databricks workspace.
- Create a new Cluster.
Create Databricks Workspace
- Create the workspace in the same region as the storage account.
- Under Network, select the VNET you created earlier.
Continue creating the Databricks resource.
Create a new Cluster
Note that not all Node types are available in all regions. I had to select the Standard_D4a_v4 node for this particular demo.
Update the Network/Firewall Properties in the Storage Account
Under Add existing Virtual Network, add the two new Databrick subnets created during the provisioning of the resource.
At this point, you now have Storage and Databricks sitting in the same VNET.
Create the Notebook
// Connection details
val containerName = "container1"
val storageAccountName = "azstorageeax01"
val sas = "sp=r&st=2022-02-23T05:10:08Z&se=2022-02-23T13:10:08Z&spr=https&sv=2020-08-04&sr=c&sig=RKyeSXIAwVM8Wamzs9Hfe4bfKRKZhEAwNYGXRN450p4%3D"
val config = "fs.azure.sas." + containerName+ "." + storageAccountName + ".blob.core.windows.net"
// Unmount the storage directory if this already exists
dbutils.fs.unmount("/mnt/azstorage/emp1.csv")
// Mount the Azure Blob file to a location in Databricks DBFS
// Source is the Source file
// MountPoint is the directory where the file should be placed
// extraConfigs supply the credentials to connect
dbutils.fs.mount(
source = "wasbs://container1@azstorageeax01.blob.core.windows.net/emp1.csv",
mountPoint = "/mnt/azstorage/emp1.csv",
extraConfigs = Map(config -> sas)
)
%python
# Optional step to check if the file/directory exists
dbutils.fs.ls("dbfs:/mnt/")
// Output the contents of the file to a table
val mydf = spark.read
.option("header","true")
.option("inferSchema", "true")
.csv("/mnt/azstorage/emp1.csv")
display(mydf)
The result should be a Table