Read & Write to Azure Blob Storage from Databricks

Post by: syed hussain in All Azure Databricks

Summary

In this post I’ll demonstrate how to Read & Write to Azure Blob Storage from within Databricks. Databricks can be either the Azure Databricks or the Community edition.

I’ve updated the approach to connect to Azure Storage, you can read more about it here: https://eax360.com/connect-databricks-to-azure/

Cluster Details

Notebook Details

Notebook created with base language: Scala

Locate Azure Storage Details

Note that the following variables will be used throughout. These variables will need to be changed where necessary (Storage Account Name, Storage Account Key and Storage Account Source Container).

%python
# Azure Storage Account Name
storage_account_name = "azurestorage"

# Azure Storage Account Key
storage_account_key = "1Vmkb3OQNgOoVI6MnhwerjhewrjhweFZVZ9w=="

# Azure Storage Account Source Container
container = "source"

# Set the configuration details to read/write
spark.conf.set("fs.azure.account.key.{0}.blob.core.windows.net".format(storage_account_name), storage_account_key)

Mount the filesystem & Blob Storage

%python

# Check if file exists in mounted filesystem, if not create the file
if "Master.xlsm" not in [file.name for file in dbutils.fs.ls("/mnt/azurestorage")]:
  dbutils.fs.mount(
   source = "wasbs://{0}@{1}.blob.core.windows.net".format(container, storage_account_name),
   mount_point = "/mnt/azurestorage",
   extra_configs = {"fs.azure.account.key.{0}.blob.core.windows.net".format(storage_account_name): storage_account_key}
  )
  
  # Unmount filesystem if required
  # dbutils.fs.unmount("/mnt/azurestorage")

Check if files exist

%python
# Check is all files exist

dbutils.fs.ls("dbfs:/mnt/azurestorage")

Output command should be similar to this:

Install the xlrd library (optional)

I’m installing this library as I intend to manipulate Excel & CSV files. This is an optional Step.

%python
%pip install xlrd

Read a file (optional)

To ensure that the filesystem and file is accessible – read a file.

%python
df = spark.read.text("/mnt/azurestorage/b_Contacts.csv")
df.show()

Write back to Azure Blob Storage container

%scala

// Write the file back to Azure Blob Storage
val df = spark.read
.option("header","true")
.option("inferSchema", "true")
.csv("/mnt/azurestorage/b_Contacts.csv")

spark.conf.set("fs.azure.account.key.azurestorage.blob.core.windows.net","1Vmkb3OQNgOoVI6MnhwerjhewrjhweFZVZ9w==")

// Save to the source container
df.write.mode(SaveMode.Append).json("wasbs://source@expclarionstorage.blob.core.windows.net/source/")

// Display the output in a table
display(df)

Tags: blob storage databricks

12 Sep 2020