Microsoft Azure Databricks Best Practices v 1.0
Summary
In my last project (a few months ago) we received a very hefty bill on our Azure Databricks spend in our second month. Because of this, we decided that our sprint retrospectives would include at least 4 lessons learnt focusing on various areas on Databricks with the goal being to maximise on our investment, reduce TCO, increase our security and finally reduce technical debt. The list below is an accumulation of all our practical real-world field experiences put together to form this best practice document that has now become part of our Databricks mandate – all new software and data engineers must follow the items below. The list is generalised enough to be flexed between projects.
Databricks Best Practices
1. Always ensure your Databricks workspace is in the same region as your data.
Keeping your Databricks workspace in the same region as your data isn’t just about shaving a few milliseconds off response times—it’s critical for reducing latency and ensuring data compliance. Different regions can have varying legal requirements concerning data residency and sovereignty, and choosing the same region for both your workspace and data ensures that you are not inadvertently violating these laws, especially if you are dealing with regulations like GDPR.
Additionally, when your workspace and data reside in the same region, you’ll notice a significant improvement in performance. Data movement across regions, especially when it involves large datasets or multiple read/write operations, can cause delays and result in higher egress costs. The Azure pricing structure charges extra for cross-region data traffic, so aligning both in one region avoids unnecessary expenses.
Azure’s region offerings are vast, so it’s worth doing your homework to determine the optimal region for your workloads. For example, if you’re operating globally, central regions might be better for evenly distributing access across regions. Always review Azure’s latest region availability and pricing when setting up your environment.
2. Use Role-Based Access Control (RBAC) to restrict user access.
RBAC, or Role-Based Access Control, is foundational for securing your Databricks environment. Instead of assigning permissions directly to individual users (which becomes a management nightmare as teams grow), RBAC allows you to group users by roles—each role gets assigned a specific set of permissions. For example, data engineers might have access to data pipelines and cluster configuration, while analysts might only need query permissions. This reduces the complexity of managing access at scale.
The key benefit here is least privilege access—users only get the permissions they need to perform their job functions, nothing more. In Azure, you can define roles and apply them across your resources via Azure Active Directory (AAD), and integrate these roles into Databricks by leveraging AAD authentication. This allows for a seamless, centralised user management system where you can manage access in one place for all your Azure resources, not just Databricks.
Here’s how you might implement a role-based system:
dbutils.fs.ls("/mnt/my-secure-storage")
In the above, if users don’t have the right permissions, they won’t be able to access certain data paths, ensuring sensitive data is protected.
3. Use Azure Active Directory (AAD) for secure authentication.
Azure Active Directory (AAD) should be your go-to for managing user authentication in Databricks. Since Databricks is part of the Azure ecosystem, it integrates directly with AAD, providing a secure and centralised method for managing identity and access. By using AAD, you avoid the hassle of maintaining separate credential stores or authentication mechanisms for Databricks.
AAD allows you to enforce policies like multi-factor authentication (MFA), which adds an additional layer of security. Additionally, conditional access policies can be applied to restrict access based on user location, device state, or other factors, giving you precise control over who can access your Databricks environment and when.
One of the key benefits of AAD is Single Sign-On (SSO). SSO streamlines the login process for users, allowing them to access Databricks and other Azure resources with a single set of credentials. This reduces the number of passwords users need to remember, which in turn reduces the risk of password-related security incidents. Here’s a quick example of how to set up SSO in Databricks:
databricks configure --aad-token
This ensures that users authenticate using their AAD credentials, making the environment more secure and easier to manage at scale.
4. Apply IP whitelisting for workspace security.
IP whitelisting is a straightforward yet effective measure to restrict access to your Databricks workspace. By limiting access to trusted IP addresses, you can ensure that only users within approved networks (e.g., corporate VPNs, on-prem locations) can reach your Databricks environment. This adds a layer of security by keeping unwanted users or malicious actors out, even if they somehow obtain valid credentials.
Azure Databricks allows you to configure IP whitelisting in the workspace settings, where you can define a set of trusted IP addresses or CIDR ranges. Here’s how you’d set this up via Azure portal or ARM templates:
{
"properties": {
"enabledIpWhitelist": true,
"ipWhitelist": [
"203.0.113.0/24",
"198.51.100.0/24"
]
}
}
This setup ensures that only traffic from the specified IP addresses can access your Databricks environment. Make sure you regularly review and update your whitelist, especially as your organisation’s network infrastructure evolves. Using IP whitelisting in combination with other network security features, like private endpoints, enhances the overall security posture of your Databricks environment.
5. Enable encryption for data at rest and in transit.
Encryption is the first line of defense in protecting sensitive data. For Databricks, encryption at rest ensures that any data stored on disk (whether in object storage like Azure Data Lake or within Databricks itself) is unreadable to unauthorised users or attackers. Azure automatically encrypts all data at rest with platform-managed keys by default, but you can add another layer by using customer-managed keys (CMK).
Encryption in transit protects data as it moves between systems—whether from storage to compute nodes or from your on-prem systems to Azure. Databricks uses TLS 1.2 for all network traffic to secure the transport layer. To enable encryption with your own keys for data at rest, you’d use Azure Key Vault integration:
databricks secrets create-scope --scope myKeyVaultScope
With this setup, you ensure that both storage and communications are secured, greatly reducing the risk of data interception or theft. Data breaches often target unencrypted data, so this step is crucial for regulatory compliance and overall security.
6. Implement multi-factor authentication for added security.
Multi-Factor Authentication (MFA) is a critical security measure that adds an additional verification step to the login process, typically combining something you know (like a password) with something you have (like a phone or hardware token). For Azure Databricks, MFA is enforced via Azure Active Directory (AAD), which allows you to set up MFA policies and apply them across all your resources.
By requiring a second factor of authentication, MFA significantly reduces the risk of account compromises, especially in cases where user passwords are phished or stolen. Azure’s MFA service integrates seamlessly with Databricks and can be enabled through conditional access policies in AAD.
To implement MFA for Databricks users, you would configure it in AAD:
az ad user update --user mfaUser --force-mfa
Once this is done, users attempting to access Databricks will need to verify their identity with an additional method, such as a one-time passcode sent to their phone. This extra step prevents unauthorised access, even if an attacker manages to acquire a user’s password.
7. Ensure workspace auditing is enabled.
Workspace auditing is essential for keeping tabs on who’s accessing your Databricks environment and what they’re doing. Auditing tracks activity within the workspace, logging important events like user logins, resource creation, modifications, and deletions. This provides an audit trail that can be invaluable for troubleshooting, compliance, and security investigations.
To enable auditing in Azure Databricks, you can configure diagnostic logs via the Azure portal or Azure Monitor. These logs track workspace actions, cluster events, and job executions, offering a full picture of what’s happening inside your environment. You can stream these logs to Azure Log Analytics, Azure Event Hubs, or other monitoring solutions for real-time alerts and in-depth analysis.
For example, you can enable workspace logging with this snippet in your ARM template:
"diagnostics": {
"enabled": true,
"storageAccountId": "/subscriptions/{subscription-id}/resourceGroups/{resource-group}/providers/Microsoft.Storage/storageAccounts/{storage-account}"
}
Regularly review audit logs to detect suspicious activity early, like unauthorised access attempts or unusual changes to critical resources. Compliance frameworks like GDPR often require you to maintain such audit trails, so this step not only protects your infrastructure but also helps you stay on the right side of regulatory obligations.
8. Manage user permissions at the group level.
Managing user permissions at the group level is one of the most scalable approaches for securing your Databricks environment. As your teams grow, individual permissions management can become cumbersome, error-prone, and time-consuming. Instead, group-based permissions allow you to assign roles and access levels to a collection of users with similar needs (e.g., data engineers, analysts, machine learning teams), ensuring consistency and ease of management.
Azure Active Directory (AAD) groups can be synchronised with Databricks, allowing you to leverage Azure’s existing identity management tools. For example, creating a “Data Science” group and assigning that group permissions to read certain datasets but write to others ensures that everyone in that group has the correct access from day one, without the need for individual adjustments. You can configure this via the Azure Portal:
az ad group create --display-name "DataScienceGroup" --mail-nickname "dsgroup"
Once set up, managing users becomes as simple as adding or removing them from the relevant group. This reduces administrative overhead and ensures that permissions are always aligned with user roles, particularly when people move between teams or projects.
9. Use private endpoints to access Databricks securely.
Private endpoints allow you to access your Databricks workspace securely by ensuring that all traffic between your network and Azure stays within the Microsoft backbone network, never going over the public internet. This is a critical security measure for enterprises that handle sensitive or regulated data, as it reduces the risk of data interception.
With private endpoints, Databricks integrates with Azure Private Link, which enables you to create a private IP address for your Databricks instance. This means that all traffic to and from your Databricks environment will flow through your private Azure Virtual Network (VNet) instead of traversing the public web.
Here’s how to create a private endpoint using Azure CLI:
az network private-endpoint create \
--name myPrivateEndpoint \
--resource-group myResourceGroup \
--vnet-name myVnet \
--subnet mySubnet \
--private-connection-resource-id /subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.Databricks/workspaces/{workspace}
By using private endpoints, you significantly reduce exposure to network-based attacks, while also complying with strict data security policies, such as GDPR and HIPAA. It’s a no-brainer for any organisation serious about securing their Databricks deployment.
10. Regularly audit Databricks workspaces for unused clusters.
Clusters are the engine that powers your Databricks environment, but unused clusters can quickly turn into costly, resource-hogging liabilities. Regular auditing of your Databricks workspace to identify and terminate inactive clusters is a crucial step in optimising both cost and performance.
Azure provides several ways to monitor cluster activity. You can use Databricks’ built-in cluster monitoring tools to see which clusters are active, which ones are underused, and which have been sitting idle for too long. You can also set automated policies for cluster auto-termination, ensuring clusters shut down after a specified period of inactivity. Here’s a Python example for setting up auto-termination in a Databricks cluster:
databricks_cluster = {
"cluster_name": "example-cluster",
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"autotermination_minutes": 30
}
Regular audits can also help you optimise cluster configurations, such as adjusting the size or removing underutilised clusters that no longer serve an active purpose. In the end, a leaner, more actively managed set of clusters saves both time and money.
11. Keep secrets out of notebooks by using the secrets API.
Never hardcode credentials or sensitive information like API keys directly into your notebooks—this is a major security risk, especially in environments where multiple users have access to the same workspace. Databricks provides a secrets API that enables you to store and retrieve these sensitive values securely without exposing them in your code.
Secrets are stored in what’s called a “Secret Scope”, and access to these secrets can be managed via Azure Key Vault or directly within Databricks. For example, you might need to access a database using a password stored securely. Here’s how you can retrieve that secret in your notebook:
password = dbutils.secrets.get(scope="my-scope", key="db-password")
This way, the password is never hardcoded in the notebook and remains protected within the secret scope. Regularly review and rotate secrets to prevent unauthorised access, and always limit access to these secrets based on the principle of least privilege.
12. Avoid hardcoding credentials in notebooks.
Hardcoding credentials like database passwords, API keys, or OAuth tokens directly in your Databricks notebooks is a recipe for disaster. Not only can this lead to accidental exposure of sensitive information (for example, if a notebook is shared or pushed to a version control system), but it also makes managing and rotating credentials much harder.
Instead, use environment-specific configurations or the Databricks secrets API to securely handle these values. For example, if you’re connecting to a database, the credentials should be stored in a secure location (such as Azure Key Vault) and accessed programmatically within the notebook:
db_password = dbutils.secrets.get(scope="my-scope", key="db-password")
By externalising these sensitive values, you ensure that no one can inadvertently access them, even if they gain access to your notebooks. Additionally, storing credentials securely makes it easier to rotate keys and tokens without needing to go back and edit every notebook or script that uses them. This reduces the risk of exposing sensitive data and keeps your environment compliant with security best practices.
13. Store secret keys securely using Databricks Secret Scope.
Storing secret keys securely is a non-negotiable practice when working with sensitive data in Databricks. The Databricks Secret Scope is designed to handle this exact need. It allows you to securely store API keys, passwords, or any other sensitive configuration data that can then be accessed from within your notebooks without exposing them in your code.
Secret scopes can be created either directly within Databricks or by integrating with Azure Key Vault. Once a secret is stored, it can be retrieved at runtime. This ensures that credentials aren’t hardcoded in notebooks or logs, drastically reducing the chances of exposing sensitive information.
Here’s an example of how to create and use a secret in Databricks:
databricks secrets create-scope --scope myScope
Then, to access the secret in a notebook:
db_password = dbutils.secrets.get(scope="myScope", key="db-password")
By using secret scopes, you not only enhance security but also simplify credential management. Regularly rotate secrets and keep access tightly controlled to ensure the integrity of your environment remains intact.
14. Choose the right cluster size based on workload requirements.
Cluster sizing is a balancing act between performance and cost. Over-provisioning a cluster means burning through your budget, while under-provisioning could result in slow runtimes and failed jobs. Choosing the right size depends on your workload—heavy data processing jobs require larger clusters with more cores and memory, while lighter, ad-hoc analytics tasks can be handled by smaller clusters.
A good starting point is to evaluate the size of your dataset, the complexity of your transformations, and how frequently you’ll be running jobs. Databricks provides several instance types (e.g., memory-optimised, compute-optimised), so you’ll want to match these to your workload characteristics. For instance, memory-bound tasks (e.g., large joins, aggregations) might need machines with higher RAM, while compute-heavy operations (e.g., machine learning model training) will benefit from more CPUs.
Databricks also supports autoscaling, which dynamically adjusts cluster size based on demand:
spark.conf.set("spark.databricks.cluster.auto.scale.enabled", "true")
Regularly monitor your cluster performance and optimise accordingly. Start with a modest cluster and scale up as needed—remember, more resources don’t always equate to better performance if not used wisely.
15. Use Auto-scaling to optimise resources.
Auto-scaling is an essential feature in Databricks that allows your clusters to automatically adjust their size based on current workload demand. This ensures that you always have just the right amount of computational power—no more, no less—helping you optimise both performance and cost.
When auto-scaling is enabled, Databricks will add more worker nodes when the workload increases and scale back down when the job load decreases or the cluster becomes idle. This is particularly useful in environments with fluctuating workloads, such as development environments, where you may not need full-scale compute power all the time.
To enable auto-scaling in your cluster configuration:
"autoscale": {
"min_workers": 2,
"max_workers": 10
}
The minimum workers setting ensures that the cluster doesn’t scale down to zero, keeping it alive for incoming tasks. Auto-scaling is a set-it-and-forget-it solution, but keep an eye on your cluster metrics to ensure you’re not over or under-provisioning for sustained workloads.
16. Automate cluster termination after inactivity.
Clusters left running with no active jobs are a silent budget killer. Automating cluster termination after a period of inactivity ensures that you’re not paying for idle resources. In Databricks, you can configure an auto-termination policy at the cluster level, specifying how long a cluster should remain idle before being shut down.
This is particularly useful for development and test environments where clusters are often spun up for short bursts of activity but then left running. It’s easy to forget to shut down a cluster after a workday or meeting, and without auto-termination, those resources continue to accrue costs.
Here’s how to set up auto-termination in Databricks:
{
"autotermination_minutes": 30
}
This setting automatically terminates the cluster if it remains idle for 30 minutes. You can adjust the time based on your needs, but keep it short enough to avoid unnecessary charges. Regularly review your termination policies and adjust them as your workflows evolve.
17. Leverage Databricks jobs for scheduled tasks.
Databricks jobs allow you to automate and schedule tasks like running notebooks, executing Spark jobs, or managing complex workflows. Instead of manually triggering processes, you can schedule jobs to run at specific times, ensuring that ETL pipelines, model training, and data ingestion processes run consistently and on time.
Databricks jobs support a range of triggers, including time-based schedules (e.g., run every night at midnight) and downstream job dependencies (e.g., job B runs only after job A completes). This flexibility allows you to build complex, multi-step workflows without constant manual intervention.
For example, you can schedule a Databricks job via the REST API:
curl -n -X POST https://<databricks-instance>/api/2.0/jobs/create \
-d '{
"name": "My ETL Job",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2"
},
"libraries": [],
"notebook_task": {
"notebook_path": "/Users/my_user/my_notebook"
},
"schedule": {
"quartz_cron_expression": "0 0 12 * * ?",
"timezone_id": "UTC"
}
}'
Jobs are an essential part of any production workflow in Databricks, as they ensure that your processes run smoothly and without requiring constant human oversight.
18. Monitor cluster usage to optimise cost.
Monitoring cluster usage is crucial for keeping your Databricks spend in check. Clusters that are over-provisioned for light workloads or left running longer than necessary can lead to inflated costs. Similarly, clusters that are under-provisioned for large workloads can result in performance bottlenecks and extended job runtimes.
Azure provides several monitoring tools to help you keep track of cluster usage, including Azure Monitor and Databricks’ built-in cluster metrics dashboards. These tools let you track key metrics like CPU and memory utilisation, node activity, and job runtimes. By regularly reviewing these metrics, you can identify underutilised clusters and downsize them, or spot performance bottlenecks and allocate additional resources.
To retrieve cluster metrics programmatically:
curl -n -X GET https://<databricks-instance>/api/2.0/cluster/events \
-d '{
"cluster_id": "my-cluster-id"
}'
Set alerts for resource spikes or underuse and use auto-scaling policies to dynamically adjust cluster size as needed. Proactively managing your cluster usage ensures that you only pay for the compute power you truly need, maximising your cost efficiency.
19. Regularly update cluster libraries to maintain security.
Out-of-date libraries are security risks waiting to happen. They’re the virtual equivalent of leaving your door unlocked and wondering why you’re being robbed blind.
Databricks lets you install libraries at the cluster level, and these libraries—whether it’s Spark, TensorFlow, or some obscure, lovely little package you found on PyPI—need to be up to date. New vulnerabilities are discovered all the time, and an old library is an easy target.
You can update cluster libraries manually or, if you’ve got better things to do, automate it as part of your CI/CD pipeline:
databricks libraries install --cluster-id <cluster-id> --pypi-package requests==2.26.0
Of course, keep an eye on compatibility. The latest shiny version might not play nice with your current stack, so do some testing before you update your whole cluster farm. But when it works, it’s glorious. No vulnerabilities, no performance hits from old code. Just clean, lean, updated libraries keeping your clusters secure and efficient.
20. Use Delta Lake for efficient data storage.
Delta Lake adds transactional integrity and scalable performance to your data lake, transforming it from a mere storage location into an efficient, reliable data platform. The key features Delta Lake brings are ACID transactions and schema enforcement, ensuring data consistency across large-scale operations and preventing schema mismatches that can otherwise wreak havoc on your pipelines.
The format optimises storage by compacting smaller files into larger ones, reducing the overhead on file reads and writes. It also introduces versioning, which allows you to time-travel to previous data states—critical when you need to audit or correct historical data issues.
Here’s how you can store data in Delta format:
df.write.format("delta").mode("overwrite").save("/mnt/delta/my_data")
This layer of efficiency isn’t just about saving space—it speeds up queries by organising data efficiently on disk. As your datasets grow, Delta Lake’s ability to handle batch updates and streaming data makes it ideal for data engineering tasks, providing both storage efficiency and query optimisation.
21. Use caching to optimise repeated queries.
If you’re running the same queries repeatedly on the same data, caching is a critical optimisation strategy. It reduces the need to reprocess the same dataset, cutting down on execution times for frequently run queries. By caching data, you instruct Spark to store it in memory, making subsequent queries significantly faster.
Caching is particularly useful in exploratory data analysis or iterative workflows, where the same data is used across multiple steps. It speeds up interactive queries and ensures that data isn’t constantly being fetched from storage, which can be slow and resource-intensive.
To cache a table in Databricks, use:
spark.sql("CACHE TABLE my_table")
Once cached, Spark retrieves the data from memory instead of reading from disk, which optimises performance, especially for large datasets. However, be mindful of memory limitations—caching too much data can cause memory pressure and lead to unnecessary garbage collection cycles.
22. Repartition data strategically for better parallelism.
Partitioning is essential to achieving efficient parallelism in Spark, and how you manage partitions can make or break your cluster performance. Large datasets are split into smaller, more manageable partitions, allowing Spark to process different parts of the data in parallel. However, poor partitioning can lead to imbalanced workloads, with some tasks taking significantly longer than others.
Repartitioning involves reshuffling your data across a new number of partitions, ensuring that your workload is balanced and Spark can fully utilise the available cluster resources. Generally, aim for partition sizes of around 128MB, but adjust based on the size of your dataset and the complexity of your queries.
To repartition data:
df.repartition(200).write.format("delta").save("/mnt/delta/repartitioned_table")
This command redistributes your data into 200 partitions. Adjust the number of partitions based on cluster size and the type of job you’re running. Well-balanced partitioning improves resource utilisation and significantly reduces processing times.
23. Use Spot Instances for cost-effective compute.
Spot Instances are an effective way to lower your Databricks costs. These are spare Azure compute resources available at discounted rates, allowing you to run non-critical workloads for a fraction of the usual price. Spot Instances are ideal for development, testing, or batch jobs where intermittent interruptions are acceptable.
The trade-off is that Spot Instances can be reclaimed by Azure when demand spikes, so they shouldn’t be used for critical production tasks that require guaranteed uptime. However, for many workloads—such as ETL jobs that can be resumed if interrupted—they provide an excellent opportunity to optimise compute costs.
In Databricks, you can configure your cluster to use Spot Instances:
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"spot_instances": {
"enabled": true,
"max_spot_price": 0.05
}
By taking advantage of Spot Instances, you can scale compute power efficiently while keeping costs under control. Just be mindful of Azure reclaiming them and plan for resilience in your job execution strategies.
24. Optimise ETL processes using Databricks pipelines.
Databricks pipelines streamline the ETL (Extract, Transform, Load) process by leveraging Spark’s distributed computing capabilities, allowing you to efficiently move data from raw ingestion to structured, analytics-ready formats. With Databricks, ETL processes can be automated and scaled to handle large datasets, reducing manual intervention and improving data pipeline reliability.
One of the key strengths of Databricks pipelines is their integration with Delta Lake. Delta enables ACID transactions and schema enforcement, ensuring data consistency throughout the pipeline. This is critical when dealing with data from multiple sources that need to be transformed before being ingested into a centralised data lake.
Here’s a basic example of an ETL pipeline in Databricks:
df = spark.read.format("csv").option("header", "true").load("/mnt/raw_data/data.csv")
df_cleaned = df.filter(df["column"].isNotNull())
df_cleaned.write.format("delta").save("/mnt/processed/data")
By automating this process, you can transform raw data into structured formats at scale, while minimising manual overhead. Regularly review your ETL pipelines to optimise performance, using features like caching and partitioning to manage resources effectively.
25. Use Auto Loader to simplify incremental data ingestion.
Auto Loader is Databricks’ tool for simplifying the incremental ingestion of new files into your data lake. Instead of manually managing file discovery and ingestion, Auto Loader automatically detects new files and ingests them as they arrive. This is particularly useful for environments where data is ingested in real-time or where files are dropped continuously into storage.
Auto Loader works by monitoring directories in your cloud storage (like Azure Data Lake or S3) and using schema inference to automatically read new files in various formats (CSV, JSON, etc.). This reduces the complexity of building custom ingestion processes and makes it easier to scale as your data grows.
Here’s how to set up Auto Loader:
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "parquet")
.load("/mnt/raw-data"))
By using Auto Loader, you can simplify the process of incrementally loading new data into your Databricks environment, ensuring that your data pipelines are always up to date without the need for constant manual adjustments.
26. Use Delta Lake for ACID transactions and schema enforcement.
If you’re not enforcing ACID transactions with Delta Lake, you’re essentially leaving your data out in the rain and hoping it doesn’t get soggy. ACID (Atomicity, Consistency, Isolation, Durability) transactions are what keep your data from turning into a chaotic mess when multiple users, updates, and jobs come at it from every angle.
Delta Lake provides the data stability you need by ensuring schema enforcement as well. No more sneak attacks from mismatched data types—if your data doesn’t fit the schema, Delta won’t allow it through. It’s like having a bouncer at your data club.
df.write.format("delta").mode("append").save("/mnt/delta/acme_table")
Delta’s ACID properties let you confidently update, delete, or merge records in bulk, knowing that everything will remain consistent and recoverable. And in the data world, where one rogue operation can send everything spiraling, that’s worth its weight in gold.
27. Avoid small file writes by batching data.
Small files are the data equivalent of plastic straws: they seem harmless, but before you know it, they’re everywhere, causing problems, and you’re wishing you’d dealt with them properly in the first place. Batching your data into larger chunks before writing to storage is the solution.
Writing data in small files leads to sluggish query performance, especially as you scale. Instead of making Spark sift through thousands of tiny files, consolidate them into fewer, larger ones. This not only improves read performance but also reduces the storage overhead.
df.repartition(10).write.format("delta").mode("overwrite").save("/mnt/delta/big_files")
Repartitioning ensures that data is written in larger, more manageable chunks. It’s a simple yet effective way to prevent your storage from becoming an ungovernable pile of file fragments, which, trust me, nobody wants to deal with later.
28. Use OPTIMIZE and VACUUM on Delta tables to improve query performance.
If your Delta Lake tables could talk, they’d probably be asking you to tidy up after yourself. OPTIMIZE and VACUUM are your housekeeping tools here, ensuring that your Delta tables remain lean and efficient as you pile on new data.
OPTIMIZE compacts those pesky small files, grouping them into larger, more query-friendly sizes. Think of it as reorganising your kitchen so all the pots are where they should be—no more rummaging around for things Spark can’t find quickly.
spark.sql("OPTIMIZE my_table")
Once you’ve tidied up with OPTIMIZE, it’s time to clean out the trash with VACUUM. This command deletes old files that are no longer needed (but have been hanging around because of versioning). By default, Delta keeps files for 30 days, but you can tweak that:
spark.sql("VACUUM my_table RETAIN 7 HOURS")
Regularly running these commands keeps your tables in tip-top shape and your queries running like a dream, rather than trudging through a minefield of old data.
29. Apply ZORDER for efficient data reads.
ZORDER is your answer to speeding up queries when you’re filtering on specific columns. In essence, it’s like sorting your filing cabinet so that all the important documents are grouped together, rather than scattered randomly. When your data is ZORDERed, Spark can quickly jump to the relevant chunks without reading through the entire dataset.
If your queries frequently filter by certain columns (e.g., date, customer ID), applying ZORDER can make those queries run significantly faster by optimising the file layout:
spark.sql("OPTIMIZE my_table ZORDER BY (customer_id)")
This helps reduce the number of data files Spark needs to scan, drastically improving performance. It’s especially useful for large datasets, where scanning everything would otherwise be slow and resource-heavy. ZORDER is essentially a time-saver for the overworked data engineer—use it wisely.
30. Use Bronze, Silver, and Gold layers for structured data processing.
The Bronze-Silver-Gold architecture is the perfect way to give your data the proper life cycle treatment it deserves.
- Bronze Layer: This is the raw data dump. Unprocessed, untouched, and full of errors, duplicates, and general weirdness. But you need it, so it lands here.
- Silver Layer: Here’s where the scrubbing happens. Data gets cleaned, transformed, and checked for quality, ready for more advanced analysis or enrichment.
- Gold Layer: By the time data reaches this stage, it’s the crown jewel—perfectly structured, ready for analytics or machine learning, and refined to your exact specifications.
This layered architecture not only keeps your data organised but also allows for reprocessing without touching the raw data. It’s scalable, efficient, and saves you from a complete redo when things go wrong upstream.
31. Implement data partitioning for large datasets.
Partitioning is one of those things you don’t realise you need until your queries start crawling like they’re wading through quicksand. Large datasets can become unwieldy, but partitioning them strategically helps Spark work with your data more efficiently.
When you partition a dataset, you break it up into smaller, more manageable pieces based on certain keys (e.g., date, region). This allows Spark to skip large chunks of irrelevant data during queries, speeding things up dramatically.
df.write.partitionBy("region").format("delta").save("/mnt/partitioned_table")
The key is to choose partition columns that match your query patterns. If you frequently filter by date, partition by date. Just be careful not to over-partition—too many partitions can create more overhead than performance gains.
32. Leverage the SQL Analytics service for high-performance queries.
If you’re running complex analytics, you don’t want to be waiting around for hours while your queries plod through terabytes of data. That’s where SQL Analytics steps in. It’s designed to run high-performance, fast SQL queries directly on your data lake—no need to pre-aggregate or move the data somewhere else.
The service integrates seamlessly with Delta Lake, so you get the benefits of scalable data management alongside SQL execution that’s optimised for speed. It supports BI tools like Power BI and Tableau, giving you the flexibility to build dashboards and reports directly on top of your lakehouse.
Here’s an example of a basic SQL query in Databricks:
SELECT customer_id, total_spend
FROM gold_layer
WHERE region = 'EMEA'
With SQL Analytics, the performance improvements can be significant, especially on large datasets where traditional SQL engines might struggle. If you’re dealing with real-time analytics or business reporting, this is the tool to make those processes as efficient as possible.
33. Ensure that your data pipelines are idempotent to handle failures.
In the world of data engineering, failures are inevitable. Machines go down, jobs fail, and sometimes, things simply don’t run as expected. That’s why your data pipelines need to be idempotent—meaning they can be run multiple times without unintended side effects. Whether a job fails halfway or you need to reprocess a batch, an idempotent pipeline ensures that each run produces the same result.
For example, if your pipeline is processing sales data and a failure occurs, you don’t want duplicate entries when it reruns. By making your pipeline idempotent, you ensure consistency. This might involve using techniques like merge operations in Delta Lake, which allow you to update or insert data based on conditions.
MERGE INTO sales_data
USING updates
ON sales_data.id = updates.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
The result? If the process runs again after failure, it won’t double your data or make things worse. This practice is key to building reliable, scalable pipelines.
34. Use streaming for near real-time data ingestion.
Batch processing is great, but for scenarios where data is flowing continuously—like IoT sensor feeds or real-time transaction logs—you need streaming. Databricks provides built-in support for Structured Streaming, allowing you to process data as it arrives, in near real-time.
Streaming lets you transform and store data incrementally, rather than waiting for a full batch to finish processing. This is critical when time is of the essence, whether for live analytics, monitoring, or time-sensitive decision-making.
Here’s a quick example of how to set up a simple streaming job:
df = (spark.readStream
.format("delta")
.table("bronze_table"))
df.writeStream
.format("delta")
.outputMode("append")
.table("silver_table")
By moving data through your pipeline as it’s generated, you reduce latency and keep your datasets up to date, making sure you’re never more than a few seconds behind live events.
35. Use caching to optimise repeated queries.
When it comes to performance, caching is an absolute must for repeated queries. If your team is running the same SQL statements or transformations on a dataset multiple times, caching that data in memory can drastically reduce query times. Without caching, Spark reads from disk each time, which is a costly operation, particularly on large datasets.
Spark’s caching mechanism stores data in memory the first time it’s accessed, making subsequent queries faster. You can cache tables or DataFrames with a simple command:
df.cache()
However, don’t overdo it. Memory is a finite resource, so only cache datasets that are frequently queried and small enough to fit comfortably within your cluster’s memory. Regularly monitor your memory usage to avoid overloading the system and causing garbage collection delays.
36. Optimise data layouts with ZORDER for faster access.
ZORDER is your tool for optimising data access when you’re querying on specific columns regularly. Unlike traditional indexing, ZORDER physically rearranges data within your Delta table so that related records are stored closer together. This reduces the number of files Spark has to scan during a query, making your data retrieval much faster.
For example, if you frequently filter on a customer ID or date range, ZORDER ensures that Spark reads fewer files to get to the relevant data. It’s particularly useful for large tables where scanning the whole dataset would take far too long.
Here’s how you apply ZORDER to optimise a Delta table:
spark.sql("OPTIMIZE sales_table ZORDER BY (customer_id)")
This simple step can lead to significant performance gains, especially when running queries on highly selective columns.
37. Use Delta Lake OPTIMIZE to compact small files.
As you process data over time, it’s easy to end up with a bunch of small files scattered across your data lake, which can slow down query performance. OPTIMIZE is Delta Lake’s built-in solution for compacting these small files into larger, more efficient ones.
This file consolidation reduces the number of data files that Spark needs to scan during a query, which can have a huge impact on performance, particularly for large datasets. After multiple data writes, running OPTIMIZE
will make sure your data stays organised and easy to access.
spark.sql("OPTIMIZE my_delta_table")
Regularly running OPTIMIZE
is a low-effort, high-impact way to maintain the health of your Delta tables. It’s one of those things you don’t think about until it’s too late—so get it in your routine early.
38. Monitor and manage the execution of long-running queries.
Long-running queries are a reality when you’re working with big data, but you don’t want them running unchecked. In Databricks, monitoring these queries is key to identifying performance bottlenecks and understanding resource usage. The Spark UI is your best friend here—it shows you detailed information about job execution, including stages, tasks, and resource consumption.
If a query is taking longer than expected, start by checking the stages in Spark UI for things like data shuffling (which can kill performance) or skewed partitions (where one task is handling way too much data). Based on this information, you can take steps like repartitioning data or rewriting your query logic.
For example, if you see a lot of data shuffling, consider using broadcast joins to reduce the amount of data movement:
df = df1.join(broadcast(df2), "id")
It’s also good practice to set timeouts for long-running queries so they don’t hog resources indefinitely. Regular monitoring helps you catch inefficiencies early, keeping your environment running smoothly.
39. Avoid wide transformations that involve large data shuffling.
Wide transformations—such as joins, groupBy operations, or anything that requires data to be moved between nodes—are often the culprits behind performance issues. These operations trigger data shuffling, which can drastically slow down your queries, especially with large datasets. Shuffling is expensive because it involves network IO, disk usage, and CPU, which means a poorly optimised wide transformation can bring your cluster to its knees.
To mitigate this, try to reduce the amount of data that needs to be shuffled. For joins, consider using broadcast joins for smaller tables, which keep the data local rather than sending it across the network:
from pyspark.sql.functions import broadcast
df = df1.join(broadcast(df2), "id")
Additionally, check if you can filter or aggregate your data before the shuffle, reducing the dataset size as early as possible in the query. Minimising wide transformations means your clusters spend less time moving data around and more time processing it.
40. Apply best practices for joins, using broadcast joins where applicable.
When it comes to joins, broadcast joins are a lifesaver for smaller datasets. If you’re joining a large table with a small one (the small table typically being under 10MB, but you can adjust this limit), Spark can broadcast the smaller dataset to every node, avoiding expensive shuffles and making the join much faster.
To enable a broadcast join explicitly:
from pyspark.sql.functions import broadcast
df = df1.join(broadcast(df2), "id")
This approach is particularly useful when the smaller dataset is static or relatively small (like lookup tables or reference data). Spark will automatically attempt to use broadcast joins when it thinks it’s appropriate, but manually forcing them where applicable can often yield better performance.
Be cautious when joining large tables, as broadcasting massive datasets can overwhelm memory and lead to failures. Always monitor query performance in the Spark UI to ensure you’re not pushing things too far.
41. Use coalesce to reduce the number of partitions before saving data.
Saving data with too many partitions can result in thousands of small files being written, which is a nightmare for storage costs and query performance. The solution? Coalesce your DataFrame to reduce the number of partitions before saving. This operation reduces the number of partitions without triggering a shuffle, making it more efficient than repartition()
when you’re just trying to combine output files.
Here’s how to coalesce your DataFrame before saving it:
df.coalesce(10).write.format("delta").save("/mnt/delta/optimized_output")
This will combine your data into 10 partitions, creating fewer, larger files. Keep an eye on the size of your partitions—too many will overload Spark, but too few will cause long job runtimes as partitions get too large to process efficiently. Strike a balance based on your data and workload.
42. Tune Spark’s memory configurations for large data processing tasks.
Spark’s default memory settings are fine for smaller tasks, but if you’re dealing with large datasets or heavy transformations, you’ll want to tweak the memory configuration to ensure smooth processing. The key parameters to focus on are executor memory and driver memory.
Start by setting an appropriate amount of memory for each executor, depending on the size of your data. This prevents executors from running out of memory, causing job failures:
--executor-memory 8g --driver-memory 4g
You can also adjust memory overhead for tasks that involve heavy shuffling or large joins. This allocates additional memory beyond the executor’s JVM heap for off-heap storage:
--conf spark.yarn.executor.memoryOverhead=2048
Regularly monitor memory usage in the Spark UI to see if jobs are failing due to memory limits, and adjust these configurations accordingly. Tuning memory settings properly is critical for preventing out-of-memory errors and keeping your large processing tasks running smoothly.
43. Enable adaptive query execution for more efficient query plans.
Adaptive Query Execution (AQE) is a feature in Spark that dynamically optimises your query plans based on the actual data being processed. It adjusts things like join strategies, shuffling, and partitioning at runtime, helping to avoid inefficiencies that can’t be predicted during the initial query planning phase.
To enable AQE, you need to set the following configuration:
spark.conf.set("spark.sql.adaptive.enabled", "true")
AQE is especially useful when the size of your data can vary significantly between jobs or when you don’t have perfect control over how data is partitioned. Spark adjusts the query plan on the fly, allowing for better performance and resource utilisation. It’s a simple switch that can lead to significant performance gains in complex queries.
44. Optimise storage by cleaning up unused data regularly.
Data tends to accumulate. You might have outdated files, old snapshots, or orphaned data lying around that no one’s touched in months. All of this adds up, not just in terms of storage costs but also in terms of query performance, as Spark will have to sift through more files than necessary.
To avoid this, regularly clean up unused data. If you’re using Delta Lake, this can be handled by running the VACUUM
command, which removes files that are no longer needed after a certain retention period:
spark.sql("VACUUM delta_table RETAIN 168 HOURS")
For non-Delta formats, build a regular process to archive or delete old data. This not only saves you on storage costs but also improves performance by reducing the number of files Spark has to scan during queries.
45. Minimise the usage of UDFs where built-in functions suffice.
User Defined Functions (UDFs) can be powerful, but they’re often much slower than Spark’s native functions. This is because UDFs break Spark’s ability to optimise query plans, forcing it to move data between JVMs and perform operations row by row. Wherever possible, use Spark’s built-in functions, which are optimised for distributed processing.
For example, instead of writing a custom UDF to manipulate strings:
from pyspark.sql.functions import upper
df = df.withColumn("upper_name", upper(df["name"]))
This will always be faster than a UDF doing the same job, thanks to Spark’s optimisations. Only resort to UDFs when the built-in functions can’t accomplish what you need. Keep an eye on your code—UDF-heavy pipelines tend to lag behind their native function counterparts.
46. Use shared notebooks for real-time collaboration.
Databricks notebooks are more than just code repositories—they’re collaborative workspaces where your team can share ideas, develop solutions, and document workflows in real-time. Shared notebooks enable multiple users to edit, comment, and review code simultaneously, making collaboration smoother and more productive.
You can leverage Git integration with notebooks to version control your work, ensuring that changes are tracked and can be rolled back if needed. Encourage your team to document their code in-line within the notebook, making it easier for others to pick up where someone left off.
Here’s how you enable version control:
git init
databricks workspace import /path/to/notebook
The key is to use notebooks as living documents where data, code, and results can coexist—providing transparency and improving team collaboration.
47. Document your notebooks clearly for team collaboration.
Clear documentation is a must when working with notebooks, especially when collaborating across teams. It’s easy to assume that everyone understands the logic behind your code, but reality rarely works that way. Adding clear markdown cells to explain your process, decisions, and expected outcomes is crucial.
Use headers to separate different sections of your notebook, and add comments to explain complex logic or any workarounds. This doesn’t just help others understand your work—it helps future you when you return to the notebook in six months and can’t remember why you did something a certain way.
# Data Cleaning Process
This section removes null values and filters out incomplete rows from the dataset.
Good documentation turns your notebooks into reusable assets that can be shared, reviewed, and built upon by others.
48. Version control notebooks with GitHub or Azure DevOps integration.
Keeping track of changes in your notebooks is essential, especially in collaborative environments where multiple people are working on the same projects. Without version control, you run the risk of losing work or overwriting important changes. That’s where GitHub or Azure DevOps integration comes in.
By linking your Databricks workspace with a Git repository, you ensure that all changes are tracked, and you can easily revert to previous versions if needed. This allows for a smooth workflow where multiple team members can contribute without stepping on each other’s toes.
Here’s how you can set up Git integration:
databricks workspace import -f DBC /path/to/notebooks
Version control also brings accountability to your notebooks—each change is documented, and you can enforce pull requests to ensure that code is reviewed before it goes live. Whether you’re working on data engineering pipelines or machine learning models, having version control in place is critical for maintaining a clean and reliable workflow.
49. Use cluster policies to enforce best practices and prevent misconfigurations.
Cluster policies are an underrated but powerful feature in Databricks, allowing you to enforce best practices and prevent common mistakes when creating clusters. With cluster policies, you can restrict certain configurations (like disabling auto-termination or using inappropriate instance types) to prevent costly misconfigurations that lead to resource wastage or security issues.
For instance, you can create a policy that restricts the number of nodes in a cluster or enforces a maximum runtime to ensure clusters don’t run indefinitely:
{
"spark_conf.spark.databricks.cluster.autotermination_minutes": {
"type": "range",
"maxValue": 60
}
}
By applying these policies, you can ensure that all clusters conform to your organisational standards, saving both time and cost in the long run. It’s a simple way to safeguard your environment from user error while ensuring operational efficiency.
50. Manage permissions at a granular level for shared resources.
When working with shared resources in Databricks, managing permissions is critical to ensuring that users only have access to the data and functionality they need. Granular permission settings allow you to control who can view, edit, or execute jobs, notebooks, and clusters, minimising the risk of unauthorised changes or data breaches.
Databricks integrates with Azure Active Directory (AAD), allowing you to manage permissions through role-based access control (RBAC). You can set permissions at the group level, making it easier to manage larger teams with different responsibilities.
For example, data scientists might have full access to notebooks but only read access to production data, while admins have control over the cluster configuration. Permissions can be managed directly in the Databricks UI:
databricks permissions update --resource-type cluster --resource-id <cluster-id> --permission admin
Implementing granular permission controls helps protect sensitive data and ensures that users can’t make changes outside their scope of responsibility.
51. Use GitHub or Azure DevOps for version control and collaboration.
When your team is spread across projects, it’s critical to keep everyone on the same page. Integrating GitHub or Azure DevOps for version control ensures that your code and notebooks are synchronised and accessible to all. This means changes are tracked, reviewed, and integrated properly, avoiding the chaos of unmanaged files floating around.
With Git integration in Databricks, you can keep your notebooks under version control, allowing multiple users to collaborate without overwriting each other’s work. Here’s how to set up GitHub integration:
databricks workspace configure --token <your-github-token>
This setup encourages best practices such as branching, pull requests, and code reviews, ensuring that only approved changes make it into production. It’s also a safeguard against accidental deletions or errors, as you can always revert to an earlier version of your notebook.
52. Encourage teams to follow naming conventions for easier resource management.
Naming conventions might sound like a minor detail, but in complex environments with hundreds of clusters, jobs, and notebooks, a solid naming strategy is essential. Consistent naming makes it easier to identify resources, track costs, and manage permissions. Without it, good luck finding that one test notebook from three months ago buried in a sea of vaguely named “test1” and “final_v2” files.
Develop and enforce a naming convention that includes descriptive labels, project names, and version numbers. For example:
- Cluster name:
projectX_dev_cluster
- Notebook:
projectX_data_ingestion_v1
This approach makes it easy to understand the purpose of each resource and who it belongs to. It also helps streamline searchability and cost attribution across departments or teams.
53. Use tags to track resource utilisation and categorise costs.
Resource tags are a lifesaver when it comes to managing costs and understanding where your compute hours are being burned. By tagging clusters, jobs, and other resources, you can break down costs by department, project, or use case. This is particularly useful in multi-team environments where different groups share the same Databricks workspace.
You can add tags like cost_center
or environment
(dev, test, prod) to your clusters and jobs, making it easier to generate reports and track costs accurately:
az resource tag --name myCluster --resource-group myResourceGroup --tags environment=prod cost_center=12345
Using tags also helps with resource governance and auditing. You’ll have better visibility into what’s being used, why, and by whom, enabling you to optimise your Databricks usage based on real data rather than assumptions.
54. Use MLflow for tracking machine learning experiments.
Machine learning experiments can get messy quickly—different models, parameters, and datasets can easily get lost in the shuffle. That’s where MLflow comes in. Integrated directly into Databricks, MLflow helps you track all aspects of your machine learning experiments, from code and data to metrics and model versions.
MLflow’s tracking API allows you to log every run, so you always know which model performed best and under what conditions. You can log model parameters and performance metrics with a few simple commands:
import mlflow
mlflow.log_param("alpha", 0.5)
mlflow.log_metric("rmse", 0.25)
Using MLflow means your experiments are documented and reproducible, making it easier to compare models, debug issues, and collaborate with other team members. It’s an essential tool for scaling machine learning workflows and maintaining control over your experiments.
55. Apply feature engineering techniques within the Databricks workspace.
Feature engineering is at the heart of every machine learning model, and doing it directly in Databricks makes the process seamless. You can use Spark’s distributed processing to perform feature extraction, scaling, encoding, and more across massive datasets, preparing your data for machine learning at scale.
Databricks also allows you to integrate libraries like pandas and scikit-learn alongside Spark, giving you flexibility in how you transform your data. For example, you can use VectorAssembler in Spark to create feature vectors:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=["col1", "col2"], outputCol="features")
df = assembler.transform(df)
By performing feature engineering within Databricks, you keep your entire machine learning pipeline—from raw data to model training—inside one unified platform. This reduces data movement, simplifies debugging, and ensures consistency across experiments.
56. Use distributed machine learning libraries like SparkML or TensorFlow.
Training models on large datasets can be slow and resource-intensive if you’re running them on a single machine. SparkML and TensorFlow integrate natively with Databricks to provide distributed machine learning capabilities, allowing you to train models across a cluster of machines, drastically improving performance.
SparkML is ideal for classical machine learning models (e.g., linear regression, decision trees), and it operates seamlessly within the Spark ecosystem. For deep learning, TensorFlow on Databricks can distribute training across multiple nodes:
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol="features", labelCol="label")
model = lr.fit(df)
Using distributed machine learning allows you to handle far larger datasets than would be possible on a single machine, making it a powerful tool for scaling your data science operations.
57. Automate model retraining pipelines for continual improvement.
Once you’ve deployed a model, the work doesn’t stop. Over time, models degrade due to changes in data patterns, so you need to periodically retrain them. Automating your retraining pipelines ensures your models stay up-to-date without constant manual intervention.
You can use Databricks Jobs to schedule retraining tasks, automatically ingesting new data, refreshing feature engineering steps, and re-running model training. MLflow integrates well here, letting you track every retrained model:
from mlflow import log_metric, log_param
log_param("learning_rate", 0.01)
log_metric("accuracy", 0.92)
By automating retraining, you maintain model accuracy and performance over time, all while keeping your workflow efficient and scalable.
58. Store models centrally for reproducibility.
Storing machine learning models centrally ensures that your team can easily access, reuse, and reproduce models when needed. Databricks integrates directly with MLflow, making it easy to log and store models in a central repository. This means that once a model is trained, it doesn’t just sit on someone’s local machine—it’s available for anyone on the team to use or improve upon.
MLflow provides a way to store not only the model but also the environment it was trained in (e.g., the exact Python packages and versions). This makes reproducing results much easier. Here’s how you can log a model in MLflow:
import mlflow
mlflow.log_model(my_model, "model")
By storing models centrally, you ensure that all versions are tracked, models can be easily compared, and collaboration between data scientists is streamlined. This also prevents the dreaded “lost model” situation where no one knows which version was deployed.
59. Version your machine learning models with MLflow.
Machine learning models evolve over time, and keeping track of which version is in production, which one performed best, and which one should be retrained is crucial. MLflow provides built-in model versioning, ensuring you always know which model is being used in your pipeline and how it compares to previous versions.
By logging each version of your model during training, you can compare performance metrics like accuracy, precision, or recall across multiple experiments:
mlflow.log_metric("accuracy", accuracy_score)
mlflow.register_model("path_to_model", "model_name")
This creates a clear lineage of your models, enabling you to easily roll back to a previous version if necessary. Model versioning is a critical practice for building reproducible, reliable machine learning systems.
60. Monitor model drift and ensure consistent performance.
Machine learning models don’t live in a vacuum. Over time, the data they were trained on can change—this phenomenon is known as model drift. If left unchecked, model drift can lead to poor predictions and degraded performance in production environments. That’s why monitoring your models post-deployment is just as important as training them.
Using MLflow or third-party tools, you can monitor metrics such as prediction accuracy, data distributions, and feature importance in real-time. Set up alerts for when performance drops below a certain threshold, and automate retraining pipelines to counteract drift.
import mlflow
mlflow.log_metric("drift_score", drift_value)
By regularly evaluating your models, you ensure that they continue to perform consistently in production, reducing the risk of poor decision-making based on outdated predictions.
61. Ensure explainability of models using tools like SHAP.
Explainability is no longer optional in machine learning, especially when your models are used to inform critical decisions. Stakeholders and regulators increasingly demand transparency into how models make predictions, and tools like SHAP (SHapley Additive exPlanations) help explain the contributions of each feature to the model’s output.
SHAP integrates easily with Databricks, allowing you to generate visual explanations for model predictions. For example, you can use SHAP to show which features are most important in determining a prediction:
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)
By providing detailed explanations, you improve trust in your models and ensure compliance with regulations like GDPR, which may require explainability in AI-driven decision-making.
62. Use model serving for scalable machine learning model deployment.
Getting a model into production is only half the battle. The other half is ensuring it scales. Model serving in Databricks provides a way to deploy your models at scale without worrying about the underlying infrastructure. This allows you to serve machine learning models as REST APIs, enabling real-time predictions while Databricks handles the scalability.
With Databricks, you can deploy models directly from MLflow and serve them with minimal configuration. Here’s an example of how you might serve a model:
databricks models serve --model-name "my_model" --version 1
This approach ensures that your models can handle real-world production workloads, with built-in load balancing and scaling. It’s ideal for applications requiring fast, low-latency predictions.
63. Ensure that your data is pre-processed consistently across experiments.
Inconsistent data pre-processing is one of the most common causes of issues in machine learning experiments. If your training data and production data are pre-processed differently, the model will likely perform poorly in real-world scenarios. Consistency in data transformations—whether it’s scaling, encoding, or handling missing values—is essential.
Use pipelines in Databricks to automate pre-processing steps and ensure that the same transformations are applied to both the training and inference stages:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
model = scaler.fit(training_data)
scaled_data = model.transform(test_data)
By enforcing consistent data pre-processing, you minimise the risk of data leakage and ensure that your model performs as expected in production.
64. Use the Databricks monitoring UI to keep track of cluster health.
Monitoring the health of your Databricks clusters is critical to ensuring optimal performance and avoiding costly downtime. The Databricks monitoring UI provides a real-time view of resource usage across your clusters, including CPU and memory utilisation, disk usage, and node status. This helps you identify bottlenecks, over-provisioning, or failing nodes before they impact your workflows.
Regularly reviewing this data allows you to make informed decisions about cluster scaling, configuration adjustments, and job scheduling. Set up alerts for critical thresholds (e.g., if memory usage exceeds 80%) to proactively address issues:
databricks clusters events --cluster-id <cluster-id>
By keeping an eye on your cluster’s health, you ensure that resources are being used efficiently, preventing performance degradation and unnecessary costs.
65. Monitor job execution via Spark UI for bottlenecks.
The Spark UI is a crucial tool for understanding how your jobs are running in Databricks. It provides detailed insights into the execution of Spark jobs, including stages, tasks, and memory usage. If a job is running slower than expected, the Spark UI can help you pinpoint the bottleneck—whether it’s skewed data, long-running stages, or excessive shuffling.
For example, if you see that a single task is taking disproportionately longer than others, it could indicate an unbalanced partition. You can address this by repartitioning the data or optimising the query logic.
Use the Spark UI to review job execution and optimise your workflows:
databricks clusters spark-ui --cluster-id <cluster-id>
This helps keep jobs running efficiently and minimises costly delays in data processing.
66. Set up alerts for cluster over-provisioning.
Cluster over-provisioning can quickly become a silent cost killer in Databricks. If you’re allocating more resources than necessary for a job, you’re effectively burning money. Setting up alerts helps you monitor when clusters are over-provisioned or underutilised, allowing you to scale down when appropriate.
Databricks integrates with Azure Monitor to set up alerts based on custom metrics like CPU usage, memory utilisation, or job execution times. For example, you can trigger an alert if a cluster’s CPU usage stays below 20% for an extended period:
az monitor metrics alert create --name "ClusterUnderutilised" --resource "my-cluster"
By setting up alerts for over-provisioning, you ensure that you’re not paying for compute resources you don’t need, optimising both performance and cost.
67. Use the cluster event log to troubleshoot issues.
When things go wrong with a cluster, the event log is your first port of call for troubleshooting. Databricks provides detailed logs that record events such as node failures, job crashes, or configuration changes, giving you a clear picture of what happened and why.
By reviewing the cluster event log, you can trace issues back to their root cause—whether it’s a misconfigured Spark parameter, a node running out of memory, or an unexpected resource spike. Use the following command to access the logs:
databricks clusters event-log --cluster-id <cluster-id>
Regularly reviewing event logs helps you quickly diagnose and resolve issues, ensuring minimal disruption to your workflows and preventing similar problems in the future.
68. Apply Spark logs and job history for debugging failed jobs.
When a job fails, the first thing you need is visibility into what went wrong. Spark logs and job history in Databricks provide detailed insights into each step of your job execution, helping you track down the exact point of failure. The logs capture everything—from configuration settings to error messages—making them your go-to tool for debugging.
You can access Spark logs through the Databricks UI or by using the Databricks CLI:
databricks jobs get-output --job-id <job-id>
For a more granular view, use the Spark History Server, which tracks jobs and stages in detail, letting you dig into task execution times, memory usage, and shuffling issues. Regularly reviewing these logs allows you to optimise future jobs and avoid repeated failures.
69. Use logging for data pipeline jobs.
Logging isn’t just for troubleshooting—it’s a key component of observability in data pipelines. By implementing structured logging in your Databricks jobs, you gain insights into job progress, performance, and any unexpected behavior that may occur along the way. Log important metrics, intermediate results, and critical steps to create an audit trail of your pipeline execution.
Using Python’s built-in logging library or the dbutils.fs
commands, you can add logs to your pipeline:
import logging
logging.basicConfig(level=logging.INFO)
logging.info("Starting data ingestion...")
This not only helps during debugging but also makes monitoring easier, as you can see at which stage your pipeline failed or lagged. Keep your logs clean, structured, and informative so they’re useful not only for error detection but also for performance tuning.
70. Regularly check for and upgrade deprecated runtimes.
Databricks regularly releases new runtimes that include performance enhancements, bug fixes, and support for newer Spark versions. Sticking to old, deprecated runtimes can result in missed performance improvements and potential security vulnerabilities. It’s good practice to check which runtimes you’re using and update them regularly.
You can view available runtimes and their support lifecycles in the Databricks documentation. When upgrading, test your workloads on the new runtime in a development environment before rolling it out to production. To change the runtime for an existing cluster:
databricks clusters edit --cluster-id <cluster-id> --spark-version <new-version>
By keeping your runtime up to date, you ensure that your environment benefits from the latest enhancements, without falling behind on security or functionality.
71. Leverage the metrics API for automated performance monitoring.
Keeping tabs on your job performance and resource utilisation is crucial for ensuring that your clusters aren’t wasting resources. Databricks provides a metrics API that lets you pull real-time performance data, enabling automated monitoring and alerting for your workloads.
With the metrics API, you can track metrics like CPU and memory usage, task execution times, and more. Integrate these metrics into your existing monitoring dashboards (e.g., Grafana or Azure Monitor) for real-time insights into how your jobs and clusters are performing.
Here’s a basic example of retrieving metrics via the API:
curl -n -X GET https://<databricks-instance>/api/2.0/metrics/<metric-id>
Automating performance monitoring allows you to spot inefficiencies early and adjust your cluster configurations or Spark jobs to improve performance. It’s essential for maintaining optimal resource usage, especially in large-scale environments.
72. Use spot instances to reduce cluster costs.
Spot instances are an effective way to lower your Databricks infrastructure costs. By taking advantage of unused Azure compute resources, you can access spot instances at a significant discount compared to on-demand pricing. The downside is that these resources can be reclaimed by Azure with little notice, so they’re best suited for non-critical workloads where interruptions are acceptable.
To use spot instances in Databricks, you can configure your cluster settings to prioritise these lower-cost resources:
"instance_pool_id": "my-spot-instance-pool",
"spot_instances": {
"enabled": true,
"max_spot_price": 0.05
}
Spot instances are ideal for tasks like development, testing, or batch jobs, where cost savings outweigh the risk of resource loss. However, always monitor job execution and set retries or use fault-tolerant architectures to handle potential instance terminations gracefully.
73. Shut down idle clusters promptly with automated cluster termination.
Leaving clusters running when they’re not in use is like leaving the lights on in an empty room—it’s a needless waste of resources. Databricks provides a simple solution with automated cluster termination, which ensures that clusters shut down after a defined period of inactivity.
You can set this in the cluster configuration, specifying how long a cluster should remain idle before shutting down:
"autotermination_minutes": 30
This not only cuts costs but also keeps your workspace clean by automatically terminating clusters that are no longer needed. Automating this process ensures you’re not left paying for clusters that are sitting idle, doing nothing but consuming compute time and racking up costs.
74. Optimise storage by cleaning up outdated data and snapshots.
Data storage costs can spiral out of control if outdated files, snapshots, or logs aren’t regularly cleaned up. In Delta Lake, you can use the VACUUM
command to clean up old data files that are no longer referenced by any Delta transactions, freeing up storage space.
For example, you can remove old files that have been retained for more than 30 days:
spark.sql("VACUUM my_delta_table RETAIN 30 DAYS")
This helps reduce clutter and optimise your data storage costs, ensuring you’re only paying for what you actually need. Regularly running cleanup commands also prevents performance degradation in queries, as Delta Lake doesn’t have to process unnecessary historical data.
75. Avoid over-provisioning of clusters and ensure resource rightsizing.
Over-provisioning clusters—allocating more compute resources than you need—can quickly inflate your Databricks bill. The trick is to rightsize your clusters based on workload requirements. Monitor your jobs to understand resource usage patterns and adjust cluster sizes accordingly.
For lighter, short-lived jobs, consider using smaller instance types or fewer nodes. For heavy, data-intensive jobs, configure auto-scaling so that resources are dynamically allocated based on demand. This keeps your compute usage efficient without over-provisioning:
"autoscale": {
"min_workers": 2,
"max_workers": 10
}
By regularly reviewing resource usage, you can ensure clusters are appropriately sized, reducing the risk of over-provisioning and saving money in the process.
76. Monitor job runtimes and eliminate unnecessary delays.
Job runtimes can stretch longer than necessary due to poorly optimised queries, data shuffling, or misconfigured clusters. Monitoring job runtimes closely allows you to identify and eliminate unnecessary delays in your pipelines. The Spark UI provides visibility into how long each stage of your job takes, making it easier to pinpoint slow sections.
To optimise runtimes, focus on reducing data shuffling, repartitioning large datasets appropriately, and caching frequently accessed data. Also, consider using broadcast joins for smaller tables, as they can eliminate the need for data movement across nodes.
Tracking runtimes over time helps you spot trends and adjust configurations for faster, more efficient processing.
77. Use smaller clusters for development and testing.
For development and testing workloads, there’s no need to spin up large clusters. You can save costs by using smaller clusters for non-production environments. These clusters don’t need the same resources as production environments, so scale them down appropriately:
"autoscale": {
"min_workers": 1,
"max_workers": 4
}
Additionally, set a short auto-termination period to ensure they shut down quickly when idle. By using smaller, more efficient clusters for development, you can significantly reduce your overall compute costs without compromising on performance for critical production jobs.
78. Consider using ephemeral clusters for one-time or ad-hoc jobs.
Ephemeral clusters are ideal for one-off or ad-hoc jobs where you don’t need persistent infrastructure. They spin up quickly, do the job, and then terminate, ensuring you’re only paying for compute resources when you actually need them. This approach is particularly useful for development, testing, or analytics tasks that don’t require long-term cluster availability.
When running an ephemeral cluster, you simply configure it to auto-terminate after completing the task, which prevents the cost drain of idle clusters hanging around. It’s a good practice to run any ad-hoc analytics or ETL jobs this way instead of using a persistent production cluster.
79. Use the Azure Cost Management tool to monitor Databricks costs.
Cost management is key to staying within budget, especially in cloud environments where resources can quickly become costly if left unchecked. Azure Cost Management provides a dashboard to monitor and analyse your Databricks spending, helping you keep track of where your money is going and spot patterns that could lead to overspending.
By setting budgets and alerts for your Databricks usage, you can avoid unexpected spikes in costs. The tool also allows you to break down spending by resource (e.g., clusters, storage) or even by tags, making it easier to understand which teams or projects are consuming the most resources. Regular cost reviews will help you adjust cluster sizes, optimise resource usage, and avoid runaway bills.
80. Optimise costs with the Azure Reserved Instances for long-running clusters.
If your Databricks clusters are running consistently over long periods, you can significantly reduce costs by using Azure Reserved Instances (RIs). Reserved Instances offer discounts in exchange for committing to a set amount of compute usage over a one- or three-year period. This is particularly effective for clusters that are critical to production workloads or operate around the clock.
Using RIs allows you to lock in lower rates, and the savings can be substantial—up to 72% compared to pay-as-you-go pricing. It’s worth doing the math on which of your long-running clusters would benefit from this, as it can make a big difference to your overall budget.
81. Use the cluster event log to troubleshoot issues.
Databricks provides a cluster event log that tracks key events like job failures, configuration changes, and node issues. When troubleshooting cluster problems, the event log is your first stop for figuring out what went wrong. You can identify the exact moment a job failed, see if any nodes were terminated unexpectedly, or discover configuration errors that may have caused performance problems.
By regularly reviewing the event log, you can also identify trends that could indicate deeper issues, such as frequent node crashes or memory bottlenecks. It’s a valuable tool for proactive monitoring and fine-tuning your cluster configurations to avoid repeated issues.
82. Apply Spark logs and job history for debugging failed jobs.
Failed jobs are inevitable, but the important part is figuring out why they failed and preventing similar issues in the future. The Spark job history and logs in Databricks are essential for understanding what went wrong during job execution. You can dive into specific stages and tasks to pinpoint where a job stalled, crashed, or failed to process data correctly.
Once you’ve identified the problem, whether it’s related to out-of-memory errors, skewed data, or improper partitioning, you can take steps to optimise your jobs or adjust your cluster configuration. Regularly reviewing Spark job logs helps you catch inefficiencies and avoid repeated failures.
83. Use the error messaging system in Databricks notebooks to trace back issues.
When things go wrong in your notebooks, Databricks provides error messages that help trace back the issue to the exact line of code that caused it. While this might seem like a standard feature, it’s particularly useful in a distributed environment like Databricks, where debugging errors can be more complex due to the nature of parallel processing.
The error messaging system often includes stack traces and references to the specific operation or transformation that failed, making it easier to debug distributed jobs or data pipelines. Reviewing these error messages, rather than just restarting the notebook, helps you understand the root cause of failures and optimise your code.
84. Isolate issues by running code in smaller test clusters.
When a job fails on a large production cluster, it can be difficult to debug efficiently. One effective strategy is to run the same code on a smaller test cluster where you can isolate the issue without using up expensive resources. Test clusters allow you to iterate quickly, troubleshoot errors, and fine-tune your code before scaling it up to production.
This is especially useful when you suspect that the failure is caused by data partitioning or memory management issues, as these can be more easily diagnosed on smaller datasets. By isolating the problem in a controlled environment, you can resolve it more quickly without impacting larger production jobs.
85. Review job output logs for detailed information on job failures.
When a job fails, the output logs in Databricks provide detailed information about what happened during execution. These logs include the steps taken by Spark, any errors encountered, and resource utilisation metrics. Reviewing these logs allows you to diagnose issues such as data skew, out-of-memory errors, or problems with data sources (e.g., missing files or permission issues).
Once you’ve reviewed the logs, you can take targeted actions to resolve the problem, such as repartitioning data, increasing memory, or adjusting configurations. Job logs are also helpful for identifying systemic issues that may impact multiple jobs, giving you a chance to fix things at the root level.
86. Regularly check for package dependency conflicts.
One often-overlooked area of debugging in Databricks involves package dependencies. If you’re using third-party libraries or custom packages, dependency conflicts can cause jobs to fail unexpectedly. Regularly reviewing your library versions and ensuring compatibility across notebooks and jobs is essential for preventing conflicts.
When possible, pin specific package versions in your environment to avoid mismatches. Databricks allows you to manage libraries at the cluster level, making it easier to ensure consistency across jobs. Dependency conflicts are a common cause of job failures, but they can be resolved by maintaining a clear view of the versions you’re using.
87. Monitor and visualise metrics using Databricks dashboards.
Databricks provides built-in tools for creating dashboards that can monitor job performance, resource utilisation, and data pipeline health in real-time. These dashboards let you visualise key metrics without diving into the logs or Spark UI every time you want to check on the status of a job.
By setting up dashboards to track metrics like CPU usage, memory consumption, or job execution times, you can quickly identify bottlenecks or areas for optimisation. Dashboards also make it easier to share insights with your team, providing a clear view of how well your jobs and clusters are performing without having to rely on ad-hoc analysis.
88. Ensure the compliance standards like GDPR are met with your Databricks architecture.
Compliance with regulations like GDPR (General Data Protection Regulation) is non-negotiable, especially when dealing with personal data in Databricks. This means ensuring that data processing aligns with regulations, including data minimisation, access control, and data deletion practices.
You should configure Role-Based Access Control (RBAC) to limit who can access sensitive data, and use Azure Active Directory (AAD) for authentication. Additionally, ensure you have workflows in place to respond to data subject requests (e.g., the right to be forgotten), which could involve deleting personal data from both active and historical records.
One way to ensure compliance is by using Delta Lake’s time travel feature, which allows you to track changes and easily delete records:
# Delete personal data from a Delta table
spark.sql("DELETE FROM user_data WHERE user_id = '12345'")
By regularly reviewing your Databricks architecture, you can ensure compliance with GDPR and other regulatory frameworks, protecting both your data and your organisation from hefty fines.
89. Ensure workspace auditing is enabled.
Auditing is crucial for understanding who is accessing your Databricks workspace and what they’re doing with the data. By enabling workspace auditing, you create an audit trail that captures key actions such as cluster creation, job execution, and data access, ensuring you have a clear record for both security and compliance purposes.
To enable auditing, use Azure Monitor to stream diagnostic logs from your Databricks environment to a storage account, event hub, or log analytics workspace. These logs allow you to track and review actions, providing insights into unusual activity or potential misuse of resources. It’s essential for both internal governance and meeting regulatory requirements.
90. Manage user permissions at the group level.
Managing user permissions at the group level is much more scalable than handling individual permissions, especially in larger teams. By grouping users based on their roles (e.g., data scientists, data engineers, analysts), you can assign access permissions in bulk, making it easier to control who can access certain data or resources.
Using Azure Active Directory (AAD), you can sync groups to your Databricks workspace and manage permissions directly through the interface. This reduces the administrative overhead of managing permissions on a per-user basis and ensures that your security policies are applied consistently across the organisation.
For example, you can assign group-level permissions using the Databricks CLI:
databricks permissions update --object-type cluster --object-id <cluster-id> --principal "DataScienceGroup" --permission "CAN_MANAGE"
By managing permissions at the group level, you maintain tighter control over your environment while ensuring that users have the appropriate access to the resources they need.
91. Regularly audit Databricks workspaces for unused clusters.
Clusters that are left running with no active jobs can quickly become a drain on resources. Regularly auditing your Databricks workspaces for unused clusters is an important best practice to avoid paying for compute power you’re not using. Clusters that are not being used should be terminated to prevent unnecessary costs.
You can set up a scheduled job that audits all clusters and sends alerts for those that have been idle beyond a certain threshold. Alternatively, use cluster auto-termination settings to ensure clusters shut down after a specified period of inactivity.
Here’s a simple query to get a list of idle clusters using the Databricks CLI:
databricks clusters list --output JSON | jq '.clusters[] | select(.state == "TERMINATED")'
Automating the audit process will help you keep costs down and ensure your environment is optimally configured.
92. Conduct periodic security audits.
Security threats evolve, and so should your security audits. Conducting regular security audits ensures that your Databricks workspace stays compliant with your organisation’s security policies and external regulations. This involves reviewing user access permissions, checking for unused or over-permissioned accounts, and ensuring that secrets are stored securely (e.g., using Azure Key Vault).
Part of this process includes reviewing audit logs, ensuring data is encrypted both at rest and in transit, and verifying that your clusters are configured according to best practices. Conducting periodic security audits helps mitigate risks and prevent data breaches.
93. Keep up-to-date with new Databricks features and integrate them when necessary.
Databricks is constantly evolving, with new features and updates that can improve performance, security, or usability. Staying informed about these updates ensures you’re taking advantage of the latest improvements and not relying on outdated workflows. Subscribe to release notes and review new features as they’re announced to see how they can benefit your use cases.
For instance, if Databricks releases a new runtime version that improves query performance or introduces a new Delta Lake feature, make sure you test it in your development environment and integrate it into your workflows where appropriate. Keeping your workspace modernised not only boosts efficiency but also ensures you’re not missing out on cost-saving or performance-enhancing features.
94. Implement logging for all access to Databricks resources.
Logging all access to your Databricks resources is an essential part of maintaining security and compliance. By enabling detailed logging, you can track who accessed your clusters, notebooks, and data, and what actions they performed. This information is critical for investigating security incidents or responding to compliance audits.
In Databricks, you can use Azure Monitor to collect diagnostic logs and store them in a central location for analysis. From there, you can set up alerts for any suspicious activity, such as unusual data access patterns or job failures, to ensure you’re always on top of potential security issues.
95. Use secure network connections to your data sources.
Ensuring that your Databricks clusters are using secure connections to your data sources is non-negotiable in a world where data breaches can cost millions. Whether you’re connecting to a data lake, SQL database, or an external API, always use encrypted connections (e.g., TLS) and private endpoints where possible.
Private endpoints route traffic over a private network rather than the public internet, significantly reducing the risk of man-in-the-middle attacks. You can configure private endpoints via Azure Private Link to secure your Databricks connections:
az network private-endpoint create --name myPrivateEndpoint --resource-group myResourceGroup
By using secure networking practices, you ensure that your data remains protected both at rest and in transit.
96. Conduct regular data access reviews to maintain security compliance.
As your Databricks environment grows, so does the number of users and teams accessing your data. Regular data access reviews are necessary to ensure that only the right people have access to sensitive resources. This involves reviewing permissions across clusters, notebooks, and storage locations to ensure compliance with security policies and regulations.
Set up periodic audits of your RBAC configuration, removing users who no longer need access and tightening permissions where necessary. By conducting these reviews, you maintain a secure environment and reduce the risk of unauthorised access to sensitive data.
97. Track resource usage and optimise accordingly to maintain cost-efficiency.
Resource usage in Databricks can balloon if not monitored carefully. Tracking resource utilisation metrics—such as CPU, memory, and storage—allows you to optimise clusters and jobs, ensuring that you’re not over-provisioning or under-utilising resources.
You can use Azure Monitor or third-party tools like Datadog to gather detailed metrics on how resources are being consumed. By tracking these metrics over time, you can spot inefficiencies and take corrective action, such as scaling down clusters, optimising jobs, or adjusting data partitions. Resource tracking ensures that you maintain cost-efficiency without sacrificing performance.