Building a Databricks Jobs Error Monitoring Dashboard

5 min readJust now

–

In this article, I’ll show you how to build a comprehensive set of dashboards for monitoring Databricks Jobs in Databricks, with a focus on tracking errors and performance degradation. This material is beneficial for those who rely on the standard Databricks scheduler and already manage dozens of jobs. This will be relevant for both workspace administrators and IT specialists responsible for their domains in the Data Mesh architecture. The approaches described can be used as a basis for any analytics for Databricks Jobs.

Press enter or click to view image in full size

5 min readJust now

–

Press enter or click to view image in full size

The main problem with viewing jobs through the UI is that the Databricks interface is great for diagnosing a single job, but it’s of little use when you need a comprehensive overview of the entire workspace. It’s difficult to quickly understand how many jobs are currently scheduled, which ones crashed yesterday, which ones haven’t run in a while, where execution times are gradually increasing, and which tasks haven’t had a Spark version update in a while. All of this can be seen, but only manually and incrementally, not as a complete, visual overview.

Furthermore, production typically has dozens or hundreds of jobs, multiple teams, different SLAs, and different business criticality levels. You need a unified monitoring center for Databricks Jobs.

General architecture

The architecture is simple: one script that compiles a common table and, in my case, the standard functionality of Databricks Dashboard. You can also use Tableau or Microsoft Power BI.

Press enter or click to view image in full size

Sources can be system tables or JSON via API. I described how to extract data through API in Databricks in my previous article — How to Monitor Databricks Jobs: API-Based Dashboard.

In my case, I rely on the API: it provides raw data with many columns and allows for near-real-time reporting. System tables, on the other hand, are significantly more compact in their structure but offer the important advantage of historical and versioning, which is convenient for long-term analytics. In more complex scenarios, it’s more appropriate to combine both approaches: fetching operational metrics via the REST API and deep historical data and trends from system tables.

Press enter or click to view image in full size

For clarity, I’ve summarized all the data sources I use in one table above: on the left is what I need for monitoring (clusters, jobs, runs, task settings, and accesses); in the center are the corresponding REST API calls; on the right are the closest equivalents in the Databricks system tables. You can also see that not all the necessary data is available in the system tables.

A set of dashboards

I only started exploring job monitoring capabilities a week ago, but based on my experience administering the Data Mesh approach, I’ve already sketched out a set of dashboards. Below, I’ll show you how to implement each of them in practice. If you have your own ideas for dashboards or metrics for monitoring Databricks Jobs, please share them. I’d be happy to expand this set.

Dashboard 1 — General statistics on jobs

Press enter or click to view image in full size

I created the first dashboard to see all the jobs created in the Databricks scheduler. It displays the following information:

how many jobs are there in total;
how many of them are scheduled or have never been run;
how successful the last run was;
and which user they were run under.

There’s also a team filter on the left so you can see all this information only for your own domain of responsibility. This is especially convenient in the Data Mesh approach: each team can view the same metrics and errors, but only for their own jobs, without getting lost in the general noise of the entire workspace.

Dashboard 2 — Scheduled Jobs & Errors

Press enter or click to view image in full size

The following dashboard displays scheduled jobs and their latest run status. The table below provides detailed information. Conveniently, Databricks supports links, allowing you to directly access a specific run in the Databricks UI.

Dashboard 3 — Daily Success/Failure

Press enter or click to view image in full size

The following dashboard shows job status by day for the past two months. I used the Heatmap visualization for it. It’s visually convenient, but it needs to be broken down into categories, as the current visualization lumps the names together.

Dashboard 4 — by execution time and optimizations

I created this dashboard to view the current runtime configuration of job clusters, to understand where adjustments are needed to improve performance, as I discussed in a previous article. Increasing the runtime can improve performance with minimal time and cost, but it is not always sufficient.

I also displayed columns with the average job execution time and the last run duration.

Press enter or click to view image in full size

From this, we noticed that there are scripts that take a long time to run and could be optimized for performance. I have some work to do with this, and I might write about it later. Job name 1 is a real example before and after script optimization. The real name has been replaced.

Press enter or click to view image in full size

Dashboard 5 — by daily execution time

Press enter or click to view image in full size

I also tried to create graphs that would show the execution time daily. For this, I used the Scatter visualization, which doesn’t look very good. But I couldn’t find anything more suitable.

I also highlighted how our Job Name 1 is displayed before and after optimization.

Later, when I’ve polished the solution, I might share the code separately. Subscribe to my blog so you don’t miss it.

The purpose of this section was primarily to demonstrate the capabilities of this approach. Feel free to share your own metrics and dashboards — things like this thrive on sharing best practices.

General architecture

A set of dashboards

Dashboard 1 — General statistics on jobs

Dashboard 2 — Scheduled Jobs & Errors

Dashboard 3 — Daily Success/Failure

Dashboard 4 — by execution time and optimizations

Dashboard 5 — by daily execution time

Similar Posts