Analyze Past Databricks Job Runs Without Re-Running

The Gap in Spark Debugging

Until now, analyzing a Spark job's physical plan required running it again — either via a dry run or by re-triggering the actual job. That's expensive, slow, and often impossible for jobs that consume production data or have long runtimes.

But the information you need is already there. Every Databricks job writes a Spark event log to DBFS — a structured file that contains every stage, task, and query execution plan from the run. CatalystOps 0.9.0 taps directly into those logs.

The Jobs Sidebar

A new Jobs tree view appears in the CatalystOps sidebar alongside your Clusters view. It lists all Databricks jobs in your workspace, each showing its last-run status:

✅ Success — job completed without errors
❌ Failed — job run ended in failure
🔄 Running — job currently in progress
⏭ Skipped / Cancelled

Click any job to trigger historical run analysis. No re-execution required.

How the Analysis Works

When you click a job, CatalystOps:

Fetches the most recent completed run via the Databricks Jobs API
Reads the Spark event log from DBFS (dbfs:/cluster-logs/<cluster-id>/eventlog/)
Parses the event log to extract physical query plans from all SparkListenerSQLExecutionStart events
Runs the full CatalystOps plan analysis on those plans — detecting shuffles, sort aggregates, missing partition filters, single-partition bottlenecks, and more
Correlates plan issues with the source notebook using the View Source button in the DAG
Opens a markdown report in a new editor tab with the full findings

💡 What gets analyzed

The same plan-level detectors that run during a dry run apply to historical runs: SinglePartitionBottleneck, SortAggregate, GlobalWindow, missing partition filters, broadcast join opportunities, and more.

Interactive DAG for Job Runs

0.9.0 also ships an improved DAG view that now renders for job run analysis results. The plan tree displays:

Operator tree with └─ / ├─ connectors
Query groups collapsed into accordions with execution counts
Human-friendly filter conditions (col not null, a and b)
View Source button to open the originating notebook
Raw Plans collapsible section for debugging

New Plan-Level Issue Detectors

0.9.0 ships three new detectors that apply to both live dry runs and historical job run analysis:

Detector	What it catches
`SinglePartitionBottleneck`	`Exchange SinglePartition` — collects all data to a single executor (caused by global aggregations or global window functions). Often the root cause of driver OOM.
`SortAggregate`	Sort-based aggregation — slower than hash-based and prone to spilling on large datasets. Usually fixable by enabling AQE or rewriting the query.
`GlobalWindow` via `RunningWindowFunction`	Extends the existing global-window check to cover Databricks Photon's `RunningWindowFunction` operator, which generates a full-data shuffle.

AQE Initial Plan Support

When Adaptive Query Execution (AQE) is enabled, Databricks may not produce a "final" physical plan until partway through execution. Previously, CatalystOps skipped these plans entirely — silently missing every issue in AQE-enabled jobs.

0.9.0 now analyzes plans whose operator tree sits entirely under == Initial Plan ==, so AQE jobs are no longer invisible to the analyzer.

MCP: Get Last Job Run Analysis

The MCP server gains a new get_last_job_run_analysis tool that exposes the most recent job run's plan issues and physical plan text to Claude (or any MCP client) — without fetching from Databricks again. Useful for asking "what was wrong with the last run?" directly in your AI assistant.

Use Cases

Historical job run analysis is especially useful for:

Production incident review — a job failed or ran slow; analyze the plan from that specific run without touching production data
Optimization without re-running — identify shuffle-heavy stages in a 3-hour job by analyzing the event log from the last run
Scheduled job auditing — regularly check your scheduled jobs for plan regressions without needing to trigger them manually
Onboarding — understand what a job is actually doing by reading its physical plan, not just its source code

Getting Started

Install CatalystOps from the VS Code Marketplace
Configure your Databricks workspace connection
Open the CatalystOps Jobs sidebar panel
Click any job to analyze its most recent run
The report opens in a new editor tab — no re-execution needed

Free & open source

Try CatalystOps today

Catch PySpark performance issues before they hit production — inline, in your editor.

Install for VS Code View on GitHub