The Gap in Spark Debugging
Until now, analyzing a Spark job's physical plan required running it again — either via a dry run or by re-triggering the actual job. That's expensive, slow, and often impossible for jobs that consume production data or have long runtimes.
But the information you need is already there. Every Databricks job writes a Spark event log to DBFS — a structured file that contains every stage, task, and query execution plan from the run. CatalystOps 0.9.0 taps directly into those logs.
The Jobs Sidebar
A new Jobs tree view appears in the CatalystOps sidebar alongside your Clusters view. It lists all Databricks jobs in your workspace, each showing its last-run status:
- ✅ Success — job completed without errors
- ❌ Failed — job run ended in failure
- 🔄 Running — job currently in progress
- ⏭ Skipped / Cancelled
Click any job to trigger historical run analysis. No re-execution required.
How the Analysis Works
When you click a job, CatalystOps:
- Fetches the most recent completed run via the Databricks Jobs API
- Reads the Spark event log from DBFS (
dbfs:/cluster-logs/<cluster-id>/eventlog/) - Parses the event log to extract physical query plans from all
SparkListenerSQLExecutionStartevents - Runs the full CatalystOps plan analysis on those plans — detecting shuffles, sort aggregates, missing partition filters, single-partition bottlenecks, and more
- Correlates plan issues with the source notebook using the View Source button in the DAG
- Opens a markdown report in a new editor tab with the full findings
The same plan-level detectors that run during a dry run apply to historical runs: SinglePartitionBottleneck, SortAggregate, GlobalWindow, missing partition filters, broadcast join opportunities, and more.
Interactive DAG for Job Runs
0.9.0 also ships an improved DAG view that now renders for job run analysis results. The plan tree displays:
- Operator tree with
└─/├─connectors - Query groups collapsed into accordions with execution counts
- Human-friendly filter conditions (
col not null,a and b) - View Source button to open the originating notebook
- Raw Plans collapsible section for debugging
New Plan-Level Issue Detectors
0.9.0 ships three new detectors that apply to both live dry runs and historical job run analysis:
| Detector | What it catches |
|---|---|
SinglePartitionBottleneck |
Exchange SinglePartition — collects all data to a single executor (caused by global aggregations or global window functions). Often the root cause of driver OOM. |
SortAggregate |
Sort-based aggregation — slower than hash-based and prone to spilling on large datasets. Usually fixable by enabling AQE or rewriting the query. |
GlobalWindow via RunningWindowFunction |
Extends the existing global-window check to cover Databricks Photon's RunningWindowFunction operator, which generates a full-data shuffle. |
AQE Initial Plan Support
When Adaptive Query Execution (AQE) is enabled, Databricks may not produce a "final" physical plan until partway through execution. Previously, CatalystOps skipped these plans entirely — silently missing every issue in AQE-enabled jobs.
0.9.0 now analyzes plans whose operator tree sits entirely under == Initial Plan ==, so AQE jobs are no longer invisible to the analyzer.
MCP: Get Last Job Run Analysis
The MCP server gains a new get_last_job_run_analysis tool that exposes the most recent job run's plan issues and physical plan text to Claude (or any MCP client) — without fetching from Databricks again. Useful for asking "what was wrong with the last run?" directly in your AI assistant.
Use Cases
Historical job run analysis is especially useful for:
- Production incident review — a job failed or ran slow; analyze the plan from that specific run without touching production data
- Optimization without re-running — identify shuffle-heavy stages in a 3-hour job by analyzing the event log from the last run
- Scheduled job auditing — regularly check your scheduled jobs for plan regressions without needing to trigger them manually
- Onboarding — understand what a job is actually doing by reading its physical plan, not just its source code
Getting Started
- Install CatalystOps from the VS Code Marketplace
- Configure your Databricks workspace connection
- Open the CatalystOps Jobs sidebar panel
- Click any job to analyze its most recent run
- The report opens in a new editor tab — no re-execution needed
Free & open source
Try CatalystOps today
Catch PySpark performance issues before they hit production — inline, in your editor.