The Problem
Debugging a slow PySpark job has always meant leaving your editor, navigating to the Databricks UI, clicking through the Spark UI, squinting at a wall of plan text, and then manually figuring out what to do about it.
Even experienced data engineers find the raw physical plan hard to read. And once you've identified the problem — a bad join, an unnecessary shuffle, a repeated scan — you still have to remember the right fix and apply it by hand.
CatalystOps 0.8.0 changes that.
What's New in 0.8.0
🔍 Explain Plan View
After running a dry run against your Databricks cluster, CatalystOps now populates a sidebar tree showing your full physical query plan — broken down node by node, with a cost score on each operation.
No more digging through Spark UI. The plan lives in your editor, right next to your code. Sort-merge joins, exchanges (shuffles), and repeated scans are flagged automatically so you know exactly where your DBUs are going.
🗺 Interactive DAG Visualization
Open the Plan DAG (CatalystOps: Show Plan DAG) to see your query rendered as an interactive graph in a VS Code webview panel. It makes it immediately obvious when your plan has unnecessary stages, fan-outs, or redundant operations that wouldn't be visible in the tree view.
orders = spark.table("orders") customers = spark.table("customers") products = spark.table("products") # Large join — no broadcast hints result = ( orders .join(customers, "customer_id") .join(products, "product_id") .groupBy("region") .agg(F.sum("revenue")) )
⚡ Context-Aware Quick Fixes
CatalystOps doesn't just show you the problem — it offers one-click fixes directly on plan tree nodes. These aren't generic suggestions; they're generated from the actual plan for your specific query:
| Problem detected | Quick fix offered |
|---|---|
| Inefficient join (missing broadcast) | Add broadcast hint |
| Unnecessary exchange / shuffle | Add repartition hint |
| Repeated scan without caching | Add .persist() |
| Sort-merge join with AQE disabled | Set AQE config |
| Cartesian product | Add join condition hint |
⏱ Configurable Dry-Run Timeout
Set your own timeout via catalystops.dryRun.timeoutSeconds (default: 300s, minimum: 30s). No more jobs silently timing out mid-analysis on large plans.
Why This Matters
Most PySpark performance issues fall into a handful of categories: bad joins, unnecessary shuffles, repeated scans, missing statistics. They're common, they're expensive, and they're fixable — if you can spot them.
The problem has always been visibility. The Spark physical plan contains everything you need, but it's buried in a UI most developers don't open until something is already on fire. CatalystOps 0.8.0 brings that visibility into your editor, before you ship to production, with the fixes already written for you.
After your first dry run, open the Plan DAG alongside the plan tree — the graph view makes multi-stage shuffle chains immediately obvious at a glance.
Getting Started
- Install CatalystOps from the VS Code Marketplace or Open VSX
- Connect your Databricks workspace in the extension settings
- Open a PySpark file and run a dry run (
CatalystOps: Run Dry Run) - Open the Explain Plan panel in the sidebar
- Click any node to see cost details and available quick fixes
- Open the DAG view —
CatalystOps: Show Plan DAG
What's Next
0.8.0 lays the groundwork for deeper plan analysis. On the roadmap: multi-file plan correlation, historical plan comparison, and cost trend tracking across runs.
Got feedback? Open an issue on GitHub or drop a review on the VS Code Marketplace.
Free & open source
Try CatalystOps today
Catch PySpark performance issues before they hit production — inline, in your editor.