See Inside Your Spark Jobs: CatalystOps 0.8.0

The Problem

Debugging a slow PySpark job has always meant leaving your editor, navigating to the Databricks UI, clicking through the Spark UI, squinting at a wall of plan text, and then manually figuring out what to do about it.

Even experienced data engineers find the raw physical plan hard to read. And once you've identified the problem — a bad join, an unnecessary shuffle, a repeated scan — you still have to remember the right fix and apply it by hand.

CatalystOps 0.8.0 changes that.

What's New in 0.8.0

🔍 Explain Plan View

After running a dry run against your Databricks cluster, CatalystOps now populates a sidebar tree showing your full physical query plan — broken down node by node, with a cost score on each operation.

No more digging through Spark UI. The plan lives in your editor, right next to your code. Sort-merge joins, exchanges (shuffles), and repeated scans are flagged automatically so you know exactly where your DBUs are going.

🗺 Interactive DAG Visualization

Open the Plan DAG (CatalystOps: Show Plan DAG) to see your query rendered as an interactive graph in a VS Code webview panel. It makes it immediately obvious when your plan has unnecessary stages, fan-outs, or redundant operations that wouldn't be visible in the tree view.

🗺 Live Example: Query Plan DAG

A typical PySpark join query — click any node to see what CatalystOps flags

analysis.py

orders = spark.table("orders")
customers = spark.table("customers")
products = spark.table("products")

# Large join — no broadcast hints
result = (
  orders
  .join(customers, "customer_id")
  .join(products, "product_id")
  .groupBy("region")
  .agg(F.sum("revenue"))
)

CatalystOps: Plan DAG 2 issues found

Critical Warning Info OK

⚡ Context-Aware Quick Fixes

CatalystOps doesn't just show you the problem — it offers one-click fixes directly on plan tree nodes. These aren't generic suggestions; they're generated from the actual plan for your specific query:

Problem detected	Quick fix offered
Inefficient join (missing broadcast)	Add broadcast hint
Unnecessary exchange / shuffle	Add repartition hint
Repeated scan without caching	Add `.persist()`
Sort-merge join with AQE disabled	Set AQE config
Cartesian product	Add join condition hint

⏱ Configurable Dry-Run Timeout

Set your own timeout via catalystops.dryRun.timeoutSeconds (default: 300s, minimum: 30s). No more jobs silently timing out mid-analysis on large plans.

Why This Matters

Most PySpark performance issues fall into a handful of categories: bad joins, unnecessary shuffles, repeated scans, missing statistics. They're common, they're expensive, and they're fixable — if you can spot them.

The problem has always been visibility. The Spark physical plan contains everything you need, but it's buried in a UI most developers don't open until something is already on fire. CatalystOps 0.8.0 brings that visibility into your editor, before you ship to production, with the fixes already written for you.

🔥 Pro tip

After your first dry run, open the Plan DAG alongside the plan tree — the graph view makes multi-stage shuffle chains immediately obvious at a glance.

Getting Started

Install CatalystOps from the VS Code Marketplace or Open VSX
Connect your Databricks workspace in the extension settings
Open a PySpark file and run a dry run (CatalystOps: Run Dry Run)
Open the Explain Plan panel in the sidebar
Click any node to see cost details and available quick fixes
Open the DAG view — CatalystOps: Show Plan DAG

What's Next

0.8.0 lays the groundwork for deeper plan analysis. On the roadmap: multi-file plan correlation, historical plan comparison, and cost trend tracking across runs.

Got feedback? Open an issue on GitHub or drop a review on the VS Code Marketplace.

Free & open source

Try CatalystOps today

Catch PySpark performance issues before they hit production — inline, in your editor.

Install for VS Code View on GitHub