CatalystOps MCP Server — AI-Assisted PySpark Optimization

What is MCP?

Model Context Protocol (MCP) is an open standard created by Anthropic for connecting AI assistants to external tools and data sources. Instead of copy-pasting error messages or code snippets into a chat window, MCP lets the AI call tools directly — reading live data, triggering actions, and returning structured results.

MCP is now supported by Claude (Desktop and Claude.ai), GitHub Copilot in VS Code 1.99+, Cursor, Windsurf, and Kiro. Any of these can connect to the CatalystOps MCP server and access your Spark analysis in real time.

What the CatalystOps MCP Server Exposes

The server exposes three categories of capabilities: tools (actions), resources (live data), and prompts (structured workflows).

Tools

Tools are callable actions the AI can invoke during a conversation:

analyze_pyspark — Run local static analysis on any PySpark snippet. No Databricks connection needed. Pass any code string and get back structured issues immediately.
get_active_file_issues — Returns the current file's issues as structured JSON. The AI sees exactly what CatalystOps sees in your editor right now.
run_dry_run — Triggers a full Databricks dry run for the active file, waits up to 5 minutes, and returns the physical plan plus any plan-level issues detected.
get_plan_analysis — Returns the last dry run results without triggering a new one. Useful for follow-up questions about a plan you already ran.
get_billing_summary — Returns Databricks spend: total dollars, DBUs, and breakdown by user, job, and workload type. Accepts a period parameter: day, week, or month.
refresh_billing — Forces a live billing fetch, bypassing the cache. Use this when you want current numbers rather than the last cached snapshot.

Resources

Resources are live data endpoints the AI can read at any time:

catalystops://issues/current — Live Markdown issue list for the active file, formatted for readability in AI context.
catalystops://plans/last — Raw physical and logical plan text from the last dry run.
catalystops://billing/summary — Formatted billing snapshot, ready to inject into a conversation.

Prompts

Prompts are pre-built workflows that combine tool calls with structured AI instructions:

pyspark_code_review — Injects CatalystOps findings as context, then asks the AI for a holistic code review grounded in real issues rather than style preferences.
optimize_spark_plan — Injects the raw Catalyst plan (logical + physical), then asks the AI to identify optimization opportunities based on the actual execution plan.

A Real Example

Here's what a conversation looks like when Claude has access to the MCP server. The user has a PySpark ETL file open in VS Code.

Claude conversation

User: Review my PySpark ETL and tell me what will be expensive at scale.

Claude: Let me check your current file for issues first.
→ Calling get_active_file_issues

Found 3 issues: cartesian product in join on line 42, window
function missing partitionBy on line 67, repeated DataFrame
scan of "events" table (used 4 times without caching).

Let me also get the physical plan to confirm what Spark will
actually execute.
→ Calling run_dry_run
  (waiting for Databricks dry run...)

Got the plan. Here's what will be expensive at scale:

1. SortMergeJoin on line 42
   The physical plan shows a SortMergeJoin with a full shuffle
   on both sides — no broadcast. At 10M+ rows this will be
   your biggest bottleneck. Consider broadcasting the smaller
   table if it fits in memory (<200MB).

2. Window function without partitionBy (line 67)
   rank() over an unbounded window means all data goes to a
   single executor. This will cause an OOM at scale. Add a
   partitionBy clause — likely partitionBy("user_id") based
   on your schema.

3. Repeated scans of "events" table
   The physical plan confirms 4 separate scans. Cache this
   DataFrame after the first read with .cache() or .persist()
   to avoid redundant I/O on every reference.

Fix the window function first — that's an OOM risk, not just
a performance issue.

The key difference: Claude isn't guessing from generic PySpark knowledge. It's reading your actual code issues and your actual execution plan. The advice is specific to your file, your schema, and what Databricks will actually do.

How It Works Technically

When CatalystOps starts, it launches a lightweight HTTP server bound to 127.0.0.1 on an OS-assigned port. The server runs in stateless HTTP mode — each request is independent, no persistent connection required.

For VS Code 1.99+, the extension automatically writes a .vscode/mcp.json file to your workspace root. GitHub Copilot reads this file on startup and discovers the server without any manual configuration.

.vscode/mcp.json (auto-generated)

{
  "servers": {
    "catalystops": {
      "type": "http",
      "url": "http://127.0.0.1:<port>/mcp"
    }
  }
}

For Claude Desktop or other external clients, the server URL is printed to the CatalystOps Output panel in VS Code when it starts. Copy that URL into your client's MCP configuration.

Note

The port changes each time VS Code restarts. For VS Code 1.99+, this is handled automatically via .vscode/mcp.json. For external clients like Claude Desktop, you'll need to update the URL after restarting VS Code.

Setup

VS Code (GitHub Copilot)

Update to CatalystOps 0.8.2 or later from the VS Code Marketplace. The MCP server starts automatically when the extension loads. GitHub Copilot in VS Code 1.99+ will discover it via the auto-generated .vscode/mcp.json — no additional steps required.

Claude Desktop, Cursor, Windsurf, Kiro

Open the CatalystOps Output panel in VS Code (View → Output → CatalystOps). You'll see a line like:

VS Code Output panel

CatalystOps MCP server running at http://127.0.0.1:49823/mcp

Copy that URL and add it to your client's MCP server configuration. For Claude Desktop, this goes in claude_desktop_config.json:

claude_desktop_config.json

{
  "mcpServers": {
    "catalystops": {
      "url": "http://127.0.0.1:49823/mcp"
    }
  }
}

Disabling the MCP server

If you don't want the MCP server running, add this to your VS Code settings:

settings.json

"catalystops.mcp.enabled": false

Why This Matters

AI coding assistants are only as useful as the context they have. Ask an AI to review PySpark code without context, and you get generic advice: "avoid shuffles", "cache your DataFrames", "use broadcast joins". Technically correct. Practically useless without knowing whether those issues actually apply to your code.

CatalystOps gives AI assistants three layers of grounded context:

Static analysis — rule-based issues detected locally, no Databricks required
Runtime plan — the actual physical execution plan from Databricks, showing what Spark will really do
Billing data — real spend numbers, so cost discussions are anchored to actual dollars, not estimates

Together, these turn a generic code review into a specific, actionable audit. The AI can say "this join will be a SortMergeJoin with a full shuffle, and your team spent $340 on shuffle-heavy jobs last week" instead of "shuffles can be expensive".

That's the difference between an AI that's read the documentation and an AI that's read your code.

Free and open source

Add AI to your PySpark workflow

Install CatalystOps 0.8.2 and connect Claude, Copilot, or Cursor to your live Spark analysis.

Install for VS Code View on GitHub