What is MCP?
Model Context Protocol (MCP) is an open standard created by Anthropic for connecting AI assistants to external tools and data sources. Instead of copy-pasting error messages or code snippets into a chat window, MCP lets the AI call tools directly — reading live data, triggering actions, and returning structured results.
MCP is now supported by Claude (Desktop and Claude.ai), GitHub Copilot in VS Code 1.99+, Cursor, Windsurf, and Kiro. Any of these can connect to the CatalystOps MCP server and access your Spark analysis in real time.
What the CatalystOps MCP Server Exposes
The server exposes three categories of capabilities: tools (actions), resources (live data), and prompts (structured workflows).
Tools
Tools are callable actions the AI can invoke during a conversation:
analyze_pyspark— Run local static analysis on any PySpark snippet. No Databricks connection needed. Pass any code string and get back structured issues immediately.get_active_file_issues— Returns the current file's issues as structured JSON. The AI sees exactly what CatalystOps sees in your editor right now.run_dry_run— Triggers a full Databricks dry run for the active file, waits up to 5 minutes, and returns the physical plan plus any plan-level issues detected.get_plan_analysis— Returns the last dry run results without triggering a new one. Useful for follow-up questions about a plan you already ran.get_billing_summary— Returns Databricks spend: total dollars, DBUs, and breakdown by user, job, and workload type. Accepts a period parameter:day,week, ormonth.refresh_billing— Forces a live billing fetch, bypassing the cache. Use this when you want current numbers rather than the last cached snapshot.
Resources
Resources are live data endpoints the AI can read at any time:
catalystops://issues/current— Live Markdown issue list for the active file, formatted for readability in AI context.catalystops://plans/last— Raw physical and logical plan text from the last dry run.catalystops://billing/summary— Formatted billing snapshot, ready to inject into a conversation.
Prompts
Prompts are pre-built workflows that combine tool calls with structured AI instructions:
pyspark_code_review— Injects CatalystOps findings as context, then asks the AI for a holistic code review grounded in real issues rather than style preferences.optimize_spark_plan— Injects the raw Catalyst plan (logical + physical), then asks the AI to identify optimization opportunities based on the actual execution plan.
A Real Example
Here's what a conversation looks like when Claude has access to the MCP server. The user has a PySpark ETL file open in VS Code.
User: Review my PySpark ETL and tell me what will be expensive at scale. Claude: Let me check your current file for issues first. → Calling get_active_file_issues Found 3 issues: cartesian product in join on line 42, window function missing partitionBy on line 67, repeated DataFrame scan of "events" table (used 4 times without caching). Let me also get the physical plan to confirm what Spark will actually execute. → Calling run_dry_run (waiting for Databricks dry run...) Got the plan. Here's what will be expensive at scale: 1. SortMergeJoin on line 42 The physical plan shows a SortMergeJoin with a full shuffle on both sides — no broadcast. At 10M+ rows this will be your biggest bottleneck. Consider broadcasting the smaller table if it fits in memory (<200MB). 2. Window function without partitionBy (line 67) rank() over an unbounded window means all data goes to a single executor. This will cause an OOM at scale. Add a partitionBy clause — likely partitionBy("user_id") based on your schema. 3. Repeated scans of "events" table The physical plan confirms 4 separate scans. Cache this DataFrame after the first read with .cache() or .persist() to avoid redundant I/O on every reference. Fix the window function first — that's an OOM risk, not just a performance issue.
The key difference: Claude isn't guessing from generic PySpark knowledge. It's reading your actual code issues and your actual execution plan. The advice is specific to your file, your schema, and what Databricks will actually do.
How It Works Technically
When CatalystOps starts, it launches a lightweight HTTP server bound to 127.0.0.1 on an OS-assigned port. The server runs in stateless HTTP mode — each request is independent, no persistent connection required.
For VS Code 1.99+, the extension automatically writes a .vscode/mcp.json file to your workspace root. GitHub Copilot reads this file on startup and discovers the server without any manual configuration.
{
"servers": {
"catalystops": {
"type": "http",
"url": "http://127.0.0.1:<port>/mcp"
}
}
}
For Claude Desktop or other external clients, the server URL is printed to the CatalystOps Output panel in VS Code when it starts. Copy that URL into your client's MCP configuration.
The port changes each time VS Code restarts. For VS Code 1.99+, this is handled automatically via .vscode/mcp.json. For external clients like Claude Desktop, you'll need to update the URL after restarting VS Code.
Setup
VS Code (GitHub Copilot)
Update to CatalystOps 0.8.2 or later from the VS Code Marketplace. The MCP server starts automatically when the extension loads. GitHub Copilot in VS Code 1.99+ will discover it via the auto-generated .vscode/mcp.json — no additional steps required.
Claude Desktop, Cursor, Windsurf, Kiro
Open the CatalystOps Output panel in VS Code (View → Output → CatalystOps). You'll see a line like:
CatalystOps MCP server running at http://127.0.0.1:49823/mcp
Copy that URL and add it to your client's MCP server configuration. For Claude Desktop, this goes in claude_desktop_config.json:
{
"mcpServers": {
"catalystops": {
"url": "http://127.0.0.1:49823/mcp"
}
}
}
Disabling the MCP server
If you don't want the MCP server running, add this to your VS Code settings:
"catalystops.mcp.enabled": false
Why This Matters
AI coding assistants are only as useful as the context they have. Ask an AI to review PySpark code without context, and you get generic advice: "avoid shuffles", "cache your DataFrames", "use broadcast joins". Technically correct. Practically useless without knowing whether those issues actually apply to your code.
CatalystOps gives AI assistants three layers of grounded context:
- Static analysis — rule-based issues detected locally, no Databricks required
- Runtime plan — the actual physical execution plan from Databricks, showing what Spark will really do
- Billing data — real spend numbers, so cost discussions are anchored to actual dollars, not estimates
Together, these turn a generic code review into a specific, actionable audit. The AI can say "this join will be a SortMergeJoin with a full shuffle, and your team spent $340 on shuffle-heavy jobs last week" instead of "shuffles can be expensive".
That's the difference between an AI that's read the documentation and an AI that's read your code.
Free and open source
Add AI to your PySpark workflow
Install CatalystOps 0.8.2 and connect Claude, Copilot, or Cursor to your live Spark analysis.