Skip to content

Local Analysis

CatalystOps runs 40+ anti-pattern checks on every Python file open and save — no cluster, no network call, no configuration required.

How It Works

When you open or edit a .py file, CatalystOps:

  1. Scans the code with regex-based pattern matchers
  2. Flags issues inline as VS Code diagnostics (squiggly underlines)
  3. Shows a hover card with a one-sentence explanation and a quick-fix snippet
  4. Updates the Issues sidebar with a full list

Analysis also runs on Jupyter notebooks (.ipynb) — each cell is analyzed in context and issues are mapped back to the correct cell.

Severity Levels

LevelColorMeaning
CriticalRedWill likely cause OOM, data loss, or incorrect results
WarningYellowPerformance problem that will hurt at scale
InfoBlueBest-practice recommendation
SuggestionGreyMinor improvement opportunity

Suppressing a Rule

Add # noqa: catalystops to any line to suppress all rules on that line:

python
df.collect()  # noqa: catalystops

Rule Categories

Spark Actions

Rule IDSeverityDescription
CODE_COLLECT_001Criticalcollect() pulls all data to the driver — OOM risk on large datasets
CODE_PANDAS_001WarningtoPandas() brings all data to the driver
CODE_COUNT_001Warningcount() > 0 triggers a full Spark job; use isEmpty() instead
CODE_ITER_COLLECT_001Criticalfor row in df.collect() iterates row-by-row on the driver

Joins

Rule IDSeverityDescription
CODE_CROSSJOIN_001CriticalcrossJoin() or implicit cartesian product
CODE_COALESCE_001Warningcoalesce(1) funnels all data to one partition

Shuffles & Partitioning

Rule IDSeverityDescription
CODE_ORDERBY_001WarningGlobal orderBy() / sort() triggers a full shuffle
CODE_REPARTITION_001Warningrepartition(N) before a write — use coalesce() to avoid the full shuffle
CODE_WINDOW_001WarningWindow.orderBy() without partitionBy() — global window, single partition

DataFrame Operations

Rule IDSeverityDescription
CODE_WITHCOLUMN_LOOP_001WarningwithColumn() inside a loop creates a new plan node per iteration
CODE_REPEATED_ACTIONS_001WarningMultiple actions on the same DataFrame without .cache()
CODE_REPRO_001InfoRepeated source scan without .cache() or .persist()

UDFs

Rule IDSeverityDescription
CODE_UDF_001InfoPython UDF prevents query plan optimization
CODE_UDF_FILTER_001WarningUDF inside .filter() blocks predicate pushdown on Delta/partitioned tables

Schema & Reads

Rule IDSeverityDescription
CODE_SCHEMA_001InfoinferSchema=true requires an extra data pass; provide an explicit schema
CODE_SELECT_STAR_001InfoSELECT * may read unnecessary columns
CODE_WRITE_MODE_001InfoMissing .mode() on write — defaults to error, will fail if data exists

Security

Rule IDSeverityDescription
CODE_SQL_INJECT_001Criticalf-string interpolation in spark.sql() — SQL injection risk
CODE_KAFKA_COMMIT_001CriticalKafka auto-commit enabled — can cause data loss or duplication

Streaming

Rule IDSeverityDescription
CODE_STREAMING_TRIGGER_001WarningNo .trigger() — runs continuous micro-batches
CODE_STREAMING_WATERMARK_001WarninggroupBy() without watermark causes unbounded state accumulation
CODE_STREAMING_INNER_JOIN_001WarningStreaming inner join silently drops late events
CODE_DYNAMIC_ALLOC_001WarningDynamic allocation on a streaming cluster causes instability
CODE_ROCKSDB_001InfoStateful streaming without RocksDB state store
CODE_AUTOLOADER_RATE_001WarningAuto Loader without maxBytesPerTrigger — unbounded ingestion rate
CODE_CHECKPOINT_DBFS_001WarningCheckpoint stored on DBFS — use cloud storage for reliability
CODE_UNNAMED_QUERY_001InfoStreaming query has no name — harder to monitor

Delta Lake

Rule IDSeverityDescription
CODE_MERGE_DV_001InfoMERGE without Deletion Vectors enabled
CODE_MERGE_RLC_001InfoMERGE without Row-Level Concurrency
CODE_OPTIMIZE_MERGE_001WarningOPTIMIZE after every MERGE causes latency spikes
CODE_DROP_CREATE_001WarningDROP TABLE + CREATE TABLE is non-atomic; use CREATE OR REPLACE
CODE_ZORDER_001InfoZORDER is deprecated for new tables; use Liquid Clustering
CODE_ANALYZE_001InfoMissing ANALYZE TABLE after large writes — statistics may be stale
CODE_FLOAT_FINANCIAL_001WarningFLOAT/DOUBLE for financial columns loses precision; use DECIMAL

DLT Pipelines

Rule IDSeverityDescription
CODE_DLT_PARTITION_001InfoDLT PARTITIONED BY / partition_cols — prefer Liquid Clustering
CODE_DLT_CDC_ORDER_001CriticalAPPLY AS DELETE WHEN in wrong clause order causes incorrect CDC
CODE_READ_FILES_SCHEMA_001Inforead_files() without schemaHints — schema inference on every run

Released under the Elastic License 2.0.