Skip to content

All Rules Reference

This page lists every rule CatalystOps checks. For context and fix examples, see Local Analysis and Plan Analysis.

Static Code Rules

Critical

Rule IDNameDescription
CODE_COLLECT_001collect() UsageBrings all data to the driver — OOM risk
CODE_ITER_COLLECT_001for-loop over collect()Row-by-row iteration on the driver
CODE_CROSSJOIN_001Cross JoinCartesian product — exponential data growth
CODE_SQL_INJECT_001SQL Injectionf-string in spark.sql()
CODE_KAFKA_COMMIT_001Kafka Auto-CommitCan cause data loss or duplication
CODE_DLT_CDC_ORDER_001DLT CDC Clause OrderAPPLY AS DELETE WHEN in wrong order

Warning

Rule IDNameDescription
CODE_PANDAS_001toPandas()Brings all data to driver
CODE_COUNT_001count() > 0Use isEmpty() instead
CODE_COALESCE_001coalesce(1)Funnels all data to one partition
CODE_ORDERBY_001Global orderByFull shuffle
CODE_REPARTITION_001repartition before writeUse coalesce()
CODE_WINDOW_001Window without partitionByGlobal window
CODE_WITHCOLUMN_LOOP_001withColumn in loopNew plan node per iteration
CODE_REPEATED_ACTIONS_001Repeated actions without cacheRecomputes the DataFrame
CODE_UDF_FILTER_001UDF in filter()Blocks predicate pushdown
CODE_STREAMING_TRIGGER_001No .trigger()Continuous micro-batches
CODE_STREAMING_WATERMARK_001Streaming groupBy without watermarkUnbounded state
CODE_STREAMING_INNER_JOIN_001Streaming inner joinSilently drops late events
CODE_DYNAMIC_ALLOC_001Dynamic allocation on streamingCluster instability
CODE_AUTOLOADER_RATE_001Auto Loader without rate limitUnbounded ingestion
CODE_CHECKPOINT_DBFS_001Checkpoint on DBFSUse cloud storage
CODE_OPTIMIZE_MERGE_001OPTIMIZE after every MERGELatency spikes
CODE_DROP_CREATE_001DROP + CREATE TABLENon-atomic, use CREATE OR REPLACE
CODE_FLOAT_FINANCIAL_001FLOAT for financial dataUse DECIMAL

Info

Rule IDNameDescription
CODE_UDF_001UDF UsagePrevents query plan optimization
CODE_SCHEMA_001Schema InferenceExtra data pass
CODE_SELECT_STAR_001SELECT *Reads unnecessary columns
CODE_WRITE_MODE_001Missing write modeDefaults to error
CODE_REPRO_001Repeated source without cacheMultiple scans
CODE_ROCKSDB_001Stateful streaming without RocksDBState store performance
CODE_UNNAMED_QUERY_001Unnamed streaming queryHard to monitor
CODE_MERGE_DV_001MERGE without Deletion VectorsPerformance on large tables
CODE_MERGE_RLC_001MERGE without Row-Level ConcurrencyConcurrency issues
CODE_ZORDER_001ZORDERDeprecated; use Liquid Clustering
CODE_ANALYZE_001Missing ANALYZE TABLEStale statistics
CODE_DLT_PARTITION_001DLT PARTITIONED BYUse Liquid Clustering
CODE_READ_FILES_SCHEMA_001read_files() without schemaHintsSchema inference per run

Plan-Level Detectors

These run after a dry run or job run analysis and require an actual Catalyst execution plan:

NameDescription
BroadcastHashJoin (missing)Sort-merge join where broadcast would be faster
CartesianProductCartesian join in the physical plan
ShuffleExchangeUnnecessary shuffle
SinglePartitionBottleneckExchange SinglePartition — all data on one executor
SortAggregateSort-based aggregation (prefer hash-based)
GlobalWindow / RunningWindowFunctionGlobal window, Photon-aware
RepeatedTableScanSame table scanned multiple times
MissingPartitionFilterPartition filters empty — full table scan
MissingTableStatisticsTable has no statistics
CacheSpillCached data spilling to disk
TooFewPartitionsLow parallelism for data size
CrossJoinCartesian join (also caught by local analysis)
UnionSchemaMismatchColumn order mismatch (also caught by local analysis)

Released under the Elastic License 2.0.