Schema Validation
CatalystOps tracks DataFrame schemas across your file and validates column references and type usage at edit time — no cluster required.
What Gets Validated
Schema validation activates when a StructType or DDL schema string is defined in the same file. CatalystOps extracts the schema and propagates it through transformations.
Column Existence
Unknown column names are flagged with a "did you mean?" suggestion when a close match exists:
schema = StructType([
StructField("user_id", StringType()),
StructField("event_ts", TimestampType()),
])
df = spark.createDataFrame(data, schema)
df.select("user_Id") # ⚠ Unknown column — did you mean 'user_id'?Type Mismatches
Using a function that expects a specific type on an incompatible column:
# event_ts is TimestampType, not a numeric type
df.filter(col("event_ts") > 100) # ⚠ Type mismatchUnion / Intersect Column Order
union() and intersect() align by position, not name. CatalystOps detects when the same column appears in different positions across two DataFrames — a silent wrong-result bug at runtime:
a = spark.createDataFrame([], "id INT, name STRING")
b = spark.createDataFrame([], "name STRING, id INT") # columns swapped
a.union(b) # ⚠ Column order mismatch — 'id' aligns with 'name'
# Use unionByName() insteadJoin Condition Types
Type mismatches in join keys:
orders.join(users, orders["user_id"] == users["id"])
# ⚠ if user_id is STRING and id is INTSchema Propagation
Schemas are tracked through common transformations:
.filter()— schema unchanged.select()— schema updated to selected columns.drop()— column removed from schema.withColumn()— column added or replaced.withColumnRenamed()— column renamed
What Is Skipped
External sources without an explicit schema are skipped to avoid false positives:
spark.table("my_catalog.my_table") # skipped — schema unknown at edit time
spark.read.parquet("s3://...") # skipped — no .schema() callAdd .schema(your_schema) to a read call for CatalystOps to validate downstream operations.