Applying Software Engineering Best Practices in Databricks: A Modular PySpark Pipeline
Many teams adopt Databricks for large-scale data processing but quickly fall into a common trap: business logic living inside notebooks.
While notebooks are great for exploration, production pipelines require the same rigor as any software system. Without proper structure, Databricks projects become difficult to test, maintain, and extend.
This article shows how to apply software engineering and data engineering best practices in a Databricks project by:
- Separating orchestration from business logic
- Organizing code in a modular repository
- Keeping notebooks as thin entrypoints
- Structuring transformations as reusable functions
- Building maintainable PySpark pipelines
We will walk through a simple but production-style pipeline architecture.
Core Principle: Notebooks Are Entry Points, Not Logic Containers
In many Databricks projects, notebooks contain everything — transformations, data validation, table creation, pipeline orchestration, and helper functions. This leads to large notebooks that are hard to test, difficult to reuse, hard to version, and fragile in production.
Instead, treat notebooks as entrypoints. Their responsibility should be limited to:
- Reading source data
- Calling transformation functions
- Creating tables if needed
- Writing the results
All business logic should live in Python modules inside a repository. This separation is the single most impactful change you can make to a Databricks project.
A Production-Ready Repository Structure
A clean modular structure could look like this:
data-pipeline/
│
├── notebooks/
│ └── pipeline_entrypoint.py
│
├── src/
│ ├── transformations/
│ │ └── sales_transformation.py
│ │
│ ├── tables/
│ │ └── table_manager.py
│ │
│ └── utils/
│ └── spark_utils.py
│
├── tests/
│ └── test_transformations.py
│
├── pyproject.toml
└── README.md| Layer | Purpose |
|---|---|
notebooks/ |
Orchestration entrypoints |
src/transformations/ |
Business logic |
src/tables/ |
Table management |
src/utils/ |
Reusable helpers |
tests/ |
Unit tests |
This structure allows the pipeline to be testable, reusable, maintainable, and production-ready.
Writing Transformations as Pure Functions
Business logic should live in transformation modules. Each transformation takes a DataFrame in and returns a DataFrame out — no side effects, no Spark session creation, no IO operations.
src/transformations/sales_transformation.py
from pyspark.sql import DataFrame
from pyspark.sql.functions import col, sum
def transform_sales(df: DataFrame) -> DataFrame:
"""
Clean and aggregate sales data by product.
Filters out null prices and non-positive quantities,
then aggregates total revenue per product.
"""
cleaned = (
df
.filter(col("price").isNotNull())
.filter(col("quantity") > 0)
)
aggregated = (
cleaned
.groupBy("product_id")
.agg(sum("price").alias("total_revenue"))
)
return aggregatedBest practices for transformation functions:
- Pure functions: Given the same input, always produce the same output.
- No Spark session creation: The session is managed by the entrypoint.
- No IO operations: No reading from or writing to tables.
- No table writes: Writing is the entrypoint's responsibility.
- Type hints: Make the interface explicit with
DataFrameannotations.
This pattern makes every transformation independently testable and composable. You can chain multiple transformations together, swap them out, or run them in isolation during development.
Managing Tables Safely
Production pipelines often need to ensure tables exist before writing. Encapsulating this logic in a dedicated module keeps it reusable and out of the transformation layer.
src/tables/table_manager.py
def create_table_if_not_exists(spark, table_name: str, schema: str):
"""
Ensure a Delta table exists with the given schema.
If the table already exists, this is a no-op.
"""
spark.sql(
f"""
CREATE TABLE IF NOT EXISTS {table_name}
{schema}
USING DELTA
"""
)In a more advanced setup, this module could also handle schema evolution, table properties, partitioning strategies, and OPTIMIZE/ZORDER operations. The key is that table management logic lives in one place, not scattered across notebooks.
The Notebook Entry Point
The notebook itself becomes a thin orchestration layer. It reads data, calls transformations, ensures the target table exists, and writes the results. Nothing more.
notebooks/pipeline_entrypoint.py
from pyspark.sql import SparkSession
from src.transformations.sales_transformation import transform_sales
from src.tables.table_manager import create_table_if_not_exists
spark = SparkSession.builder.getOrCreate()
# Configuration
SOURCE_TABLE = "raw.sales"
TARGET_TABLE = "analytics.product_revenue"
# Step 1: Read source data
source_df = spark.table(SOURCE_TABLE)
# Step 2: Apply transformation
result_df = transform_sales(source_df)
# Step 3: Ensure target table exists
create_table_if_not_exists(
spark,
TARGET_TABLE,
"(product_id STRING, total_revenue DOUBLE)"
)
# Step 4: Write results
(
result_df
.write
.format("delta")
.mode("overwrite")
.saveAsTable(TARGET_TABLE)
)The pipeline flow is immediately readable: read → transform → ensure table → write.
Anyone on the team can open this notebook and understand what it does in seconds. The details of how the transformation works live in the module, where they can be tested and reviewed independently.
Adding Unit Tests
With transformations extracted into pure functions, unit testing becomes straightforward. You can use pytest with a local Spark session to validate logic without connecting to any external systems.
tests/test_transformations.py
from src.transformations.sales_transformation import transform_sales
def test_sales_transformation(spark):
"""
Validate that the sales transformation:
- Filters out null prices
- Filters out non-positive quantities
- Aggregates revenue by product
"""
data = [
("A", 10.0, 2),
("A", 5.0, 1),
("B", None, 3) # Should be filtered out
]
df = spark.createDataFrame(data, ["product_id", "price", "quantity"])
result = transform_sales(df)
# Only product A should remain (B has null price)
assert result.count() == 1
# Total revenue for product A should be 15.0
row = result.collect()[0]
assert row["product_id"] == "A"
assert row["total_revenue"] == 15.0You can run these tests locally with pytest or in your CI/CD pipeline. This is one of the biggest advantages of modular code — you can validate business logic without deploying to Databricks.
Deployment with Databricks Asset Bundles (DABs)
This modular architecture integrates naturally with Databricks Asset Bundles (DABs), the recommended way to deploy and manage Databricks projects.
DABs allow you to:
- Deploy pipelines as structured projects
- Version infrastructure and jobs alongside code
- Define jobs and tasks declaratively in YAML
- Promote pipelines across environments (dev → staging → production)
With DABs, your project layout extends to include a bundle.yml configuration:
data-pipeline/
│
├── bundle.yml
├── notebooks/
├── src/
├── tests/
└── resources/The bundle.yml defines your jobs, clusters, and deployment targets. Combined with modular source code, this gives you a fully versioned, reproducible deployment workflow.
Observability: Logging and Alerting
A modular architecture makes it easy to add observability without cluttering your business logic:
- Structured logging: Add log statements in your entrypoint without touching transformation code.
- Execution metrics: Track row counts, execution times, and data quality scores.
- Row-count validation: Compare input and output counts to detect data loss.
- Anomaly detection: Flag unexpected changes in data volume or distribution.
- Failure alerting: Integrate with PagerDuty, Slack, or email for pipeline failures.
Because orchestration and logic are separated, you can wrap any transformation call with logging and metrics without modifying the transformation itself.
Better Job Structure and Lineage
Splitting logic into modules directly improves your Databricks job structure:
- Well-defined job tasks: Each task maps to a clear step in the pipeline.
- Clear pipeline boundaries: Input and output contracts are explicit.
- Improved data lineage: Unity Catalog can trace data flow through well-structured tables.
- Easier monitoring: Smaller, focused tasks are easier to monitor and debug than monolithic notebooks.
Easier Orchestration with External Systems
This architecture integrates cleanly with external orchestrators like Airflow, Dagster, or Prefect. Because your notebooks are thin entrypoints, orchestrators can trigger them as tasks in a broader pipeline:
raw_ingestion → transformation → analytics_tablesEach stage is a separate, testable unit. If the transformation step fails, you can re-run it without re-ingesting raw data. If you need to add a new downstream consumer, you add a new task — not more code in an existing notebook.
Final Thoughts
By treating notebooks as thin orchestration layers and moving logic into modular Python modules, teams can apply the same best practices used in modern software engineering.
The result is a data platform that is:
- Maintainable: Changes are localized and easy to review.
- Testable: Business logic can be validated without a running cluster.
- Scalable: New pipelines follow the same patterns without reinventing the wheel.
- Production-ready: Deployments are reproducible and version-controlled.
Good data engineering is ultimately good software engineering. The tools may be different — PySpark instead of REST APIs, Delta Lake instead of PostgreSQL — but the principles of modularity, testing, and separation of concerns are universal.