Keeping Databricks Declarative Automation Bundles (formerly Databricks Asset Bundles) Modular with Jinja2

    Learn how to use Jinja2 templating to keep Databricks Declarative Automation Bundles (formerly Databricks Asset Bundles / DABs) DRY, composable, and environment-aware with reusable fragments and conditional logic.

    By Nicola De Lillo--14 min read
    Databricks
    Declarative Automation Bundles
    Databricks Asset Bundles
    DABs
    Jinja2
    YAML
    CI/CD
    data engineering
    infrastructure
    best practices

    Keeping Databricks Declarative Automation Bundles (formerly Databricks Asset Bundles) Modular with Jinja2

    Databricks Declarative Automation Bundles (formerly Databricks Asset Bundles, still commonly called DABs) are a great way to manage your data pipelines as code — version-controlled YAML that defines jobs, clusters, schedules, and permissions. But as your project grows beyond a handful of jobs, your bundle configuration gets unwieldy fast.

    Naming note: Databricks renamed Asset Bundles to Declarative Automation Bundles. In this article, we use both terms so teams can map legacy docs and current terminology.

    DABs support include: directives to pull in multiple YAML files, but that's about as far as native modularity goes. There's no inheritance, no conditionals, no environment-aware logic, and no reusable fragments. You end up copy-pasting cluster definitions, permission blocks, and git source configs across every job.

    This post walks through how we solved this by introducing Jinja2 templating as a pre-render step — keeping our DABs DRY, readable, and truly environment-aware.


    The Problem: YAML Repetition at Scale

    Consider a project with multiple jobs. Each one needs:

    • A cluster definition (same policy, same Spark version, same env vars)
    • A git source block (branch in dev/staging, tag in production)
    • Permissions (needed in dev, not in production)
    • A schedule (only in production)
    • Tags (same across all jobs)

    In vanilla DABs, every job file repeats all of this. A 5-job project means 5 copies of the cluster block, 5 copies of the git source, 5 copies of the tags. Change the Spark version? Touch 5 files.

    DABs include: only works at the top level — it merges separate YAML files into the bundle. But it doesn't give you reusable fragments within a job definition. There's no !include for nested blocks, no YAML anchors across files, and no way to say "use this cluster block, but only add a schedule in production."


    The Solution: Jinja2 as a Pre-Render Layer

    The idea is simple:

    1. Write your bundle resources as Jinja2 templates (.yml.j2)
    2. Extract shared fragments into include files
    3. Run a Python script that renders the templates into plain YAML
    4. Point databricks.yml at the rendered output

    DABs never sees Jinja — it only sees the final, valid YAML. Jinja runs before the bundle is validated or deployed.

    Project Structure

    .
    ├── databricks.yml                  # Root bundle config, includes rendered output
    ├── bundle/
    │   ├── includes/                   # Reusable Jinja fragments
    │   │   ├── base-job.yml.j2
    │   │   ├── clusters.yml.j2
    │   │   └── schedule.yml.j2
    │   └── workflows/                  # One template per job
    │       ├── my_etl_pipeline.yml.j2
    │       └── my_ml_pipeline.yml.j2
    ├── dist-bundle/                    # Rendered output (gitignored)
    │   ├── my_etl_pipeline.yml
    │   └── my_ml_pipeline.yml
    └── scripts/
        └── render_bundle.py            # The render script

    The key insight: bundle/ contains templates, dist-bundle/ contains rendered YAML. Only the rendered files are included by databricks.yml.


    Building the Include Fragments

    Base Job

    The base job fragment handles git source, permissions, and tags — the stuff every job needs:

    # bundle/includes/base-job.yml.j2
    base-job: &base-job
      {% if environment != 'production' %}
      permissions:
        - group_name: "my_team_group"
          level: "CAN_MANAGE"
      {% endif %}
    
      git_source:
        git_url: https://github.com/my-org/my-repo.git
        git_provider: gitHub
        {% if environment == 'production' %}
        git_tag: ${var.version}
        {% else %}
        git_branch: ${bundle.git.branch}
        {% endif %}
    
      tags:
        "team": DataEngineering
        "environment": {{ environment }}

    Notice how this single fragment handles three concerns:

    • Permissions: Only added outside production (developers need CAN_MANAGE to iterate; production is locked down)
    • Git source: Production pins to a release tag; other environments track the branch
    • Tags: Environment name is injected dynamically

    Clusters

    # bundle/includes/clusters.yml.j2
    my_clusters: &my_clusters
      job_clusters:
        - job_cluster_key: main_cluster
          new_cluster:
            policy_id: 000EE9580030D241
            spark_version: 16.4.x-scala2.12
            spark_env_vars:
              ENVIRONMENT: {{ environment }}
            autoscale:
              min_workers: 2
              max_workers: 8

    One place to update the Spark version, cluster policy, or scaling config.

    Schedule

    # bundle/includes/schedule.yml.j2
    everyday_5am: &everyday_5am
      quartz_cron_expression: "0 0 5 * * ?"
      timezone_id: "Europe/Rome"
      pause_status: "PAUSED"

    Composing a Workflow

    With the fragments in place, a workflow template is remarkably concise:

    # bundle/workflows/my_etl_pipeline.yml.j2
    {% include "includes/base-job.yml.j2" %}
    {% include "includes/clusters.yml.j2" %}
    {% include "includes/schedule.yml.j2" %}
    
    resources:
      jobs:
        my_etl_pipeline:
          name: "{{ job_prefix }} my_etl_pipeline"
          <<: [*base-job, *my_clusters]
          {% if environment == 'production' %}
          schedule: *everyday_5am
          {% endif %}
    
          tasks:
            - task_key: ingest_raw_data
              notebook_task:
                notebook_path: src/pipelines/ingest_raw_data
              job_cluster_key: main_cluster
    
            - task_key: transform_silver
              depends_on:
                - task_key: ingest_raw_data
              notebook_task:
                notebook_path: src/pipelines/transform_silver
              job_cluster_key: main_cluster

    The {% include %} directives pull in the YAML anchors (&base-job, &my_clusters, &everyday_5am), and the <<: merge key applies them. Adding a new job is a matter of copying this template and changing the job name and tasks — the infra boilerplate is inherited.

    How YAML Anchors + Jinja Include Work Together

    This is the trick that makes it all click. Jinja's {% include %} inserts the fragment's text verbatim into the template before YAML parsing. So the rendered output is a single YAML document where the anchors (&base-job) and aliases (*base-job) are all in scope.

    The <<: merge key is standard YAML — it merges all fields from the referenced anchor into the current mapping. By listing multiple anchors (<<: [*base-job, *my_clusters]), you compose multiple fragments into one job definition.


    The Render Script

    The render script is intentionally simple — just Jinja2 with a FileSystemLoader:

    # scripts/render_bundle.py
    import os
    import shutil
    from enum import StrEnum
    from pathlib import Path
    from typing import Any
    
    from jinja2 import Environment, FileSystemLoader
    
    RESOURCES_DIR = Path(__file__).parent.parent / "bundle"
    OUTPUT_DIR = Path(__file__).parent.parent / "dist-bundle"
    
    
    class BundleEnvironment(StrEnum):
        DEVELOPMENT = "development"
        STAGING = "staging"
        PRODUCTION = "production"
    
    
    def cleanup(directory: Path) -> None:
        """Recreate the output directory from scratch."""
        shutil.rmtree(directory, ignore_errors=True)
        directory.mkdir(parents=True, exist_ok=True)
    
    
    def build_context() -> dict[str, Any]:
        """Build template context from environment variables."""
        target = os.getenv("BUNDLE_TARGET")
        if target not in {e.value for e in BundleEnvironment}:
            raise ValueError(
                f"BUNDLE_TARGET must be one of {[e.value for e in BundleEnvironment]}, "
                f"got: {target!r}"
            )
    
        job_prefix_map = {
            "development": "[${var.version}]",
            "staging": "[${bundle.target}] [master]",
            "production": "[${bundle.target}] [${var.version}]",
        }
    
        return {
            "environment": target,
            "job_prefix": job_prefix_map[target],
        }
    
    
    def render(resources: Path, output: Path) -> None:
        """Render workflow templates into the output directory."""
        env = Environment(loader=FileSystemLoader(resources))
        ctx = build_context()
    
        for filename in resources.rglob("workflows/*.yml.j2"):
            print(f"Rendering {filename.stem}...")
            template_path = filename.relative_to(resources)
            template = env.get_template(str(template_path.as_posix())).render(ctx)
            (output / template_path.stem).write_text(template)
    
        print("Done!")
    
    
    if __name__ == "__main__":
        cleanup(OUTPUT_DIR)
        render(RESOURCES_DIR, OUTPUT_DIR)

    Key design choices:

    • BUNDLE_TARGET is the only input — set as an environment variable by the Makefile or CI
    • job_prefix_map controls how jobs are named per environment, mixing DABs variables (${bundle.target}, ${var.version}) with static text
    • cleanup() always starts fresh — no stale rendered files from a previous target
    • Templates are discovered automatically via rglob("workflows/*.yml.j2"), so adding a new job doesn't require touching the script

    Wiring It Into the Makefile

    The render step slots into the Makefile chain naturally:

    BUNDLE_TARGET ?= development
    
    .PHONY: bundle-render
    bundle-render: install-dependencies
    	BUNDLE_TARGET=$(BUNDLE_TARGET) uv run python scripts/render_bundle.py
    
    .PHONY: bundle-validate
    bundle-validate: bundle-render
    	databricks bundle validate --target $(BUNDLE_TARGET)
    
    .PHONY: bundle-deploy
    bundle-deploy: bundle-validate
    	databricks bundle deploy --target $(BUNDLE_TARGET)

    The dependency chain is: bundle-deploybundle-validatebundle-renderinstall-dependencies. Templates are always rendered fresh before any bundle operation.


    Wiring It Into databricks.yml

    The root bundle config simply includes the rendered output:

    # databricks.yml
    bundle:
      name: my_project
    
    include:
      - "dist-bundle/*.yml"
    
    variables:
      version:
        description: >
          Release version extracted from git (branch prefix or tag).
    
    targets:
      development:
        mode: development
        default: true
        workspace:
          host: https://my-databricks-instance.cloud.databricks.com
    
      staging:
        mode: production
        workspace:
          host: https://my-databricks-instance.cloud.databricks.com
        run_as:
          service_principal_name: 00000000-0000-0000-0000-000000000000
    
      production:
        mode: production
        workspace:
          host: https://my-databricks-instance.cloud.databricks.com
        run_as:
          service_principal_name: 00000000-0000-0000-0000-000000000000

    The include: ["dist-bundle/*.yml"] picks up whatever the render script produced. databricks.yml itself stays clean — just bundle metadata and target definitions.


    Testing the Templates

    Since the render script is pure Python, it's straightforward to test:

    # tests/test_render_bundle.py
    import os
    from pathlib import Path
    from unittest.mock import patch
    
    import pytest
    import yaml
    
    from scripts.render_bundle import build_context, render, cleanup
    
    # Path to the actual bundle templates
    RESOURCES = Path(__file__).parent.parent / "bundle"
    
    
    class TestBuildContext:
        def test_development_context(self):
            with patch.dict(os.environ, {"BUNDLE_TARGET": "development"}):
                ctx = build_context()
            assert ctx["environment"] == "development"
            assert ctx["job_prefix"] == "[${var.version}]"
    
        def test_staging_context(self):
            with patch.dict(os.environ, {"BUNDLE_TARGET": "staging"}):
                ctx = build_context()
            assert ctx["environment"] == "staging"
            assert "[master]" in ctx["job_prefix"]
    
        def test_production_context(self):
            with patch.dict(os.environ, {"BUNDLE_TARGET": "production"}):
                ctx = build_context()
            assert ctx["environment"] == "production"
            assert "${var.version}" in ctx["job_prefix"]
    
        def test_invalid_target_raises(self):
            with patch.dict(os.environ, {"BUNDLE_TARGET": "invalid"}):
                with pytest.raises(ValueError):
                    build_context()
    
    
    class TestRenderIntegration:
        """Integration tests that render real templates and validate the output."""
    
        def test_production_renders_valid_yaml_with_schedule(self, tmp_path):
            with patch.dict(os.environ, {"BUNDLE_TARGET": "production"}):
                render(RESOURCES, tmp_path)
    
            for yml_file in tmp_path.glob("*.yml"):
                content = yaml.safe_load(yml_file.read_text())
                job = list(content["resources"]["jobs"].values())[0]
                assert "schedule" in job
                # Verify production uses git_tag, not git_branch
                assert "git_tag" in yaml.dump(content)
    
        def test_development_renders_without_schedule(self, tmp_path):
            with patch.dict(os.environ, {"BUNDLE_TARGET": "development"}):
                render(RESOURCES, tmp_path)
    
            for yml_file in tmp_path.glob("*.yml"):
                content = yaml.safe_load(yml_file.read_text())
                job = list(content["resources"]["jobs"].values())[0]
                assert "schedule" not in job

    You're testing real template rendering, not mocked YAML — so you catch Jinja syntax errors, broken anchors, and missing variables before they hit databricks bundle validate.


    Why Not Just Use DABs Variables?

    DABs does have ${var.*} variables, and they're great for simple string substitution. But they can't:

    • Conditionally include/exclude blocks (e.g., permissions only in non-prod, schedule only in prod)
    • Compose reusable fragments across jobs (clusters, git source, tags)
    • Generate job names with complex per-environment logic

    Jinja handles all of these. And since it runs before DABs, you can still use ${var.*} and ${bundle.*} in the rendered output — they're just opaque strings to Jinja.


    Why Not Native include: in the Templates?

    DABs include: merges top-level YAML files into the bundle. It doesn't support:

    • Including a fragment inside a job definition
    • YAML anchors defined in one file and referenced in another
    • Any conditional logic

    Jinja's {% include %} is textual inclusion — it pastes the fragment inline before YAML parsing, so anchors and aliases work across fragments. It's a fundamentally different (and more powerful) mechanism.


    Recap

    Concern Without Jinja With Jinja
    Cluster config Copied in every job file Defined once in clusters.yml.j2
    Git source Copied, manually toggled Conditional in base-job.yml.j2
    Permissions Copied or forgotten Conditional in base-job.yml.j2
    Schedule Copied or forgotten Conditional per workflow
    New job Copy-paste full YAML 3 includes + job-specific tasks
    Environment logic Manual file edits {{ environment }} and {% if %}
    Testing databricks bundle validate only Python unit tests + validate

    The Jinja layer adds one small Python script and a bundle/ directory to your project. In return, you get composable, testable, environment-aware bundle configurations that scale cleanly from 1 job to 50.

    About the Author

    Nicola De Lillo is a data engineer with 3 years of experience building reliable data pipelines for anti-fraud and analytics platforms. He works primarily with dbt, PySpark, AWS Redshift, and Databricks, and enjoys sharing practical lessons through technical writing.

    Ready to Apply What You Learned?

    Take the next step in your data engineering journey with structured roadmaps and hands-on projects designed for real-world experience.