Home-Cloud Technologies-Databricks Shifts From Spark SQL to Open Declarative Pipelines
Databricks

Databricks Shifts From Spark SQL to Open Declarative Pipelines

For years, Apache Spark users stitched together ETL and streaming jobs using ad-hoc Spark SQL, Python notebooks, and external schedulers. That approach scales poorly: too much glue code, duplicated logic, and fragile orchestration across batch and streaming. Databricks has now open sourced its Declarative Pipeline framework (donated upstream as Spark Declarative Pipelines) and introduced a new real-time execution mode that brings sub-second streaming performance to a broader range of workloads. Together, these moves represent a major evolution in how data pipelines are authored and operated in the Spark ecosystem. :contentReference[oaicite:0]{index=0}

From Queries to Pipelines: Why Declarative Matters

Spark SQL made individual queries declarative — you describe the result set and Spark figures out the execution plan. Declarative Pipelines extends that idea to entire multi-table dataflows: you specify what datasets should exist, how they depend on upstream data, and quality / retention policies; the engine infers ordering, incremental runs, retries, checkpointing, and materialization. This removes large amounts of “undifferentiated heavy lifting” that teams previously re-implemented in notebooks and jobs. :contentReference[oaicite:1]{index=1}

Root Problems Declarative Pipelines Aim to Solve

  • Excess glue code for incremental ingestion, backfills, error handling, and dependency wiring across jobs.
  • Duplicated patterns across teams; no common standard for pipeline quality, testing, or lineage.
  • Separate stacks for batch vs streaming ETL, increasing operational overhead and cost.

These pain points were repeatedly observed across large production Spark deployments and directly informed the design of the open Declarative Pipelines contribution. :contentReference[oaicite:2]{index=2}

Built on Real-World Experience (DLT Heritage)

The open source contribution draws heavily on lessons from Databricks’ commercial Delta Live Tables (DLT) service, which codified best practices for reliable pipeline development at scale. After years of customer usage, the core ideas — declarative table definitions, automatic dependency tracking, managed checkpointing, and unified batch/stream semantics — have been generalized and donated as Spark Declarative Pipelines so they can run anywhere Spark runs. :contentReference[oaicite:3]{index=3}

Unified Batch + Streaming

Instead of maintaining two separate code paths, the Declarative model lets you define once and execute in batch, streaming, or mixed (incremental) modes. The framework automatically handles incremental processing, checkpoint management, and backfill ordering, dramatically simplifying hybrid data architectures that ingest both historical and real-time feeds. :contentReference[oaicite:4]{index=4}

Real-Time Mode: Sub-300ms Streaming

Databricks has also detailed a new real-time execution mode for Spark Structured Streaming that consistently delivers p99 latencies under ~300 ms across a broad class of stateless and stateful queries, enabled by a concurrent stage scheduler, non-blocking shuffle, and operator optimizations. Early adopters have reported up to 10× lower latencies, unlocking use cases like feature engineering for ML and real-time enrichment that previously required specialized streaming stacks. :contentReference[oaicite:5]{index=5}

Declarative Pipeline Anatomy

A pipeline is defined as a set of target tables with transformation logic, quality expectations, and dependencies. You focus on “what” — the engine figures out “how” and “when.” Below is a simplified Python-style definition sketching a medallion-style flow (Bronze → Silver → Gold) using the Declarative API concepts described by Databricks:

# Pseudo-style example inspired by Spark Declarative Pipelines concepts
from spark_declarative import pipeline, table, read_stream, read_batch

@table(name="bronze_raw_orders", mode="streaming")
def bronze_raw_orders():
    return read_stream("s3://landing/orders_json")

@table(name="silver_orders_clean", depends_on="bronze_raw_orders")
def silver_orders_clean(bronze):
    df = bronze.selectExpr(
        "cast(orderId as long) as order_id",
        "cast(customerId as long) as customer_id",
        "cast(amount as double) as amount",
        "to_timestamp(orderTimestamp) as ts"
    ).where("order_id is not null and amount >= 0")
    return df

@table(name="gold_daily_revenue", depends_on="silver_orders_clean", mode="incremental")
def gold_daily_revenue(silver):
    return (silver
            .groupByExpr("date_trunc('day', ts) as day")
            .aggExpr("sum(amount) as revenue"))
            
pipeline.build()

In a Declarative Pipeline, dependency ordering, incremental updates, retries, and checkpoint locations are managed automatically — no need to wire DAG edges manually in an external scheduler. :contentReference[oaicite:6]{index=6}

YAML / Config-Driven Pipelines

Many teams prefer metadata-driven pipelines. Spark Declarative Pipelines (and its DLT heritage) support config-based definitions that tools can generate. Here’s a conceptual YAML fragment mapping sources to derived tables and expectations:

tables:
  - name: bronze_raw_clicks
    source: kafka://clickstream
    mode: streaming

  - name: silver_clicks_clean
    input: bronze_raw_clicks
    transform: |
      SELECT user_id, url, ts
      FROM bronze_raw_clicks
      WHERE user_id IS NOT NULL

  - name: gold_daily_active_users
    input: silver_clicks_clean
    transform: |
      SELECT date_trunc('day', ts) AS day,
             COUNT(DISTINCT user_id) AS dau
      FROM silver_clicks_clean
      GROUP BY day
    mode: incremental

Metadata-first authoring helps enforce standards, testing, and governance across large data engineering teams, a benefit repeatedly cited by Databricks and early community adopters. :contentReference[oaicite:7]{index=7}

Comparing Spark SQL Jobs vs Declarative Pipelines

Where does Declarative Pipelines really save time? The short comparison below highlights key differences; scroll horizontally on mobile if needed.

DimensionAd-Hoc Spark SQL / JobsDeclarative Pipelines
Dependency ManagementManual DAG wiring in schedulersAutomatic table dependency tracking
Batch + StreamingSeparate code pathsUnified spec; incremental by default
Retries / CheckpointsCustom glue codeManaged by framework
Operational OverheadHigh; per-job tuningLower; standardized runtime

Databricks reports significant reductions in pipeline build time and operational toil when teams move from individually orchestrated Spark SQL jobs to the declarative model, with early adopters citing faster onboarding and fewer failures in production. :contentReference[oaicite:8]{index=8}

Low-Latency Streaming + Declarative Pipelines = Real-Time Lakehouse

The combination of declarative multi-table pipelines and the new real-time execution mode closes a long-standing gap: historically, Spark Structured Streaming offered strong throughput but struggled to meet sub-second SLAs for certain operational use cases. Architectural improvements (concurrent scheduling, non-blocking shuffle, operator tuning) now enable p99 latencies below ~300 ms while preserving Spark’s fault tolerance — making it feasible to build real-time feature stores, alerting, and personalization pipelines directly in Spark. :contentReference[oaicite:9]{index=9}

Getting Started: Migration Path From Spark SQL

  1. Inventory existing jobs: Identify repeatable patterns (bronze ingestion, CDC merges, aggregates) across your Spark SQL notebooks. :contentReference[oaicite:10]{index=10}
  2. Model target tables: For each output dataset, define schema, refresh mode (batch, streaming, incremental), and upstream sources. :contentReference[oaicite:11]{index=11}
  3. Translate logic declaratively: Replace orchestration glue with table definitions in Python or SQL; let the pipeline compute ordering and backfills. :contentReference[oaicite:12]{index=12}
  4. Enable real-time mode (where needed): Apply low-latency execution to stages that power interactive dashboards or ML feature serving. :contentReference[oaicite:13]{index=13}

Operational Tips

  • Validate pipeline definitions before runtime — Spark Declarative Pipelines supports pre-execution validation to catch schema errors early. :contentReference[oaicite:14]{index=14}
  • Use unified batch/stream configs to avoid drift between historical backfills and live feeds. :contentReference[oaicite:15]{index=15}
  • Monitor latency SLOs if adopting real-time mode; early users saw large gains but tuning still matters for stateful joins. :contentReference[oaicite:16]{index=16}

When to Adopt Declarative Pipelines

Consider moving if you maintain dozens (or hundreds) of production Spark jobs with overlapping logic, maintain separate streaming and batch stacks, or struggle with backfills and SLA drift. Organizations running governed lakehouse architectures stand to benefit the most because declarative specs integrate naturally with cataloged tables and lineage tracking. :contentReference[oaicite:17]{index=17}

Conclusion

Databricks’ donation of Declarative Pipelines to Apache Spark — coupled with a real-time execution mode — signals a new phase in data engineering: one where teams describe desired data states, and the platform continuously keeps those states correct, fresh, and low-latency. If your environment is full of brittle, hand-wired Spark SQL jobs, now is the time to evaluate the declarative model and modernize your streaming architecture. :contentReference[oaicite:18]{index=18}

logo softsculptor bw

Experts in development, customization, release and production support of mobile and desktop applications and games. Offering a well-balanced blend of technology skills, domain knowledge, hands-on experience, effective methodology, and passion for IT.

Search

© All rights reserved 2012-2026.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.