For years, Apache Spark users stitched together ETL and streaming jobs using ad-hoc Spark SQL, Python notebooks, and external schedulers. That approach scales poorly: too much glue code, duplicated logic, and fragile orchestration across batch and streaming. Databricks has now open sourced its Declarative Pipeline framework (donated upstream as Spark Declarative Pipelines) and introduced a new real-time execution mode that brings sub-second streaming performance to a broader range of workloads. Together, these moves represent a major evolution in how data pipelines are authored and operated in the Spark ecosystem. :contentReference[oaicite:0]{index=0}
Spark SQL made individual queries declarative — you describe the result set and Spark figures out the execution plan. Declarative Pipelines extends that idea to entire multi-table dataflows: you specify what datasets should exist, how they depend on upstream data, and quality / retention policies; the engine infers ordering, incremental runs, retries, checkpointing, and materialization. This removes large amounts of “undifferentiated heavy lifting” that teams previously re-implemented in notebooks and jobs. :contentReference[oaicite:1]{index=1}
These pain points were repeatedly observed across large production Spark deployments and directly informed the design of the open Declarative Pipelines contribution. :contentReference[oaicite:2]{index=2}
The open source contribution draws heavily on lessons from Databricks’ commercial Delta Live Tables (DLT) service, which codified best practices for reliable pipeline development at scale. After years of customer usage, the core ideas — declarative table definitions, automatic dependency tracking, managed checkpointing, and unified batch/stream semantics — have been generalized and donated as Spark Declarative Pipelines so they can run anywhere Spark runs. :contentReference[oaicite:3]{index=3}
Instead of maintaining two separate code paths, the Declarative model lets you define once and execute in batch, streaming, or mixed (incremental) modes. The framework automatically handles incremental processing, checkpoint management, and backfill ordering, dramatically simplifying hybrid data architectures that ingest both historical and real-time feeds. :contentReference[oaicite:4]{index=4}
Databricks has also detailed a new real-time execution mode for Spark Structured Streaming that consistently delivers p99 latencies under ~300 ms across a broad class of stateless and stateful queries, enabled by a concurrent stage scheduler, non-blocking shuffle, and operator optimizations. Early adopters have reported up to 10× lower latencies, unlocking use cases like feature engineering for ML and real-time enrichment that previously required specialized streaming stacks. :contentReference[oaicite:5]{index=5}
A pipeline is defined as a set of target tables with transformation logic, quality expectations, and dependencies. You focus on “what” — the engine figures out “how” and “when.” Below is a simplified Python-style definition sketching a medallion-style flow (Bronze → Silver → Gold) using the Declarative API concepts described by Databricks:
# Pseudo-style example inspired by Spark Declarative Pipelines concepts
from spark_declarative import pipeline, table, read_stream, read_batch
@table(name="bronze_raw_orders", mode="streaming")
def bronze_raw_orders():
return read_stream("s3://landing/orders_json")
@table(name="silver_orders_clean", depends_on="bronze_raw_orders")
def silver_orders_clean(bronze):
df = bronze.selectExpr(
"cast(orderId as long) as order_id",
"cast(customerId as long) as customer_id",
"cast(amount as double) as amount",
"to_timestamp(orderTimestamp) as ts"
).where("order_id is not null and amount >= 0")
return df
@table(name="gold_daily_revenue", depends_on="silver_orders_clean", mode="incremental")
def gold_daily_revenue(silver):
return (silver
.groupByExpr("date_trunc('day', ts) as day")
.aggExpr("sum(amount) as revenue"))
pipeline.build()
In a Declarative Pipeline, dependency ordering, incremental updates, retries, and checkpoint locations are managed automatically — no need to wire DAG edges manually in an external scheduler. :contentReference[oaicite:6]{index=6}
Many teams prefer metadata-driven pipelines. Spark Declarative Pipelines (and its DLT heritage) support config-based definitions that tools can generate. Here’s a conceptual YAML fragment mapping sources to derived tables and expectations:
tables:
- name: bronze_raw_clicks
source: kafka://clickstream
mode: streaming
- name: silver_clicks_clean
input: bronze_raw_clicks
transform: |
SELECT user_id, url, ts
FROM bronze_raw_clicks
WHERE user_id IS NOT NULL
- name: gold_daily_active_users
input: silver_clicks_clean
transform: |
SELECT date_trunc('day', ts) AS day,
COUNT(DISTINCT user_id) AS dau
FROM silver_clicks_clean
GROUP BY day
mode: incremental
Metadata-first authoring helps enforce standards, testing, and governance across large data engineering teams, a benefit repeatedly cited by Databricks and early community adopters. :contentReference[oaicite:7]{index=7}
Where does Declarative Pipelines really save time? The short comparison below highlights key differences; scroll horizontally on mobile if needed.
| Dimension | Ad-Hoc Spark SQL / Jobs | Declarative Pipelines |
|---|---|---|
| Dependency Management | Manual DAG wiring in schedulers | Automatic table dependency tracking |
| Batch + Streaming | Separate code paths | Unified spec; incremental by default |
| Retries / Checkpoints | Custom glue code | Managed by framework |
| Operational Overhead | High; per-job tuning | Lower; standardized runtime |
Databricks reports significant reductions in pipeline build time and operational toil when teams move from individually orchestrated Spark SQL jobs to the declarative model, with early adopters citing faster onboarding and fewer failures in production. :contentReference[oaicite:8]{index=8}
The combination of declarative multi-table pipelines and the new real-time execution mode closes a long-standing gap: historically, Spark Structured Streaming offered strong throughput but struggled to meet sub-second SLAs for certain operational use cases. Architectural improvements (concurrent scheduling, non-blocking shuffle, operator tuning) now enable p99 latencies below ~300 ms while preserving Spark’s fault tolerance — making it feasible to build real-time feature stores, alerting, and personalization pipelines directly in Spark. :contentReference[oaicite:9]{index=9}
Consider moving if you maintain dozens (or hundreds) of production Spark jobs with overlapping logic, maintain separate streaming and batch stacks, or struggle with backfills and SLA drift. Organizations running governed lakehouse architectures stand to benefit the most because declarative specs integrate naturally with cataloged tables and lineage tracking. :contentReference[oaicite:17]{index=17}
Databricks’ donation of Declarative Pipelines to Apache Spark — coupled with a real-time execution mode — signals a new phase in data engineering: one where teams describe desired data states, and the platform continuously keeps those states correct, fresh, and low-latency. If your environment is full of brittle, hand-wired Spark SQL jobs, now is the time to evaluate the declarative model and modernize your streaming architecture. :contentReference[oaicite:18]{index=18}