SQL-First Data Engineering: Why Query Optimization Still Matters in Cloud Data Platforms

By Andreas Paech

Head of Data Engineering

Modern cloud data platforms promise scale with minimal operational overhead. Auto-scaling, serverless compute, and managed storage lower the barrier to “big data.” Yet the cloud still charges you for waste: inefficient SQL and poorly shaped data layouts translate directly into higher latency and higher spend because cost and performance scale with scanned bytes, shuffle volume, and join strategy.

This post argues for a pragmatic stance: SQL-first is not nostalgia. It is the lowest-friction way to share and version the executable definitions behind metrics and pipelines, so analysts, engineers, and power users can review intent, detect semantic drift, and keep velocity and trust intact.

SQL as the Lowest-Friction Interface for Shared Meaning

In most data organisations, “speed” problems are often interpretation problems: teams hesitate because they do not trust what a metric means, whether a change altered semantics, or whether today is comparable to yesterday. SQL’s advantage is that it is declarative: semantic meaning is readable, diffable, and testable across functions.

In practice, the query is the contract: it encodes join rules, filter boundaries, and null-handling. When treated as a versioned, code-reviewed artefact, SQL reduces context decay by keeping logic close to the number. SQL-first is therefore not only an engineering preference; it is a data governance primitive that makes trust reproducible.

Why the Cloud Makes Query Discipline More, Not Less, Important

Cloud elasticity changes failure modes. In on-premise systems, an inefficient query often fails loudly because capacity is fixed. In the cloud, inefficient queries often “succeed,” but at the cost of slow feedback loops and expensive runs. That is a more dangerous outcome: the system appears functional while quietly taxing the organisation with longer cycles, higher DBU/compute consumption, and slower time-to-insight.

Managed features also create a psychological trap. It is tempting to assume that auto-scaling and serverless compute compensate for suboptimal SQL. They do not. They can mask symptoms and inflate cost. Query planners cannot infer business semantics, and distributed engines cannot parallelise workloads effectively if the underlying logic produces unnecessary shuffles, Cartesian blow-ups, or oversized partitions. The engine can only optimise what you actually express. When the expression is sloppy, the bill and the latency will eventually make that visible.

A Real Incident: When a “Stable” ETL Job Turns Into a Six-Hour Outage

Treat query regressions as reliability incidents. In one production pipeline, an ETL job that normally ran 40-50 minutes spiked to 6.91 hours and 6.11 hours on consecutive days while processing a comparable volume (~20.2 million records). This was not a demand surge; it was a failure in disguise.

Instead of blaming transient infrastructure, we inspected the query execution plan. The transformation’s nested subqueries and join logic amplified skew under a specific file distribution, creating redundant scans and expensive shuffles.

The fix was boring and decisive: rewrite into explicit Common Table Expressions (CTEs), tighten join predicates, and harden null-handling; then align the table layout with the access pattern (partitioning). Runtime dropped to 2 minutes and 21 seconds, including a quick clustering step including statistics update. The point is not the rewrite; it is that a SQL-first posture turns “cloud slowness” into deterministic levers.

Data Layout Is Part of the Query

Query optimisation and data layout optimisation are two halves of the same performance problem. Clean SQL can still run poorly if files, partitions, or clustering prevent parallelism or amplify shuffles; a clean layout cannot rescue fundamentally inefficient logic.

A simple example is file distribution: a handful of oversized files caps parallelism; thousands of tiny files shift the bottleneck to metadata and scheduling. Practical layout literacy therefore includes file size distribution, locality (partitioning/clustering aligned to dominant filters and join keys), and statistics refresh when the optimiser depends on them.

A Pragmatic Process Model: Optimise Logic, Then Layout, Finally Automate

Signal	Likely Root Cause	First Move	When to Automate
Runtime spikes without volume change	Query plan instability, join explosion, skew sensitivity	Rewrite SQL for intent (CTEs, joins, null logic)	After runtime is stable and predictable
Consistently slow filters on common dimensions	Poor locality, oversized files, missing clustering	Fix physical data layout (file sizes, clustering, stats)	When writes are steady and layout drifts over time
Performance decays gradually under steady writes	Layout drift, small-file/large-file oscillation	Establish baseline layout and validate Access patterns	Once baseline is proven and repeatable
Cold/static tables with rare updates	One-time poor layout, stale files	One-time optimisation and cleanup	Usually not needed; monitor only
Streaming or highly irregular ingestion	Unsupported automation paths, volatile patterns	Manual layout discipline + monitoring	Only when platform support matures for the workload

Table 1: How to choose between SQL rewrites, baseline layout work, and automation, based on performance signals.

Teams often jump to automation too early. The sequence that scales is: optimise core logic first, stabilise physical layout second, then use automation to preserve a proven baseline.

Optimising logic means making the plan intentional: remove accidental Cartesian products, prefer explicit joins, rewrite nested subqueries that obscure intent, and be deliberate about null semantics. The goal is not clever SQL; it is predictable execution.

Stabilising layout means shaping the physical reality to match access patterns: partition only when it reflects dominant filters, cluster/order when locality matters, keep files in a parallelism-friendly size range, and refresh statistics when needed.

Only after those two layers are sound does automation compound value. Automation is a multiplier applied to a stable baseline.

Right-Sized Clusters and Auto-Scaling: The Third Lever

Logic and layout determine how much work the engine must do; compute determines how fast it can do it. Efficient queries can still crawl if clusters are undersized, contended, or slow to start, which often presents as “random bad days” in production.

The goal is controlled elasticity, not permanent overprovisioning. Autoscaling reduces tail latency under spikes; auto-termination prevents idle burn. Serverless compute can remove provisioning overhead for smaller teams and mixed interactive/job workloads, but it is not always the cheapest option at scale.

Compute scaling is not a substitute. If SQL and layout are inefficient, adding workers accelerates waste.

Platforms are adding features such as predictive optimisation, auto-compaction, adaptive clustering, and managed statistics refresh. Useful, but not plug-and-play. Limitation one: not all columns are equal. Automation cannot reliably infer which identifiers, dimensions, and join keys are business-critical for your dashboards and downstream aggregates.

Limitation two is workload shape. Batch append, frequent updates, and streaming ingestion stress automation differently. Cold tables usually benefit from one-time optimisation plus monitoring; volatile streaming patterns often still require manual discipline until platform support matures.

Use automation to maintain a baseline you already understand. Evaluate features like Liquid Clustering and Predictive Optimization as multipliers on a correct design, not as fixes for unclear semantics or unstable plans.

SQL-First as a Trust Primitive

SQL-first data engineering is ultimately an organisational choice disguised as a technical preference. SQL is widely teachable, declarative, and reusable across Business Intelligence, analytics, data engineering, and advanced power users. That matters because high-velocity organisations cannot afford meaning to be locked inside a narrow specialist group. If the organisation wants trust to scale, the definitions that create truth must be legible, reviewable, and executable by the people who depend on them.

Treat SQL as a first-class artefact, and it becomes more than a data query language. It becomes the interface that preserves meaning under change, the contract that makes metrics auditable, and the stabiliser that reduces the time-to-trust tax. Optimise the core logic, shape the underlying data structures, and then automate where it compounds value. That is the practical sequence that keeps cloud platforms fast, costs predictable, and trust intact.

Be sure to subscribe to our newsletter to stay up to date with the latest news.

SQL-First Data Engineering: Why Query Optimization Still Matters in Cloud Data Platforms

By Andreas Paech

Head of Data Engineering

Work With Us

Company

Legal

Resources