By Andreas Paech
Head of Business Intelligence
gaming ma 2024 cover photo

Data engineering on cloud platforms increasingly depends on automation to keep performance predictable and costs under control. Databricks’ Liquid Clustering (LC) and Predictive Optimization (PO) promise to reduce the burden of manual table maintenance by continuously reclustering data, compacting files, and refreshing statistics. This article distills our experience applying these features to high-volume ETL pipelines, outlines the trade-offs against one-time manual optimisations, and proposes a phased process model for introducing LC and PO into a modern DataOps practice.

Cloud data platforms abstract cluster management, but they do not eliminate the need for well-maintained data layouts. In traditional settings, data engineers had to run periodic OPTIMIZE, VACUUM, and ANALYZE commands and carefully choose partitioning and clustering keys. At Exmox, where daily ETL jobs ingest and transforms tens of millions of rows, manual maintenance quickly became operationally expensive.

In early 2025, we began evaluating Liquid Clustering and Predictive Optimization to automate these tasks. The goal was not to replace SQL-first discipline, but to reduce the operational load once workloads were stable and tables correctly modelled. What follows are two case studies that illustrate both the promise and the limits of automation, finishing with a suggestion for a process model and decision matrix to optimise data management.

_____________________

Concepts of Liquid Clustering and Predictive Optimization

Liquid Clustering represents a shift away from traditional static partitions and Z-order indexing towards a model of continuous, incremental reclustering. Rather than forcing engineers to define fixed partition columns, Databricks maintains the distribution of data dynamically, reorganising files as new records are written. This approach directly addresses two of the most common issues in large-scale data management: the “small-file problem,” where too many tiny files slow metadata operations, and the “monolithic file problem,” where a single oversized file prevents effective parallelism. In practice, Liquid Clustering can use up to four columns for clustering, with Databricks automatically selecting keys unless they are explicitly defined by the engineer. Reclustering happens incrementally during writes, reducing the need for disruptive, full-table OPTIMIZE operations. It is important to note, however, that Liquid Clustering is currently available only for Unity Catalog–managed Delta tables and does not yet support streaming ingestion or materialized views as of 2025.

Predictive Optimization builds on this foundation by automating table maintenance tasks that previously required explicit scheduling. Instead of manually orchestrating jobs, engineers can rely on the platform to trigger actions based on observed usage patterns. Predictive Optimization automatically runs OPTIMIZE to compact fragmented or oversized files, VACUUM to clean up obsolete or deleted files, and ANALYZE to refresh table statistics and improve query planning. By relying on heuristics and workload monitoring rather than fixed schedules, Predictive Optimization delivers several practical benefits. Statistics remain consistently up to date without manual intervention, reducing the risk of performance regressions. Write amplification is lowered compared to frequent manual compactions, since maintenance happens only when it is likely to improve performance.

Case Study 1: Nightly Rewrite of High-Volume Tables

One of the most demanding workloads in our environment involves the nightly rewrite of a table that ingests more than twenty million rows. Under normal conditions, this batch load was expected to complete in less than an hour. However, in early January runtimes suddenly stretched to more than six hours, creating significant downstream delays for dependent analytics pipelines. Initial investigation showed that inefficient SQL logic was partly to blame, but another critical factor was the physical layout of the destination table. Data had accumulated into a small number of very large files, preventing Spark from parallelising reads effectively and introducing heavy shuffling during execution.

A one-time optimisation provided immediate relief. By reorganising the table with an OPTIMIZE … ZORDER BY (…) operation, the oversized files were compacted into a balanced set of 260 files averaging around 135 megabytes each. This new distribution allowed the compute cluster to read files in parallel, avoiding the bottleneck of a monolithic data structure. The results were striking: runtimes collapsed from several hours to just two minutes 21 seconds, even with the same input volume. Storage efficiency also improved as redundant and obsolete records were eliminated during compaction from 50GB to 35GB.

The long-term challenge, however, is avoiding a return to this degraded state. Manually running OPTIMIZE can deliver quick wins, but it requires ongoing scheduling and operational effort. To maintain healthy file sizes and consistent performance without constant human intervention, we enabled Liquid Clustering on this workload. By reclustering incrementally during writes, Liquid Clustering has kept the table in a balanced state automatically for 7 months now.

Case Study 2: Incremental Loads into Large Tracking Tables

Another common workload type is incremental ingestion. One of our most important tables accumulates around half a million new records every day, with the total volume now approaching a quarter of a billion rows. This scale poses a different set of challenges from full nightly rewrites. Historically, the table was partitioned by date, which made sense in theory but quickly produced thousands of tiny files that slowed metadata operations. When partitioning was removed, the pendulum swung the other way: new data was appended into a handful of very large files, which undermined query performance and increased the risk of skewed workloads.

Liquid Clustering provided a pragmatic middle ground. Instead of relying on a fixed partitioning strategy, Databricks automatically selected up to four clustering keys and continuously reclustered data during inserts. This incremental approach maintained a balance between too many small files and too few large ones. In early tests, query latency for downstream analytics dropped noticeably, as Spark executors could process more evenly sized tasks. Compute consumption also decreased, as less time was spent shuffling data across the cluster.

That said, Liquid Clustering is not a silver bullet. Automatic column selection may not always reflect the most important query patterns for the business. For example, if analysts frequently filter on a campaign identifier or customer dimension that is not chosen as a clustering key, performance gains may be less pronounced. In such cases, manual overrides and additional domain knowledge remain essential to guide clustering decisions. Nonetheless, for high-volume, continuously growing tables, Liquid Clustering significantly reduces the operational burden of maintaining optimal file distribution. It offers a way to keep ingestion pipelines efficient without constant manual intervention, provided teams continue to monitor outcomes and adjust where necessary.

When and How to Automate DataOps

Liquid Clustering and Predictive Optimization are powerful tools, but their benefits depend heavily on context. On the positive side, they relieve engineers from routine maintenance tasks such as running OPTIMIZE, VACUUM, or ANALYZE. By reclustering incrementally during writes, they avoid the disruption of manual compaction cycles and adapt table layouts to evolving query patterns. This reduces operational toil and frees Data teams to focus on higher-value engineering work. For large, frequently updated tables, automation can deliver measurable improvements in both query performance and cost efficiency.

At the same time, these features are not universally applicable. They are currently limited to Unity Catalog–managed Delta tables and do not yet extend to streaming ingestion or materialized views. Engineers also have limited control over how clustering keys are selected, which means that performance may not always align with domain-specific query requirements. For static or rarely updated tables, the overhead of continuous reclustering may provide little benefit compared to a one-time OPTIMIZE followed by routine monitoring. In short, automation compounds value once the basics are sound, but it cannot substitute for well-designed schemas and efficient SQL.

From our experience, the most effective adoption path follows a phased approach. The first step is to stabilise workloads by ensuring that queries are efficient and schemas are modelled appropriately; automation cannot compensate for flawed logic. The next step is to apply one-time optimisations to historic or static datasets, running OPTIMIZE and VACUUM to establish a clean baseline. Once this foundation is in place, Liquid Clustering can be introduced selectively on high-volume tables where manual maintenance would otherwise be frequent. For Unity Catalog–managed tables, Predictive Optimization can then be layered on top, allowing Databricks to decide when to compact files or refresh statistics based on observed usage. Finally, continuous monitoring remains essential: runtime metrics, DBU consumption, and file distribution patterns should be reviewed regularly to ensure that automation delivers sustained value.

  1. Stabilise workloads. Ensure SQL queries and schema design are sound. Automation cannot fix bad logic.
  2. Apply one-time optimisations. For historic/static tables, run OPTIMIZE + VACUUM to establish a clean baseline.
  3. Enable Liquid Clustering selectively. Use on tables with steady, high-volume writes where manual maintenance is frequent.
  4. Add Predictive Optimization for managed tables. Let Databricks trigger maintenance actions adaptively.
  5. Monitor & iterate. Track runtimes, DBU consumption, and file distribution histograms to validate gains.

Taken together, these practices show that Liquid Clustering and Predictive Optimization represent a significant step forward in automated DataOps. At Exmox, their introduction reduced costs and improved consistency in both nightly ETL and incremental tracking pipelines, cutting down the need for daily human intervention. Still, these technologies are not replacements for sound engineering. They are multipliers that deliver the greatest impact once core logic and data structures have already been optimised. The enduring lesson is clear: get the fundamentals right first, then enable automation where it compounds value most effectively.

Key Takeaways

In practice, the lessons from applying Liquid Clustering and Predictive Optimization can be distilled into a few clear principles and workload-specific recommendations that guide when automation adds value and when manual optimisation remains the better choice:

  • Liquid Clustering replaces static partitions and ZORDER with continuous reclustering during writes.
  • Predictive Optimization automates OPTIMIZE, VACUUM, and ANALYZE, reducing operational overhead.
  • Best fit: Large, frequently updated tables under Unity Catalog with steady workloads.
  • Not ideal: Cold/historic tables or streaming pipelines.
  • Process tip: Apply manual one-time optimisation first, then layer LC/PO selectively, and continuously monitor impact.

 

Workload Type Characteristics Recommended Action
Historic / Cold Tables Rarely updated; large static data; queries mainly read-only One-time OPTIMIZE + VACUUM; no LC/PO needed
High-Volume Batch Loads Nightly/weekly rewrites of tens of millions of rows; stable schema/query patterns One-time optimisation → then enable LC
Incremental Loads Continuous inserts (100k+ per day), table size grows steadily; mixed query patterns Enable LC for reclustering; add PO if UC-managed
Ad-hoc / Irregular Tables Small or sporadic updates; unpredictable query workload Manual checks; selective OPTIMIZE; avoid LC
Streaming Pipelines Structured Streaming or continuous ingestion LC/PO not supported (2025); rely on manual layout tuning
Critical Analytics Tables Widely used dashboards; latency-sensitive queries; predictable access patterns Baseline optimisation → LC + PO for continuous maintenance

_____________________

Be sure to subscribe to our newsletter to stay up to date with the latest news.