Driving Data Velocity: A Real-World Comparison of Databricks and Snowflake

By Andreas Paech

Head of Business Intelligence

At exmox, we set out to modernize our Data Warehousing setup to support Business Reporting and unlock advanced analytics use cases. Over the past year, we built a fully functional proof-of-concept (PoC) on Databricks, leveraging AWS for web-analytics.

The next step was to extend our data architecture to empower our entire organization with self-service Business Reporting by migrating existing dashboards to a unified data warehouse. Our data analysts needed to craft transformations for metric calculation in SQL without worrying about back-end complexities. Meanwhile, our Data Engineers had to focus on maintaining a stable, scalable Data Platform — freeing them from repetitive day-to-day tasks.

Before finalizing Databricks as our main data platform, we decided to benchmark it against Snowflake, which is widely recognized as a leading cloud data warehouse and often cited as a significant competitor in the market, to determine whether there was a more efficient and maintainable path forward with higher velocity and less maintenance while maintaining scalable data architecture.

_____________________

Benchmarking Requirements

Our proof of concept (PoC) was guided by several key objectives aimed at enhancing our data management and analytics capabilities. First and foremost, we prioritized velocity in implementing data domains end-to-end. We identified approximately 25 dashboards out of the existing data approaches spanning several data domains like marketing, sales, operations, and finance, each requiring data quality enhancements by relying on unified metric calculation and some dashboards also with near real-time data. Ensuring these dashboards receive timely updates is crucial for informed decision-making and maintaining operational efficiency for the whole organization.

Data ingestion capabilities were also a crucial factor. We required robust handling of various data sources, including MySQL, Postgres, and streaming data besides 3rd party API ingestion. Evaluating how each platform manages these diverse ingestion scenarios was key to ensuring comprehensive and efficient data integration across our systems. Data governance was another important consideration. We needed built-in functionalities for data cataloging, lineage tracking, and security to maintain data integrity and compliance. Effective governance tools are vital for managing data assets, ensuring traceability, and safeguarding sensitive information against unauthorized access.

In terms of data engineering and observability, we examined how each platform handles ETL (Extract, Transform, Load) pipelines, monitoring, and version control. Effective management of ETL processes is essential for maintaining data quality and reliability, while observability tools aid in diagnosing and resolving issues promptly. Additionally, we emphasized the need for seamless integration with SQL-based workflows. Our data analysts depend heavily on SQL transformations for data analysis and reporting, although we have decided against leveraging dbt at the current stage. Therefore, any new data architecture must facilitate a smooth transition, allowing analysts to continue their work without significant adjustments or disruptions. This integration ensures that our team can maintain productivity and leverage their existing skills without a steep learning curve. Therefore, one of the primary criteria was the ease of maintainability and migration. We needed to assess whether we could seamlessly migrate our existing dashboards to the new platform and replicate the necessary data transformations without significant effort or disruption. A smooth migration process is essential to minimize downtime and ensure continuity in our analytics operations.

Another significant objective was to minimize maintenance overhead. Our DevOps and cloud expertise is concentrated within the core Tech team, and we sought a solution that would reduce the complexity of cluster and resource management. By simplifying these aspects, we can allocate our limited technical resources more effectively and avoid potential bottlenecks associated with managing intricate infrastructure. Scalability and cluster management capabilities were evaluated to determine whether the platforms offer auto-scaling clusters or serverless options. Automated scaling ensures that our infrastructure can adjust to varying workloads without manual intervention, enhancing performance and cost-efficiency.

Support for machine learning (ML) use cases was an additional criterion. We looked for built-in capabilities that facilitate advanced analytics tasks such as fraud detection and ranking. Integrated ML features enable our data scientists to develop and deploy sophisticated models more efficiently, driving deeper insights and innovation.

Lastly, we considered the total cost of ownership (TCO). This assessment included not only the licensing costs of each platform but also the associated DevOps overhead and the required skill sets. A comprehensive understanding of TCO helps in making an informed decision that aligns with our budgetary constraints and resource availability. Ultimately, our goal was to establish a robust platform that scales without driving up complexity or cost. Scalability is essential to accommodate growing data volumes and increasing analytical demands. However, this scalability must not come at the expense of added complexity or prohibitive costs. We aimed to find a solution that balances performance and scalability with simplicity and cost-effectiveness, ensuring sustainable growth and efficient resource utilization.

Findings from Technical Benchmarking

Over the course of four weeks, we conducted a Proof of Concept (PoC) using Snowflake, with a primary focus on data ingestion, transformations and run management, streaming and machine learning (ML), and overall observability. We discovered that, while Snowflake offers native connectors for MySQL and PostgreSQL, these connectors posed certain setup limitations. In particular, an external staging area was required for the data ingestion process, which introduced an extra layer of overhead to our existing workflows. Also, the native drivers have been unable to read data from read-only replicas. Consequently, the data ingestion with the native drivers has been failing the PoC implementation in early stages.

When examining Snowflake’s capabilities for transformations and run management, we found its Streams and Tasks features — along with Dynamic Tables — promising for achieving near real-time data transformations and also enabling data analysts to maintain transformation with as little involvement from data engineers as possible. However, despite this potential, the scheduling and orchestration capabilities felt less mature than Databricks’ Data Engineering Workflows functionality. This was an important consideration for our team, as we wanted a robust solution that seamlessly integrates with the rest of our data infrastructure and processes. In terms of observability, we found Snowflake’s monitoring dashboards to be useful for tracking various aspects of the platform’s health and performance. We noted that some elements, such as version control or detailed run histories, would still require additional tooling or processes to manage effectively. During this same period, we continued to rely on Databricks and observed that it consistently demonstrated several advantages for our workflows. One such advantage was workflow simplicity: Databricks allowed us to effortlessly manage both our nightly ETL jobs and our near real-time clickstream ingestion — typically at 15-minute intervals or less. Databricks’ SQL notebooks and collaboration features also proved valuable, enabling our data analysts and engineers to work side by side in shared workspaces and easily switch between SQL-based transformations and Python Spark-based jobs depending on the task at hand. This gap highlighted an opportunity for further improvement or integration within Snowflake’s ecosystem.

Another factor we assessed was Snowflake’s support for streaming and machine learning. Snowflake markets itself as a platform that can readily handle real-time use cases and out-of-the-box ML solutions, such as anomaly detection. However, because we already relied on a well-established AWS Kinesis pipeline that directly feeds into Databricks, the overhead required to replicate or modify that pipeline for Snowflake did not seem to add enough tangible value. Consequently, we decided that our existing setup remained preferable for our streaming and ML needs.

Finally, Databricks’ cluster management features, while not fully automated, provided flexible cluster policies and auto-termination settings that helped us maintain control over costs. While configuring clusters manually may be more hands-on, having the ability to customize these settings suited our team’s needs, especially when managing a range of workloads across development, testing, and production environments. As Databricks’ serverless compute for notebooks, jobs, and pipelines gains traction, we are keen to evaluate this fully managed and auto-optimizing compute option to run workloads without spending time to manually provisioning infrastructure in future.

Overall, our PoC revealed that Snowflake has strong potential in key areas such as transformations with dynamic tables, data sharing and user-friendly monitoring tools. However, the need for external staging in ingestion workflows, the lack of advanced scheduling and orchestration, and the overhead associated with reworking our streaming pipeline limited the immediate benefits. Meanwhile, Databricks continued to offer a more straightforward and unified environment for our existing workloads, including ETL jobs, collaborative analytics, and flexible cluster management which got upgraded in the meantime from Databricks with a serverless option as well.

Conclusion

From our PoC, we encountered a few notable hurdles that ultimately steered us toward remaining with Databricks. In a nutshell, while Snowflake offers powerful features, especially around concurrency and data sharing, it did not sufficiently outweigh the benefits and momentum we already had with Databricks.

First, we faced compatibility issues with Snowflake’s MySQL and Postgres connectors; the recommended external staging introduced unwanted complexity while also adding additional workload on DevOps resources. Second, Snowflake’s orchestration tools — while promising — still lacked the comprehensive job management features and scheduling capabilities our team relies on for high-velocity data transformations. Third, transitioning our analysts to Snowflake’s UI or an external orchestration solution would require additional training and configuration, creating friction when we already had well-established workflows in Databricks where analysts can leverage seamless SQL and/or Python based collaborative development experience.

In the end, we concluded that scaling our data platform with Databricks minimized overhead, especially since we had a reliable proof of concept with seamless near real-time ingestion for our AWS-based web-tracking pipelines. Although Snowflake boasts robust concurrency, hot storage availability, data sharing, and ML features, these advantages did not override the existing maturity and TCO of our Databricks setup. As a result, we will continue leveraging Databricks for our Data Warehousing and Business Reporting, knowing that Snowflake remains a strong competitor should our needs evolve in the future.

If you’re looking to evaluate these platforms yourself, we recommend starting small, defining clear PoC benchmarks, and involving all key stakeholders early. This ensures that your final decision is rooted in practical experience and organizational readiness — just as ours was.

_____________________

Recognition: Throughout our engagement, the Snowflake account manager and technical sales engineer demonstrated a high level of professionalism — helping to refine the benchmarking requirements, guiding us through the PoC implementation, and conducting a thorough architecture review to ensure a comprehensive evaluation.

Be sure to subscribe to our newsletter to stay up to date with the latest news.

Driving Data Velocity: A Real-World Comparison of Databricks and Snowflake

By Andreas Paech

Head of Business Intelligence

Benchmarking Requirements

Findings from Technical Benchmarking

Conclusion

Work With Us

Company

Legal

Resources