Data lake vs data lakehouse

1/23/2024

With Synapse you can do a federated query over ADLS Gen2, Spark Tables, and Cosmos DB, and eventually others such as Synapse dedicated pools, SQL Database and SQL Managed Instance. Part of this architecture is making it easy to query data in multiple sources by building out a semantic layer using a distributed query engine like Presto or OPENROWSET in a serverless pool in Synapse. They do this by using federated queries to integrate data across the data lake, data warehouses, and any purpose-built data services that are being used. What I’m seeing customers do is adopting a lakehouse architecture that goes beyond the data lake and the data warehouse. The bottom line is you can try to get by with just a NoEDW, but it is very likely that you will run into issues and will need to have some of the data in the data lake copied to a relational database.

You definitely do need a data lake (see reasons). The paper also does not discuss how master data management (MDM) fits in, which are almost always relational database solutions. Also note that Delta Lake does not support cross-table transactions and that Databricks does not have a pay-per-query approach like Synapse serverless has. In addition, the NoEDW option requires using Delta Lake, adding a layer of complexity and requiring all tools using the data lake to have to support Delta Lake.

And a ProEDW gives you the additional benefits of speed, security, and features that I mentioned at Data Lakehouse & Synapse. Having the metadata along with the data in a relational database allows everyone to be on the same page as to what the data actually means, versus more of a wild west with a data lake. The additional prior reason which was to save costs, is much less now that the storage cost for Synapse has dropped around 80%, so it is about the same cost as data lake storage.īut I still see it being very difficult to manage a solution with just a data lake when you have data from many sources. I can certainly see some uses where you could be fine with a NoEDW: if you have a small amount of data, if the users are all data scientists (hence have advanced technical skills), if you just want to build a POC, or if you want to get a quick win with a report/dashboard. This becomes even more apparent when you are dealing with data from many different sources. The extra cost, complexity, and time to value in incorporating a relational database into a data lakehouse is worth it for many reasons, one of which is a relational database combines the metadata with the data to make it much easier for self-service BI compared to a data lake where the metadata is separated out from the data in many cases. While Databricks touts NoEDW by using Delta Lake and SQL Analytics, Microsoft touts ProEDW with Azure Synapse Analytics.įor NoEDW, my thought process is, if you are trying to make a data lake work like a relational database, why not just use a relational database (RDBMS)? Then have the data lake do what it is good at, and the RDBMS do what it is good at?

Now we’re here once again!įor simplicity I’ll break down a data lakehouse into two types of architectures: one-tier that is data lake (in the form of using schema-on-read storage), which I’ll call NoEDW, and two-tier that is a data lake and a relational database (in the form of an enterprise data warehouse, or EDW), which I’ll call ProEDW. This was clearly a mistake (see Is the traditional data warehouse dead?). It’s funny how when Hadoop first came out, I heard many say that the end of relational data warehouses is here and just use a data lake for everything. While I agree there may be some uses cases where technical designs may allow Lakehouse systems to completely replace relational data warehouses, I believe those use cases are much more limited than this paper suggests. Instead of the two-tier data lake + relational data warehouse model, you will just need a data lake, which is made possible by implementing data warehousing functionality over open data lake file formats. As a follow-up to my blog Data Lakehouse & Synapse, I wanted to talk about the various definitions I am seeing about what a data lakehouse is, including a recent paper by Databricks.ĭatabricks uses the term “Lakehouse” in their paper (see Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics), which argues that the data warehouse architecture as we know it today will wither in the coming years and be replaced by a new architectural pattern, the Lakehouse.

0 Comments

Data lake vs data lakehouse

Leave a Reply.

Author

Archives

Categories