The inevitable shift from Data Warehouses to Data Lakes

Organizations are increasingly facing one of the two questions: ‘Is it time to move from the data warehouse to a data lake? Or ‘We want to build a warehouse. Should we rather go for a data lake?’

Data Lake is by no means a replacement of data warehouse. Let’s look briefly what each of these mean:

Data Warehouse is a repository where data is maintained and processed always in a highly structured manner – in a group of relational databases.

Data Lake on the other hand is a repository that allows storing all kinds of data: raw, semi-structured, structured, and is essentially based on ‘big-data’ (Hadoop) ecosystem (because of the very reason that Hadoop allows data to be stored in any structure).

For decades, Data Warehouse (DW) has been the main system for data discovery and business intelligence (BI) where BI was all about insights from historical data: What has happened? Why it happened? What should be done? The frequency of insights was occasional: quarterly, monthly, weekly, or (at the most) daily. Essentially DW is built for current needs and not much for predictions of future (although some BI tools do allow a bit of predictive analytics). The data sources were essentially databases (transaction databases, CRM databases, product databases) or structured data (e.g. CSV files). “Structured” is the key. Not only structured data was the only way data could be stored, even the raw data has to be brought into a structure before pushing into a warehouse.

Typical flow in a DW: Extract data from multiple sources -> convert to structured data -> load data into a staging area (a database) -> Transform data (segregate, aggregate, filter, rearrange, correlate) using  database /SQL commands -> save transformed data into data marts (database) -> slice and dice data to produce data cubes (Database views) -> create reports -> pull reports into dashboards of a BI tool. Right from start to end of the journey – the data is highly structured.

This explanation will make it easy to realize why such a setup is good enough for obtaining weekly/monthly/quarterly insights of historical data. Also, since data has to be shaped up upfront, even a small change in the end BI report could trigger change in the structure at every single stage in the data chain.

Data Lake on the other hand is designed to store data in any format: raw, unstructured, semi-structured or structured. The very fact that data purity and structure can be of any nature gives this  repository the name ‘lake’. Different data streams (similar to water streams) can carry structured data (analogous to pure water), semi-structured data (partially filtered water) or raw data (unfiltered, muddy water) to the data lake. The lake can harbour any and every kind of data (RDBMS, files, proprietary databases, CSVs, logs, XML, JSON, emails, documents, PDFs, social media, and even images, video and audio). The other important purpose is to store vast amount of data at a much lower cost. Hence a data lake essentially comes into picture when the data is ‘big data’ and Hadoop ecosystem is the underlying infrastructure to store, compute and analyse data.

Since a data lake stores all kinds of data, it allows flexibility in correlating and slicing & dicing data late in the game. You can give structure to the data only when you need to. Data Lake allows ad-hoc analysis and just in time data slicing and dicing. It allows picking up data anywhere in the data chain and shape it up for insights. In the world of big data (heterogeneous data at high volume) – the data discovery cannot be fathomed upfront. It has to be delayed for later – which means the structuration of data has to be delayed.

Let’s look at an example of where a data warehouse would be useful and when a data lake would be. Imagine an online retail company TheUltraBasket. TheUltraBasket maintains all purchases in a transaction database. A CRM database maintains information about customers, a ProductCatalog database maintains information on products and prices.

Part A: The BI team at TheUltraBasket wants following insights: ‘revenue earned from the sale of each brand of shoes, across various age groups, across various geographies in the United States of Americain the month of January 2017’.

A data warehouse setup would need seek the customer information from the CRM database, the shoes, brands and pricing information from the products database, the sale amount for each sale from transactions database. The data would be picked every month. Then correlations would be applied and data marts (data mart is a subset of the data warehouse and is usually oriented to a specific business line or team) would be created in a manner that the aggregated data would be sliced and diced as per geography, product category, by shoe brand, by age groups, for each month in the entire of 2017 (you get the point). Then reports would be published using a BI / visualization tool. The Data Warehouse would do this job aptly every month.

Part B: Now let’s throw in some complexity. Suppose TheUltraBasket wants to do the following:

  1. Track each and every action of each of their 100000 customers (what they browse, what they buy, what they add to their carts but do not actually buy).
  2. Monitor purchase history of each customer.
  3. Based on above parameters, predict what each of them is likely to buy the next time they log in.
  4. Tracking what the customers are tweeting about products sold on their portal.
  5. Calculate discount on each product that each customer is likely to buy (note that this needs to be done for each customer)
  6. At the end of each month measure the effectiveness of the discount campaigns (how much more revenue got generated due to discounted offers).

First, the data that needs to be ingested is quite huge (tracking every mouse click for every customer).

Second, the data is mix of non-structured data (events, click stream data, twitter streams, etc.) and structured data (previous purchase history of customers, previous sale history of products).

Third, slicing and dicing of huge data is to be done for each user to offer her the best possible deals.

Fourth, there is tremendous amount of read and write of data since every user offers a separate path of data transformation.

Fifth, at the end of the month, historical data has to be published in the report indicating the effectiveness of discount campaigns.

A Data lake of course would solve both Part A and Part B, where as a Data Warehouse would solve only Part A. Data Lake also offers the advantage of storing huge amount of raw or semi-cooked data at much cheaper storage costs. You can build custom analysis on this data in future. This huge repository of data, and the flexible underlying technology can help build futuristic (predictive) analytics that will drive future decisions with much more ease than a Data Warehouse.

Now, to answer the questions in the top section of this article ‘Is it time to move from the data warehouse to a data lake? Or ‘We want to build a warehouse. Should we rather go for a data lake?’, the approach of every organization should be based on the amount of data they want to analyse and the type of analyses they want to build.

  1. Organizations that do not have any warehouse, should definitely plan to go the data lake way.
  2. Organizations that have data warehouses and have started dealing with much higher data and feel a need for ad-hoc analysis, should seriously think converting to data lakes. Data lakes can of course have databases as a part of their setup.

Comments 1

Anoop

May 31, 2017

Great writeup of comparing datalake to data warehouse. Can you elaborate on the databases as part of the DL set-up? Can the high performance DB ‘s not be downstream from Hadoop? If one has an existing warehouse, how do you work towards a “unified data hub” that abstracts away the underlying polyglot?