Data Lake Definition

Let us define Data Lake

Data lakes are in the news today. Increasingly, IT managers and enterprise architects are incorporating Data Lakes as a core part of the organization's future architecture. Many people think a data lake is simply a type of data warehouse. However, a data lake and a data warehouse are distinct, though they share similarities.

The short definition of a data lake is: "A data lake is a governed storage repository architecture holding a large amount of raw data (i.e., data in its native format) that is only defined and structured upon usage."

Let us investigate the differences.

What is a data lake?

What is the Difference Between a Data Lake and a Data Warehouse?

Both of them are repositories of data storage. That is their only resemblance.

A data lake holds structured, semi-structured, and unstructured data. The data structure and requirements are not defined or changed until the data is needed. This will increase the speed of data extraction, loading, and processing.

A data warehouse is a large store of data accumulated from various sources within a company and used to guide management decisions.

Differences in Treating Data

A data lake stores all data without changing it. A data warehouse stores data that has been prepared for storage. It has been defined and structured.

Differences in Processing Data

Data is loaded using two different approaches. In a data lake, data is loaded via schema-on-read and presented as-is, in its raw form. In a data warehouse, data is loaded schema-on-write, meaning the data is defined and structured before it is loaded.

An Agile Solution

A data lake is an agile solution because definitions and structures can be created only when the data is needed, and models, queries, and apps can be generated. A data warehouse is less of an agile solution because all the business processes that use specific parts of the data warehouse will not suddenly permit the data warehouse to be changed. Data warehouses cannot be changed as quickly as data lakes.

A Secure Solution

A data lake is a new technology. Therefore, data lake products are designed to meet the latest security requirements and principles. A data warehouse is a fairly old technology, so the products built with it contain older security requirements and principles.

Hadoop

Not always, but often a data lake is implemented using Hadoop.

Hadoop is open-source software and a framework that can be used for distributed storage and processing of big data sets using the MapReduce programming model. Hadoop consists of computer clusters built from commodity hardware. Many data lake solutions utilize or are related to Hadoop. However, Hadoop is not the only software and framework for Data Lakes.

Data Lake Architecture Principle

The architecture principle of the data lake concept is: By concentrating all data in one collection and placing smart governance on top of it, without spending time and resources in the restructuring of defining data before usage, the business can be presented with a much better single and agile data view than otherwise.

The above visualization shows a data lake design pattern compliant with the principle.

Also Read

More sources of Data Lakes are: