Resources / Dragon1 Glossary of Terms

Data Lake Definition

Dragon1 Icon for Data Lake

Dragon1 Icon for Data Lake
CREATED BY ANONYMOUS, CREATIVE COMMONS LICENSE

Dragon1 Definition for Data Lake: A data lake is a governed storage repository architecture holding a large amount of raw data (i.e. data in its native format) that are only defined and structured upon usage.

Data Lake Definition Summary

What is a Data Lake? What does it mean?

Data lakes are in the news today. More and more IT Managers and Enterprise Architects make Data Lake a core part of the future architecture of the organization. But many people think that a data lake is just some kind of data warehouse. But a data lake and a data warehouse really are different things, although they have similarities.

The short definition of a data lake is: "A data lake is a governed storage repository architecture holding a large amount of raw data (i.e. data in its native format) that are only defined and structured upon usage."

Let us investigate the differences.

What is the Difference Between a Data Lake and a Data Warehouse?

Both of them are repositories of data storage. That is their only resemblance.

A data lake holds data that is structured, semi-structured and unstructured. The data structure and requirements are not defined or changed until the data is needed. This will increase speeds extracting, loading and working with the data.

A data warehouse is a large store of data accumulated from a wide range of sources within a company and used to guide management decisions.

Differences in Treating Data

A data lake stores all data without changing it. A data warehouse stores data that first has been made fit to store. It has been defined and structured.

Differences in Processing Data

Data is loaded using two different approaches. In a data lake, data is loaded via schema-on-read, meaning the data is load as raw data, as-is. In a data warehouse data is loaded schema-on-write, meaning the data is defined and structured before it is loaded.

An Agile Solution

A data lake is an agile solution because only at the moment the data is needed, definitions and structures have to be created and models, queries and apps can be generated. A data warehouse is less of an agile solution, because all the business processes that make use of certain parts of the data warehouse will not permit that the data warehouse is changed all of a sudden. Data warehouses cannot be changed as quickly as data lakes.

A Secure Solution

A data lake is a new technology. So data lake products are build using the newest security requirements and principles. Data warehouse is a fairly old technology, so the products build with it contain older security requirements and principles.

Hadoop

Not always, but often a data lake is implemented using Hadoop.

Hadoop is open source software and a framework that can be used for distributed storage and processing of data sets of big data using the MapReduce programming model. Hadoop consists of computer clusters built from commodity hardware. Many data lake solution make use of or are related to Hadoop. But ofcourse Hadoop is not the only software and framework for Data Lakes.

Data Lake Architecture Principle

The architecture principle of the data lake concept is: By concentrating all data in one collection, and placing smart governance on top of it, without spending time and resources in restructuring of defining data prior to usage, the business can be presented with a much better single and agile data view than otherwise.

The above picture shows a data lake design pattern compliant to the principle.

Also Read

More sources of Data Lakes are:



If you have comments or remarks about this Dragon1 term or definition, please mail to specs@dragon1.com.