What is a data lake?

A data lake is a storage repository that holds a large amount of data in its native, raw format. Data lake stores are optimized for scaling their size to terabytes and petabytes of data. The data typically comes from multiple diverse sources and can include structured, semi-structured, or unstructured data. A data lake helps you store everything in its original, untransformed state. This method differs from a traditional data warehouse, which transforms and processes data at the time of ingestion.

A diagram that shows various data lake use cases.

Key data lake use cases include:

Consider the following advantages of a data lake:

A complete data lake solution consists of both storage and processing. Data lake storage is designed for fault tolerance, infinite scalability, and high-throughput ingestion of various shapes and sizes of data. Data lake processing involves one or more processing engines that can incorporate these goals and can operate on data that's stored in a data lake at scale.

When you should use a data lake

We recommend that you use a data lake for data exploration, data analytics, and machine learning.

A data lake can act as the data source for a data warehouse. When you use this method, the data lake ingests raw data and then transforms it into a structured queryable format. Typically, this transformation uses an extract, load, transform (ELT) pipeline in which the data is ingested and transformed in place. Relational source data might go directly into the data warehouse via an ETL process and skip the data lake.

You can use data lake stores in event streaming or IoT scenarios because data lakes can persist large amounts of relational and nonrelational data without transformation or schema definition. Data lakes can handle high volumes of small writes at low latency and are optimized for massive throughput.

The following table compares data lakes and data warehouses.

A table that compares data lake features with data warehouse features.

Challenges

Technology choices

When you build a comprehensive data lake solution on Azure, consider the following technologies:

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps

Related resources