Scalable Data Lake
Explore the concept of scalable data lakes on AWS using Amazon S3 and Lake Formation. Understand how these solutions differ from data warehouses and production databases, and how they support the storage and analysis of structured and unstructured data for business intelligence and machine learning.
A data lake is a centralized location for storing data that has been ingested from various places. The term was coined around 2011 to distinguish it from other forms of centralized data storage. Others creatively coined the term “data swamps” to describe badly managed data lakes.
In this lesson, we consider the AWS approach for setting up a data lake and how a data lake differs from data warehouses and production data stores.
AWS services for scalable data lake
The AWS team suggests the following two services for setting up a scalable data lake: Simple Storage Service (S3) and Lake Formation.
Amazon S3
Amazon’s Data Lake on AWS architecture recommends S3 as the centralized location to store data of all formats.
Amazon S3 is a scalable and cost-effective way to store a variety of objects and has been widely used among AWS customers of all sizes and industries.
Since its launch in 2006, S3 now stores over 100 trillion objects and can handle tens of millions of requests per second.
Amazon S3 is similar to a cloud-based file system. It consists of buckets containing folder and file objects.