Search⌘ K
AI Features

Fundamentals of Data and Lineage

Understanding the fundamental characteristics of data—volume, velocity, and variety—guides AWS data engineers in making architectural decisions. These dimensions influence service selection, data partitioning, and processing methods. Choosing the appropriate storage format, such as Parquet for analytical workloads, is crucial for optimizing query performance. Data lineage ensures the traceability and trustworthiness of data, while schema evolution mechanisms in AWS facilitate adaptability to changing data structures. Rigorous data validation is essential to maintain data integrity and quality throughout the pipeline, preventing issues that could arise from schema changes.

Many AWS data architecture decisions start with understanding the data as it moves through your pipelines. Before you choose storage, design ETL, or configure streaming ingestion, you need to evaluate the data’s volume, velocity, variety, format, lineage requirements, and schema behavior. The AWS Certified Data Engineer – Associate exam expects you to apply this understanding across multiple question types and map data characteristics to appropriate AWS service choices. This lesson introduces the foundational concepts: the three Vs, storage formats, lineage tracking, and schema evolution that underpin many topics you’ll see on the exam and in production data engineering on AWS.

Understanding data characteristics

The three primary dimensions that shape AWS data architecture are volume, velocity, and variety, collectively referred to as the 3 Vs. These dimensions directly determine which services you provision, how you partition data, and whether your pipeline processes records in batch or real time.

  • Volume refers to the scale of data your pipeline must handle, ranging from gigabytes to petabytes. When volume is moderate, Amazon RDS or DynamoDB can serve as primary stores. At the petabyte scale, Amazon S3 becomes the foundation of a data lake, with Amazon Athena or Amazon Redshift providing the query layer.

  • Velocity describes how fast data arrives and how quickly it must be available for consumption. Batch workloads that run on hourly or daily schedules align with AWS Glue ETL or direct S3 uploads. Near-real-time requirements call for Amazon Kinesis Data Firehose, which provides managed buffering and automatic Parquet conversion. True real-time, low-latency processing demands Amazon Kinesis Data Streams with custom consumer applications.

  • Variety captures the structural diversity of your data. Structured data fits neatly into relational databases such as Amazon RDS and Redshift. Semi-structured data, such as JSON or nested event payloads, is best stored in Parquet on S3, with AWS Glue Crawlers inferring the schema. Unstructured data, including images, PDFs, and video, lives in S3 object storage with metadata tagging for discoverability.

These three dimensions collectively determine your storage format selection, partitioning strategy, and lineage requirements, all of which are explored in the sections that follow.

The following mind map shows how each of the 3 Vs maps to ...