Fundamentals of Data and Lineage

Understanding the fundamental characteristics of data—volume, velocity, and variety—guides AWS data engineers in making architectural decisions. These dimensions influence service selection, data partitioning, and processing methods. Choosing the appropriate storage format, such as Parquet for analytical workloads, is crucial for optimizing query performance. Data lineage ensures the traceability and trustworthiness of data, while schema evolution mechanisms in AWS facilitate adaptability to changing data structures. Rigorous data validation is essential to maintain data integrity and quality throughout the pipeline, preventing issues that could arise from schema changes.

We'll cover the following...

Understanding data characteristics
Data storage formats and access patterns
- Comparing common formats
- Partitioning and format evolution
  - Apache Iceberg on AWS Glue
Data lineage and trustworthiness
Schema evolution and data validation
- Schema evolution on AWS
- Data validation pillars
Conclusion

Many AWS data architecture decisions start with understanding the data as it moves through your pipelines. Before you choose storage, design ETL, or configure streaming ingestion, you need to evaluate the data’s volume, velocity, variety, format, lineage requirements, and schema behavior. The AWS Certified Data Engineer – Associate exam expects you to apply this understanding across multiple question types and map data characteristics to appropriate AWS service choices. This lesson introduces the foundational concepts: the three Vs, storage formats, lineage tracking, and schema evolution that underpin many topics you’ll see on the exam and in production data engineering on AWS.

Understanding data characteristics

The three primary dimensions that shape AWS data architecture are volume, velocity, and variety, collectively referred to as the 3 Vs. These dimensions directly determine which services you provision, how you partition data, and whether your pipeline processes records in batch or real time.

Volume refers to the scale of data your pipeline must handle, ranging from gigabytes to petabytes. When volume is moderate, Amazon RDS or DynamoDB can serve as primary stores. At the petabyte scale, Amazon S3 becomes the foundation of a data lake, with Amazon Athena or Amazon Redshift providing the query layer.
Velocity describes how fast data arrives and how quickly it must be available for consumption. Batch workloads that run on hourly or daily schedules align with AWS Glue ETL or direct S3 uploads. Near-real-time requirements call for Amazon Kinesis Data Firehose, which provides managed buffering and automatic Parquet conversion. True real-time, low-latency processing demands Amazon Kinesis Data Streams with custom consumer applications.
Variety captures the structural diversity of your data. Structured data fits neatly into relational databases such as Amazon RDS and Redshift. Semi-structured data, such as JSON or nested event payloads, is best stored in Parquet on S3, with AWS Glue Crawlers inferring the schema. Unstructured data, including images, PDFs, and video, lives in S3 object storage with metadata tagging for discoverability.

These three dimensions collectively determine your storage format selection, partitioning strategy, and lineage requirements, all of which are explored in the sections that follow.

The following mind map shows how each of the 3 Vs maps to ...

1.Introduction

2.Data Ingestion Architectures

Cloud Lab

3.AWS Data Stores

Cloud Lab

4.Data Cataloging and Lifecycle Management

5.Data Processing and Programming Logic

Cloud Lab

Cloud Lab

Cloud Lab

6.Pipeline Orchestration and Operations

Cloud Lab

Cloud Lab

Cloud Lab

7.Data Analysis and Quality Control

Cloud Lab

Cloud Lab

8.Pipeline Monitoring, Maintenance, and Auditing

Cloud Lab

Cloud Lab

9.Data Security and Governance

Assessment

10.Practice Exam Solution 1: AWS Certified Data Engineer – Associate

11.Free AWS Certified Data Engineer Associate Practice Exam

12.Conclusion

Fundamentals of Data and Lineage

Understanding data characteristics