Data Analytics on AWS: An Architectural Guide/

...

Data Ingestion

Learn about the many different ways data can be migrated into a centralized location.

We'll cover the following...

AWS services for data ingestion
Real-time vs. batch ingestion
Ingesting data beyond AWS

Why would we need to ingest or migrate data? A primary reason is to create one place where analytics can be done on the entirety of the available data. Without data ingestion, analytics can still be done from individual data sources. However, this siloed (isolated) approach can get overly complex and doesn’t provide a holistic way to discover insights from relevant data.

AWS services for data ingestion

AWS includes a variety of services to ingest the data into a data lake.

AWS Database Migration Services

Allows the migration of data from one database to another.
Possible to migrate between the same database engine (e.g., from an Oracle database to another Oracle database) or to migrate between different database engines (e.g., from an Oracle database to a MySQL database).
To be able to use AWS Database Migration Services (DMS), one of the databases must be hosted on AWS.

AWS DataSync

Allows migration of data between storage systems such as Network File System (NFS) file servers, Server Message Block (SMB) file servers, Hadoop Distributed File System (HDFS), object storage systems, and Amazon services including Amazon S3, Elastic File System (EFS), and AWS Snowcone devices.
Also supports archiving cold dataRarely-accessed data to long-term storage on AWS through services such as S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive.

Amazon Kinesis

Allows the ingestion of real-time streaming data so that processing and analysis can happen as soon as the data arrives.
Streaming dataData that’s continuously generated, including website clickstreams, video, audio, and application logs. can be captured and processed by Kinesis Data Streams, Kinesis Data Firehose, Kinesis Video Streams, and Managed Streaming for Apache Kafka.

Amazon Managed Streaming for Apache Kafka

Allows the use of Apache Kafka, an open-source platform for ingesting and processing streaming data.
Amazon Managed Streaming for Kafka (MSK) provides additional functionality for managing and configuring servers for Kafka-based applications.
Amazon MSK also attempts to detect and automatically recover from common failure scenarios for Kafka clusters so that related applications can continue operating.

AWS IoT Core

Allows the connection of Internet of Things (IoT)The inclusion of sensors and other technologies into physical devices that can then transmit data through a communications network (e.g., the public internet). devices and messages to AWS services.
AWS IoT Core is designed to connect billions of IoT devices and route trillions of messages to AWS services for further processing and analysis.

Amazon AppFlow

Allows the migration of data between SaaS applications and AWS services without writing code.
Supported SaaS applications include Salesforce, Marketo, SAP, Zendesk, Slack, and ServiceNow.
Supported AWS services include Amazon S3 and Redshift. Each flow can run up to 100 GB of data, which allows for millions of SaaS records to be transferred for further processing and analysis.

AWS Data Exchange

Allows the integration of third-party data from an AWS-hosted data marketplace into Amazon services such as S3.
Data providers include Reuters, Foursquare, Change Healthcare, Equifax, and many others. The data products span industries, including healthcare, media and entertainment, financial services, and more.
After subscribing to a data product, customers can use the subscribed data from within other AWS services and also be alerted by an Amazon CloudWatch Event when updates to the data become available.

Real-time vs. batch ingestion

Some of the available services, such as Amazon Kinesis, are designed to ingest data in real time as soon as it arrives. This approach is effective for use cases where it’s important to process and analyze data as soon as possible (i.e., within seconds)—for example, when decisions must be made based on up-to-the-minute information.

For many other use cases, it's sufficient to migrate data in batches. The migration schedule can be configured at various intervals according to the use case (e.g., every hour or every day). The flexibility to use batch ingestion reduces energy consumption and can be more cost-efficient with no significant degradation to the user experience (as compared to real-time ingestion).

Ingesting data beyond AWS

While AWS includes many services for data ingestion, there are even more services outside of the Amazon ecosystem!

Instead of ingesting data into a data lake, companies can choose to set up a data warehouse and migrate data from various sources directly into the data warehouse.

The term Extract, Load, Transform (ELT) is another way to describe the migration of data from various sources to a centralized location. ELT tools include Fivetran and Airbyte. These tools can migrate (“extract” and “load”) data to data warehouses such as Snowflake, BigQuery, Azure Synapse Analytics, and Amazon Redshift, as well as to data lakes such as Amazon S3.

To illustrate the wide variety of data that can be ingested, here are just a few of the hundreds of data sources supported by Fivetran.

Fivetran offers data ingestion services at different price points depending on these factors:

Users: From 1 to 10 to unlimited users of the tool.
Usage level: From up to 500,000 monthly active rows migrated to an unlimited number of rows.
Frequency of ingestion: From synchronizing every 1 hour to every 15 minutes and every 5 minutes.

While the terminology around data ingestion might change, the core concepts are still the same: to be able to migrate data from various sources to a centralized location for further processing and analysis (and within the required time frames).

Data Lakes	Data Warehouses
Amazon S3	Snowflake
Azure Data Lake	Google BigQuery
Google Cloud Storage	Azure Synapse Analytics
Databricks	Amazon Redshift

Databases	Marketing and Sales	Product and Engineering	Support and Operations
Oracle	Instagram Business	GitHub	Zendesk
MySQL on AWS RDS	TikTok Ads	SurveyMonkey	Stripe
MongoDB	Salesforce	Google Sheets	Dropbox
Amazon Aurora PostgreSQL	Google Analytics	FTP and SFTP	Greenhouse
Amazon DynamoDB	Shopify	Azure Cloud Functions	Workday HCM

Overview

Data Sources

Data Ingestion

Scalable Data Lake

Unified Governance

Seamless Data Movement

Purpose-Built Analytics and Insights

Wrap Up

Scalable Machine Learning Model for Accurate Predictions on AWS

Data Ingestion

AWS services for data ingestion

AWS Database Migration Services

AWS DataSync

Amazon Kinesis

Amazon Managed Streaming for Apache Kafka

AWS IoT Core

Amazon AppFlow

AWS Data Exchange

Real-time vs. batch ingestion

Ingesting data beyond AWS

Just a Few Places Where Ingested Data Can Go

Just a Few Data Sources Where Fivetran Can Ingest Data From