Data ingestion is the process of migrating data from its various sources to a centralized location or storage area.

Why would we need to ingest or migrate data? A primary reason is to create one place where analytics can be done on the entirety of the available data. Without data ingestion, analytics can still be done from individual data sources. However, this siloed (isolated) approach can get overly complex and doesn’t provide a holistic way to discover insights from relevant data.

AWS services for data ingestion

AWS includes a variety of services to ingest the data into a data lake.

AWS Database Migration Services

  • Allows the migration of data from one database to another.

  • Possible to migrate between the same database engine (e.g., from an Oracle database to another Oracle database) or to migrate between different database engines (e.g., from an Oracle database to a MySQL database).

  • To be able to use AWS Database Migration Services (DMS), one of the databases must be hosted on AWS.

AWS DataSync

  • Allows migration of data between storage systems such as Network File System (NFS) file servers, Server Message Block (SMB) file servers, Hadoop Distributed File System (HDFS), object storage systems, and Amazon services including Amazon S3, Elastic File System (EFS), and AWS Snowcone devices.

  • Also supports archiving cold dataRarely-accessed data to long-term storage on AWS through services such as S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive.

Amazon Kinesis

  • Allows the ingestion of real-time streaming data so that processing and analysis can happen as soon as the data arrives.

  • Streaming dataData that’s continuously generated, including website clickstreams, video, audio, and application logs. can be captured and processed by Kinesis Data Streams, Kinesis Data Firehose, Kinesis Video Streams, and Managed Streaming for Apache Kafka.

Amazon Managed Streaming for Apache Kafka

  • Allows the use of Apache Kafka, an open-source platform for ingesting and processing streaming data.

  • Amazon Managed Streaming for Kafka (MSK) provides additional functionality for managing and configuring servers for Kafka-based applications.

  • Amazon MSK also attempts to detect and automatically recover from common failure scenarios for Kafka clusters so that related applications can continue operating.

AWS IoT Core

  • Allows the connection of Internet of Things (IoT)The inclusion of sensors and other technologies into physical devices that can then transmit data through a communications network (e.g., the public internet). devices and messages to AWS services.

  • AWS IoT Core is designed to connect billions of IoT devices and route trillions of messages to AWS services for further processing and analysis.

Amazon AppFlow

  • Allows the migration of data between SaaS applications and AWS services without writing code.

  • Supported SaaS applications include Salesforce, Marketo, SAP, Zendesk, Slack, and ServiceNow.

  • Supported AWS services include Amazon S3 and Redshift. Each flow can run up to 100 GB of data, which allows for millions of SaaS records to be transferred for further processing and analysis.

AWS Data Exchange

  • Allows the integration of third-party data from an AWS-hosted data marketplace into Amazon services such as S3.

  • Data providers include Reuters, Foursquare, Change Healthcare, Equifax, and many others. The data products span industries, including healthcare, media and entertainment, financial services, and more.

  • After subscribing to a data product, customers can use the subscribed data from within other AWS services and also be alerted by an Amazon CloudWatch Event when updates to the data become available.

Real-time vs. batch ingestion

Some of the available services, such as Amazon Kinesis, are designed to ingest data in real time as soon as it arrives. This approach is effective for use cases where it’s important to process and analyze data as soon as possible (i.e., within seconds)—for example, when decisions must be made based on up-to-the-minute information.

For many other use cases, it's sufficient to migrate data in batches. The migration schedule can be configured at various intervals according to the use case (e.g., every hour or every day). The flexibility to use batch ingestion reduces energy consumption and can be more cost-efficient with no significant degradation to the user experience (as compared to real-time ingestion).

Ingesting data beyond AWS

While AWS includes many services for data ingestion, there are even more services outside of the Amazon ecosystem!

Instead of ingesting data into a data lake, companies can choose to set up a data warehouse and migrate data from various sources directly into the data warehouse.

Just a Few Places Where Ingested Data Can Go

Data Lakes

Data Warehouses

Amazon S3

Snowflake

Azure Data Lake

Google BigQuery

Google Cloud Storage

Azure Synapse Analytics

Databricks

Amazon Redshift

The company Databricks, based on the open-source Apache Spark data-processing platform, even started using the term “data lakehouse” to describe how their product can be a hybrid of both a data lake and a data warehouse.

The term Extract, Load, Transform (ELT) is another way to describe the migration of data from various sources to a centralized location. ELT tools include Fivetran and Airbyte. These tools can migrate (“extract” and “load”) data to data warehouses such as Snowflake, BigQuery, Azure Synapse Analytics, and Amazon Redshift, as well as to data lakes such as Amazon S3.

To illustrate the wide variety of data that can be ingested, here are just a few of the hundreds of data sources supported by Fivetran.

Just a Few Data Sources Where Fivetran Can Ingest Data From

Databases

Marketing and Sales

Product and Engineering

Support and Operations

Oracle

Instagram Business

GitHub

Zendesk

MySQL on AWS RDS

TikTok Ads

SurveyMonkey

Stripe

MongoDB

Salesforce

Google Sheets

Dropbox

Amazon Aurora PostgreSQL

Google Analytics

FTP and SFTP

Greenhouse

Amazon DynamoDB

Shopify

Azure Cloud Functions

Workday HCM

Fivetran offers data ingestion services at different price points depending on these factors:

  • Users: From 1 to 10 to unlimited users of the tool.

  • Usage level: From up to 500,000 monthly active rows migrated to an unlimited number of rows.

  • Frequency of ingestion: From synchronizing every 1 hour to every 15 minutes and every 5 minutes.

While the terminology around data ingestion might change, the core concepts are still the same: to be able to migrate data from various sources to a centralized location for further processing and analysis (and within the required time frames).