How does serverless data processing work

Serverless data processing is a computing model that allows us to execute the code and process data without managing the underlying infrastructure. It’s usually used for data processing tasks like ETL pipelines, data analytics, or batch processing.

Working of serverless data processing

Here are some points that describe the working of serverless data processing:

Event triggers: Serverless data processing is event-driven. It responds to events like uploading files, changing databases, or scheduling intervals. These events serve as triggers that initiate the data processing workflow.
Function execution: A serverless function that is function-as-a-service (FaaS) invoked when an event is triggered. Functions are momentary units of code designed to perform a specific task or process a chunk of data. Functions aren’t bound by any programming languages.
Scaling: The serverless platform automatically manages the scaling of function instances based on the workload that will come in the future. It provisions and assigns resources to handle the processing requirements. If there is a rise in events or data volume, the platform can scale up by creating additional representatives to parallelize the processing and ensure efficient execution.
Data retrieval: The function retrieves the necessary data from the event source or other storage systems like databases, message queues, or object storage. This data can be in various formats, such as files, streams, or database records.
Data processing: The function performs the required tasks based on our application logic. This may involve transforming data, aggregating information, filtering records, running calculations, or executing complex algorithms. We can leverage libraries and tools in our chosen programming language to simplify the data processing tasks.
Output and storage: The function generates the desired output once the data processing is complete. It can store the processed data in a persistent storage system like databases, data lakes, or object storage. It can also trigger downstream actions or notifications like sending results to other systems, invoking APIs, or generating reports/results.
Billing and resource management: The serverless platforms charge is based on the actual usage of resources and execution time of the functions. We are billed for the number of requests, the duration of the function, and the resources consumed. The platform abstracts the underlying infrastructure management, allowing us to focus on the code and the data processing logic.

import json
def process_data(event, context):
    # Data retrieve by using events
    data = json.loads(event['body'])
    
    # Data processing task
    processed_data = process(data)
    
    # Return response
    response = {
        'statusCode': 200,
        'body': json.dumps(processed_data)
    }
    return response
def process(data):
    # Converting each value to uppercase
    processed_data = {}
    for key, value in data.items():
        processed_data[key] = value.upper()
    return processed_data
def main():
    # Events
    event = {
        'body': '{"1": "abc", "2": "bdc", "3": "xyz", "4": "mno", "5": "educative" }'  # JSON data to process
    }
    # Set context
    context = None  
    
    # Call the function 
    result = process_data(event, context)
    
    # Print result
    print(result['body'])
# Main function
if __name__ == '__main__':
    main()

How does serverless data processing work

Working of serverless data processing

Example for preprocessing data

Explanation