How does serverless data processing work
Serverless data processing is a computing model that allows us to execute the code and process data without managing the underlying infrastructure. It’s usually used for data processing tasks like ETL pipelines, data analytics, or batch processing.
Working of serverless data processing
Here are some points that describe the working of serverless data processing:
- Event triggers: Serverless data processing is event-driven. It responds to events like uploading files, changing databases, or scheduling intervals. These events serve as triggers that initiate the data processing workflow.
- Function execution: A serverless function that is function-as-a-service (FaaS) invoked when an event is triggered. Functions are momentary units of code designed to perform a specific task or process a chunk of data. Functions aren’t bound by any programming languages.
- Scaling: The serverless platform automatically manages the scaling of function instances based on the workload that will come in the future. It provisions and assigns resources to handle the processing requirements. If there is a rise in events or data volume, the platform can scale up by creating additional representatives to parallelize the processing and ensure efficient execution.
- Data retrieval: The function retrieves the necessary data from the event source or other storage systems like databases, message queues, or object storage. This data can be in various formats, such as files, streams, or database records.
- Data processing: The function performs the required tasks based on our application logic. This may involve transforming data, aggregating information, filtering records, running calculations, or executing complex algorithms. We can leverage libraries and tools in our chosen programming language to simplify the data processing tasks.
- Output and storage: The function generates the desired output once the data processing is complete. It can store the processed data in a persistent storage system like databases, data lakes, or object storage. It can also trigger downstream actions or notifications like sending results to other systems, invoking APIs, or generating reports/results.
- Billing and resource management: The serverless platforms charge is based on the actual usage of resources and execution time of the functions. We are billed for the number of requests, the duration of the function, and the resources consumed. The platform abstracts the underlying infrastructure management, allowing us to focus on the code and the data processing logic.
Example for preprocessing data
import jsondef process_data(event, context):# Data retrieve by using eventsdata = json.loads(event['body'])# Data processing taskprocessed_data = process(data)# Return responseresponse = {'statusCode': 200,'body': json.dumps(processed_data)}return responsedef process(data):# Converting each value to uppercaseprocessed_data = {}for key, value in data.items():processed_data[key] = value.upper()return processed_datadef main():# Eventsevent = {'body': '{"1": "abc", "2": "bdc", "3": "xyz", "4": "mno", "5": "educative" }' # JSON data to process}# Set contextcontext = None# Call the functionresult = process_data(event, context)# Print resultprint(result['body'])# Main functionif __name__ == '__main__':main()
Explanation
Line 1: We import the
jsonlibrary.Lines 3 - 5: The
process_datafunction retrieves the data from the event by usingjson.loads()to analyze the JSON data.Line 8: We call the
processfunction and give thedatain which values are stored and created from the above event.Lines 11 - 15: We generate the response to return.
Lines 17- 22: We define the
processfunction which converts the data from lower to upper case.Lines 24 - 40: We define the main function to run the above code.
Free Resources