ETL Pipeline: Fraud Detection Preprocessing
In this project, we’ll aid the data science department of a retail tech company in developing an effective fraud detection model. The model involves the streamlined processing and transfer of data from the company’s data warehouse to a dedicated data mart.
The company is called “TechSwift Retail Solutions” and it is a retail tech company that sells various products such as phones, tablets, watches, and more. Lately, the company has observed a surge in fraudulent transactions. Subsequent investigations by our data analysts revealed a common trend: a significant portion of these fraudulent activities involved the purchase of tablets. To address this issue, the company has requested the data science team to create a fraud detection machine learning model to alert the company in real-time when a possible fraudulent transaction is being made.
The data science team requires clean and relevant data to test the model. Therefore, the company has requested us to create a daily ETL pipeline designed to efficiently transfer clean and relevant data from the company’s data warehouse to a dedicated data mart, exclusively crafted for the data science team. The data mart will store the most recent 24 hours of tablet related data identified as potential fraud.
The final product of this project will be an ETL pipeline scheduled to run once every 24 hours to fill a data mart with the latest tablet-related transactions. The pipeline will contain the following tasks:
Here’s the sample of the transaction data from the data warehouse:
SaleID,Product,Category,Price,Quantity,Discount,SaleDate,Customer,CustomerAge,CustomerGender,CustomerLocation,IsNewCustomer,ProductPopularity,ProductCategory,ProductReviews,TransactionHour1074,Watch,D,300.0,3,0.0,2023-10-01,Emily Moore,33,Female,Boston,Yes,Medium,Electronics,4.4,121075,Phone,A,200.0,5,0.0,2023-10-02,John Wilson,29,Male,Chicago,Yes,High,Electronics,4.6,131076,Tablet,B,500.0,7,0.05,2023-10-03,Patricia Davis,37,Female,New York,No,High,Electronics,4.8,151077,E-Book,C,100.0,2,0.0,2023-10-04,Robert Hernandez,42,Male,Los Angeles,Yes,Low,Electronics,3.8,111078,Watch,D,300.0,4,0.0,2023-10-05,Linda Smith,31,Female,Houston,Yes,Medium,Electronics,4.2,141079,Phone,A,200.0,8,0.0,2023-10-06,William Johnson,36,Male,San Francisco,Yes,High,Electronics,4.7,171080,Tablet,B,500.0,5,0.05,2023-10-07,Susan Davis,28,Female,Miami,Yes,High,Electronics,4.5,161081,E-Book,C,100.0,3,0.0,2023-10-07,David Wilson,30,Male,Seattle,Yes,Low,Electronics,3.7,121082,Watch,D,300.0,7,0.0,2023-10-08,Michelle Anderson,39,Female,Los Angeles,No,Medium,Electronics,4.3,13
Here are some key insights about the project:
Each row in the data warehouse is a single transaction of an electronic device.
We’ll create various functions in a helper file called helper.py and use these functions to build our ETL pipeline using Apache Airflow in a file called sales_pipeline.py.
Throughout the project, we’ll work only on helper.py.