Time Series Data
Practice implementing ARIMA, incorporating seasonality with SARIMA, and handling outliers using statistical techniques.
Time series data plays a vital role in monitoring sensor-driven environments like manufacturing. From forecasting machine behavior to catching unusual patterns, the ability to model trends and clean noisy signals is a core data science skill. Let’s get started.
Implement ARIMA for manufacturing sensor data
In manufacturing environments, sensors continuously monitor equipment conditions such as temperature, vibration, pressure, etc. This implementation helps predict future values and detect trends. To support predictive maintenance and identify anomalies, you’re asked to forecast future sensor readings using a classic statistical model:
This is a question frequently asked by industrial analytics companies like GE Digital, Siemens, and manufacturing-focused AI startups.
Can you show how you would implement an ARIMA model for manufacturing sensor data? The implement_arima()
function starts at line 8.
Implement an ARIMA model to forecast future values of a manufacturing sensor time series.
Ensure the data is stationary before modeling.
Generate predictions and evaluate performance using MSE.
import numpy as npimport pandas as pdfrom statsmodels.tsa.arima.model import ARIMAfrom statsmodels.tsa.stattools import adfullerfrom sklearn.preprocessing import StandardScalerimport matplotlib.pyplot as pltdef implement_arima(sensor_data, order=(1,1,1)):"""Implement ARIMA model for manufacturing sensor data to predict future valuesParameters:sensor_data (pd.Series): Time series of sensor readingsorder (tuple): ARIMA order (p,d,q)Returns:tuple: (model, predictions, mse)"""#TODO - your implementation here# Ensure data is stationary# Difference data if not stationary# Fit ARIMA model# Make predictions# Calculate MSE for the overlapping periodreturn fitted, predictions, msedef initialize_arima_data():"""Initialize sample manufacturing sensor data"""np.random.seed(42)dates = pd.date_range(start='2024-01-01', periods=1000, freq='H')base_signal = np.sin(np.linspace(0, 100, 1000)) * 10noise = np.random.normal(0, 1, 1000)sensor_data = pd.Series(base_signal + noise, index=dates)return sensor_data# Example usagesensor_data = initialize_arima_data()model, predictions, mse = implement_arima(sensor_data)print(f"ARIMA MSE: {mse:.2f}")
Sample answer
Here’s how you might structure your response:
Preprocess the sensor data
Start by verifying if the time series is stationary using the Augmented Dickey-Fuller (ADF) test.
If it’s not stationary, apply differencing iteratively until the test indicates stationarity.
Ensure missing values from differencing are handled e.g., drop
NaN
s.
Select ARIMA order parameters
Choose
(p, d, q)
manually or use domain knowledge/defaults for simplicity.In interviews, be prepared to explain why
d=1
is often a good starting point for non-stationary data.
Fit the model
Fit an ARIMA model to the original (not differenced) series, as the library handles differencing internally.
Catch fitting errors or convergence issues and mention fallback strategies if the model doesn’t converge e.g., simplifying parameters.
Make and evaluate predictions
Predict a short future window e.g., next 10-20 steps.
Compare predictions against known values to compute performance metrics like MSE.
Mention how residuals or prediction intervals could also be used for model evaluation or anomaly detection.
Explainability and trade-offs
In interviews, highlight that ARIMA is interpretable and well-suited for short-term forecasting, but may struggle with long-range trends or complex seasonality.
Here’s the solution code:
import numpy as npimport pandas as pdfrom statsmodels.tsa.arima.model import ARIMAfrom statsmodels.tsa.stattools import adfullerdef implement_arima(sensor_data, order=(1,1,1)):"""Implement ARIMA model for manufacturing sensor data to predict future valuesParameters:sensor_data (pd.Series): Time series of sensor readingsorder (tuple): ARIMA order (p,d,q)Returns:tuple: (model, predictions, mse)"""# Ensure data is stationarydef check_stationarity(data):result = adfuller(data)return result[1] < 0.05# Difference data if not stationarydata = sensor_data.copy()while not check_stationarity(data):data = data.diff().dropna()# Fit ARIMA modelmodel = ARIMA(sensor_data, order=order)fitted = model.fit()# Make predictionspredictions = fitted.predict(start=len(sensor_data)-10, end=len(sensor_data)+10)# Calculate MSE for the overlapping periodmse = np.mean((predictions[:10] - sensor_data[-10:])**2)return fitted, predictions, msedef initialize_arima_data():"""Initialize sample manufacturing sensor data"""np.random.seed(42)dates = pd.date_range(start='2024-01-01', periods=1000, freq='H')base_signal = np.sin(np.linspace(0, 100, 1000)) * 10noise = np.random.normal(0, 1, 1000)sensor_data = pd.Series(base_signal + noise, index=dates)return sensor_data# Example usagesensor_data = initialize_arima_data()model, predictions, mse = implement_arima(sensor_data)print(f"ARIMA MSE: {mse:.2f}")
In the solution above:
Lines 18–20: Define a helper function
check_stationarity
using the Augmented Dickey-Fuller (ADF) test. It returnsTrue
if the p-value is below 0.05, indicating the time series is stationary.Lines 23–25: We copy the original sensor data and repeatedly apply differencing (
.diff().dropna()
) until the series becomes stationary. This ensures the ARIMA model can be fit properly.Line 28: Fit the ARIMA model to the original (non-differenced) sensor data using the provided
(p,d,q)
order.Line 32: Predict values starting from 10 steps before the end of the data to 10 steps after. This includes both a backtest and a short forecast.
Line 35: Compute Mean Squared Error (MSE) for the last 10 time steps by comparing predictions to the actual sensor values, providing a quick evaluation of model accuracy on recent data.
Augment the model with seasonality
Manufacturing processes often show cyclical patterns due factors like shift changes, daily temperature variations, or equipment warm-up/cooldown cycles. Can you augment your answer to Question 1 with seasonality? The function implement_sarima()
starts at line 6.
import numpy as npimport pandas as pdfrom statsmodels.tsa.statespace.sarimax import SARIMAXfrom statsmodels.tsa.stattools import adfullerdef implement_sarima(sensor_data, order=(1,1,1), seasonal_order=(1,1,1,24)):"""Implement Seasonal ARIMA model for manufacturing sensor dataParameters:sensor_data (pd.Series): Time series of sensor readingsorder (tuple): ARIMA order (p,d,q)seasonal_order (tuple): Seasonal order (P,D,Q,s)Returns:tuple: (model, predictions, mse)"""#TODO - your implementation here# Fit SARIMA model# Make predictions# Calculate MSEreturn fitted, predictions, msedef initialize_seasonal_data():"""Initialize sample seasonal manufacturing sensor data"""np.random.seed(42)dates = pd.date_range(start='2024-01-01', periods=1000, freq='H')# Base signal with daily seasonalitybase_signal = np.sin(np.linspace(0, 100, 1000)) * 10daily_pattern = np.sin(np.linspace(0, 2*np.pi*41.67, 1000)) * 5 # 41.67 cycles for 1000 hoursnoise = np.random.normal(0, 1, 1000)sensor_data = pd.Series(base_signal + daily_pattern + noise, index=dates)return sensor_data# Example usageseasonal_data = initialize_seasonal_data()seasonal_model, seasonal_predictions, seasonal_mse = implement_sarima(seasonal_data)print(f"SARIMA MSE: {seasonal_mse:.2f}")
Sample answer
Here’s how you may structure your response:
Prepare seasonal time series
Ensure the dataset shows repeating patterns e.g., daily cycles every 24 hours.
Mention that we can optionally visualize it to confirm seasonality before proceeding.
Define SARIMA parameters
Choose a seasonal order
(P, D, Q, s)
.P
= seasonal AR termsD
= seasonal differencing (often 1)Q
= seasonal MA termss
= season length (e.g., 24 for hourly data with daily cycles)
Combine this with the standard ARIMA order
(p, d, q)
.
Fit the SARIMA model
Use the
seasonal_order
argument in your model implementation.Be aware that fitting SARIMA may take longer—mention this trade-off.
Generate predictions
Forecast a reasonable window, keeping in mind that seasonality may introduce lag or delay in how the model responds.
Evaluate and compare
Use MSE for evaluation, but also highlight whether the model captures seasonal peaks and valleys accurately.
In interviews, explain how SARIMA improves over standard ARIMA for periodic data, and when it’s worth the added complexity.
Here’s the solution code:
import numpy as npimport pandas as pdfrom statsmodels.tsa.statespace.sarimax import SARIMAXfrom statsmodels.tsa.stattools import adfullerdef implement_sarima(sensor_data, order=(1,1,1), seasonal_order=(1,1,1,24)):"""Implement Seasonal ARIMA model for manufacturing sensor dataParameters:sensor_data (pd.Series): Time series of sensor readingsorder (tuple): ARIMA order (p,d,q)seasonal_order (tuple): Seasonal order (P,D,Q,s)Returns:tuple: (model, predictions, mse)"""# Fit SARIMA modelmodel = SARIMAX(sensor_data, order=order, seasonal_order=seasonal_order, enforce_stationarity=False, enforce_invertibility=False)fitted = model.fit(disp=False)# Make predictionspredictions = fitted.predict(start=len(sensor_data)-10, end=len(sensor_data)+10)# Calculate MSEmse = np.mean((predictions[:10] - sensor_data[-10:])**2)return fitted, predictions, msedef initialize_seasonal_data():"""Initialize sample seasonal manufacturing sensor data"""np.random.seed(42)dates = pd.date_range(start='2024-01-01', periods=1000, freq='H')# Base signal with daily seasonalitybase_signal = np.sin(np.linspace(0, 100, 1000)) * 10daily_pattern = np.sin(np.linspace(0, 2*np.pi*41.67, 1000)) * 5 # 41.67 cycles for 1000 hoursnoise = np.random.normal(0, 1, 1000)sensor_data = pd.Series(base_signal + daily_pattern + noise, index=dates)return sensor_data# Example usageseasonal_data = initialize_seasonal_data()seasonal_model, seasonal_predictions, seasonal_mse = implement_sarima(seasonal_data)print(f"SARIMA MSE: {seasonal_mse:.2f}")
Key components of this implementation:
Line 19: Create an SARIMA model instance using both regular
order
andseasonal_order
parameters. Theseasonal_order=(P,D,Q,s)
captures repeating seasonal patterns—in this case, hourly data with daily (24-hour) seasonality. We disable stationarity and invertibility constraints to allow the model more flexibility during fitting.Line 20: Fit the model to the input sensor data using
.fit()
and suppress output withdisp=False
for a cleaner run. This estimates the optimal parameters internally.Line 23: Use
.predict()
to generate forecasts from 10 time steps before the end of the dataset to 10 steps after, giving both a backtest and a short future prediction.Line 26: Compute the Mean Squared Error (MSE) for the 10-step backtest portion by comparing predictions to actual values from the sensor data.
Line 28: Return the fitted model, full prediction series, and the MSE for evaluation.
Handling outliers with sensor data
Sensor data frequently contains anomalies due to measurement errors, equipment malfunctions, or genuine process deviations. How will you handle outliers in sensor data?
Sample answer
Let’s explore some key components of handling outliers in sensor data. We will need:
Rolling stats: Use a rolling window (e.g., 24 hours) to calculate mean and standard deviation.
Z-Score detection: Flag values where Z-score > threshold (commonly three standard deviations).
Replacement strategy: Replace outliers with the local rolling mean.
Reporting: Return the cleaned data, index of outliers, and replaced values for validation.
Let’s look at a sample snippet that allows us to demonstrate this using numpy
and pandas
.
import numpy as npimport pandas as pddef handle_outliers(sensor_data, threshold=3):"""Handle outliers in manufacturing sensor data using robust statistical methodsParameters:sensor_data (pd.Series): Time series of sensor readingsthreshold (float): Z-score threshold for outlier detectionReturns:tuple: (cleaned_data, outlier_indices, replaced_values)"""# Calculate rolling statisticsrolling_mean = sensor_data.rolling(window=24, center=True).mean()rolling_std = sensor_data.rolling(window=24, center=True).std()# Calculate z-scoresz_scores = np.abs((sensor_data - rolling_mean) / rolling_std)# Identify outliersoutlier_indices = z_scores > threshold# Replace outliers with rolling meancleaned_data = sensor_data.copy()cleaned_data[outlier_indices] = rolling_mean[outlier_indices]# Store replaced values for verificationreplaced_values = sensor_data[outlier_indices]return cleaned_data, outlier_indices, replaced_valuesdef initialize_outlier_data():"""Initialize sample manufacturing sensor data with outliers"""np.random.seed(42)dates = pd.date_range(start='2024-01-01', periods=1000, freq='H')base_signal = np.sin(np.linspace(0, 100, 1000)) * 10noise = np.random.normal(0, 1, 1000)# Add artificial outliersoutlier_indices = np.random.choice(1000, 50, replace=False)noise[outlier_indices] *= 10sensor_data = pd.Series(base_signal + noise, index=dates)return sensor_data# Example usage for Question 3outlier_data = initialize_outlier_data()cleaned_data, outliers, replaced = handle_outliers(outlier_data)print(f"Number of outliers detected: {outliers.sum()}")
Let’s look at it in detail:
handle_outliers(sensor_data, threshold=3)
: This function detects and corrects outliers in a time series of manufacturing sensor data using rolling statistics and z-score analysis.
Lines 17–18: Compute the rolling mean and standard deviation using a 24-hour window (assuming hourly data) with
center=True
to align the window around each point.Line 21: Calculate z-scores for each data point, i.e., how many standard deviations away each value is from the local mean.
Line 24: Mark data points as outliers if their absolute z-score is greater than the specified
threshold
. The default threshold is three, which aligns with the “3-sigma rule” often used in statistical quality control.Lines 27–28: Create a copy of the original data and replace outlier values with the corresponding values from the rolling mean.
Line 31: Store the original outlier values that were replaced.
Line 33: Return the cleaned data, a boolean mask of outlier positions, and the list of values that were replaced.
initialize_outlier_data()
: This function simulates a noisy sensor dataset with injected outliers for testing purposes.
Lines 37–40: Create a sine-wave-based signal simulating a manufacturing process. Add standard Gaussian noise for realism.
Lines 43–44: Randomly select 50 time steps and amplify their noise by a factor of 10 to create artificial outliers. This step simulates sensor glitches or anomalies.
Line 46: Combine the base signal and noise into a pandas Series with hourly timestamps as the index.