From Stationarity to ARIMA
A quick orientation so you see how the two courseworks connect to the module.
| Assessment | Weight | When | What we’re assessing |
|---|---|---|---|
| CW1 | 30% | Week 6 (6 March) | Responsible Data Science presentation |
| CW2 | 70% | Week 13 | Applied Data Science report + scaffold notebook |
Brief for CW1 is released this week; use it to choose a FinTech application and plan your 10-minute presentation.
Why this design: We want you to evaluate data quality and bias in a real FinTech context, not just describe a business model.
Why this design: We provide scaffold notebooks so you can focus on critical analysis and reflection rather than coding from scratch.
Financial observations arrive ordered in time:
The challenge: Classical statistics assumes independence. Time series methods handle temporal dependence.
Before learning time series methods, we must ask: where is the signal?
Financial prediction divides into three distinct problems:
| Problem | Target | Signal (R²) | Methods |
|---|---|---|---|
| The Mean | Future returns | ~1-2% | ARIMA rarely beats naive |
| The Variance | Volatility | ~15-40% | GARCH family |
| The Cross-Section | Which assets | ~5-15% | Factors, ML |
Key insight: ARIMA targets the mean. In efficient markets, the mean has almost no signal.
Before fitting any time series model to returns, ask:
Can this model beat the naive forecast?
The naive forecast is simply: \(\hat{r}_{t+1} = 0\) (or the historical mean)
If your ARIMA(1,1,1) achieves R² = 0.5% vs naive’s R² = 0%: - Statistically: ✓ You “won” - Economically: ✗ Useless after transaction costs
Tsay’s insight: “For most asset return series… building a mean equation amounts to removing the sample mean from the data.”
Near-zero autocorrelation in returns is not a failure of our models : it is a success of markets:
This is the efficient market hypothesis at work: competition destroys predictability in the conditional mean.
But competition does not destroy predictability in variance:
This is why GARCH succeeds where ARIMA fails: it targets a phenomenon (variance) that has genuine, exploitable signal.
By the end of this session, you should be able to:
A weakly stationary series has:
Why it matters: If the process is changing over time, historical relationships may not hold going forward.
Mean return: 0.000519
Std deviation: 0.0123
Two fundamental reasons we model returns rather than prices:
Statistical rationale: Returns are (approximately) stationary : constant mean, stable variance. Prices have a unit root, violating assumptions of most statistical methods.
Economic rationale: Returns represent the complete round-trip of an investment. A 5% return on Apple is directly comparable to a 3% return on HSBC, regardless of their price levels.
| Type | Formula | Use Case |
|---|---|---|
| Simple return | \((P_t - P_{t-1}) / P_{t-1}\) | Single-period performance |
| Log return | \(\ln(P_t) - \ln(P_{t-1})\) | Multi-period aggregation (log returns sum) |
| Excess return | \(r_t - r_f\) | Isolates skill from market exposure |
The Autocorrelation Function (ACF) measures correlation between \(Y_t\) and \(Y_{t-k}\):
\[\rho_k = \frac{\text{Cov}(Y_t, Y_{t-k})}{\text{Var}(Y_t)} = \frac{\gamma_k}{\gamma_0}\]
Under weak stationarity: \(\rho_0 = 1\), \(\rho_k = \rho_{-k}\), \(-1 \leq \rho_k \leq 1\). A linear time series model can be characterised by its ACF (Tsay 2010 Ch 2).
Pattern identification:
The Augmented Dickey-Fuller (ADF) test evaluates:
\[\Delta Y_t = \alpha + \beta t + \gamma Y_{t-1} + \sum_{i=1}^{p} \delta_i \Delta Y_{t-i} + \varepsilon_t\]
If test statistic < critical value → reject null → series is stationary.
SPY Prices:
ADF Statistic: 0.0322
p-value: 0.9611
Critical value (5%): -2.8632
Conclusion: Non-stationary
SPY Returns:
ADF Statistic: -12.8742
p-value: 0.0000
Critical value (5%): -2.8632
Conclusion: Stationary
PACF measures correlation at lag \(k\) after controlling for intermediate lags (\(Y_{t-1}, \ldots, Y_{t-k+1}\)) (Brooks 2019 Ch 5). Equivalently, the added contribution of \(Y_{t-k}\) over an AR(\(k-1\)) model (Tsay 2010 Ch 2).
Why it matters: Helps identify AR vs MA processes.
| Process | ACF Pattern | PACF Pattern |
|---|---|---|
| AR(p) | Geometrically decaying | Cuts off after lag \(p\) |
| MA(q) | Cuts off after lag \(q\) | Geometrically decaying |
| ARMA(p,q) | Geometrically decaying | Geometrically decaying |
For AR(\(p\)), PACF has \(p\) non-zero points then cuts off; for MA(\(q\)), ACF has \(q\) non-zero points then cuts off (Brooks 2019 Ch 5).
Differencing removes trends by working with changes rather than levels:
\[\Delta Y_t = Y_t - Y_{t-1}\]
| Series | Differencing | Result |
|---|---|---|
| Prices (\(Y_t\)) | \(d = 1\) | Returns (approximately stationary) |
| Trending returns | \(d = 1\) | Changes in returns |
| Seasonal data | Seasonal differencing | Removes seasonal pattern |
The “I” in ARIMA stands for “Integrated” : meaning the original series is the sum (integral) of a stationary process. Differencing reverses this integration.
Practical rule: Most financial prices need \(d = 1\) (first differencing → returns). Rarely is \(d = 2\) needed; if so, question whether the model is appropriate.
An AR(p) model uses past values to predict current value:
\[Y_t = c + \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + \cdots + \phi_p Y_{t-p} + \varepsilon_t\]
Interpretation: For AR(1), ACF satisfies \(\rho_k = \phi_1^k\) : exponential decay (or alternating for \(\phi_1 < 0\)) (Tsay 2010 Ch 2). Stationarity requires characteristic roots \(< 1\) in modulus.
An MA(q) model uses past forecast errors:
\[Y_t = c + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2} + \cdots + \theta_q \varepsilon_{t-q}\]
Key difference from AR: - AR: depends on past values - MA: depends on past errors
ARIMA(p, d, q) = AutoRegressive Integrated Moving Average
Examples: - ARIMA(0,1,0) = random walk - ARIMA(1,0,0) = AR(1) - ARIMA(1,1,1) = differenced AR(1) with MA(1) error
The systematic approach to ARIMA modelling (Brooks 2019 Ch 5):
1. Identify → Examine ACF/PACF to determine plausible orders (\(p\), \(d\), \(q\))
2. Estimate → Fit the model using maximum likelihood
3. Diagnose → Check residuals are white noise (Ljung-Box test, residual ACF)
4. Forecast → Generate predictions with confidence intervals
If diagnosis fails (residuals show patterns), return to Step 1 and try different orders.
Key insight: This is an iterative process, not a one-shot procedure. Model building requires judgment, not just automated selection.
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 19.6689 1.724 11.411 0.000 16.290 23.047
ar.L1 0.9631 0.003 325.088 0.000 0.957 0.969
sigma2 4.2887 0.040 106.399 0.000 4.210 4.368
==============================================================================
After fitting, check that residuals are white noise (no remaining autocorrelation):
Lab uses this to validate ARIMA fits.
When multiple ARIMA orders seem plausible, information criteria formalise the trade-off between fit and parsimony:
| Criterion | Formula | Penalty | Preference |
|---|---|---|---|
| AIC | \(-2\ln(L) + 2k\) | Lighter | Better for prediction |
| BIC | \(-2\ln(L) + k\ln(n)\) | Heavier | Better for identification |
In practice: Use auto_arima (from pmdarima) or compare a grid of orders. But always check residual diagnostics : information criteria alone are not sufficient.
Before fitting complex models, establish baselines:
| Method | Formula | When It Works |
|---|---|---|
| Naive | \(\hat{y}_{t+h} = y_t\) | Random walk, efficient markets |
| Mean | \(\hat{y}_{t+h} = \bar{y}\) | Mean-reverting, no trend |
| Drift | \(\hat{y}_{t+h} = y_t + \frac{h}{T-1}(y_T - y_1)\) | Trending series |
Critical rule: Never shuffle time series data.
Walk-forward validation: Train on expanding window, test on next period.
| Metric | Formula | When to Use |
|---|---|---|
| MAE | \(\frac{1}{n}\sum |y_t - \hat{y}_t|\) | Interpretable, robust to outliers |
| RMSE | \(\sqrt{\frac{1}{n}\sum (y_t - \hat{y}_t)^2}\) | Penalises large errors |
| MASE | \(\frac{\text{MAE}}{\text{MAE}_{\text{naive}}}\) | Scale-independent, compares to baseline |
Time Series Cross-Validation Results:
Naive MAE: 0.011187
Mean MAE: 0.005827
Conclusion: Mean forecast wins
Why is financial prediction so difficult?
Typical monthly market statistics:
Even a perfect model would achieve R² of roughly \(0.2^2 = 4\%\). Published results claiming R² > 10% for return prediction are almost certainly overfit.
Out-of-sample R²:
\[R^2_{OOS} = 1 - \frac{\sum_{t}(y_t - \hat{y}_t)^2}{\sum_{t}(y_t - \bar{y})^2}\]
Two types of uncertainty:
| Type | What It Captures | Formula |
|---|---|---|
| Confidence Interval | Uncertainty about mean | \(\hat{\mu} \pm t_{\alpha/2} \cdot SE(\hat{\mu})\) |
| Prediction Interval | Uncertainty about next value | \(\hat{y} \pm t_{\alpha/2} \cdot \sqrt{SE^2 + \sigma^2}\) |
Prediction intervals are wider because they include residual variance.
When approaching any financial prediction problem, work through this hierarchy:
Start with the target: What am I predicting : mean, variance, or cross-section?
Assess signal strength: What R² is plausible?
Choose appropriate complexity: Match model to signal. Do not use LSTM when naive wins.
Validate honestly: Time-aware CV, compare to naive benchmark, assess economic (not just statistical) significance.
Most “failed” financial models are not bad models : they are good models applied to the wrong problem.
Classical time series (ARIMA) assumes we can transform data to stationarity.
But what if:
Classical methods struggle when stationarity is fundamentally violated.
| Classical Concept | Sequence Learning Extension | Key Advance |
|---|---|---|
| AR(p) process | Recurrent Neural Networks (RNN, LSTM, GRU) | Non-linear dependencies |
| Differencing for stationarity | Direct modelling of non-stationary sequences | No transformation required |
| ARIMA forecasts | Transformer architectures | Attention over long horizons |
| Regime switching | Hidden state models | Data-driven regime detection |
Long Short-Term Memory (LSTM) networks and Transformers can:
The trade-off:
Two non-stationary series may share a common stochastic trend:
Formally: If \(Y_t\) and \(X_t\) are both I(1) but \(Y_t - \beta X_t\) is I(0), the series are cointegrated : they share a long-run equilibrium.
Applications in finance:
Engle-Granger test: Regress \(Y_t\) on \(X_t\), then test whether residuals are stationary using ADF with modified critical values.
This course philosophy:
Preview: We return to sequence learning in Week 11, after building the classical foundations.
Last week we modelled the mean. Next week we model variance:
FinTech & Data Science