Week 3: Time Series Foundations

From Stationarity to ARIMA

Module Assessments: What and Why

A quick orientation so you see how the two courseworks connect to the module.

Assessment	Weight	When	What we’re assessing
CW1	30%	Week 6 (6 March)	Responsible Data Science presentation
CW2	70%	Week 13	Applied Data Science report + scaffold notebook

Brief for CW1 is released this week; use it to choose a FinTech application and plan your 10-minute presentation.

CW1: Responsible Data Science in FinTech (30%)

Why this design: We want you to evaluate data quality and bias in a real FinTech context, not just describe a business model.

Task: Choose a FinTech application; present (10 min) on context, data quality issues (selection bias, validity, reliability), and responsible practice
Alignment: Builds directly on Week 2 (Data, Measurement & Quality)
Brief: Released this week; topic approval by Week 5
Criteria: Content (40%), Communication (30%), Visual aids (20%), Academic rigour (10%)

CW2: Applied Data Science with Critical Reflection (70%)

Why this design: We provide scaffold notebooks so you can focus on critical analysis and reflection rather than coding from scratch.

Task: Complete provided scaffold code (trees, factors, backtesting, sequence learning) and write a 2,500-word reflective report
Focus: Method rationale, data decisions, interpretation, limitations, and appropriate use : not implementation skill
Brief: Released Week 8; scaffolds available then
Due: Week 13

Part I: Why Time Series?

Financial Data Is Temporal

Financial observations arrive ordered in time:

Today’s return may depend on yesterday’s
Volatility clusters: calm follows calm, storms follow storms
Shocks persist but eventually decay
Structural breaks violate assumptions

The challenge: Classical statistics assumes independence. Time series methods handle temporal dependence.

What Can We Actually Predict?

Before learning time series methods, we must ask: where is the signal?

Financial prediction divides into three distinct problems:

Problem	Target	Signal (R²)	Methods
The Mean	Future returns	~1-2%	ARIMA rarely beats naive
The Variance	Volatility	~15-40%	GARCH family
The Cross-Section	Which assets	~5-15%	Factors, ML

Key insight: ARIMA targets the mean. In efficient markets, the mean has almost no signal.

The ARIMA Reality Check

Before fitting any time series model to returns, ask:

Can this model beat the naive forecast?

The naive forecast is simply: \(\hat{r}_{t+1} = 0\) (or the historical mean)

If your ARIMA(1,1,1) achieves R² = 0.5% vs naive’s R² = 0%: - Statistically: ✓ You “won” - Economically: ✗ Useless after transaction costs

Tsay’s insight: “For most asset return series… building a mean equation amounts to removing the sample mean from the data.”

Why Is the Mean Unpredictable?

Near-zero autocorrelation in returns is not a failure of our models : it is a success of markets:

Suppose positive autocorrelation exists (yesterday up → today up)
Traders would buy after up days, pushing prices up immediately
The predictable pattern disappears as it is arbitraged away

This is the efficient market hypothesis at work: competition destroys predictability in the conditional mean.

But competition does not destroy predictability in variance:

Volatility clustering reflects the arrival process of news (fundamentals)
You cannot “arbitrage away” volatility directly : only indirectly through options
Leverage effects and information diffusion are real and persistent

This is why GARCH succeeds where ARIMA fails: it targets a phenomenon (variance) that has genuine, exploitable signal.

Learning Objectives

By the end of this session, you should be able to:

Diagnose whether a time series is stationary
Interpret ACF and PACF plots to identify patterns
Fit AR, MA, and ARIMA models in Python
Validate forecasts using time series cross-validation
Connect classical methods to modern sequence learning

Part II: Stationarity and Why It Matters

What Is Stationarity?

A weakly stationary series has:

Constant mean: \(\mathbb{E}[Y_t] = \mu\) for all \(t\)
Constant variance: \(\text{Var}(Y_t) = \sigma^2\) for all \(t\)
Autocovariance depends only on lag: \(\text{Cov}(Y_t, Y_{t-k}) = \gamma_k\)

Why it matters: If the process is changing over time, historical relationships may not hold going forward.

Example: SPY Prices (Non-Stationary)

Example: SPY Returns (Stationary)

Mean return: 0.000519
Std deviation: 0.0123

Why Returns, Not Prices?

Two fundamental reasons we model returns rather than prices:

Statistical rationale: Returns are (approximately) stationary : constant mean, stable variance. Prices have a unit root, violating assumptions of most statistical methods.

Economic rationale: Returns represent the complete round-trip of an investment. A 5% return on Apple is directly comparable to a 3% return on HSBC, regardless of their price levels.

Type	Formula	Use Case
Simple return	\((P_t - P_{t-1}) / P_{t-1}\)	Single-period performance
Log return	\(\ln(P_t) - \ln(P_{t-1})\)	Multi-period aggregation (log returns sum)
Excess return	\(r_t - r_f\)	Isolates skill from market exposure

Visual Diagnostics: ACF

The Autocorrelation Function (ACF) measures correlation between \(Y_t\) and \(Y_{t-k}\):

\[\rho_k = \frac{\text{Cov}(Y_t, Y_{t-k})}{\text{Var}(Y_t)} = \frac{\gamma_k}{\gamma_0}\]

Under weak stationarity: \(\rho_0 = 1\), \(\rho_k = \rho_{-k}\), \(-1 \leq \rho_k \leq 1\). A linear time series model can be characterised by its ACF (Tsay 2010 Ch 2).

Pattern identification:

Stationary: ACF drops quickly to zero
Non-stationary: ACF decays slowly

ACF: SPY Prices vs Returns

Unit Root Testing: The ADF Test

The Augmented Dickey-Fuller (ADF) test evaluates:

\[\Delta Y_t = \alpha + \beta t + \gamma Y_{t-1} + \sum_{i=1}^{p} \delta_i \Delta Y_{t-i} + \varepsilon_t\]

Null hypothesis: \(\gamma = 0\) (unit root, non-stationary) (Tsay 2010 Ch 2)
Alternative: \(\gamma < 0\) (stationary)

If test statistic < critical value → reject null → series is stationary.

ADF Test: SPY Prices vs Returns

SPY Prices:
  ADF Statistic: 0.0322
  p-value: 0.9611
  Critical value (5%): -2.8632
  Conclusion: Non-stationary

SPY Returns:
  ADF Statistic: -12.8742
  p-value: 0.0000
  Critical value (5%): -2.8632
  Conclusion: Stationary

Example: VIX (Mean-Reverting)

Part III: Autocorrelation and PACF

The Partial Autocorrelation Function

PACF measures correlation at lag \(k\) after controlling for intermediate lags (\(Y_{t-1}, \ldots, Y_{t-k+1}\)) (Brooks 2019 Ch 5). Equivalently, the added contribution of \(Y_{t-k}\) over an AR(\(k-1\)) model (Tsay 2010 Ch 2).

ACF: total correlation (includes indirect effects)
PACF: unique contribution of lag \(k\); for AR(\(p\)), sample PACF cuts off at lag \(p\)

Why it matters: Helps identify AR vs MA processes.

ACF vs PACF Interpretation

Process	ACF Pattern	PACF Pattern
AR(p)	Geometrically decaying	Cuts off after lag \(p\)
MA(q)	Cuts off after lag \(q\)	Geometrically decaying
ARMA(p,q)	Geometrically decaying	Geometrically decaying

For AR(\(p\)), PACF has \(p\) non-zero points then cuts off; for MA(\(q\)), ACF has \(q\) non-zero points then cuts off (Brooks 2019 Ch 5).

SPY Squared Returns: Volatility Clustering

Differencing: Making Series Stationary

Differencing removes trends by working with changes rather than levels:

\[\Delta Y_t = Y_t - Y_{t-1}\]

Series	Differencing	Result
Prices (\(Y_t\))	\(d = 1\)	Returns (approximately stationary)
Trending returns	\(d = 1\)	Changes in returns
Seasonal data	Seasonal differencing	Removes seasonal pattern

The “I” in ARIMA stands for “Integrated” : meaning the original series is the sum (integral) of a stationary process. Differencing reverses this integration.

Practical rule: Most financial prices need \(d = 1\) (first differencing → returns). Rarely is \(d = 2\) needed; if so, question whether the model is appropriate.

Part IV: AR, MA, and ARIMA Models

Autoregressive (AR) Models

An AR(p) model uses past values to predict current value:

\[Y_t = c + \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + \cdots + \phi_p Y_{t-p} + \varepsilon_t\]

Interpretation: For AR(1), ACF satisfies \(\rho_k = \phi_1^k\) : exponential decay (or alternating for \(\phi_1 < 0\)) (Tsay 2010 Ch 2). Stationarity requires characteristic roots \(< 1\) in modulus.

\(\phi_1 = 0\): white noise; \(\phi_1 = 1\): random walk; \(\phi_1 < 0\): oscillating

Simulating an AR(1)

Moving Average (MA) Models

An MA(q) model uses past forecast errors:

\[Y_t = c + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2} + \cdots + \theta_q \varepsilon_{t-q}\]

Key difference from AR: - AR: depends on past values - MA: depends on past errors

ARIMA: Putting It All Together

ARIMA(p, d, q) = AutoRegressive Integrated Moving Average

p: AR order (lags of \(Y\))
d: differencing order (to achieve stationarity)
q: MA order (lags of errors)

Examples: - ARIMA(0,1,0) = random walk - ARIMA(1,0,0) = AR(1) - ARIMA(1,1,1) = differenced AR(1) with MA(1) error

The Box-Jenkins Methodology

The systematic approach to ARIMA modelling (Brooks 2019 Ch 5):

1. Identify → Examine ACF/PACF to determine plausible orders (\(p\), \(d\), \(q\))

2. Estimate → Fit the model using maximum likelihood

3. Diagnose → Check residuals are white noise (Ljung-Box test, residual ACF)

4. Forecast → Generate predictions with confidence intervals

If diagnosis fails (residuals show patterns), return to Step 1 and try different orders.

Key insight: This is an iterative process, not a one-shot procedure. Model building requires judgment, not just automated selection.

Fitting ARIMA to VIX

==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         19.6689      1.724     11.411      0.000      16.290      23.047
ar.L1          0.9631      0.003    325.088      0.000       0.957       0.969
sigma2         4.2887      0.040    106.399      0.000       4.210       4.368
==============================================================================

Residual Diagnostics: Ljung–Box

After fitting, check that residuals are white noise (no remaining autocorrelation):

Ljung–Box \(Q(m)\): joint test \(H_0\): \(\rho_1 = \cdots = \rho_m = 0\) on residuals
Reject \(H_0\) → residuals still autocorrelated → consider more lags or different model (Tsay 2010 Ch 2)

Lab uses this to validate ARIMA fits.

Model Selection: AIC and BIC

When multiple ARIMA orders seem plausible, information criteria formalise the trade-off between fit and parsimony:

Criterion	Formula	Penalty	Preference
AIC	\(-2\ln(L) + 2k\)	Lighter	Better for prediction
BIC	\(-2\ln(L) + k\ln(n)\)	Heavier	Better for identification

\(L\) = maximised likelihood, \(k\) = number of parameters, \(n\) = sample size
Lower is better for both criteria
BIC penalises complexity more heavily, favouring simpler models

In practice: Use auto_arima (from pmdarima) or compare a grid of orders. But always check residual diagnostics : information criteria alone are not sufficient.

Part V: Forecasting and Validation

Simple Forecasting Methods (Baselines)

Before fitting complex models, establish baselines:

Method	Formula	When It Works
Naive	\(\hat{y}_{t+h} = y_t\)	Random walk, efficient markets
Mean	\(\hat{y}_{t+h} = \bar{y}\)	Mean-reverting, no trend
Drift	\(\hat{y}_{t+h} = y_t + \frac{h}{T-1}(y_T - y_1)\)	Trending series

Train/Test Splits for Time Series

Critical rule: Never shuffle time series data.

Time Series Cross-Validation

Walk-forward validation: Train on expanding window, test on next period.

Accuracy Metrics

Metric	Formula	When to Use
MAE	\(\frac{1}{n}\sum \|y_t - \hat{y}_t\|\)	Interpretable, robust to outliers
RMSE	\(\sqrt{\frac{1}{n}\sum (y_t - \hat{y}_t)^2}\)	Penalises large errors
MASE	\(\frac{\text{MAE}}{\text{MAE}_{\text{naive}}}\)	Scale-independent, compares to baseline

Example: Forecasting SPY Returns

Time Series Cross-Validation Results:
Naive MAE: 0.011187
Mean MAE: 0.005827

Conclusion: Mean forecast wins

Signal-to-Noise in Financial Returns

Why is financial prediction so difficult?

Typical monthly market statistics:

Mean return (signal): ~0.8%
Standard deviation (noise): ~4.0%
Signal-to-noise ratio: 0.2 (noise is 5× larger than signal)

Even a perfect model would achieve R² of roughly \(0.2^2 = 4\%\). Published results claiming R² > 10% for return prediction are almost certainly overfit.

Out-of-sample R²:

\[R^2_{OOS} = 1 - \frac{\sum_{t}(y_t - \hat{y}_t)^2}{\sum_{t}(y_t - \bar{y})^2}\]

\(R^2_{OOS} > 0\): model outperforms the historical mean
\(R^2_{OOS} < 0\): model is worse than predicting the mean (overfitting)
Realistic expectation: 1-3% for monthly returns is genuinely meaningful

Part VI: Prediction Uncertainty

Confidence vs Prediction Intervals

Two types of uncertainty:

Type	What It Captures	Formula
Confidence Interval	Uncertainty about mean	\(\hat{\mu} \pm t_{\alpha/2} \cdot SE(\hat{\mu})\)
Prediction Interval	Uncertainty about next value	\(\hat{y} \pm t_{\alpha/2} \cdot \sqrt{SE^2 + \sigma^2}\)

Prediction intervals are wider because they include residual variance.

ARIMA Forecast with Intervals

The Practitioner’s Hierarchy

When approaching any financial prediction problem, work through this hierarchy:

Start with the target: What am I predicting : mean, variance, or cross-section?
Assess signal strength: What R² is plausible?
- Mean: ~1-2% | Variance: ~15-40% | Cross-section: ~5-15%
Choose appropriate complexity: Match model to signal. Do not use LSTM when naive wins.
Validate honestly: Time-aware CV, compare to naive benchmark, assess economic (not just statistical) significance.

Most “failed” financial models are not bad models : they are good models applied to the wrong problem.

Part VII: Beyond Stationarity

The Limitation of Classical Methods

Classical time series (ARIMA) assumes we can transform data to stationarity.

But what if:

Structural breaks occur (COVID crash, policy changes)?
Regimes shift over time?
Non-linear dependencies exist?

Classical methods struggle when stationarity is fundamentally violated.

Sequence Learning: Beyond Stationarity

Classical Concept	Sequence Learning Extension	Key Advance
AR(p) process	Recurrent Neural Networks (RNN, LSTM, GRU)	Non-linear dependencies
Differencing for stationarity	Direct modelling of non-stationary sequences	No transformation required
ARIMA forecasts	Transformer architectures	Attention over long horizons
Regime switching	Hidden state models	Data-driven regime detection

Why This Matters

Long Short-Term Memory (LSTM) networks and Transformers can:

Learn directly from non-stationary sequences
Capture patterns in how the data evolves
Detect regime changes automatically

The trade-off:

Require substantially more data
Less interpretable (black-box)
Risk of overfitting without proper validation

Cointegration: When Non-Stationary Series Move Together

Two non-stationary series may share a common stochastic trend:

Spot and futures prices wander, but their spread is stationary
Related stock prices (e.g., Shell and BP) drift apart but revert

Formally: If \(Y_t\) and \(X_t\) are both I(1) but \(Y_t - \beta X_t\) is I(0), the series are cointegrated : they share a long-run equilibrium.

Applications in finance:

Pairs trading: Trade the mean-reverting spread between cointegrated assets
Price discovery: Does the futures or spot market lead?
Yield curve: Testing the expectations hypothesis across maturities

Engle-Granger test: Regress \(Y_t\) on \(X_t\), then test whether residuals are stationary using ADF with modified critical values.

The Bridge: Classical to Modern

This course philosophy:

Classical methods (Weeks 3-4): Understand why they work
- Stationarity, parsimony, interpretability
- When assumptions hold, they’re hard to beat
Sequence learning (Week 11): Know when to go beyond
- Complex patterns, sufficient data, non-stationarity
- Modern architectures for modern problems

Preview: We return to sequence learning in Week 11, after building the classical foundations.

Part VIII: Key Takeaways

Summary: Five Core Concepts

Stationarity enables prediction (constant mean, variance, ACF)
ACF/PACF identify AR vs MA patterns
ARIMA combines differencing, AR, and MA
Validation requires time-aware splits (no shuffling)
Sequence learning extends beyond stationarity (preview)

Looking Ahead: Week 4 Volatility

Last week we modelled the mean. Next week we model variance:

Returns are unpredictable → ACF near zero
But volatility is predictable → ACF of squared returns shows persistence
ARCH/GARCH models capture volatility clustering
Connection: GARCH is to variance what ARIMA is to the mean

Directed Learning

Core: Read Tsay (2010) Chapter 2 (linear time series) and Brooks (2019) Chapter 5 (univariate time series, Box–Jenkins); complete lab with Bloomberg data; experiment with ARIMA orders
Optional extension: Try Bayesian AR with PyMC; write reflection on classical vs modern methods

Exit Ticket

Identify one Bloomberg asset and predict whether it’s stationary (test your hypothesis with ADF)
Fit an AR(1) to VIX and interpret the coefficient
Note one limitation of ARIMA that sequence learning addresses

References

Brooks, Chris. 2019. Introductory Econometrics for Finance. 4th ed. Cambridge, UK: Cambridge University Press.

Tsay, Ruey S. 2010. Analysis of Financial Time Series. 3rd ed. Hoboken, NJ: Wiley.