The Conceptual Bridge
Opening : Where we have been, and where we are going
Part I : Two different problems, prediction versus causal inference
Part II : OLS as a prediction model, and its limits
Part III : The bias-variance tradeoff, the central tension in forecasting
Part IV : Regularisation, shrinking towards better predictions
Part V : The virtue of complexity, when more features help
Part VI : Ensemble methods, trees, forests, and boosting
Part VII : CW2 scaffold preview, what the assessment asks of you
You met this framework in Week 1. Seven weeks later, it should look different.
| Problem | Target variable | Typical R² | Where we covered it |
|---|---|---|---|
| Mean | Future returns | ~1–2% | Week 3: ARIMA rarely beats naive |
| Variance | Future volatility | ~15–40% | Week 4: GARCH succeeds |
| Cross-section | Which assets outperform | ~5–15% | Weeks 8–10: factors and ML |
The third prediction problem is where most of the action in financial machine learning lives.
We have now covered the foundational toolkit. This week is the pivot.
You are here: the conceptual bridge
The conceptual gap between the two blocks is significant. Today we bridge it.
Before choosing a method, ask: what is the question?
Two superficially similar tasks:
Both involve regressing returns on firm characteristics. The methods look similar. But the objectives are fundamentally different.
Mullainathan and Spiess (2017) draw the sharpest version of this distinction.
Causal inference asks: What is the effect of X on Y?
Prediction asks: Given X, what is the best forecast of Y?
This difference runs all the way through to how we evaluate success.
| Causal inference | Prediction | |
|---|---|---|
| Goal | Estimate β̂ accurately | Minimise forecast error |
| Key metric | Standard error of β̂ | Out-of-sample MSPE |
| Omitted variables | Bias, major problem | Less critical |
| Overfitting | Not the primary concern | The central problem |
| More features | Risk of multicollinearity | Can help with regularisation |
| Interpretability | Essential | Optional |
Can you classify each task?
Ordinary Least Squares minimises the in-sample sum of squared residuals:
\[\hat{\beta}_{OLS} = \arg\min_{\beta} \sum_{t=1}^{T} (y_t - \mathbf{x}_t'\beta)^2\]
This produces the Best Linear Unbiased Estimator (BLUE) under the Gauss-Markov assumptions.
But notice: the objective is in-sample. We minimise residuals on the data we already have, not on data we have not yet seen.
R², our usual measure of fit, always increases as we add predictors, even if they are pure noise.
Illustration with financial data:
| Predictors | Description | In-sample R² |
|---|---|---|
| 1 | Market beta only | 12% |
| 5 | + size, value, momentum, profitability | 28% |
| 20 | + 15 random noise variables | 41% |
| 50 | + 30 more noise variables | 63% |
The last model is almost certainly worse at predicting next month’s returns.
The right criterion for prediction is Mean Squared Prediction Error (MSPE):
\[MSPE = E\left[(y_{T+h} - \hat{y}_{T+h})^2\right]\]
This is the expected squared error on new, unseen data, not data used to estimate the model.
Stock and Watson (2002) (Chapter 14): OLS minimises in-sample MSPE but not out-of-sample MSPE. Adding predictors helps in-sample but often hurts out-of-sample.
Shrinkage: A deliberately biased estimator can have lower MSPE than OLS if the reduction in variance more than compensates for the increase in bias.
Think of () as the strength of shrinkage (regularisation).
MSPE
^
| total MSPE
| /\
| / \
| / \
|______/ \______
|
+-----------------------------> shrinkage strength (λ)
λ = 0 (OLS) λ* (best)
Financial prediction has properties that stress OLS particularly hard.
Many predictors, few observations. Let (P) be the number of predictors (features) and (T) the number of observations (months). In many cross-sectional return prediction settings:
\[ P \in [50, 200],\qquad T \approx 240,\qquad \frac{P}{T} \gtrsim 1 \]
This can put us in the overparameterised regime.
Highly correlated features. Size, value, and momentum factors are correlated. OLS coefficient estimates become unstable under multicollinearity.
Non-stationarity. Return distributions shift over time. A model estimated in one regime may not generalise to another.
Noise dominates signal. Typical monthly return R² values are 1–5%. Most of the variation in returns is unexplained. OLS will happily fit that noise.
In regression notation, the feature matrix (X) has shape:
\[ X \in \mathbb{R}^{T \times P} \]
predictors (P)
┌─────────────────┐
T │ │
obs │ X │
│ │
└─────────────────┘
When (P T), the matrix (X^X) is not invertible, so the OLS formula breaks:
\[ \hat{\beta}_{OLS} = (X^\top X)^{-1}X^\top y \]
Intuitively, you have more knobs than data points: many different coefficient vectors can fit the training data equally well, which makes estimates unstable and out-of-sample performance fragile.
Any prediction error can be decomposed into three components:
Recall from Week 1: see the Foundations chapter section on the bias-variance tradeoff.
\[E[(y - \hat{f}(x))^2] = \underbrace{\text{Bias}(\hat{f})^2}_{\text{systematic error}} + \underbrace{\text{Var}(\hat{f})}_{\text{estimation noise}} + \underbrace{\sigma^2_\varepsilon}_{\text{irreducible}}\]
We can only control bias and variance, and they move in opposite directions.
As model complexity increases, bias falls but variance rises.
Total Error
│
│ ╲ ___
│ ╲ / ← variance
│ ╲ /
│ ╲____/
│ ↑
│ optimal
│ complexity
│
└──────────────────────────────
Model Complexity (P/T)
The “optimal” point balances these forces. OLS without regularisation will sit too far to the right in high-dimensional settings.
High-bias model (underfitting): CAPM, a single market beta predictor.
High-variance model (overfitting): OLS with 100 firm characteristics.
The goal: A model that captures the true structure (low bias) without fitting the noise (low variance). Regularisation provides the mechanism.
When the number of predictors P equals the number of observations T, OLS does something catastrophic.
With P = T: OLS fits the training data perfectly (R² = 1 by construction). Every data point is exactly interpolated.
The out-of-sample forecast is essentially pure noise, the model has memorised the training set rather than learning the underlying pattern.
This P = T point is called the interpolation boundary. It is where classical statistical intuition says OLS completely fails.
Regularisation deliberately introduces bias into our coefficient estimates to reduce variance. The result is lower total prediction error.
The mechanism: add a penalty to the OLS objective function that discourages large coefficients.
\[\hat{\beta}_{reg} = \arg\min_{\beta} \left[ \sum_{t=1}^{T} (y_t - \mathbf{x}_t'\beta)^2 + \lambda \cdot \text{penalty}(\beta) \right]\]
The parameter λ controls the bias-variance tradeoff:
Penalty: sum of squared coefficients (L2 norm)
\[\hat{\beta}_{Ridge} = \arg\min_{\beta} \left[ \sum_{t=1}^{T} (y_t - \mathbf{x}_t'\beta)^2 + \lambda \sum_{j=1}^{P} \beta_j^2 \right]\]
Properties:
Financial interpretation: assume all factors contribute a little, shrink the big effects down.
Penalty: sum of absolute coefficients (L1 norm)
\[\hat{\beta}_{LASSO} = \arg\min_{\beta} \left[ \sum_{t=1}^{T} (y_t - \mathbf{x}_t'\beta)^2 + \lambda \sum_{j=1}^{P} |\beta_j| \right]\]
Properties:
Financial interpretation: from 200 factors, select the 10–20 that matter most and set the rest to zero.
How do we find the right λ? We hold out some data and evaluate out-of-sample performance.
K-fold cross-validation:
Terminology: a fold is one slice of the training sample. The held-out fold is a temporary validation set used to choose ().
In finance, cross-validation requires care. Standard K-fold randomly shuffles observations. With time series data this induces look-ahead bias: the held-out fold can contain observations before some training observations.
Solution: time-series cross-validation (walk-forward validation). Always use past data to train, future data to evaluate.
Chapter 06 compares two Ridge models on daily Bloomberg data (UKX next-day returns, IVIUK + gilt yield lags): a fixed penalty (\(\lambda = 10\), as a foil) and a nested-CV tuned penalty where an inner TimeSeriesSplit loop selects the best \(\lambda\) from a 50-point log-grid, using only past data.
Classical statistical wisdom: parsimonious models generalise better. Keep it simple.
Kelly, Malamud, and Zhou (2024) challenge this directly (Kelly, Malamud, and Zhou 2024).
Using Random Matrix Theory, they characterise ridge (and “ridgeless”) return prediction when the number of predictors (P) is large relative to the sample size (T). Performance deteriorates near the interpolation boundary (P T), but can recover for (P/T > 1) under shrinkage.
This is called the “double descent” phenomenon.
Out-of-sample R²
│
│ ╲ /‾‾‾‾
│ ╲ /
│ ╲_____/
│ ↑
│ P = T
│ (danger zone)
│
└──────────────────────────────
P/T 0.5 1.0 2.0 5.0
The first descent is the classical overfitting: as P approaches T, variance explodes.
The second ascent is the new finding: once P >> T with ridge regularisation, the implicit shrinkage of the estimator increases and out-of-sample R² recovers. It can exceed low-complexity benchmarks.
The key insight from Kelly, Malamud, and Zhou (2024): in the overparameterised regime, ridge regularisation acts as implicit shrinkage that becomes stronger as P/T increases.
As you add more predictors:
The crucial condition: regularisation must be present. Without ridge, OLS in the overparameterised regime is useless. With ridge, high complexity models can outperform.
The Kelly et al. result motivates the approach taken in CW2 Scaffold B: can a high-dimensional, non-linear model beat a parsimonious linear benchmark?
Linear approach: Select 3–5 well-known factors (Fama-French), run OLS with HAC standard errors. Sparse, interpretable, disciplined.
Machine learning approach (Scaffold B): Use 20–100 firm characteristics simultaneously, apply tree-based methods (which implicitly regularise), evaluate out-of-sample. Potentially better predictions, but harder to interpret.
Neither is universally right. The question is empirical: which approach performs better for UK equities in the JKP dataset?
Ridge and LASSO are still linear: the prediction is a weighted sum of features. Sometimes the true relationship is non-linear.
Decision trees provide a flexible non-linear alternative. From a statistical viewpoint, they are trying to reduce the same objective as before: out-of-sample prediction error (MSPE).
A tree builds a piecewise-constant approximation to (E[r_{t+1} X_t]) by repeatedly splitting the data to reduce squared error.
How to read this tree:
Example decision rules:
Is momentum > 0.05?
├── Yes: Is P/B < 1.2?
│ ├── Yes: Predict +2.1%
│ └── No: Predict +0.8%
└── No: Is size > median?
├── Yes: Predict -0.3%
└── No: Predict -1.2%
Two concrete paths:
A random forest addresses the high variance of single trees through two innovations:
Bootstrap aggregation (bagging):
Random feature subsets:
Key property: the averaged prediction has lower variance than any single tree, while preserving the non-linear flexibility.
Where random forests reduce variance, gradient boosting reduces bias.
The algorithm:
Each tree corrects the systematic mistakes of the previous ensemble. The model progressively reduces its bias.
| Random Forest | Gradient Boosting | |
|---|---|---|
| Core idea | Average many trees | Correct residuals sequentially |
| Bias | Higher (each tree weak) | Lower (explicitly corrects) |
| Variance | Lower (averaging) | Higher (can overfit) |
| Training | Parallelisable | Sequential |
| Tuning | Fewer hyperparameters | More careful tuning required |
| Speed | Faster | Slower (but XGBoost is efficient) |
| Finance use | Factor importance, robust baseline | Higher performance ceilings |
Both methods appear in the recent financial ML literature (Gu, Kelly, and Xiu 2020).
Tree-based models are non-linear and cannot be interpreted via coefficients alone. SHAP values (Shapley Additive Explanations) provide a principled solution.
For each prediction, SHAP decomposes the prediction into contributions from each feature:
\[\hat{y}_i = \underbrace{\phi_0}_{\text{baseline}} + \underbrace{\phi_1^{(i)}}_{\text{momentum}} + \underbrace{\phi_2^{(i)}}_{\text{value}} + \underbrace{\phi_3^{(i)}}_{\text{size}} + \ldots\]
Properties:
SHAP helps you describe how a fitted model behaves. It does not identify what truly drives returns.
It is not: a test of your Python coding ability.
It is: a test of your capacity to complete provided code, interpret outputs, and reflect critically on what the results mean.
The scaffold notebooks provide:
You provide:
Scaffolds are released next week (Week 8). Today is a preview.
| Scaffold | Topic | Methods | Data |
|---|---|---|---|
| A | Blockchain Fraud Detection | Logistic regression, random forest, walk-forward CV, cost-sensitive thresholds | Elliptic Bitcoin (46K labelled txns) |
| B | Tree-Based Factor Investing | Random forest, gradient boosting, SHAP | JKP UK monthly factors |
| C | Volatility Forecasting | GARCH, GJR-GARCH, Mincer-Zarnowitz evaluation | Bloomberg equity indices |
All three use real data from professional sources. All three require genuine analytical judgement, not just running code.
The question: Can we reliably detect illicit Bitcoin transactions, and how does model performance degrade as fraud patterns evolve?
Why this matters: The Elliptic dataset contains 46,564 labelled Bitcoin transactions across 49 time steps. Illicit transactions (money laundering, scams) make up roughly 10% of labels, but the rate fluctuates dramatically over time.
The analytical challenge: Walk-forward temporal validation versus shuffled CV. The gap between them reveals how much published fraud detection benchmarks overstate real-world performance. Cost-sensitive threshold selection, because the default 0.5 catches almost nothing when fraud is rare.
The data quality thread: Only labelled transactions are included (selection bias). Only detected illicit activity appears (survivorship bias). Features are anonymised, limiting interpretability. How do these constraints affect your conclusions?
The question: Does a non-linear tree-based model predict cross-sectional returns better than the linear Fama-French model?
Why this matters: If OLS with 5 factors is a biased but stable model, and gradient boosting with 50 factors is a lower-bias model, which wins in the JKP data?
The analytical challenge: Evaluate out-of-sample prediction performance. Compute SHAP values. Interpret which factors matter and whether their importance is stable over time.
The data quality thread: How does look-ahead bias arise in accounting-based factors (book-to-market, profitability)? What is the “factor zoo” problem and how might data snooping affect your results?
The question: Does asymmetric GARCH (GJR-GARCH) systematically outperform symmetric GARCH(1,1) for UK and US equity volatility?
Why this matters: Asymmetric GARCH accounts for the leverage effect. Bad news increases volatility more than good news of the same magnitude. This is a real empirical regularity.
The analytical challenge: Mincer-Zarnowitz forecast evaluation. Regress realised volatility on forecasted volatility. A well-calibrated forecast should have intercept ≈ 0 and slope ≈ 1.
The data quality thread: Realised volatility is measured from returns data. How does data quality affect the benchmark you are comparing against?
There is no universally correct choice. Consider three questions.
What topic interests you most? Blockchain forensics, factor investing, and volatility forecasting are all active research areas. The scaffold you find most intellectually engaging is usually the one you will write about most convincingly.
What methods from this module do you understand best? Scaffold A builds on Week 8 (fraud detection, rare-event classification, temporal CV). Scaffold B builds on today’s content and Week 9 (factor investing, SHAP). Scaffold C extends Week 4 (volatility modelling).
What data quality issues do you find most tractable? CW2 Section B of the report assesses your ability to identify bias in your chosen dataset. Review your CW1 skills and ask which scaffold presents the most interesting data quality challenge.
Your CW1 developed three skills that transfer directly to CW2.
Data risk register thinking. Every dataset has data generating process assumptions that can fail. The Elliptic data has selection bias (only labelled transactions) and survivorship bias (only detected illicit activity). The JKP factor data has survivorship bias (only surviving firms). Bloomberg volatility data has measurement error. Identifying these risks is Section B of the report.
Look-ahead bias detection. You analysed look-ahead bias in CW1. Each scaffold has its own version: temporal drift in fraud patterns (Scaffold A); accounting data timing (Scaffold B); realised volatility measurement (Scaffold C). The walk-forward validation in each scaffold is designed to prevent it; your report should explain why.
Responsible practice framing. A fraud detection model, a factor investing strategy, or a volatility forecasting model all have downstream consequences. Who uses these models? What happens when they fail? Who bears the cost of false positives? This is the professional accountability thread running through both assessments.
Week 8 topic: Cryptocurrency and Fraud Detection
CW2 scaffolds released at the start of Week 8. Instructions on Blackboard.
Lab session (Week 8): Introduction to scaffolds. You will:
Recommendation for this week: Read through the scaffold descriptions again. Think about which topic you find most interesting. Look back at your CW1 feedback.
Today’s journey in six steps:
Two problems: Prediction and causal inference require different tools and different evaluation criteria.
OLS limits: In-sample fit optimisation is not the same as out-of-sample forecast accuracy.
Bias-variance tradeoff: Complex models overfit; simple models underfit. The goal is balance.
Regularisation: Ridge and LASSO deliberately introduce bias to reduce variance and improve MSPE.
Virtue of complexity: With proper regularisation, very high-dimensional models can outperform parsimonious benchmarks.
Ensemble methods: Random forests reduce variance through averaging; gradient boosting reduces bias through sequential correction.
Week 8: Cryptocurrency markets and fraud detection. Plus CW2 scaffold release, come ready to choose.
Week 9: Factor investing in depth. Fama-French factors, the factor zoo problem, and tree-based methods applied to the JKP dataset.
Week 10: Backtesting and validation. Walk-forward testing, combinatorial symmetric cross-validation, and the five pitfalls of statistical significance (the false discovery rate problem in factor research).
The thread: Everything from Week 7 onwards is an application of the bias-variance tradeoff under the specific constraints of financial data.
FinTech & Data Science