Lab 7: When Does Complexity Pay?

Factor Investing and the Bias-Variance Tradeoff with UK Equity Data

Open in Colab
NoteExpected time
  • Core lab (Parts 1–4): approximately 75 minutes
  • Extension: +25 minutes

1 Before You Code: The Research Question

This lab is grounded in a live research question. A working paper by faculty at Ulster University and Queen’s University Belfast asks: when does a sophisticated machine learning method for constructing factor portfolios outperform a simple traditional approach in UK equity markets?

The short answer, which we will demonstrate with the same data: less often than you might expect. In a small equity market like the UK, complex methods sometimes struggle to justify their additional estimation cost. Simple, well-designed strategies are surprisingly hard to beat out-of-sample.

This is not a reason to dismiss machine learning in finance. It is a reason to take the bias-variance tradeoff seriously, the central concept from this week’s lecture. Today you will see it in action rather than on a slide.

By the end of this lab you will have:

  • Computed Sharpe ratios for individual UK equity factors from the JKP dataset
  • Built a simple multi-factor portfolio and compared it to individual factors
  • Demonstrated the in-sample versus out-of-sample gap using OLS
  • Seen how decision tree depth directly controls the bias-variance tradeoff
  • Connected these findings to the CW2 scaffold choices

2 Setup

Show code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

plt.rcParams.update({
    'figure.dpi': 120,
    'axes.spines.top': False,
    'axes.spines.right': False,
    'font.size': 11
})

print("All libraries loaded.")

If you are running in Google Colab, you need to fetch the data first:

Show code
# Run this cell only if you are in Google Colab
import os
if 'COLAB_GPU' in os.environ or 'COLAB_BACKEND_VERSION' in os.environ:
    # Prefer the public Colab-notebooks repo, with a website fallback.
    urls = [
        "https://raw.githubusercontent.com/quinfer/fin510-colab-notebooks/main/labs/jkp_master_global_monthly.csv",
        "https://quinfer.github.io/financial-data-science/labs/jkp_master_global_monthly.csv",
    ]
    import urllib.request
    last_err = None
    for url in urls:
        try:
            urllib.request.urlretrieve(url, "jkp_master_global_monthly.csv")
            DATA_PATH = "jkp_master_global_monthly.csv"
            break
        except Exception as e:
            last_err = e
    else:
        raise RuntimeError(
            "Could not download jkp_master_global_monthly.csv. "
            "Tried the public notebooks repo and the course website."
        ) from last_err
else:
    from pathlib import Path
    candidates = [
        Path("jkp_master_global_monthly.csv"),
        Path("../jkp_master_global_monthly.csv"),          # if running from labs/notebooks
        Path("labs/jkp_master_global_monthly.csv"),
        Path("../../labs/jkp_master_global_monthly.csv"),  # if running from labs/notebooks
    ]
    for p in candidates:
        if p.exists():
            DATA_PATH = str(p)
            break
    else:
        raise FileNotFoundError(
            "Could not find jkp_master_global_monthly.csv. "
            "Expected it in the current directory, its parent, or in labs/."
        )

3 Part 1: UK Factor Data

3.1 Load and Inspect

The Jensen, Kelly, and Pedersen (2024) dataset provides factor returns for equity markets across the world. We focus on the United Kingdom, which has factor data running from 1986 to 2023.

Show code
df = pd.read_csv(DATA_PATH)
df['date'] = pd.to_datetime(df['date'])

uk = df[df['country'] == 'gbr'].copy().sort_values('date').reset_index(drop=True)
uk = uk.set_index('date')

factors = ['HML', 'SMB', 'MOM', 'RMW', 'CMA']
uk_factors = uk[factors].dropna(subset=['HML', 'SMB', 'RMW', 'CMA'])
uk_factors = uk_factors.dropna(subset=['MOM'])

print(f"UK factor data: {uk_factors.index.min().date()} to {uk_factors.index.max().date()}")
print(f"Observations: {len(uk_factors)}")
print()
print(uk_factors.describe().round(4))

The five factors are:

  • HML (High Minus Low): value premium, high book-to-market firms minus low book-to-market firms
  • SMB (Small Minus Big): size premium, small-cap firms minus large-cap firms
  • MOM (Momentum): recent winners minus recent losers
  • RMW (Robust Minus Weak): profitability premium, profitable firms minus unprofitable firms
  • CMA (Conservative Minus Aggressive): investment premium, low-investment firms minus high-investment firms

Each return is the monthly return to a long-short portfolio that takes a long position in the “high” group and a short position in the “low” group.

3.2 Factor Performance Over Time

A useful starting point is to compare the cumulative return to each factor. This tells us which factors have been persistently positive over time, which have been erratic, and whether any have reversed.

Show code
cumulative = (1 + uk_factors).cumprod()

fig, ax = plt.subplots(figsize=(11, 5))
for col in factors:
    ax.plot(cumulative.index, cumulative[col], label=col, linewidth=1.5)

ax.axhline(1, color='black', linewidth=0.8, linestyle='--', alpha=0.5)
ax.set_title("Cumulative returns: UK equity factors, 1986–2023", fontweight='bold')
ax.set_ylabel("Growth of £1 invested")
ax.legend(loc='upper left', ncol=5, fontsize=9)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
plt.tight_layout()
plt.show()

Exercise 1.1: Looking at the cumulative return chart, which factor has been the most consistent? Which has had the largest drawdown? Does any factor look like it reversed after 2010?

3.3 Sharpe Ratios: Who Wins?

The Sharpe ratio adjusts returns for the risk taken. We calculate it here using the full sample, treating each factor as a standalone investment. We omit the risk-free rate because these are already long-short returns (zero net investment in principle).

Show code
def annualised_sharpe(returns):
    mean = returns.mean() * 12
    vol = returns.std() * np.sqrt(12)
    return mean / vol if vol > 0 else np.nan

sharpe = uk_factors.apply(annualised_sharpe)

fig, ax = plt.subplots(figsize=(7, 4))
colours = ['#2196F3' if s > 0 else '#F44336' for s in sharpe.values]
bars = ax.bar(sharpe.index, sharpe.values, color=colours, edgecolor='white', linewidth=0.5)
ax.axhline(0, color='black', linewidth=0.8)
ax.set_title("Annualised Sharpe ratios: UK equity factors\n(full sample, 1986–2023)", fontweight='bold')
ax.set_ylabel("Sharpe ratio")
for bar, val in zip(bars, sharpe.values):
    ax.text(bar.get_x() + bar.get_width()/2, val + (0.02 if val >= 0 else -0.06),
            f'{val:.2f}', ha='center', fontsize=10, fontweight='bold')
plt.tight_layout()
plt.show()

print(sharpe.round(3).to_string())

Exercise 1.2: The UK SMB factor has historically had a weak or negative Sharpe ratio. Why might the size premium behave differently in the UK compared to the US?


4 Part 2: The Simple Benchmark, an Equal-Weighted Multi-Factor Portfolio

Before we ask whether machine learning can improve on individual factors, we need a sensible benchmark. The simplest multi-factor strategy is to take an equal-weighted average across factors.

This might seem naive. But it embeds an important insight from statistics: averaging across signals reduces variance without necessarily reducing expected return. This is the same logic behind random forests (which average many trees).

Show code
positive_factors = ['HML', 'MOM', 'RMW', 'CMA']
uk_factors['EW'] = uk_factors[positive_factors].mean(axis=1)

ew_sharpe = annualised_sharpe(uk_factors['EW'])
print(f"Equal-weighted multi-factor Sharpe ratio: {ew_sharpe:.3f}")
print()
for f in positive_factors:
    print(f"  {f}: {annualised_sharpe(uk_factors[f]):.3f}")

The equal-weighted portfolio often achieves a higher Sharpe ratio than any individual factor. This is diversification at work: the factors are imperfectly correlated, so combining them reduces volatility more than it reduces the average return.

4.1 Rolling Performance: Is the Equal-Weighted Portfolio Consistently Better?

Full-sample Sharpe ratios can mask a lot of time variation. A strategy that looks good over 35 years might have long periods of underperformance. Rolling 36-month Sharpe ratios reveal this.

Show code
window = 36

def rolling_sharpe(returns, window):
    roll_mean = returns.rolling(window).mean() * 12
    roll_vol = returns.rolling(window).std() * np.sqrt(12)
    return roll_mean / roll_vol

fig, ax = plt.subplots(figsize=(11, 5))
ax.plot(rolling_sharpe(uk_factors['EW'], window), label='Equal-Weighted', color='black', linewidth=2)
for f in positive_factors:
    ax.plot(rolling_sharpe(uk_factors[f], window), label=f, linewidth=1, alpha=0.6)

ax.axhline(0, color='black', linewidth=0.8, linestyle='--', alpha=0.4)
ax.set_title(f"Rolling {window}-month Sharpe ratios: factors vs equal-weighted", fontweight='bold')
ax.set_ylabel("Sharpe ratio")
ax.legend(ncol=5, fontsize=9)
plt.tight_layout()
plt.show()

Exercise 2.1: Does the equal-weighted portfolio ever underperform all individual factors? Are there periods where a single-factor approach would have done better? What does this imply about the value of diversification across factors?


5 Part 3: Can OLS Improve on Equal Weighting?

The equal-weighted approach treats all factors the same. What if we used past performance to decide which factors to up-weight? This is a prediction task: can we predict which factor will have the highest return next month based on lagged returns?

Here we fit an OLS model that uses the previous month’s factor returns to predict the current month. This is deliberately simple. The in-sample versus out-of-sample comparison will teach us something important.

5.1 The In-Sample Versus Out-of-Sample Gap

Show code
predictors = positive_factors
target = 'EW'

X = uk_factors[predictors].shift(1).dropna()
y = uk_factors[target].loc[X.index]

split = int(len(X) * 0.6)
X_train, X_test = X.iloc[:split], X.iloc[split:]
y_train, y_test = y.iloc[:split], y.iloc[split:]

ols = LinearRegression()
ols.fit(X_train, y_train)

r2_in = ols.score(X_train, y_train)
r2_out = ols.score(X_test, y_test)

print("OLS: using last month's factor returns to predict this month's equal-weighted return")
print()
print(f"  In-sample R²  (training data): {r2_in:.4f}  ({r2_in*100:.2f}%)")
print(f"  Out-of-sample R² (test data):  {r2_out:.4f}  ({r2_out*100:.2f}%)")
print()
print("Coefficients:")
for name, coef in zip(predictors, ols.coef_):
    print(f"  {name}: {coef:.4f}")

The gap between in-sample and out-of-sample R² is the central fact of financial prediction. The in-sample number can look encouraging. The out-of-sample number reveals whether the model has learned genuine structure or fitted noise.

Exercise 3.1: What does an out-of-sample R² close to zero mean? What would a negative out-of-sample R² mean?

5.2 Does Ridge Regression Help?

Ridge regression adds a penalty on coefficient size. It will shrink the OLS coefficients towards zero. With only four predictors this may not make a dramatic difference, but it illustrates the principle.

Show code
from sklearn.linear_model import RidgeCV

alphas = np.logspace(-3, 3, 50)
ridge = RidgeCV(alphas=alphas, cv=TimeSeriesSplit(n_splits=5))
ridge.fit(X_train, y_train)

r2_ridge_in = ridge.score(X_train, y_train)
r2_ridge_out = ridge.score(X_test, y_test)

print(f"Ridge (λ = {ridge.alpha_:.4f})")
print(f"  In-sample R²:    {r2_ridge_in:.4f}")
print(f"  Out-of-sample R²: {r2_ridge_out:.4f}")
print()
print("Ridge vs OLS: out-of-sample improvement:")
print(f"  {r2_ridge_out - r2_out:+.4f}")

Notice that Ridge uses time-series cross-validation (TimeSeriesSplit) to choose the shrinkage parameter. This is critical: as discussed in the lecture, using standard K-fold cross-validation on time-series data would allow the model to “see the future.”


6 Part 4: Decision Trees and the Depth-Complexity Tradeoff

Now we move from linear models to decision trees. The question is identical: can past factor returns predict the next period’s equal-weighted return? But now we allow non-linear interactions between predictors.

The key parameter is tree depth: a deeper tree makes more splits, fitting the training data more tightly. This is a concrete dial that moves us along the bias-variance curve.

6.1 Fitting Trees at Different Depths

Show code
depths = [1, 2, 3, 5, 8, 12]
results = []

for depth in depths:
    tree = DecisionTreeRegressor(max_depth=depth, random_state=42)
    tree.fit(X_train, y_train)
    r2_in_tree = tree.score(X_train, y_train)
    r2_out_tree = tree.score(X_test, y_test)
    results.append({
        'Depth': depth,
        'In-sample R²': r2_in_tree,
        'Out-of-sample R²': r2_out_tree,
        'Gap': r2_in_tree - r2_out_tree
    })

results_df = pd.DataFrame(results)
print(results_df.round(4).to_string(index=False))
Show code
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.plot(results_df['Depth'], results_df['In-sample R²'],
         'o-', color='steelblue', linewidth=2, label='In-sample R²')
ax1.plot(results_df['Depth'], results_df['Out-of-sample R²'],
         's--', color='crimson', linewidth=2, label='Out-of-sample R²')
ax1.axhline(0, color='black', linewidth=0.8, linestyle=':', alpha=0.5)
ax1.set_xlabel("Tree depth")
ax1.set_ylabel("R²")
ax1.set_title("In-sample fit vs out-of-sample performance\nas tree complexity increases", fontweight='bold')
ax1.legend()

ax2.bar(results_df['Depth'], results_df['Gap'], color='darkorange', alpha=0.8)
ax2.set_xlabel("Tree depth")
ax2.set_ylabel("R² gap (in-sample minus out-of-sample)")
ax2.set_title("The overfitting gap\ngrows with tree depth", fontweight='bold')
ax2.axhline(0, color='black', linewidth=0.8)
plt.tight_layout()
plt.show()

This chart is the bias-variance tradeoff made visible. At depth 1 the tree underfits: it can only make one binary split and captures very little structure. As depth increases, the in-sample R² rises steadily, the tree memorises the training data. But the out-of-sample R² does not follow. The gap between the two lines is the overfitting tax.

Exercise 4.1: At what depth does out-of-sample performance appear to peak? Is this the same depth where in-sample performance peaks?

Exercise 4.2: At depth 12, the in-sample R² may approach 1.0. Why? Is this a good thing?

6.2 The Research Context

This pattern, complexity helping in-sample but not always out-of-sample, is precisely what the research paper "When Simplicity Beats Sophistication" investigates using much richer UK equity microdata. The paper constructs factor portfolios using three approaches:

  • Fama-French (fixed percentile breakpoints, traditional method)
  • Decision Tree (tree-learned breakpoints, more flexible)
  • Asset Pricing Trees (machine learning methods that directly optimise the stochastic discount factor)

The finding, consistent across different sample periods: in the UK equity market, which is smaller and less liquid than the US market, the more sophisticated methods provide incremental rather than transformational gains. The simple Fama-French benchmark, properly designed, is surprisingly competitive.

This does not mean ML methods are useless in finance. It means sample size matters, estimation error matters, and the bias-variance tradeoff is a genuine constraint, not just a theoretical concept.

Exercise 4.3: Why might UK equity markets be a harder testing ground for ML methods than US equity markets? Think about sample size (fewer listed companies), factor data frequency, and estimation noise.


7 Part 5: Walk-Forward Validation

The train-test split we used is the simplest form of out-of-sample evaluation. A more realistic evaluation uses rolling walk-forward validation: repeatedly retrain the model on all available data up to a point, then predict one period ahead.

This is closer to what a practitioner would actually do, and it matches the spirit of CW2 validation: temporal walk-forward in Scaffolds A and B, and forecast evaluation in Scaffold C.

Show code
predictions_ols = []
predictions_tree = []
actuals = []

min_train = 60

for t in range(min_train, len(X) - 1):
    X_tr = X.iloc[:t]
    y_tr = y.iloc[:t]
    X_next = X.iloc[t:t+1]
    y_next = y.iloc[t]

    ols_wf = LinearRegression()
    ols_wf.fit(X_tr, y_tr)
    predictions_ols.append(ols_wf.predict(X_next)[0])

    tree_wf = DecisionTreeRegressor(max_depth=3, random_state=42)
    tree_wf.fit(X_tr, y_tr)
    predictions_tree.append(tree_wf.predict(X_next)[0])

    actuals.append(y_next)

actuals = np.array(actuals)
pred_ols = np.array(predictions_ols)
pred_tree = np.array(predictions_tree)
pred_ew = np.zeros_like(actuals)  # equal-weight "predict the mean" is ~zero return per factor

from sklearn.metrics import r2_score

print("Walk-forward out-of-sample R²:")
print(f"  Equal-weighted (naive baseline): {r2_score(actuals, pred_ew):.4f}")
print(f"  OLS (lagged factors):            {r2_score(actuals, pred_ols):.4f}")
print(f"  Decision tree (depth=3):          {r2_score(actuals, pred_tree):.4f}")

The walk-forward R² is likely to be low or negative for all three models. This is normal for monthly return prediction in equity markets. Recall from the lecture: the typical out-of-sample R² for monthly return prediction is 1–5%, often less.

What matters more in practice is whether the sign of the prediction is correct often enough to generate a profitable trading strategy. This is assessed in the CW2 scaffolds.


8 Extension: Factor Correlations and Diversification Maths

The equal-weighted portfolio works partly because factors are imperfectly correlated. This extension demonstrates the diversification arithmetic directly.

Show code
corr = uk_factors[positive_factors].corr()

fig, ax = plt.subplots(figsize=(6, 5))
im = ax.imshow(corr.values, cmap='RdBu_r', vmin=-1, vmax=1)
ax.set_xticks(range(len(positive_factors)))
ax.set_yticks(range(len(positive_factors)))
ax.set_xticklabels(positive_factors)
ax.set_yticklabels(positive_factors)
plt.colorbar(im, ax=ax, label='Correlation')
ax.set_title("UK factor return correlations\n(monthly, 1986–2023)", fontweight='bold')
for i in range(len(positive_factors)):
    for j in range(len(positive_factors)):
        ax.text(j, i, f'{corr.values[i,j]:.2f}', ha='center', va='center',
                fontsize=9, color='white' if abs(corr.values[i,j]) > 0.5 else 'black')
plt.tight_layout()
plt.show()
Show code
mean_individual_vol = uk_factors[positive_factors].std().mean() * np.sqrt(12)
ew_vol = uk_factors['EW'].std() * np.sqrt(12)
diversification_benefit = (mean_individual_vol - ew_vol) / mean_individual_vol

print(f"Average individual factor volatility: {mean_individual_vol:.4f} ({mean_individual_vol*100:.2f}% p.a.)")
print(f"Equal-weighted portfolio volatility:  {ew_vol:.4f} ({ew_vol*100:.2f}% p.a.)")
print(f"Volatility reduction from combining:  {diversification_benefit*100:.1f}%")
print()
print("This is why the equal-weighted portfolio often achieves a higher Sharpe ratio than its components.")

Extension Exercise: If all four factors had correlation of exactly 1.0, what would the equal-weighted portfolio volatility equal? What about correlation of exactly 0? What does this tell you about the role of correlation in portfolio construction?


9 Connecting to CW2

This lab has demonstrated three ideas that run directly through the CW2 scaffolds.

The in-sample versus out-of-sample gap is the central evaluation discipline across all three scaffolds. Scaffold A (fraud detection) uses walk-forward temporal validation; Scaffold B (factor investing) uses walk-forward prediction; Scaffold C (volatility forecasting) uses Mincer-Zarnowitz evaluation. In every case, in-sample model fit is not informative about out-of-sample performance. When you complete the TODO sections and write your report, you should explain why walk-forward validation is used and what the out-of-sample results mean.

Complexity does not automatically improve prediction. The decision tree experiment showed that deeper trees fit training data better but not necessarily predict better. Scaffold B (tree-based factor investing) asks you to fit random forests and gradient boosting to the same JKP data. The bias-variance tradeoff will be visible in those results too. In-sample R² will likely be high; out-of-sample R² will be much lower.

Simple benchmarks are hard to beat. The equal-weighted multi-factor portfolio achieved a competitive Sharpe ratio without any statistical estimation. All three scaffolds include a simple benchmark comparison. Your report should discuss whether the more complex approach adds genuine value and under what conditions you would expect it to do so.


10 Summary

Concept What you demonstrated Lab section
Factor premia UK HML, MOM, RMW, CMA all positive on average Part 1
Diversification benefit Equal-weighting reduces volatility, raises Sharpe Part 2
In-sample vs out-of-sample gap OLS R² drops sharply from training to test data Part 3
Bias-variance tradeoff Deeper trees increase the gap Part 4
Walk-forward validation Realistic evaluation with no look-ahead bias Part 5
Correlation and diversification Low factor correlations explain the EW advantage Extension

The research question, when does complexity pay, does not have a universal answer. The answer depends on sample size, signal-to-noise ratio, and market liquidity. In the UK equity market, with roughly 38 years of monthly data and a smaller investable universe than the US, the simple benchmark is a formidable competitor. This is not a failure of machine learning; it is a correct application of the bias-variance principle.